This paper extends the SVM framework for the ordinal regression (classification of ordered classes). The algorithm is easy to understand after the problem is formulated. The key contribution of the paper is showing that the ordinal regression problem can be transformed into regular SVR by using the difference of the feature vectors as the new feature vector. Also the error bound of the algorithm is presented.

I did not know much about SVM and hope I can find some useful online materials to learn it by myself in the future....

## Saturday, June 7, 2008

## Friday, May 30, 2008

### [Reading] The pagerank citation ranking: Bringing order to the web

This is a very famous paper. Almost all people who searched for the answer of "how Google works" will reach this paper. The basic idea is unbelievably simple, just a simple application of the random walk theory. Given an initial positions of many walkers on the web, the final distribution of these walkers is the importance of the web pages. Thanks to the AMMAI course, I can easily think that this is a linear algebra problem.

Because the number of the webpages is very large and we only want one eigenvector, an iterative method is used in the paper. An interesting thing is that the iteration step is not the gradient of the cost function but the authors claim that the convergence is faster. Another issue is that the importance of webpages is l1-normalized. However, I don't really know why this is necessary.

It's interesting to think, how pagerank works in the environment of Web 2.0 now (of course Google must handle this already...). The webpages are now very dynamic. The link structures change very fast and it's almost impossible to build the full linear system (there are billions of webpages now, and each one can easily link to hundred of webpages). Actually maybe Google do not work so well now...at least recently I feel depressed at searching on Google :P

Because the number of the webpages is very large and we only want one eigenvector, an iterative method is used in the paper. An interesting thing is that the iteration step is not the gradient of the cost function but the authors claim that the convergence is faster. Another issue is that the importance of webpages is l1-normalized. However, I don't really know why this is necessary.

It's interesting to think, how pagerank works in the environment of Web 2.0 now (of course Google must handle this already...). The webpages are now very dynamic. The link structures change very fast and it's almost impossible to build the full linear system (there are billions of webpages now, and each one can easily link to hundred of webpages). Actually maybe Google do not work so well now...at least recently I feel depressed at searching on Google :P

## Sunday, May 25, 2008

### [Reading] Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach

This paper presents a new method for mining frequent patterns. The main technique is a new data structure, FP tree, for the compact representation of the database. A FP tree can compactly represent all frequent patterns (no information loss) and enable fast query (by traversing the tree and using some range pointers).

There are always some detailed examples in describing the data structure and mining methods, usually appear before the proofs of the lemmas. This makes the paper very easy to understand. Also all pseudo codes are provides and thus it should be simple to implement the algorithm. Even the FP tree always greatly compresses the database. the authors still provide a projection method to reduce the required memory.

Because this method makes no approximation to the problem, the results are identical to the ground truth, and obtained much faster than the previous methods. In summary, this is a very exciting work !

There are always some detailed examples in describing the data structure and mining methods, usually appear before the proofs of the lemmas. This makes the paper very easy to understand. Also all pseudo codes are provides and thus it should be simple to implement the algorithm. Even the FP tree always greatly compresses the database. the authors still provide a projection method to reduce the required memory.

Because this method makes no approximation to the problem, the results are identical to the ground truth, and obtained much faster than the previous methods. In summary, this is a very exciting work !

### [Reading] Mining Association Rules between Sets of Items in Large Databases

This paper presents a system to find the association rules from a large number of transaction records. At the time of publication, the problem and the solution of this field might be very twilit and this should be a very pioneering work.

After properly formulating the problem, the problem can be decomposed into 2 sub-problems, and the trivial solutions become obvious. However, the straightforward algorithm is too expensive, and thus the authors provide many acceleration methods. I personally like the analysis at the memory analysis part.

The main difficulty in reading this paper is that the writing style is very old (or at least very different to the papers in my field). Also the results are obtained from a dataset without ground truth.

After properly formulating the problem, the problem can be decomposed into 2 sub-problems, and the trivial solutions become obvious. However, the straightforward algorithm is too expensive, and thus the authors provide many acceleration methods. I personally like the analysis at the memory analysis part.

The main difficulty in reading this paper is that the writing style is very old (or at least very different to the papers in my field). Also the results are obtained from a dataset without ground truth.

## Saturday, May 17, 2008

### [Reading] Normalized Cuts and Image Segmentation

I read this paper long time ago...and I skipped main derivation at that time :P. The key contribution of this paper is, the authors use an specified energy function to measure the quality of the segment, and then show that this energy minimization problem can be solved by the generalized eigensystem, which can be linked to the spectral graph theory. In my opinion, the forward way (normalized cut --> eigensystem) is much harder to derive than the backward way. If we think in the backward way, we may say that this is another way to interpret the spectral clustering.

Hope the presenters next week can give us a more insightful view to the spectral graph theory...I was not good at linear algebra.

Hope the presenters next week can give us a more insightful view to the spectral graph theory...I was not good at linear algebra.

## Tuesday, May 13, 2008

### [Reading] On Spectral Clustering: Analysis and An Algorithm

This paper presents a simple algorithm of clustering, based on the spectral theory. Compare with other spectral clustering methods (such as normalized cut), the algorithm is much easier to understand and implement. The key idea is that, unlike previous methods, many eigenvectors are used instead of only one.

For the trivial case, it's easy to understand that the eigenspace perfectly represents the clustering result. However, for the general case, the analysis is difficult to understand (at least to me). The authors make 4 assumptions, and also show those assumptions are reasonable in the general cases. Under these assumptions, the performance of the algorithm is guaranteed.

In summary, I seldom like the papers from NIPS....too hard for me...

For the trivial case, it's easy to understand that the eigenspace perfectly represents the clustering result. However, for the general case, the analysis is difficult to understand (at least to me). The authors make 4 assumptions, and also show those assumptions are reasonable in the general cases. Under these assumptions, the performance of the algorithm is guaranteed.

In summary, I seldom like the papers from NIPS....too hard for me...

## Tuesday, April 29, 2008

### [Reading] An Introduction to Graphical Models

This is a very good introduction paper for the graphical model. It elegantly explains why and when we should use the graphical model, what operations are usually used in the graphical model (so you may map your problem into one of them), and what tools are available to run these operations efficiently.

One main reason to use the graphical model is efficient computing. By explicitly expressing the dependency between the variables, the number of summation in inference and decision making can be greatly reduced. Also the approximate inference algorithms can be better understood. The message passing and graph cut methods all direct operate on the nodes and edges of the graphical model.

The paper also mentioned a topic that I didn't enter before, learning the structure of the graphical model. However, I personally think this is an aesthetic problem instead of a computational problem, since the search space is almost infinite and the goal (e.g., suitable for inference or decision making) is hard to formulate as an objective function. Even a rough graphical model is estimated, manual effort is still required to beautify its structure.

One main reason to use the graphical model is efficient computing. By explicitly expressing the dependency between the variables, the number of summation in inference and decision making can be greatly reduced. Also the approximate inference algorithms can be better understood. The message passing and graph cut methods all direct operate on the nodes and edges of the graphical model.

The paper also mentioned a topic that I didn't enter before, learning the structure of the graphical model. However, I personally think this is an aesthetic problem instead of a computational problem, since the search space is almost infinite and the goal (e.g., suitable for inference or decision making) is hard to formulate as an objective function. Even a rough graphical model is estimated, manual effort is still required to beautify its structure.

## Friday, April 25, 2008

### [Reading] Combining Labeled and Unlabeled Data with Co-Training

This paper describes a method to utilize the unlabeled data for training. The basic idea is to split the features into two separate sets, and learn two individual classifiers, and use these classifiers to expend the training data to learn new classifiers.

However, the paper is a little difficult to understand. Their method implies that the co-occurrence in-between or within the feature sets may possibly be found by the new labeled data. However, maybe these co-occurrence can be found by unsupervised clustering (like pLSA). It always involves the parameter setting. Picking too many unlabeled data or reducing too many dimensions can both degrade the performance.

However, the paper is a little difficult to understand. Their method implies that the co-occurrence in-between or within the feature sets may possibly be found by the new labeled data. However, maybe these co-occurrence can be found by unsupervised clustering (like pLSA). It always involves the parameter setting. Picking too many unlabeled data or reducing too many dimensions can both degrade the performance.

### [Reading] Transductive Inference for Text Classification using Support Vector Machines

This paper presents a method to expend the size of the training set for SVM. The key idea is to optimize the hyperplane and the un-seen labels together. In this way, the co-occurrence information missed in the original training data might be explored automatically.

The idea is very simple, and the algorithm for the optimization is easy to implementation (small modification to the original SVM). An interesting thing is that, if we apply pLSA to cluster the topics of all documents first, then the co-occurrence information can be found to some extent. In this case, does TSVM still outperform SVM? Or, maybe the co-occurrence is one thing found by TSVM that the author can think of, but there are other things that are hard to describe are also found by TSVM.

The idea is very simple, and the algorithm for the optimization is easy to implementation (small modification to the original SVM). An interesting thing is that, if we apply pLSA to cluster the topics of all documents first, then the co-occurrence information can be found to some extent. In this case, does TSVM still outperform SVM? Or, maybe the co-occurrence is one thing found by TSVM that the author can think of, but there are other things that are hard to describe are also found by TSVM.

## Friday, April 18, 2008

### [Reading] Improved boosting algorithms using confidence-rated predictions

This paper completes the Adaboost algorithm. In this original Adaboost paper, only the basic concept, combining several weak classifiers into a strong classifier. In this paper, the authors present a more complete story of this technique. It shows that several parameters in Adaboost can be optimized to improve the overall performance. Also it shows several extensions of Adaboost to handle non-binary classification.

However, some of these extensions looks very natural and necessary. In my previous experiments, if the alpha is not optimized, the result of Adaboost can be very bad. Also, this paper is not the best one for the beginner like me...I find some on-line documents are better.

However, some of these extensions looks very natural and necessary. In my previous experiments, if the alpha is not optimized, the result of Adaboost can be very bad. Also, this paper is not the best one for the beginner like me...I find some on-line documents are better.

## Thursday, April 17, 2008

### [Reading] Rapid object detection using a boosted cascade of simple features

This paper describes the most well-known face detection algorithm. Before reading this paper, I thought Adaboost is equivalent to the cascade detection. The cascade detection is an important contribution of this paper.

Because the Adaboost can fuse several weak classifiers, the authors found that using simple Haar-like features is enough for the face detection, and Haar-like features can be efficiently evaluated by integral image (in fact, simple sum-area table in texture mapping). Also, the cascade detection can skip many regions to reduce the computation.

These contributions combine a very powerful system. In fact, many face detection modules in the digital cameras are based on these techniques.

Because the Adaboost can fuse several weak classifiers, the authors found that using simple Haar-like features is enough for the face detection, and Haar-like features can be efficiently evaluated by integral image (in fact, simple sum-area table in texture mapping). Also, the cascade detection can skip many regions to reduce the computation.

These contributions combine a very powerful system. In fact, many face detection modules in the digital cameras are based on these techniques.

## Sunday, March 30, 2008

### [Reading] Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary

This paper presents a new concept to object recognition: model it as a machine translation problem.

Given a training dataset, an EM-alike algorithm is used to learn the correspondence between the "words" between the image (an English document) and the annotation (an French document). The paper assume the training data is perfect: there should be no noise within. However, some thresholding does improve the performance.

In EM, Lagrangian is used to make sure the variables are sum-to-one. However, maybe a simple re-projection is enough. And in fact, re-projection is indeed used in the algorithm (Fig. 4)....can't see the Lagrangian multipliers in the code @@.

In summary, the idea is very nice, but this is a paper more like introducing a concept instead of building a practical system. There are many interesting things to do.

Given a training dataset, an EM-alike algorithm is used to learn the correspondence between the "words" between the image (an English document) and the annotation (an French document). The paper assume the training data is perfect: there should be no noise within. However, some thresholding does improve the performance.

In EM, Lagrangian is used to make sure the variables are sum-to-one. However, maybe a simple re-projection is enough. And in fact, re-projection is indeed used in the algorithm (Fig. 4)....can't see the Lagrangian multipliers in the code @@.

In summary, the idea is very nice, but this is a paper more like introducing a concept instead of building a practical system. There are many interesting things to do.

## Friday, March 28, 2008

### [Reading] Names and Faces in the News Abstract

This paper present a system to put the name and face in the news together. The system uses several techniques: face detection and rectification, kPCA and LDA for dimension reduction, Nystrom approximation for efficient kPCA, K-mean clustering, pruning, and merging. One pity thing is that there is no much room to discuss the alternates of these techniques. The result shown in Fig 3 is very interesting and brings many possible research topics.

I'm very interested in the kPCA method and Nystrom approximation and I'm reading the origin papers (I think I should allocate a period to learn the matrix calculus more deeply).

How to prune the clustering is very important to my current research project. However, the method in this paper is not suitable to my problem :(. A special part is that the pruning is guided by hand labeling...what if the dataset is too large to be labeled? Also the running time is not reported in the paper.

I'm very interested in the kPCA method and Nystrom approximation and I'm reading the origin papers (I think I should allocate a period to learn the matrix calculus more deeply).

How to prune the clustering is very important to my current research project. However, the method in this paper is not suitable to my problem :(. A special part is that the pruning is guided by hand labeling...what if the dataset is too large to be labeled? Also the running time is not reported in the paper.

## Thursday, March 27, 2008

### [Reading] Algorithms for Fast Vector Quantization

This paper presents three different variant of the k-d trees to improved the search performance. The first one introduces a technique called incremental distance to reduce the calculation. In traversing the tree, the distance between the query point and the childs can be easily updated from the distance the query point and the current node (O(1) instead of O(D), where D is the dimension). I found this is very useful. It does not degrade the performance but greatly reduces the computation.

The second method uses a priority queue to store the distances to the all traversed nodes. After a leaf is checked, the algorithm always picks the node with smallest distance from the queue to traverse. The interesting thing is that, this idea is used in the SIFT system but the author did not method this reference.

The third method, although gives good performance, it requires a O(N^2) preprocessing to build the local neighborhood graph. Therefore it may not be suitable for large systems.

Both the first two methods are very exciting, and I'm considering to port the first one into my program. Also in the experiment section, a well-known early termination technique is used. In calculating the distance, we may quit at the i-th dimension if the accumulated distance is larger than the currently optimal one. However, early termination may not be the best choice in the current CPU pipeline since there is a branch is inside the for loop. If we want to optimize it, manual unfolding may be necessary. For example, calculate the first K dimensions at once, do the check, and then move to the next K dimensions.

The second method uses a priority queue to store the distances to the all traversed nodes. After a leaf is checked, the algorithm always picks the node with smallest distance from the queue to traverse. The interesting thing is that, this idea is used in the SIFT system but the author did not method this reference.

The third method, although gives good performance, it requires a O(N^2) preprocessing to build the local neighborhood graph. Therefore it may not be suitable for large systems.

Both the first two methods are very exciting, and I'm considering to port the first one into my program. Also in the experiment section, a well-known early termination technique is used. In calculating the distance, we may quit at the i-th dimension if the accumulated distance is larger than the currently optimal one. However, early termination may not be the best choice in the current CPU pipeline since there is a branch is inside the for loop. If we want to optimize it, manual unfolding may be necessary. For example, calculate the first K dimensions at once, do the check, and then move to the next K dimensions.

## Tuesday, March 25, 2008

### [Reading] Similarity Search in High Dimensions via Hashing

This is an extension of the previous paper I reviewed. To my surprise, this paper is much easier to understand. The derivation, pseudo codes, experimental results, and underlying theories are all provided.

However, I still can't understand the property of the Hamming space :(. Therefore I can hardly point out the weakness of this algorithm.....this is a painful week.

However, I still can't understand the property of the Hamming space :(. Therefore I can hardly point out the weakness of this algorithm.....this is a painful week.

## Monday, March 24, 2008

### [Reading] Approximate Nearest Neighbors:Towards Removing the Curse of Dimentsionality

Reading this makes me understand why most people still use the trivial ANN library. The theorems for the optimality of these algorithm are too hard to explain easily. This is the most painful paper I've read in this course so far, and I'm glad I finish it.

In this paper, 3 fast ANN algorithms are proposed. The last one, locality-sensitive hashing, is very similar to the Pyramid Match Hashing by Grauman and Darrell. The point in R^d is first embedded to the Hamming space, and then efficient histogram (hashing) matching can be performed. My main problem is, is there any good reference book to explain the good properties of the Hamming distance? I think that will be more interesting than this paper.

Besides the endless theorems and proofs in the paper (I re-derived some of them...), I don't know how to present this paper. There is no experimental result. I wish my brain could be so smart so I can publish something without doing any experiment....

In this paper, 3 fast ANN algorithms are proposed. The last one, locality-sensitive hashing, is very similar to the Pyramid Match Hashing by Grauman and Darrell. The point in R^d is first embedded to the Hamming space, and then efficient histogram (hashing) matching can be performed. My main problem is, is there any good reference book to explain the good properties of the Hamming distance? I think that will be more interesting than this paper.

Besides the endless theorems and proofs in the paper (I re-derived some of them...), I don't know how to present this paper. There is no experimental result. I wish my brain could be so smart so I can publish something without doing any experiment....

## Friday, March 14, 2008

### [Reading] Algorithms and Applications for the Approximate Nonnegative Matrix Factorization

This paper presents 3 main methods to obtain the NMF. An interesting thing is that the ALS is the easiest and fastest method which is used in the 1994 pioneering paper, but it is not used in the famous paper in Nature 1999.

Though this paper looks like an overview, its content is a little flat. I can hardly see or remember any convergent property of the NMF methods after reading the paper. Two examples used in the paper are not very common and NMF is not totally successful and the performance of other factorizations are not presented. To me, showing some simple application where the NMF works perfectly is more exciting. One good thing is that a public library is mentioned so maybe we can learn more from the library :).

Another thing is the NMF is getting popular in computer graphics. Many data in graphics are always non-negative: BRDF, visibility, etc. Also if NMF is used, the positive factorizations can be processed on GPU, which is optimized to handle positive numbers.

Though this paper looks like an overview, its content is a little flat. I can hardly see or remember any convergent property of the NMF methods after reading the paper. Two examples used in the paper are not very common and NMF is not totally successful and the performance of other factorizations are not presented. To me, showing some simple application where the NMF works perfectly is more exciting. One good thing is that a public library is mentioned so maybe we can learn more from the library :).

Another thing is the NMF is getting popular in computer graphics. Many data in graphics are always non-negative: BRDF, visibility, etc. Also if NMF is used, the positive factorizations can be processed on GPU, which is optimized to handle positive numbers.

### [Reading] Probabilistic Latent Semantic Indexing

The course puts NMF and pLSA together. I think the professor wants us to find their relationship. Using some simple derivations, the relationship is obvious. Let d denotes the documents, z the topics, and w the words. We have:

However, the main difference is that in the pLSA paper, the factorization process is examined in the probabilistic framework and achieved by EM. In NMF papers, it is usually casted as an optimization problem.

I didn't see the original pLSA paper before but read the LDA directly. I believe it is necessary and beneficial to read pLSA first since it gives many discussions about how and why the model is designed. Also it gives many good application examples.

p(w|d) = \sum_z p(w|z) p(z|d),

where p(w|d) is a |W|-D vector, p(w|z) is a |W|-d vector (Ai), and p(z|d) is a real value. We can put all right-sided vectors of different z's into two matrices, one is |W|x|Z| (H) and the other one is |Z|x1 (Vi). If we consider p(w|d) of different d's are columns in a |W|x|D| matrix (A), then we haveA = HV,

where the entities of H and V are all non-negative. Since A is given and what we want is H and V, this is in fact a NMF, plus a sum to 1 constraint so that each column is a valid pdf. However, the main difference is that in the pLSA paper, the factorization process is examined in the probabilistic framework and achieved by EM. In NMF papers, it is usually casted as an optimization problem.

I didn't see the original pLSA paper before but read the LDA directly. I believe it is necessary and beneficial to read pLSA first since it gives many discussions about how and why the model is designed. Also it gives many good application examples.

## Tuesday, March 11, 2008

### [Reading] Shape Matching and Object Recognition using Shape Contexts

The paper uses a representation, shape context, similar to the edgelet to describe the object. The contour of the object is densely sampled and each sample has a descriptor based on the distance and angle distribution of other samples. (One simple question is that this differential meansurement should be naturally invariant to the image plane rotation, but the authors still used an extra appendix to show this property). The distance metric of two samples can be defined based on chi-square distribution.

Given this discriminative representation, the correspondences between two images can be matched and the image-domain warping that best matches the correspondence is obtained (thin plate spline is used here, and I think there are better ones now). The similarity between two images is then defined. However, equation 12 in the paper is somewhat erroneous: the indices of the min operators are accumulated?? Also the warping distance is not so easy to understand, and it may generate big distance even if the object is only uniformly scaled.

The classification system is based on K-NN, and a K-medoids clustering is used for choosing the prototypes. It gives the best classification rate at the time of 2001. Overall the paper is well written and persuadable. Despite the small problems, it opens a new avenue to the object classification.

Given this discriminative representation, the correspondences between two images can be matched and the image-domain warping that best matches the correspondence is obtained (thin plate spline is used here, and I think there are better ones now). The similarity between two images is then defined. However, equation 12 in the paper is somewhat erroneous: the indices of the min operators are accumulated?? Also the warping distance is not so easy to understand, and it may generate big distance even if the object is only uniformly scaled.

The classification system is based on K-NN, and a K-medoids clustering is used for choosing the prototypes. It gives the best classification rate at the time of 2001. Overall the paper is well written and persuadable. Despite the small problems, it opens a new avenue to the object classification.

### [Reading] A Boundary-Fragment-Model for Object Detection

This paper present a system that uses a new shape representation, boundary-fragment-model (BFM), to perform object detection. In BFM, each edge fragment is associated with a object centroid location. Therefore several fragments should belong to the same object if their associated locations are close in the query image.

Besides the new idea, the whole system combines many other techniques, such as AdaBoost and Chamfer distance. The fragment is generated using growing and agglomerative clustering on medoids.

This is field is not what I'm familiar with, but I think many techniques they use are important. Also it is a good example about how to build a complete system.

Besides the new idea, the whole system combines many other techniques, such as AdaBoost and Chamfer distance. The fragment is generated using growing and agglomerative clustering on medoids.

This is field is not what I'm familiar with, but I think many techniques they use are important. Also it is a good example about how to build a complete system.

## Thursday, March 6, 2008

### [Reading] Nonlinear Dimensionality Reduction by Locally Linear Embedding

This is a very important paper that provides the first efficient and simple method to non-linearly project the high-dimensional signals onto a low-dimensional space. The key concept of the non-linear projection is that the data in fact lay on a manifold in the high-dimension space, which may not be represented by any simple slice of the space (manifold could become folded on the slice).

However, the locally connected graph can approximate the shape of the manifold. As long as the manifold is locally linear, we can obtain a linear reconstruction of each sample from the near neighbors. An optimal weighting set for each sample can be found by solving a least-square problem.

After the weighting set is obtain, the low-dimension representation of the sample on the manifold can then be defined, as long as the same weighting set can still reconstruct the low-dimension representation. Finding the optimal embedding is casted as a eigenvalue problem.

The bold sentences above are the key assumptions of this paper. Unlike Isomap which measure the geodesic distances of all sample pairs, only local distance is preserved. The degree of this effect may be application-dependent. For example, in Action Synopsis, it is shown that non metric replicated multi-dimensional scaling is better than LLE because the global analysis is preserved.

Also another problem is the number of the nearest neighbors (k), which should depends the local sampling rate. Also finding exact k-NN in a high-dimensional space is costly. However, since a effective dimension reduction method is so important, this cost is acceptable.

However, the locally connected graph can approximate the shape of the manifold. As long as the manifold is locally linear, we can obtain a linear reconstruction of each sample from the near neighbors. An optimal weighting set for each sample can be found by solving a least-square problem.

After the weighting set is obtain, the low-dimension representation of the sample on the manifold can then be defined, as long as the same weighting set can still reconstruct the low-dimension representation. Finding the optimal embedding is casted as a eigenvalue problem.

The bold sentences above are the key assumptions of this paper. Unlike Isomap which measure the geodesic distances of all sample pairs, only local distance is preserved. The degree of this effect may be application-dependent. For example, in Action Synopsis, it is shown that non metric replicated multi-dimensional scaling is better than LLE because the global analysis is preserved.

Also another problem is the number of the nearest neighbors (k), which should depends the local sampling rate. Also finding exact k-NN in a high-dimensional space is costly. However, since a effective dimension reduction method is so important, this cost is acceptable.

### [Reading] Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection

This paper formally classify the variation of the facial appearance into different classes. The variation can be due to different persons, different expressions, or different lighting conditions. For face identification, we hope the variations due to different expressions and lightings can be reduced by processing while the variation due to different persons are preserved. The author find that a linear projection obtained by LDA can efficiently achieve this goal. This result is very important since the projection does not only reduce the computation for matching, but also minimize the effect of the facial and lighting variation to the face recognition.

One thing should be noted is that this paper is very clear. It is maybe the best document I've ever read that explains LDA best. One interesting problem is that finding the optimal projection here is a generalized eigenvalue problem: Ax = \lambda Bx, where B and A are n-by-n matrices which could be very large for big images. Though the paper does not discuss the method to compute this, I believe a trick similar to the one used eigenface should be applied here, and I will make another post after I find it. (or if someone tell me directly :P)

Since both eigenface and fisherface are very old, I can't know the state-of-the-art in face recognition. However, these two papers are both very good at showing the applications of dimension reduction techniques.

One thing should be noted is that this paper is very clear. It is maybe the best document I've ever read that explains LDA best. One interesting problem is that finding the optimal projection here is a generalized eigenvalue problem: Ax = \lambda Bx, where B and A are n-by-n matrices which could be very large for big images. Though the paper does not discuss the method to compute this, I believe a trick similar to the one used eigenface should be applied here, and I will make another post after I find it. (or if someone tell me directly :P)

Since both eigenface and fisherface are very old, I can't know the state-of-the-art in face recognition. However, these two papers are both very good at showing the applications of dimension reduction techniques.

## Tuesday, March 4, 2008

### [Reading] Eigenfaces to Recognition

I'm surprised by the content of this paper. I've heard eigenface for a long time, I didn't realize the true importance of this work. (And I find this is another Pentland's great paper)

In this paper an efficient way to compute eigenvectors is used (maybe not the first one if we consider other research fields) and many possible attacks to the system are examined. The first topic is very important because it provides us a way to perform PCA on high-dimensional data, as long as the number of the samples is much smaller than the dimension of the samples. The second topic is about the performance degradation due to scaling, lighting, and rotation (on the image domain, not the rigid motion of the head). The authors also discuss some possible ways to solve these problems. For example, using a multi-scale processing.

Another interesting thing is that no one is talking about neural networks anymore, but obviously it was hot at 1991.

In this paper an efficient way to compute eigenvectors is used (maybe not the first one if we consider other research fields) and many possible attacks to the system are examined. The first topic is very important because it provides us a way to perform PCA on high-dimensional data, as long as the number of the samples is much smaller than the dimension of the samples. The second topic is about the performance degradation due to scaling, lighting, and rotation (on the image domain, not the rigid motion of the head). The authors also discuss some possible ways to solve these problems. For example, using a multi-scale processing.

Another interesting thing is that no one is talking about neural networks anymore, but obviously it was hot at 1991.

## Saturday, March 1, 2008

### [Reading] Scale & Affine Invariant Interest Point Detectors

Though I know this paper for a long time, I didn't attempt to read it before. This paper present a method to detect the interest points which are invariant to scale or affine transformations. That is, for each detected point, a transformation matrix is also extracted which transforms the local patch around the detected point into a canonical form for efficient description and matching.

Though the formulation and the method is elegant, it is too slow for many applications. The detection part requires around 50 times computation than SIFT, and the descriptor used in the paper is not as good as the one used in SIFT. I think that's why most applications only apply SIFT. It's interesting that some people use SIFT and this detector to detect the interest points and then use the SIFT descriptor to describe the detected points. These algorithms find different types of interest points and the union of these points cover a boarder class and improve the results. In some applications this is critical: SIFT sometimes give like 10 matchings between images.

One little question is about how to update the affine transformation. The paper uses the second moment matrix (co-variance) around the local patch and many issues are not discussed: the effect of sampling, window size, etc. Nevertheless, this paper still provides me a good example on the utilization of the continuous optimization technique.

Though the formulation and the method is elegant, it is too slow for many applications. The detection part requires around 50 times computation than SIFT, and the descriptor used in the paper is not as good as the one used in SIFT. I think that's why most applications only apply SIFT. It's interesting that some people use SIFT and this detector to detect the interest points and then use the SIFT descriptor to describe the detected points. These algorithms find different types of interest points and the union of these points cover a boarder class and improve the results. In some applications this is critical: SIFT sometimes give like 10 matchings between images.

One little question is about how to update the affine transformation. The paper uses the second moment matrix (co-variance) around the local patch and many issues are not discussed: the effect of sampling, window size, etc. Nevertheless, this paper still provides me a good example on the utilization of the continuous optimization technique.

## Tuesday, February 26, 2008

### [List] Research on Feature Points

This list is ordered in topics (may not be correct), from old to new. It should be noted that this list is far from completed and I will try to update it from time to time. A good overview of the field of feature detection can be find at Wikipedia (link).

[Parameter and Description Improvement and Optimization]

Last update: 02/27/2007

[Parameter and Description Improvement and Optimization]

- Y. Ke and R. Sukthankar. PCA-SIFT: A More Distinctive Representation for Local Image Descriptors. CVPR2004. (link)
- H. Bay, T. Tuytelaars, L. Van Gool, SURF: Speeded Up Robust Features, ECCV2006. (link)
- S. Winder and M. Brown,
*Learning Local Image Descriptors*, CVPR2007. (link) - G. Hua, M. Brown and S. Winder,
*Discriminant Embedding for Local Image Descriptors*, ICCV2007. (link)

- K. Grauman and T. Darrell. The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features. ICCV2005. (link)
- K. Grauman and T. Darrell. Pyramid Match Hashing: Sub-Linear Time Indexing Over Partial Correspondences. CVPR2007. (link)

- J. Sivic and A. Zisserman.
*Video Google: Efficient Visual Search of Videos Toward Category-Level Object Recognition.*CVPR 2003. (link) - D. Nistér and H. Stewénius. Scalable Recognition with a Vocabulary Tree. CVPR2006. (link)
- J. Philbin, O. Chum, M. Isard, J. Sivic and A. Zisserman. Object Retrieval with Large Vocabularies and Fast Spatial Matching. CVPR 2007. (link)
- G. Schindler, M. brown, and R. Szeliski. City-Scale Location Recognition. CVPR 2007. (link)
- O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman. Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval. ICCV2007. (link)
- P. Quelhas, F. Monay, J.-M. Odobez, and D. Gatica-Perez, A Thousand Words in a Scene. PAMI2007. (link)

- M. Brown and D. G. Lowe. Unsupervised 3D Object Recognition and Reconstruction in Unordered Datasets. 3DIM2005 (link)
- N. Snavely, Ｓ. M. Seitz, and R. Szeliski, Photo Tourism: Exploring Photo Collections in 3D. SIGGRAPH2006. (link)
- M. Goesele, N. Snavely, B. Curless, H. Hoppe, S. M. Seitz, Multi-View Stereo for Community Photo Collections. ICCV2007 (link)
- I. Simon, N. Snavely, and S. M. Seitz. Scene Summarization for Online Image Collections. ICCV2007. (link)

- L. Kennedy, M. Naaman, S. Ahern, R. Nair, and T. Rattenbury. How Flickr Helps us Make Sense of the World: Context and Content in Community-Contributed Media Collections. ACM MM2007. (link)

Last update: 02/27/2007

### [Reading] Distinctive Image Features from Scale-Invariant Keypoints

I read and implemented this paper long time ago (2006) for a project in the course 'digital visual effect'. I use SIFT to register multiple images into a panorama:

The paper presents a keypoint localization and description method, SIFT. Though these two parts can be considered independently, the author did not discuss this issue much. The keypoint keypoint localization is done by detecting the local extrema in the scale-scale representation of the image, and the description is a processed local gradient histogram around the keypoint.

The concept of these two steps are not hard to understand, but the implementation details are insanely complex. There are many parameters in the system, and the author found the possibly optimal values by experiments (these experiments are extended recently and the related papers are listed here). A large portion of the paper is filled with the figures in tuning these parameters. Even though, many implementation details are skipped and it is very difficult to write a SIFT program identical to the author's version.

Nevertheless, SIFT is a huge success and hence enables many new applications (also here). It is designed only for scale- and illumination-invariant, but, however, it is affine-invariant to some degree and is much faster than those truly affine-invariant keypoint detection methods. Although there are numerous new keypoint localization and description methods every years, I still prefer SIFT due to its efficiency, conciseness (the theory part). Most important of all, many small drawbacks in SIFT bring many interesting research topics.

(click the image to the project page, including more results and my SIFT code)

The paper presents a keypoint localization and description method, SIFT. Though these two parts can be considered independently, the author did not discuss this issue much. The keypoint keypoint localization is done by detecting the local extrema in the scale-scale representation of the image, and the description is a processed local gradient histogram around the keypoint.

The concept of these two steps are not hard to understand, but the implementation details are insanely complex. There are many parameters in the system, and the author found the possibly optimal values by experiments (these experiments are extended recently and the related papers are listed here). A large portion of the paper is filled with the figures in tuning these parameters. Even though, many implementation details are skipped and it is very difficult to write a SIFT program identical to the author's version.

Nevertheless, SIFT is a huge success and hence enables many new applications (also here). It is designed only for scale- and illumination-invariant, but, however, it is affine-invariant to some degree and is much faster than those truly affine-invariant keypoint detection methods. Although there are numerous new keypoint localization and description methods every years, I still prefer SIFT due to its efficiency, conciseness (the theory part). Most important of all, many small drawbacks in SIFT bring many interesting research topics.

## Monday, February 25, 2008

### [Reading] Image Retrieval: Ideas, Influences, and Trends of the New Age

This is a very long paper (66 pages, and 12 of them are references). The authors try to mention all new topics in the CBIR field after the millennium. This restriction, however, sometimes hinders the readability since all the previous work before 2000 are ignored and the readers have to check the older overview papers.

The paper contains 4 main topics: the real-world demand for CBIR, the core techniques, the offshoots, and the evaluation-related issues. For the first topic, the authors use 2 3D cubes to describe the possible behaviors of the user and of the system. Although the possible space is large, I personally think some cases should be ruled out in research. For example, designing a efficient system for a user with the intent "browser" is meaningless. All considerations about the system mentioned in the paper are very critical. Each design choice can largely affect the performance and cost. It's interesting that the authors do agree that simple thumbnails may be the best way to present the query results.

For the second issue, the core techniques, the authors first list all novel signature (feature) extraction and similarity (distance) measurement methods. I learn sometime new in this section, such as the Earth mover's distance (don't know why I used to skip Prof. Tomasi's paper at the first time :P) and some shape descriptors etc. One pity thing is that one new avenue, visual word, is less addressed in the paper, maybe it's because the paper is submitted before the emergence of this field. The visual words, quantized interesting point descriptors (e.g., SIFT), in an image can be considered as a bag of word document and used to perform text-alike query. It also enables spatial matching for object identification (like matching a sentence in a document), novel histogram matching methods (like measuring the number of the co-occurrence words in two documents), and new applications. Since the interesting point is the topic in the following course, I will try to arrange a semi-complete list about this topic in the future.

The author also discuss some methods about the clustering, the classification, and the relevance feedback. One particular thing I see through this topics is that the learning (or statistical) methods are becoming more important. The same thread happens in the computer vision fields, due to the invention of new theories and fast solvers. Building a graphical model for a specific problem is a very critical research now. Finally, the authors slightly mention the important of the manifold. It's interesting that manifold is more utilized in computer graphics (CG) than in computer vision or CBIR because many signals in CG , such as human motions and spatially- and temporal-variant BRDF's, are of high dimensions for which the direct distance measurement could be expensive and meaningless. Also the authors did not mention two important papers in 2000, the locally linear embedding and the Isomap, which inspire numerous researchers (maybe it's because they are not after 2000...).

The next topic is the offshoots from CBIR. However, I think most of them are not so meaningful. They can be challenging, but it is hard to say these fields can grow into a mature and independent ones in the future. In some sub-sections, most cited papers are from the authors. Maybe giving more details about a single topic can make me believe they are of real importance.

The final topic, evaluation strategies, is very important. The authors list many existing evaluation methods and datasets with ground truth. From these we can easily see that the number of the dataset is far from satisfaction. One main problem is that unlike the problems in computer vision, video compression, and computer graphics, it is hard to assign the ground truth for problems in CBIR. A open online interactive system could the best way to label the big dataset, but designing this could consume a huge time for small research teams.

This well-organized paper presents ample materials about CBIR. Many start-of-the-art techniques are mentioned here, and also many possible research directions. Finally, one small drawback is that the reference list is alphabetical. Many review papers order the references in topics to reach a better readability.

The paper contains 4 main topics: the real-world demand for CBIR, the core techniques, the offshoots, and the evaluation-related issues. For the first topic, the authors use 2 3D cubes to describe the possible behaviors of the user and of the system. Although the possible space is large, I personally think some cases should be ruled out in research. For example, designing a efficient system for a user with the intent "browser" is meaningless. All considerations about the system mentioned in the paper are very critical. Each design choice can largely affect the performance and cost. It's interesting that the authors do agree that simple thumbnails may be the best way to present the query results.

For the second issue, the core techniques, the authors first list all novel signature (feature) extraction and similarity (distance) measurement methods. I learn sometime new in this section, such as the Earth mover's distance (don't know why I used to skip Prof. Tomasi's paper at the first time :P) and some shape descriptors etc. One pity thing is that one new avenue, visual word, is less addressed in the paper, maybe it's because the paper is submitted before the emergence of this field. The visual words, quantized interesting point descriptors (e.g., SIFT), in an image can be considered as a bag of word document and used to perform text-alike query. It also enables spatial matching for object identification (like matching a sentence in a document), novel histogram matching methods (like measuring the number of the co-occurrence words in two documents), and new applications. Since the interesting point is the topic in the following course, I will try to arrange a semi-complete list about this topic in the future.

The author also discuss some methods about the clustering, the classification, and the relevance feedback. One particular thing I see through this topics is that the learning (or statistical) methods are becoming more important. The same thread happens in the computer vision fields, due to the invention of new theories and fast solvers. Building a graphical model for a specific problem is a very critical research now. Finally, the authors slightly mention the important of the manifold. It's interesting that manifold is more utilized in computer graphics (CG) than in computer vision or CBIR because many signals in CG , such as human motions and spatially- and temporal-variant BRDF's, are of high dimensions for which the direct distance measurement could be expensive and meaningless. Also the authors did not mention two important papers in 2000, the locally linear embedding and the Isomap, which inspire numerous researchers (maybe it's because they are not after 2000...).

The next topic is the offshoots from CBIR. However, I think most of them are not so meaningful. They can be challenging, but it is hard to say these fields can grow into a mature and independent ones in the future. In some sub-sections, most cited papers are from the authors. Maybe giving more details about a single topic can make me believe they are of real importance.

The final topic, evaluation strategies, is very important. The authors list many existing evaluation methods and datasets with ground truth. From these we can easily see that the number of the dataset is far from satisfaction. One main problem is that unlike the problems in computer vision, video compression, and computer graphics, it is hard to assign the ground truth for problems in CBIR. A open online interactive system could the best way to label the big dataset, but designing this could consume a huge time for small research teams.

This well-organized paper presents ample materials about CBIR. Many start-of-the-art techniques are mentioned here, and also many possible research directions. Finally, one small drawback is that the reference list is alphabetical. Many review papers order the references in topics to reach a better readability.

## Saturday, February 23, 2008

### [Reading] How to Give a Good Research Talk

The key point we should learn from this paper is 'If you bore your audience in the first few minutes, you may never get them back.' This happened to me for many times when I was either a speaker or an audience. Now I often try to put some results as a teaser in the beginning of the paper and the slide.

Another point is 'never hide your weakness.' This is not only true for presenting but also for writing a paper. I usually give a negative comment to the paper when I find the authors hide their drawbacks. However, where and how to present these drawbacks is an aesthetic problem. Put it at the end is not always the best solution.

This paper is a little dated so some points are not correct now. The visual aids now become far more important than it was at the time that the paper being published (1993). The computer slide authoring programs are very friendly and efficient now so people do not have to write the slide manually.

Finally, people tend to put their slide on the Internet now. I have organized a short researcher list at here.

Another point is 'never hide your weakness.' This is not only true for presenting but also for writing a paper. I usually give a negative comment to the paper when I find the authors hide their drawbacks. However, where and how to present these drawbacks is an aesthetic problem. Put it at the end is not always the best solution.

This paper is a little dated so some points are not correct now. The visual aids now become far more important than it was at the time that the paper being published (1993). The computer slide authoring programs are very friendly and efficient now so people do not have to write the slide manually.

Finally, people tend to put their slide on the Internet now. I have organized a short researcher list at here.

### [Reading] How to Read a Paper

The author presents his 'three-pass method' in this paper. To my surprise it is very similar to the method I'm using. I guess after reading enough papers, people tend to read in this way :P

Below I describe the each of his steps, and also each of them in my version.

1 [Keshav]. Read the title, the abstract, the introduction, the conclusion, and the reference list.

1 [Mine]. Read the title, abstract, the reference list, the prior work section, and then the introduction. I skip the conclusion because it has very useful function to me in the following steps. Also if there are slides/videos associated with the paper, I would check them first. These materials are very common in my field.

2 [Keshav]. Check the main body of the paper, the figures and the results. Mark the important cited papers to be survey and skip the proof.

2 [Mine]. Check the main body of the paper,

3 [Keshav]. Try to re-implement work presented in the paper. Identify its true contributions, its weakness, and the possible future work.

3 [Mine]. Try to re-implement work presented in the paper. Specifically, I'd try to convert the whole paper into a short pseudo-code and write in down. In this way I can totally ensure the method presented in this paper is feasible, and more or less understand the time complexity of the proposed method (The timing shown in the paper is not always reliable).

The little trick I have is 'using the conclusion section for relaxing myself.' When performing step 2 or 3, you may get tired for various reasons: bad notation, chaos description, etc. However, the content in the conclusion is somewhat predictable, so you can turn to that section to take a break.

The author also present a method to survey a new research topic. It's relatively trivial and therefore I skip it here.

Below I describe the each of his steps, and also each of them in my version.

1 [Keshav]. Read the title, the abstract, the introduction, the conclusion, and the reference list.

1 [Mine]. Read the title, abstract, the reference list, the prior work section, and then the introduction. I skip the conclusion because it has very useful function to me in the following steps. Also if there are slides/videos associated with the paper, I would check them first. These materials are very common in my field.

2 [Keshav]. Check the main body of the paper, the figures and the results. Mark the important cited papers to be survey and skip the proof.

2 [Mine]. Check the main body of the paper,

3 [Keshav]. Try to re-implement work presented in the paper. Identify its true contributions, its weakness, and the possible future work.

3 [Mine]. Try to re-implement work presented in the paper. Specifically, I'd try to convert the whole paper into a short pseudo-code and write in down. In this way I can totally ensure the method presented in this paper is feasible, and more or less understand the time complexity of the proposed method (The timing shown in the paper is not always reliable).

The little trick I have is 'using the conclusion section for relaxing myself.' When performing step 2 or 3, you may get tired for various reasons: bad notation, chaos description, etc. However, the content in the conclusion is somewhat predictable, so you can turn to that section to take a break.

The author also present a method to survey a new research topic. It's relatively trivial and therefore I skip it here.

Subscribe to:
Posts (Atom)