This paper presents a new concept to object recognition: model it as a machine translation problem.

Given a training dataset, an EM-alike algorithm is used to learn the correspondence between the "words" between the image (an English document) and the annotation (an French document). The paper assume the training data is perfect: there should be no noise within. However, some thresholding does improve the performance.

In EM, Lagrangian is used to make sure the variables are sum-to-one. However, maybe a simple re-projection is enough. And in fact, re-projection is indeed used in the algorithm (Fig. 4)....can't see the Lagrangian multipliers in the code @@.

In summary, the idea is very nice, but this is a paper more like introducing a concept instead of building a practical system. There are many interesting things to do.

## Sunday, March 30, 2008

## Friday, March 28, 2008

### [Reading] Names and Faces in the News Abstract

This paper present a system to put the name and face in the news together. The system uses several techniques: face detection and rectification, kPCA and LDA for dimension reduction, Nystrom approximation for efficient kPCA, K-mean clustering, pruning, and merging. One pity thing is that there is no much room to discuss the alternates of these techniques. The result shown in Fig 3 is very interesting and brings many possible research topics.

I'm very interested in the kPCA method and Nystrom approximation and I'm reading the origin papers (I think I should allocate a period to learn the matrix calculus more deeply).

How to prune the clustering is very important to my current research project. However, the method in this paper is not suitable to my problem :(. A special part is that the pruning is guided by hand labeling...what if the dataset is too large to be labeled? Also the running time is not reported in the paper.

I'm very interested in the kPCA method and Nystrom approximation and I'm reading the origin papers (I think I should allocate a period to learn the matrix calculus more deeply).

How to prune the clustering is very important to my current research project. However, the method in this paper is not suitable to my problem :(. A special part is that the pruning is guided by hand labeling...what if the dataset is too large to be labeled? Also the running time is not reported in the paper.

## Thursday, March 27, 2008

### [Reading] Algorithms for Fast Vector Quantization

This paper presents three different variant of the k-d trees to improved the search performance. The first one introduces a technique called incremental distance to reduce the calculation. In traversing the tree, the distance between the query point and the childs can be easily updated from the distance the query point and the current node (O(1) instead of O(D), where D is the dimension). I found this is very useful. It does not degrade the performance but greatly reduces the computation.

The second method uses a priority queue to store the distances to the all traversed nodes. After a leaf is checked, the algorithm always picks the node with smallest distance from the queue to traverse. The interesting thing is that, this idea is used in the SIFT system but the author did not method this reference.

The third method, although gives good performance, it requires a O(N^2) preprocessing to build the local neighborhood graph. Therefore it may not be suitable for large systems.

Both the first two methods are very exciting, and I'm considering to port the first one into my program. Also in the experiment section, a well-known early termination technique is used. In calculating the distance, we may quit at the i-th dimension if the accumulated distance is larger than the currently optimal one. However, early termination may not be the best choice in the current CPU pipeline since there is a branch is inside the for loop. If we want to optimize it, manual unfolding may be necessary. For example, calculate the first K dimensions at once, do the check, and then move to the next K dimensions.

The second method uses a priority queue to store the distances to the all traversed nodes. After a leaf is checked, the algorithm always picks the node with smallest distance from the queue to traverse. The interesting thing is that, this idea is used in the SIFT system but the author did not method this reference.

The third method, although gives good performance, it requires a O(N^2) preprocessing to build the local neighborhood graph. Therefore it may not be suitable for large systems.

Both the first two methods are very exciting, and I'm considering to port the first one into my program. Also in the experiment section, a well-known early termination technique is used. In calculating the distance, we may quit at the i-th dimension if the accumulated distance is larger than the currently optimal one. However, early termination may not be the best choice in the current CPU pipeline since there is a branch is inside the for loop. If we want to optimize it, manual unfolding may be necessary. For example, calculate the first K dimensions at once, do the check, and then move to the next K dimensions.

## Tuesday, March 25, 2008

### [Reading] Similarity Search in High Dimensions via Hashing

This is an extension of the previous paper I reviewed. To my surprise, this paper is much easier to understand. The derivation, pseudo codes, experimental results, and underlying theories are all provided.

However, I still can't understand the property of the Hamming space :(. Therefore I can hardly point out the weakness of this algorithm.....this is a painful week.

However, I still can't understand the property of the Hamming space :(. Therefore I can hardly point out the weakness of this algorithm.....this is a painful week.

## Monday, March 24, 2008

### [Reading] Approximate Nearest Neighbors:Towards Removing the Curse of Dimentsionality

Reading this makes me understand why most people still use the trivial ANN library. The theorems for the optimality of these algorithm are too hard to explain easily. This is the most painful paper I've read in this course so far, and I'm glad I finish it.

In this paper, 3 fast ANN algorithms are proposed. The last one, locality-sensitive hashing, is very similar to the Pyramid Match Hashing by Grauman and Darrell. The point in R^d is first embedded to the Hamming space, and then efficient histogram (hashing) matching can be performed. My main problem is, is there any good reference book to explain the good properties of the Hamming distance? I think that will be more interesting than this paper.

Besides the endless theorems and proofs in the paper (I re-derived some of them...), I don't know how to present this paper. There is no experimental result. I wish my brain could be so smart so I can publish something without doing any experiment....

In this paper, 3 fast ANN algorithms are proposed. The last one, locality-sensitive hashing, is very similar to the Pyramid Match Hashing by Grauman and Darrell. The point in R^d is first embedded to the Hamming space, and then efficient histogram (hashing) matching can be performed. My main problem is, is there any good reference book to explain the good properties of the Hamming distance? I think that will be more interesting than this paper.

Besides the endless theorems and proofs in the paper (I re-derived some of them...), I don't know how to present this paper. There is no experimental result. I wish my brain could be so smart so I can publish something without doing any experiment....

## Friday, March 14, 2008

### [Reading] Algorithms and Applications for the Approximate Nonnegative Matrix Factorization

This paper presents 3 main methods to obtain the NMF. An interesting thing is that the ALS is the easiest and fastest method which is used in the 1994 pioneering paper, but it is not used in the famous paper in Nature 1999.

Though this paper looks like an overview, its content is a little flat. I can hardly see or remember any convergent property of the NMF methods after reading the paper. Two examples used in the paper are not very common and NMF is not totally successful and the performance of other factorizations are not presented. To me, showing some simple application where the NMF works perfectly is more exciting. One good thing is that a public library is mentioned so maybe we can learn more from the library :).

Another thing is the NMF is getting popular in computer graphics. Many data in graphics are always non-negative: BRDF, visibility, etc. Also if NMF is used, the positive factorizations can be processed on GPU, which is optimized to handle positive numbers.

Though this paper looks like an overview, its content is a little flat. I can hardly see or remember any convergent property of the NMF methods after reading the paper. Two examples used in the paper are not very common and NMF is not totally successful and the performance of other factorizations are not presented. To me, showing some simple application where the NMF works perfectly is more exciting. One good thing is that a public library is mentioned so maybe we can learn more from the library :).

Another thing is the NMF is getting popular in computer graphics. Many data in graphics are always non-negative: BRDF, visibility, etc. Also if NMF is used, the positive factorizations can be processed on GPU, which is optimized to handle positive numbers.

### [Reading] Probabilistic Latent Semantic Indexing

The course puts NMF and pLSA together. I think the professor wants us to find their relationship. Using some simple derivations, the relationship is obvious. Let d denotes the documents, z the topics, and w the words. We have:

However, the main difference is that in the pLSA paper, the factorization process is examined in the probabilistic framework and achieved by EM. In NMF papers, it is usually casted as an optimization problem.

I didn't see the original pLSA paper before but read the LDA directly. I believe it is necessary and beneficial to read pLSA first since it gives many discussions about how and why the model is designed. Also it gives many good application examples.

p(w|d) = \sum_z p(w|z) p(z|d),

where p(w|d) is a |W|-D vector, p(w|z) is a |W|-d vector (Ai), and p(z|d) is a real value. We can put all right-sided vectors of different z's into two matrices, one is |W|x|Z| (H) and the other one is |Z|x1 (Vi). If we consider p(w|d) of different d's are columns in a |W|x|D| matrix (A), then we haveA = HV,

where the entities of H and V are all non-negative. Since A is given and what we want is H and V, this is in fact a NMF, plus a sum to 1 constraint so that each column is a valid pdf. However, the main difference is that in the pLSA paper, the factorization process is examined in the probabilistic framework and achieved by EM. In NMF papers, it is usually casted as an optimization problem.

I didn't see the original pLSA paper before but read the LDA directly. I believe it is necessary and beneficial to read pLSA first since it gives many discussions about how and why the model is designed. Also it gives many good application examples.

## Tuesday, March 11, 2008

### [Reading] Shape Matching and Object Recognition using Shape Contexts

The paper uses a representation, shape context, similar to the edgelet to describe the object. The contour of the object is densely sampled and each sample has a descriptor based on the distance and angle distribution of other samples. (One simple question is that this differential meansurement should be naturally invariant to the image plane rotation, but the authors still used an extra appendix to show this property). The distance metric of two samples can be defined based on chi-square distribution.

Given this discriminative representation, the correspondences between two images can be matched and the image-domain warping that best matches the correspondence is obtained (thin plate spline is used here, and I think there are better ones now). The similarity between two images is then defined. However, equation 12 in the paper is somewhat erroneous: the indices of the min operators are accumulated?? Also the warping distance is not so easy to understand, and it may generate big distance even if the object is only uniformly scaled.

The classification system is based on K-NN, and a K-medoids clustering is used for choosing the prototypes. It gives the best classification rate at the time of 2001. Overall the paper is well written and persuadable. Despite the small problems, it opens a new avenue to the object classification.

Given this discriminative representation, the correspondences between two images can be matched and the image-domain warping that best matches the correspondence is obtained (thin plate spline is used here, and I think there are better ones now). The similarity between two images is then defined. However, equation 12 in the paper is somewhat erroneous: the indices of the min operators are accumulated?? Also the warping distance is not so easy to understand, and it may generate big distance even if the object is only uniformly scaled.

The classification system is based on K-NN, and a K-medoids clustering is used for choosing the prototypes. It gives the best classification rate at the time of 2001. Overall the paper is well written and persuadable. Despite the small problems, it opens a new avenue to the object classification.

### [Reading] A Boundary-Fragment-Model for Object Detection

This paper present a system that uses a new shape representation, boundary-fragment-model (BFM), to perform object detection. In BFM, each edge fragment is associated with a object centroid location. Therefore several fragments should belong to the same object if their associated locations are close in the query image.

Besides the new idea, the whole system combines many other techniques, such as AdaBoost and Chamfer distance. The fragment is generated using growing and agglomerative clustering on medoids.

This is field is not what I'm familiar with, but I think many techniques they use are important. Also it is a good example about how to build a complete system.

Besides the new idea, the whole system combines many other techniques, such as AdaBoost and Chamfer distance. The fragment is generated using growing and agglomerative clustering on medoids.

This is field is not what I'm familiar with, but I think many techniques they use are important. Also it is a good example about how to build a complete system.

## Thursday, March 6, 2008

### [Reading] Nonlinear Dimensionality Reduction by Locally Linear Embedding

This is a very important paper that provides the first efficient and simple method to non-linearly project the high-dimensional signals onto a low-dimensional space. The key concept of the non-linear projection is that the data in fact lay on a manifold in the high-dimension space, which may not be represented by any simple slice of the space (manifold could become folded on the slice).

However, the locally connected graph can approximate the shape of the manifold. As long as the manifold is locally linear, we can obtain a linear reconstruction of each sample from the near neighbors. An optimal weighting set for each sample can be found by solving a least-square problem.

After the weighting set is obtain, the low-dimension representation of the sample on the manifold can then be defined, as long as the same weighting set can still reconstruct the low-dimension representation. Finding the optimal embedding is casted as a eigenvalue problem.

The bold sentences above are the key assumptions of this paper. Unlike Isomap which measure the geodesic distances of all sample pairs, only local distance is preserved. The degree of this effect may be application-dependent. For example, in Action Synopsis, it is shown that non metric replicated multi-dimensional scaling is better than LLE because the global analysis is preserved.

Also another problem is the number of the nearest neighbors (k), which should depends the local sampling rate. Also finding exact k-NN in a high-dimensional space is costly. However, since a effective dimension reduction method is so important, this cost is acceptable.

However, the locally connected graph can approximate the shape of the manifold. As long as the manifold is locally linear, we can obtain a linear reconstruction of each sample from the near neighbors. An optimal weighting set for each sample can be found by solving a least-square problem.

After the weighting set is obtain, the low-dimension representation of the sample on the manifold can then be defined, as long as the same weighting set can still reconstruct the low-dimension representation. Finding the optimal embedding is casted as a eigenvalue problem.

The bold sentences above are the key assumptions of this paper. Unlike Isomap which measure the geodesic distances of all sample pairs, only local distance is preserved. The degree of this effect may be application-dependent. For example, in Action Synopsis, it is shown that non metric replicated multi-dimensional scaling is better than LLE because the global analysis is preserved.

Also another problem is the number of the nearest neighbors (k), which should depends the local sampling rate. Also finding exact k-NN in a high-dimensional space is costly. However, since a effective dimension reduction method is so important, this cost is acceptable.

### [Reading] Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection

This paper formally classify the variation of the facial appearance into different classes. The variation can be due to different persons, different expressions, or different lighting conditions. For face identification, we hope the variations due to different expressions and lightings can be reduced by processing while the variation due to different persons are preserved. The author find that a linear projection obtained by LDA can efficiently achieve this goal. This result is very important since the projection does not only reduce the computation for matching, but also minimize the effect of the facial and lighting variation to the face recognition.

One thing should be noted is that this paper is very clear. It is maybe the best document I've ever read that explains LDA best. One interesting problem is that finding the optimal projection here is a generalized eigenvalue problem: Ax = \lambda Bx, where B and A are n-by-n matrices which could be very large for big images. Though the paper does not discuss the method to compute this, I believe a trick similar to the one used eigenface should be applied here, and I will make another post after I find it. (or if someone tell me directly :P)

Since both eigenface and fisherface are very old, I can't know the state-of-the-art in face recognition. However, these two papers are both very good at showing the applications of dimension reduction techniques.

One thing should be noted is that this paper is very clear. It is maybe the best document I've ever read that explains LDA best. One interesting problem is that finding the optimal projection here is a generalized eigenvalue problem: Ax = \lambda Bx, where B and A are n-by-n matrices which could be very large for big images. Though the paper does not discuss the method to compute this, I believe a trick similar to the one used eigenface should be applied here, and I will make another post after I find it. (or if someone tell me directly :P)

Since both eigenface and fisherface are very old, I can't know the state-of-the-art in face recognition. However, these two papers are both very good at showing the applications of dimension reduction techniques.

## Tuesday, March 4, 2008

### [Reading] Eigenfaces to Recognition

I'm surprised by the content of this paper. I've heard eigenface for a long time, I didn't realize the true importance of this work. (And I find this is another Pentland's great paper)

In this paper an efficient way to compute eigenvectors is used (maybe not the first one if we consider other research fields) and many possible attacks to the system are examined. The first topic is very important because it provides us a way to perform PCA on high-dimensional data, as long as the number of the samples is much smaller than the dimension of the samples. The second topic is about the performance degradation due to scaling, lighting, and rotation (on the image domain, not the rigid motion of the head). The authors also discuss some possible ways to solve these problems. For example, using a multi-scale processing.

Another interesting thing is that no one is talking about neural networks anymore, but obviously it was hot at 1991.

In this paper an efficient way to compute eigenvectors is used (maybe not the first one if we consider other research fields) and many possible attacks to the system are examined. The first topic is very important because it provides us a way to perform PCA on high-dimensional data, as long as the number of the samples is much smaller than the dimension of the samples. The second topic is about the performance degradation due to scaling, lighting, and rotation (on the image domain, not the rigid motion of the head). The authors also discuss some possible ways to solve these problems. For example, using a multi-scale processing.

Another interesting thing is that no one is talking about neural networks anymore, but obviously it was hot at 1991.

## Saturday, March 1, 2008

### [Reading] Scale & Affine Invariant Interest Point Detectors

Though I know this paper for a long time, I didn't attempt to read it before. This paper present a method to detect the interest points which are invariant to scale or affine transformations. That is, for each detected point, a transformation matrix is also extracted which transforms the local patch around the detected point into a canonical form for efficient description and matching.

Though the formulation and the method is elegant, it is too slow for many applications. The detection part requires around 50 times computation than SIFT, and the descriptor used in the paper is not as good as the one used in SIFT. I think that's why most applications only apply SIFT. It's interesting that some people use SIFT and this detector to detect the interest points and then use the SIFT descriptor to describe the detected points. These algorithms find different types of interest points and the union of these points cover a boarder class and improve the results. In some applications this is critical: SIFT sometimes give like 10 matchings between images.

One little question is about how to update the affine transformation. The paper uses the second moment matrix (co-variance) around the local patch and many issues are not discussed: the effect of sampling, window size, etc. Nevertheless, this paper still provides me a good example on the utilization of the continuous optimization technique.

Though the formulation and the method is elegant, it is too slow for many applications. The detection part requires around 50 times computation than SIFT, and the descriptor used in the paper is not as good as the one used in SIFT. I think that's why most applications only apply SIFT. It's interesting that some people use SIFT and this detector to detect the interest points and then use the SIFT descriptor to describe the detected points. These algorithms find different types of interest points and the union of these points cover a boarder class and improve the results. In some applications this is critical: SIFT sometimes give like 10 matchings between images.

One little question is about how to update the affine transformation. The paper uses the second moment matrix (co-variance) around the local patch and many issues are not discussed: the effect of sampling, window size, etc. Nevertheless, this paper still provides me a good example on the utilization of the continuous optimization technique.

Subscribe to:
Posts (Atom)