Chia-Kai's AMMAI

[Reading] Support vector learning for ordinal regression

2008-06-07T16:53:00.001+08:00

This paper extends the SVM framework for the ordinal regression (classification of ordered classes). The algorithm is easy to understand after the problem is formulated. The key contribution of the paper is showing that the ordinal regression problem can be transformed into regular SVR by using the difference of the feature vectors as the new feature vector. Also the error bound of the algorithm is presented.

I did not know much about SVM and hope I can find some useful online materials to learn it by myself in the future....

[Reading] The pagerank citation ranking: Bringing order to the web

2008-05-30T22:22:00.002+08:00

This is a very famous paper. Almost all people who searched for the answer of "how Google works" will reach this paper. The basic idea is unbelievably simple, just a simple application of the random walk theory. Given an initial positions of many walkers on the web, the final distribution of these walkers is the importance of the web pages. Thanks to the AMMAI course, I can easily think that this is a linear algebra problem.

Because the number of the webpages is very large and we only want one eigenvector, an iterative method is used in the paper. An interesting thing is that the iteration step is not the gradient of the cost function but the authors claim that the convergence is faster. Another issue is that the importance of webpages is l1-normalized. However, I don't really know why this is necessary.

It's interesting to think, how pagerank works in the environment of Web 2.0 now (of course Google must handle this already...). The webpages are now very dynamic. The link structures change very fast and it's almost impossible to build the full linear system (there are billions of webpages now, and each one can easily link to hundred of webpages). Actually maybe Google do not work so well now...at least recently I feel depressed at searching on Google :P

[Reading] Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach

2008-05-25T22:58:00.003+08:00

This paper presents a new method for mining frequent patterns. The main technique is a new data structure, FP tree, for the compact representation of the database. A FP tree can compactly represent all frequent patterns (no information loss) and enable fast query (by traversing the tree and using some range pointers).

There are always some detailed examples in describing the data structure and mining methods, usually appear before the proofs of the lemmas. This makes the paper very easy to understand. Also all pseudo codes are provides and thus it should be simple to implement the algorithm. Even the FP tree always greatly compresses the database. the authors still provide a projection method to reduce the required memory.

Because this method makes no approximation to the problem, the results are identical to the ground truth, and obtained much faster than the previous methods. In summary, this is a very exciting work !

[Reading] Mining Association Rules between Sets of Items in Large Databases

2008-05-25T22:33:00.002+08:00

This paper presents a system to find the association rules from a large number of transaction records. At the time of publication, the problem and the solution of this field might be very twilit and this should be a very pioneering work.

After properly formulating the problem, the problem can be decomposed into 2 sub-problems, and the trivial solutions become obvious. However, the straightforward algorithm is too expensive, and thus the authors provide many acceleration methods. I personally like the analysis at the memory analysis part.

The main difficulty in reading this paper is that the writing style is very old (or at least very different to the papers in my field). Also the results are obtained from a dataset without ground truth.

[Reading] Normalized Cuts and Image Segmentation

2008-05-17T00:04:00.003+08:00

I read this paper long time ago...and I skipped main derivation at that time :P. The key contribution of this paper is, the authors use an specified energy function to measure the quality of the segment, and then show that this energy minimization problem can be solved by the generalized eigensystem, which can be linked to the spectral graph theory. In my opinion, the forward way (normalized cut --> eigensystem) is much harder to derive than the backward way. If we think in the backward way, we may say that this is another way to interpret the spectral clustering.

Hope the presenters next week can give us a more insightful view to the spectral graph theory...I was not good at linear algebra.

[Reading] On Spectral Clustering: Analysis and An Algorithm

2008-05-13T23:35:00.002+08:00

This paper presents a simple algorithm of clustering, based on the spectral theory. Compare with other spectral clustering methods (such as normalized cut), the algorithm is much easier to understand and implement. The key idea is that, unlike previous methods, many eigenvectors are used instead of only one.

For the trivial case, it's easy to understand that the eigenspace perfectly represents the clustering result. However, for the general case, the analysis is difficult to understand (at least to me). The authors make 4 assumptions, and also show those assumptions are reasonable in the general cases. Under these assumptions, the performance of the algorithm is guaranteed.

In summary, I seldom like the papers from NIPS....too hard for me...

[Reading] An Introduction to Graphical Models

2008-04-29T14:29:00.003+08:00

This is a very good introduction paper for the graphical model. It elegantly explains why and when we should use the graphical model, what operations are usually used in the graphical model (so you may map your problem into one of them), and what tools are available to run these operations efficiently.

One main reason to use the graphical model is efficient computing. By explicitly expressing the dependency between the variables, the number of summation in inference and decision making can be greatly reduced. Also the approximate inference algorithms can be better understood. The message passing and graph cut methods all direct operate on the nodes and edges of the graphical model.

The paper also mentioned a topic that I didn't enter before, learning the structure of the graphical model. However, I personally think this is an aesthetic problem instead of a computational problem, since the search space is almost infinite and the goal (e.g., suitable for inference or decision making) is hard to formulate as an objective function. Even a rough graphical model is estimated, manual effort is still required to beautify its structure.

[Reading] Combining Labeled and Unlabeled Data with Co-Training

2008-04-25T11:01:00.002+08:00

This paper describes a method to utilize the unlabeled data for training. The basic idea is to split the features into two separate sets, and learn two individual classifiers, and use these classifiers to expend the training data to learn new classifiers.

However, the paper is a little difficult to understand. Their method implies that the co-occurrence in-between or within the feature sets may possibly be found by the new labeled data. However, maybe these co-occurrence can be found by unsupervised clustering (like pLSA). It always involves the parameter setting. Picking too many unlabeled data or reducing too many dimensions can both degrade the performance.

[Reading] Transductive Inference for Text Classification using Support Vector Machines

2008-04-25T09:59:00.002+08:00

This paper presents a method to expend the size of the training set for SVM. The key idea is to optimize the hyperplane and the un-seen labels together. In this way, the co-occurrence information missed in the original training data might be explored automatically.

The idea is very simple, and the algorithm for the optimization is easy to implementation (small modification to the original SVM). An interesting thing is that, if we apply pLSA to cluster the topics of all documents first, then the co-occurrence information can be found to some extent. In this case, does TSVM still outperform SVM? Or, maybe the co-occurrence is one thing found by TSVM that the author can think of, but there are other things that are hard to describe are also found by TSVM.

[Reading] Improved boosting algorithms using confidence-rated predictions

2008-04-18T16:11:00.003+08:00

This paper completes the Adaboost algorithm. In this original Adaboost paper, only the basic concept, combining several weak classifiers into a strong classifier. In this paper, the authors present a more complete story of this technique. It shows that several parameters in Adaboost can be optimized to improve the overall performance. Also it shows several extensions of Adaboost to handle non-binary classification.

However, some of these extensions looks very natural and necessary. In my previous experiments, if the alpha is not optimized, the result of Adaboost can be very bad. Also, this paper is not the best one for the beginner like me...I find some on-line documents are better.

[Reading] Rapid object detection using a boosted cascade of simple features

2008-04-17T01:19:00.001+08:00

This paper describes the most well-known face detection algorithm. Before reading this paper, I thought Adaboost is equivalent to the cascade detection. The cascade detection is an important contribution of this paper.

Because the Adaboost can fuse several weak classifiers, the authors found that using simple Haar-like features is enough for the face detection, and Haar-like features can be efficiently evaluated by integral image (in fact, simple sum-area table in texture mapping). Also, the cascade detection can skip many regions to reduce the computation.

These contributions combine a very powerful system. In fact, many face detection modules in the digital cameras are based on these techniques.

[Reading] Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary

2008-03-30T15:26:00.003+08:00

This paper presents a new concept to object recognition: model it as a machine translation problem.

Given a training dataset, an EM-alike algorithm is used to learn the correspondence between the "words" between the image (an English document) and the annotation (an French document). The paper assume the training data is perfect: there should be no noise within. However, some thresholding does improve the performance.

In EM, Lagrangian is used to make sure the variables are sum-to-one. However, maybe a simple re-projection is enough. And in fact, re-projection is indeed used in the algorithm (Fig. 4)....can't see the Lagrangian multipliers in the code @@.

In summary, the idea is very nice, but this is a paper more like introducing a concept instead of building a practical system. There are many interesting things to do.

[Reading] Names and Faces in the News Abstract

2008-03-28T23:24:00.002+08:00

This paper present a system to put the name and face in the news together. The system uses several techniques: face detection and rectification, kPCA and LDA for dimension reduction, Nystrom approximation for efficient kPCA, K-mean clustering, pruning, and merging. One pity thing is that there is no much room to discuss the alternates of these techniques. The result shown in Fig 3 is very interesting and brings many possible research topics.

I'm very interested in the kPCA method and Nystrom approximation and I'm reading the origin papers (I think I should allocate a period to learn the matrix calculus more deeply).

How to prune the clustering is very important to my current research project. However, the method in this paper is not suitable to my problem :(. A special part is that the pruning is guided by hand labeling...what if the dataset is too large to be labeled? Also the running time is not reported in the paper.

[Reading] Algorithms for Fast Vector Quantization

2008-03-27T18:34:00.005+08:00

This paper presents three different variant of the k-d trees to improved the search performance. The first one introduces a technique called incremental distance to reduce the calculation. In traversing the tree, the distance between the query point and the childs can be easily updated from the distance the query point and the current node (O(1) instead of O(D), where D is the dimension). I found this is very useful. It does not degrade the performance but greatly reduces the computation.

The second method uses a priority queue to store the distances to the all traversed nodes. After a leaf is checked, the algorithm always picks the node with smallest distance from the queue to traverse. The interesting thing is that, this idea is used in the SIFT system but the author did not method this reference.

The third method, although gives good performance, it requires a O(N^2) preprocessing to build the local neighborhood graph. Therefore it may not be suitable for large systems.

Both the first two methods are very exciting, and I'm considering to port the first one into my program. Also in the experiment section, a well-known early termination technique is used. In calculating the distance, we may quit at the i-th dimension if the accumulated distance is larger than the currently optimal one. However, early termination may not be the best choice in the current CPU pipeline since there is a branch is inside the for loop. If we want to optimize it, manual unfolding may be necessary. For example, calculate the first K dimensions at once, do the check, and then move to the next K dimensions.

[Reading] Similarity Search in High Dimensions via Hashing

2008-03-25T14:38:00.003+08:00

This is an extension of the previous paper I reviewed. To my surprise, this paper is much easier to understand. The derivation, pseudo codes, experimental results, and underlying theories are all provided.

However, I still can't understand the property of the Hamming space :(. Therefore I can hardly point out the weakness of this algorithm.....this is a painful week.

[Reading] Approximate Nearest Neighbors:Towards Removing the Curse of Dimentsionality

2008-03-24T17:26:00.003+08:00

Reading this makes me understand why most people still use the trivial ANN library. The theorems for the optimality of these algorithm are too hard to explain easily. This is the most painful paper I've read in this course so far, and I'm glad I finish it.

In this paper, 3 fast ANN algorithms are proposed. The last one, locality-sensitive hashing, is very similar to the Pyramid Match Hashing by Grauman and Darrell. The point in R^d is first embedded to the Hamming space, and then efficient histogram (hashing) matching can be performed. My main problem is, is there any good reference book to explain the good properties of the Hamming distance? I think that will be more interesting than this paper.

Besides the endless theorems and proofs in the paper (I re-derived some of them...), I don't know how to present this paper. There is no experimental result. I wish my brain could be so smart so I can publish something without doing any experiment....

[Reading] Algorithms and Applications for the Approximate Nonnegative Matrix Factorization

2008-03-14T23:57:00.003+08:00

This paper presents 3 main methods to obtain the NMF. An interesting thing is that the ALS is the easiest and fastest method which is used in the 1994 pioneering paper, but it is not used in the famous paper in Nature 1999.

Though this paper looks like an overview, its content is a little flat. I can hardly see or remember any convergent property of the NMF methods after reading the paper. Two examples used in the paper are not very common and NMF is not totally successful and the performance of other factorizations are not presented. To me, showing some simple application where the NMF works perfectly is more exciting. One good thing is that a public library is mentioned so maybe we can learn more from the library :).

Another thing is the NMF is getting popular in computer graphics. Many data in graphics are always non-negative: BRDF, visibility, etc. Also if NMF is used, the positive factorizations can be processed on GPU, which is optimized to handle positive numbers.

[Reading] Probabilistic Latent Semantic Indexing

2008-03-14T23:43:00.001+08:00

The course puts NMF and pLSA together. I think the professor wants us to find their relationship. Using some simple derivations, the relationship is obvious. Let d denotes the documents, z the topics, and w the words. We have:

p(w|d) = \sum_z p(w|z) p(z|d),

where p(w|d) is a |W|-D vector, p(w|z) is a |W|-d vector (Ai), and p(z|d) is a real value. We can put all right-sided vectors of different z's into two matrices, one is |W|x|Z| (H) and the other one is |Z|x1 (Vi). If we consider p(w|d) of different d's are columns in a |W|x|D| matrix (A), then we have

A = HV,

where the entities of H and V are all non-negative. Since A is given and what we want is H and V, this is in fact a NMF, plus a sum to 1 constraint so that each column is a valid pdf.

However, the main difference is that in the pLSA paper, the factorization process is examined in the probabilistic framework and achieved by EM. In NMF papers, it is usually casted as an optimization problem.

I didn't see the original pLSA paper before but read the LDA directly. I believe it is necessary and beneficial to read pLSA first since it gives many discussions about how and why the model is designed. Also it gives many good application examples.

[Reading] Shape Matching and Object Recognition using Shape Contexts

2008-03-11T22:54:00.002+08:00

The paper uses a representation, shape context, similar to the edgelet to describe the object. The contour of the object is densely sampled and each sample has a descriptor based on the distance and angle distribution of other samples. (One simple question is that this differential meansurement should be naturally invariant to the image plane rotation, but the authors still used an extra appendix to show this property). The distance metric of two samples can be defined based on chi-square distribution.

Given this discriminative representation, the correspondences between two images can be matched and the image-domain warping that best matches the correspondence is obtained (thin plate spline is used here, and I think there are better ones now). The similarity between two images is then defined. However, equation 12 in the paper is somewhat erroneous: the indices of the min operators are accumulated?? Also the warping distance is not so easy to understand, and it may generate big distance even if the object is only uniformly scaled.

The classification system is based on K-NN, and a K-medoids clustering is used for choosing the prototypes. It gives the best classification rate at the time of 2001. Overall the paper is well written and persuadable. Despite the small problems, it opens a new avenue to the object classification.

[Reading] A Boundary-Fragment-Model for Object Detection

2008-03-11T22:52:00.002+08:00

This paper present a system that uses a new shape representation, boundary-fragment-model (BFM), to perform object detection. In BFM, each edge fragment is associated with a object centroid location. Therefore several fragments should belong to the same object if their associated locations are close in the query image.

Besides the new idea, the whole system combines many other techniques, such as AdaBoost and Chamfer distance. The fragment is generated using growing and agglomerative clustering on medoids.

This is field is not what I'm familiar with, but I think many techniques they use are important. Also it is a good example about how to build a complete system.

[Reading] Nonlinear Dimensionality Reduction by Locally Linear Embedding

2008-03-06T22:51:00.002+08:00

This is a very important paper that provides the first efficient and simple method to non-linearly project the high-dimensional signals onto a low-dimensional space. The key concept of the non-linear projection is that the data in fact lay on a manifold in the high-dimension space, which may not be represented by any simple slice of the space (manifold could become folded on the slice).

However, the locally connected graph can approximate the shape of the manifold. As long as the manifold is locally linear, we can obtain a linear reconstruction of each sample from the near neighbors. An optimal weighting set for each sample can be found by solving a least-square problem.

After the weighting set is obtain, the low-dimension representation of the sample on the manifold can then be defined, as long as the same weighting set can still reconstruct the low-dimension representation. Finding the optimal embedding is casted as a eigenvalue problem.

The bold sentences above are the key assumptions of this paper. Unlike Isomap which measure the geodesic distances of all sample pairs, only local distance is preserved. The degree of this effect may be application-dependent. For example, in Action Synopsis, it is shown that non metric replicated multi-dimensional scaling is better than LLE because the global analysis is preserved.

Also another problem is the number of the nearest neighbors (k), which should depends the local sampling rate. Also finding exact k-NN in a high-dimensional space is costly. However, since a effective dimension reduction method is so important, this cost is acceptable.

[Reading] Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection

2008-03-06T12:51:00.002+08:00

This paper formally classify the variation of the facial appearance into different classes. The variation can be due to different persons, different expressions, or different lighting conditions. For face identification, we hope the variations due to different expressions and lightings can be reduced by processing while the variation due to different persons are preserved. The author find that a linear projection obtained by LDA can efficiently achieve this goal. This result is very important since the projection does not only reduce the computation for matching, but also minimize the effect of the facial and lighting variation to the face recognition.

One thing should be noted is that this paper is very clear. It is maybe the best document I've ever read that explains LDA best. One interesting problem is that finding the optimal projection here is a generalized eigenvalue problem: Ax = \lambda Bx, where B and A are n-by-n matrices which could be very large for big images. Though the paper does not discuss the method to compute this, I believe a trick similar to the one used eigenface should be applied here, and I will make another post after I find it. (or if someone tell me directly :P)

Since both eigenface and fisherface are very old, I can't know the state-of-the-art in face recognition. However, these two papers are both very good at showing the applications of dimension reduction techniques.

[Reading] Eigenfaces to Recognition

2008-03-04T21:02:00.003+08:00

I'm surprised by the content of this paper. I've heard eigenface for a long time, I didn't realize the true importance of this work. (And I find this is another Pentland's great paper)

In this paper an efficient way to compute eigenvectors is used (maybe not the first one if we consider other research fields) and many possible attacks to the system are examined. The first topic is very important because it provides us a way to perform PCA on high-dimensional data, as long as the number of the samples is much smaller than the dimension of the samples. The second topic is about the performance degradation due to scaling, lighting, and rotation (on the image domain, not the rigid motion of the head). The authors also discuss some possible ways to solve these problems. For example, using a multi-scale processing.

Another interesting thing is that no one is talking about neural networks anymore, but obviously it was hot at 1991.

[Reading] Scale & Affine Invariant Interest Point Detectors

2008-03-01T14:41:00.002+08:00

Though I know this paper for a long time, I didn't attempt to read it before. This paper present a method to detect the interest points which are invariant to scale or affine transformations. That is, for each detected point, a transformation matrix is also extracted which transforms the local patch around the detected point into a canonical form for efficient description and matching.

Though the formulation and the method is elegant, it is too slow for many applications. The detection part requires around 50 times computation than SIFT, and the descriptor used in the paper is not as good as the one used in SIFT. I think that's why most applications only apply SIFT. It's interesting that some people use SIFT and this detector to detect the interest points and then use the SIFT descriptor to describe the detected points. These algorithms find different types of interest points and the union of these points cover a boarder class and improve the results. In some applications this is critical: SIFT sometimes give like 10 matchings between images.

One little question is about how to update the affine transformation. The paper uses the second moment matrix (co-variance) around the local patch and many issues are not discussed: the effect of sampling, window size, etc. Nevertheless, this paper still provides me a good example on the utilization of the continuous optimization technique.

[List] Research on Feature Points

2008-02-26T23:06:00.014+08:00

This list is ordered in topics (may not be correct), from old to new. It should be noted that this list is far from completed and I will try to update it from time to time. A good overview of the field of feature detection can be find at Wikipedia (link).

[Parameter and Description Improvement and Optimization]

Y. Ke and R. Sukthankar. PCA-SIFT: A More Distinctive Representation for Local Image Descriptors. CVPR2004. (link)
H. Bay, T. Tuytelaars, L. Van Gool, SURF: Speeded Up Robust Features, ECCV2006. (link)
S. Winder and M. Brown, Learning Local Image Descriptors, CVPR2007. (link)
G. Hua, M. Brown and S. Winder, Discriminant Embedding for Local Image Descriptors, ICCV2007. (link)

[Efficient Matching]

K. Grauman and T. Darrell. The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features. ICCV2005. (link)
K. Grauman and T. Darrell. Pyramid Match Hashing: Sub-Linear Time Indexing Over Partial Correspondences. CVPR2007. (link)

[Visual Word]

J. Sivic and A. Zisserman. Video Google: Efficient Visual Search of Videos Toward Category-Level Object Recognition. CVPR 2003. (link)
D. Nistér and H. Stewénius. Scalable Recognition with a Vocabulary Tree. CVPR2006. (link)
J. Philbin, O. Chum, M. Isard, J. Sivic and A. Zisserman. Object Retrieval with Large Vocabularies and Fast Spatial Matching. CVPR 2007. (link)
G. Schindler, M. brown, and R. Szeliski. City-Scale Location Recognition. CVPR 2007. (link)
O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman. Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval. ICCV2007. (link)
P. Quelhas, F. Monay, J.-M. Odobez, and D. Gatica-Perez, A Thousand Words in a Scene. PAMI2007. (link)

[Scene Reconstruction]

M. Brown and D. G. Lowe. Unsupervised 3D Object Recognition and Reconstruction in Unordered Datasets. 3DIM2005 (link)
N. Snavely, Ｓ. M. Seitz, and R. Szeliski, Photo Tourism: Exploring Photo Collections in 3D. SIGGRAPH2006. (link)
M. Goesele, N. Snavely, B. Curless, H. Hoppe, S. M. Seitz, Multi-View Stereo for Community Photo Collections. ICCV2007 (link)
I. Simon, N. Snavely, and S. M. Seitz. Scene Summarization for Online Image Collections. ICCV2007. (link)

[Others]

L. Kennedy, M. Naaman, S. Ahern, R. Nair, and T. Rattenbury. How Flickr Helps us Make Sense of the World: Context and Content in Community-Contributed Media Collections. ACM MM2007. (link)

--
Last update: 02/27/2007