Friday, May 30, 2008

[Reading] The pagerank citation ranking: Bringing order to the web

This is a very famous paper. Almost all people who searched for the answer of "how Google works" will reach this paper. The basic idea is unbelievably simple, just a simple application of the random walk theory. Given an initial positions of many walkers on the web, the final distribution of these walkers is the importance of the web pages. Thanks to the AMMAI course, I can easily think that this is a linear algebra problem.

Because the number of the webpages is very large and we only want one eigenvector, an iterative method is used in the paper. An interesting thing is that the iteration step is not the gradient of the cost function but the authors claim that the convergence is faster. Another issue is that the importance of webpages is l1-normalized. However, I don't really know why this is necessary.

It's interesting to think, how pagerank works in the environment of Web 2.0 now (of course Google must handle this already...). The webpages are now very dynamic. The link structures change very fast and it's almost impossible to build the full linear system (there are billions of webpages now, and each one can easily link to hundred of webpages). Actually maybe Google do not work so well least recently I feel depressed at searching on Google :P

Sunday, May 25, 2008

[Reading] Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach

This paper presents a new method for mining frequent patterns. The main technique is a new data structure, FP tree, for the compact representation of the database. A FP tree can compactly represent all frequent patterns (no information loss) and enable fast query (by traversing the tree and using some range pointers).

There are always some detailed examples in describing the data structure and mining methods, usually appear before the proofs of the lemmas. This makes the paper very easy to understand. Also all pseudo codes are provides and thus it should be simple to implement the algorithm. Even the FP tree always greatly compresses the database. the authors still provide a projection method to reduce the required memory.

Because this method makes no approximation to the problem, the results are identical to the ground truth, and obtained much faster than the previous methods. In summary, this is a very exciting work !

[Reading] Mining Association Rules between Sets of Items in Large Databases

This paper presents a system to find the association rules from a large number of transaction records. At the time of publication, the problem and the solution of this field might be very twilit and this should be a very pioneering work.

After properly formulating the problem, the problem can be decomposed into 2 sub-problems, and the trivial solutions become obvious. However, the straightforward algorithm is too expensive, and thus the authors provide many acceleration methods. I personally like the analysis at the memory analysis part.

The main difficulty in reading this paper is that the writing style is very old (or at least very different to the papers in my field). Also the results are obtained from a dataset without ground truth.

Saturday, May 17, 2008

[Reading] Normalized Cuts and Image Segmentation

I read this paper long time ago...and I skipped main derivation at that time :P. The key contribution of this paper is, the authors use an specified energy function to measure the quality of the segment, and then show that this energy minimization problem can be solved by the generalized eigensystem, which can be linked to the spectral graph theory. In my opinion, the forward way (normalized cut --> eigensystem) is much harder to derive than the backward way. If we think in the backward way, we may say that this is another way to interpret the spectral clustering.

Hope the presenters next week can give us a more insightful view to the spectral graph theory...I was not good at linear algebra.

Tuesday, May 13, 2008

[Reading] On Spectral Clustering: Analysis and An Algorithm

This paper presents a simple algorithm of clustering, based on the spectral theory. Compare with other spectral clustering methods (such as normalized cut), the algorithm is much easier to understand and implement. The key idea is that, unlike previous methods, many eigenvectors are used instead of only one.

For the trivial case, it's easy to understand that the eigenspace perfectly represents the clustering result. However, for the general case, the analysis is difficult to understand (at least to me). The authors make 4 assumptions, and also show those assumptions are reasonable in the general cases. Under these assumptions, the performance of the algorithm is guaranteed.

In summary, I seldom like the papers from NIPS....too hard for me...