Friday, March 14, 2008

[Reading] Probabilistic Latent Semantic Indexing

The course puts NMF and pLSA together. I think the professor wants us to find their relationship. Using some simple derivations, the relationship is obvious. Let d denotes the documents, z the topics, and w the words. We have:
p(w|d) = \sum_z p(w|z) p(z|d),
where p(w|d) is a |W|-D vector, p(w|z) is a |W|-d vector (Ai), and p(z|d) is a real value. We can put all right-sided vectors of different z's into two matrices, one is |W|x|Z| (H) and the other one is |Z|x1 (Vi). If we consider p(w|d) of different d's are columns in a |W|x|D| matrix (A), then we have
A = HV,
where the entities of H and V are all non-negative. Since A is given and what we want is H and V, this is in fact a NMF, plus a sum to 1 constraint so that each column is a valid pdf.

However, the main difference is that in the pLSA paper, the factorization process is examined in the probabilistic framework and achieved by EM. In NMF papers, it is usually casted as an optimization problem.

I didn't see the original pLSA paper before but read the LDA directly. I believe it is necessary and beneficial to read pLSA first since it gives many discussions about how and why the model is designed. Also it gives many good application examples.

No comments: