Friday, April 1, 2011

Practical Probabilistic Topic Models for NLP

Latent Dirichlet Allocation, LDA is a new and very powerful technique for finding the topics in a collection of texts, using unsupervised learning. LDA is a probabilistic topic models. LDA was developed in 2003 and rely on advanced math. This post is a practical guide about how to get started building LDA models and software.

LDA will have a substantial impact on corpus based natural language processing; since it opens up for easy creation of semantic models based on machine learning.

Motivation for topic models


With the Internet we have large amount of text available. Having the text categorized into topics make text search much more precise and makes it possible to find similar documents.

Text categorization is not an easy problem:
  • Texts usually deals with more than one topic
  • There is no clear standard for categorization
  • Doing it by hand is infeasible
Nuanced categorized is a hard problem, with many moving parts, but in 2003 David M. Blei, Andrew Y. Ng and Michael I. Jordan published an article on a new approach called Latent Dirichlet Allocation. LDA can be implemented base on research articles, but if you are not a machine learning academic the math is intimidating and the material is still new.

There is actually good material available, but finding all the pieces takes some work. Most things you need are available online for free. Here is a chronological account for what I did to understanding LDA and start implementing it.

Need for more sophisticated hierarchical topic models


In 2009 I needed a fine grained classification of text, using unsupervised or semi supervised training. I spend a little time thinking about it, and had some idea about making bootstrapped training in a 2 layered hierarchy. It was hackish, complex and I was not sure how numerically stable it was. I never got around to implement it.


David Blei


I went to 4 Annual Machine Learning Symposium in 2009 and asked around for solutions to my problem. Several attendees told me to look at David Blei work. I did but he has written a lot of math heavy articles, so I did not know were to start.

I was lucky to see David Blei give a presentation on LDA first at the 5 Annual Machine Learning Symposium. David Blei works at Princeton and just exudes brilliance. He gave a lucid entertaining description of the LDA with examples. It wa really shocking to see the LDA algorithm find scientific topics on its own with no human intervention.

I saw him give the same talk at the NYC Machine Learning Meetup, and luckily that was videotaped here are part 1 and part 2. I watched these videos a few times. This gave me a good intuition for the algorithm.

I looked through his articles and found a good beginner articles BleiLafferty2009. I read through that several time, but I could not understand it.

I went out and bought the text book that David Blei recommended: Pattern Recognition and Machine Learning by Christopher M. Bishop. After reading the introduction chapter, I read BleiLafferty2009 again and was able to understand it. On page 10 the essence of the algorithm is described in a small text box.


Software implementation of LDA


There are plenty open source implementation of LDA. Here are a few observations:

lda-c in C by David Blei is an implementation in old school C. The code is readable, concise and clean.

lda for R package by Jonathan Chang. Implementing many models with extensive documentation.

Online LDA in Python by Matt Hoffman. Short code, but not too much documentation.

LDA Apache Mahout in Java. Active development community works with Hadoop / MapReduce.



No matter what language you prefer there should be a good implementation.


Practical software considerations


All the implementations looked good. But if you want to use LDA software then robustness, scalability and extendibility are big issues. First you just want the algorithm to run for simple text input. Next day you want the following options:
  • Better word tokenizer
  • Bigrams and collocation
  • Words stemmer
  • LDA on structured text
  • Read from database


Programming language choice for LDA


Here is a little common sense advice on choice of programming language for LDA programming.

C
C is an elegant, simple system programming language.
C is not my first choice of a language for text processing.


C++
C++ is a very powerful but also complex language.
NLP lib: The Lemur Project
I would be happy to use C++ for text processing.


C#
C# is a great language.
NLP lib:  SharpNLP. 
You will have to implement LDA yourself or port one of the other implementations. SciPy is getting ported to C# but it does not have the best numeric libraries.


Clojure
Clojure is a moderate sized LISP dialect build on the Java JVM.
NLP lib: OpenNLP through clojure-opennlp.
LISP is classic AI language and you can use one of the Java LDA implementations.


Java
Java is modern object oriented programming language with access to every thinkable library.
NLP lib: OpenNLP.


Python
Python is an elegant language very well suited for NLP.
NLP lib: NLTK, using NumPy and SciPy


R
R is a fantastic language for statistics, but not so great for low level text processing.
NLP lib:
The R implementation of LDA looks great; I think that it is common to do all the preprocessing in another language say Perl. And then do all the rest of the work in R.


Different versions of LDA


There are now a lot of different LDA models geared towards different domains. Let me just mention a couple:

Online LDA
Online means that: you do learning of the models in small batches; instead of on all the documents. This is useful for a continuously running system.

Dynamic LDA
Good for handling text that stretches over a long time interval say 100 years.

Hierarchical LDA
This will handle topics are organized in hierarchies.


Gray box approach to LDA


The math needed for LDA is advanced. If you do not succeed in understand it I still think that you can learn to use the code, if you are willing to take something on faith and get your hands dirty.

Bedtime Science Stories My Science Education Blog

I started a science education blog called: Bedtime Science Stories. Here is a little excerpt from my first post: Can and should a 3 year old girl be into science?


I have a 3 year old daughter that has take a bit of an interest in science. We have been talking about science when I put her to bed at night.


Last Sunday I discovered a new book called Battle Hymn of the Tiger Mother by Amy Chua, who is a law professor at Yale. She is using extreme methods to push her 2 daughters to academic excellence. They had to be the best in their class in everything except drama and physical education. Math was a topic that she really drilled them in. Just reading the back cover sent me into a rage; so much that I decided to start a new blog: Bedtime Science Stories, just to get my anger out.

Science should not be an elite activity. Making it very competitive will make a new generation of kids hate math and science. Understanding our world is worthwhile activity even if you are not the best in your class.

My 3 year old daughter