LDA will have a substantial impact on corpus based natural language processing; since it opens up for easy creation of semantic models based on machine learning.
Motivation for topic models
With the Internet we have large amount of text available. Having the text categorized into topics make text search much more precise and makes it possible to find similar documents.
Text categorization is not an easy problem:
- Texts usually deals with more than one topic
- There is no clear standard for categorization
- Doing it by hand is infeasible
There is actually good material available, but finding all the pieces takes some work. Most things you need are available online for free. Here is a chronological account for what I did to understanding LDA and start implementing it.
Need for more sophisticated hierarchical topic models
In 2009 I needed a fine grained classification of text, using unsupervised or semi supervised training. I spend a little time thinking about it, and had some idea about making bootstrapped training in a 2 layered hierarchy. It was hackish, complex and I was not sure how numerically stable it was. I never got around to implement it.
I went to 4 Annual Machine Learning Symposium in 2009 and asked around for solutions to my problem. Several attendees told me to look at David Blei work. I did but he has written a lot of math heavy articles, so I did not know were to start.
I was lucky to see David Blei give a presentation on LDA first at the 5 Annual Machine Learning Symposium. David Blei works at Princeton and just exudes brilliance. He gave a lucid entertaining description of the LDA with examples. It wa really shocking to see the LDA algorithm find scientific topics on its own with no human intervention.
I saw him give the same talk at the NYC Machine Learning Meetup, and luckily that was videotaped here are part 1 and part 2. I watched these videos a few times. This gave me a good intuition for the algorithm.
I looked through his articles and found a good beginner articles BleiLafferty2009. I read through that several time, but I could not understand it.
I went out and bought the text book that David Blei recommended: Pattern Recognition and Machine Learning by Christopher M. Bishop. After reading the introduction chapter, I read BleiLafferty2009 again and was able to understand it. On page 10 the essence of the algorithm is described in a small text box.
Software implementation of LDA
There are plenty open source implementation of LDA. Here are a few observations:
lda-c in C by David Blei is an implementation in old school C. The code is readable, concise and clean.
lda for R package by Jonathan Chang. Implementing many models with extensive documentation.
Online LDA in Python by Matt Hoffman. Short code, but not too much documentation.
LDA Apache Mahout in Java. Active development community works with Hadoop / MapReduce.
Practical software considerations
All the implementations looked good. But if you want to use LDA software then robustness, scalability and extendibility are big issues. First you just want the algorithm to run for simple text input. Next day you want the following options:
- Better word tokenizer
- Bigrams and collocation
- Words stemmer
- LDA on structured text
- Read from database
Programming language choice for LDA
Here is a little common sense advice on choice of programming language for LDA programming.
C is an elegant, simple system programming language.
C is not my first choice of a language for text processing.
C++ is a very powerful but also complex language.
NLP lib: The Lemur ProjectI would be happy to use C++ for text processing.
C# is a great language.
NLP lib: SharpNLP.
You will have to implement LDA yourself or port one of the other implementations. SciPy is getting ported to C# but it does not have the best numeric libraries.
ClojureClojure is a moderate sized LISP dialect build on the Java JVM.
NLP lib: OpenNLP through clojure-opennlp.
LISP is classic AI language and you can use one of the Java LDA implementations.
Java is modern object oriented programming language with access to every thinkable library.
NLP lib: OpenNLP.
Python is an elegant language very well suited for NLP.
NLP lib: NLTK, using NumPy and SciPy
R is a fantastic language for statistics, but not so great for low level text processing.
The R implementation of LDA looks great; I think that it is common to do all the preprocessing in another language say Perl. And then do all the rest of the work in R.
Different versions of LDA
There are now a lot of different LDA models geared towards different domains. Let me just mention a couple:
Online means that: you do learning of the models in small batches; instead of on all the documents. This is useful for a continuously running system.
Good for handling text that stretches over a long time interval say 100 years.
This will handle topics are organized in hierarchies.
Gray box approach to LDA
The math needed for LDA is advanced. If you do not succeed in understand it I still think that you can learn to use the code, if you are willing to take something on faith and get your hands dirty.