Friday, April 1, 2011

Practical Probabilistic Topic Models for NLP

Latent Dirichlet Allocation, LDA is a new and very powerful technique for finding the topics in a collection of texts, using unsupervised learning. LDA is a probabilistic topic models. LDA was developed in 2003 and rely on advanced math. This post is a practical guide about how to get started building LDA models and software.

LDA will have a substantial impact on corpus based natural language processing; since it opens up for easy creation of semantic models based on machine learning.

Motivation for topic models


With the Internet we have large amount of text available. Having the text categorized into topics make text search much more precise and makes it possible to find similar documents.

Text categorization is not an easy problem:
  • Texts usually deals with more than one topic
  • There is no clear standard for categorization
  • Doing it by hand is infeasible
Nuanced categorized is a hard problem, with many moving parts, but in 2003 David M. Blei, Andrew Y. Ng and Michael I. Jordan published an article on a new approach called Latent Dirichlet Allocation. LDA can be implemented base on research articles, but if you are not a machine learning academic the math is intimidating and the material is still new.

There is actually good material available, but finding all the pieces takes some work. Most things you need are available online for free. Here is a chronological account for what I did to understanding LDA and start implementing it.

Need for more sophisticated hierarchical topic models


In 2009 I needed a fine grained classification of text, using unsupervised or semi supervised training. I spend a little time thinking about it, and had some idea about making bootstrapped training in a 2 layered hierarchy. It was hackish, complex and I was not sure how numerically stable it was. I never got around to implement it.


David Blei


I went to 4 Annual Machine Learning Symposium in 2009 and asked around for solutions to my problem. Several attendees told me to look at David Blei work. I did but he has written a lot of math heavy articles, so I did not know were to start.

I was lucky to see David Blei give a presentation on LDA first at the 5 Annual Machine Learning Symposium. David Blei works at Princeton and just exudes brilliance. He gave a lucid entertaining description of the LDA with examples. It wa really shocking to see the LDA algorithm find scientific topics on its own with no human intervention.

I saw him give the same talk at the NYC Machine Learning Meetup, and luckily that was videotaped here are part 1 and part 2. I watched these videos a few times. This gave me a good intuition for the algorithm.

I looked through his articles and found a good beginner articles BleiLafferty2009. I read through that several time, but I could not understand it.

I went out and bought the text book that David Blei recommended: Pattern Recognition and Machine Learning by Christopher M. Bishop. After reading the introduction chapter, I read BleiLafferty2009 again and was able to understand it. On page 10 the essence of the algorithm is described in a small text box.


Software implementation of LDA


There are plenty open source implementation of LDA. Here are a few observations:

lda-c in C by David Blei is an implementation in old school C. The code is readable, concise and clean.

lda for R package by Jonathan Chang. Implementing many models with extensive documentation.

Online LDA in Python by Matt Hoffman. Short code, but not too much documentation.

LDA Apache Mahout in Java. Active development community works with Hadoop / MapReduce.



No matter what language you prefer there should be a good implementation.


Practical software considerations


All the implementations looked good. But if you want to use LDA software then robustness, scalability and extendibility are big issues. First you just want the algorithm to run for simple text input. Next day you want the following options:
  • Better word tokenizer
  • Bigrams and collocation
  • Words stemmer
  • LDA on structured text
  • Read from database


Programming language choice for LDA


Here is a little common sense advice on choice of programming language for LDA programming.

C
C is an elegant, simple system programming language.
C is not my first choice of a language for text processing.


C++
C++ is a very powerful but also complex language.
NLP lib: The Lemur Project
I would be happy to use C++ for text processing.


C#
C# is a great language.
NLP lib:  SharpNLP. 
You will have to implement LDA yourself or port one of the other implementations. SciPy is getting ported to C# but it does not have the best numeric libraries.


Clojure
Clojure is a moderate sized LISP dialect build on the Java JVM.
NLP lib: OpenNLP through clojure-opennlp.
LISP is classic AI language and you can use one of the Java LDA implementations.


Java
Java is modern object oriented programming language with access to every thinkable library.
NLP lib: OpenNLP.


Python
Python is an elegant language very well suited for NLP.
NLP lib: NLTK, using NumPy and SciPy


R
R is a fantastic language for statistics, but not so great for low level text processing.
NLP lib:
The R implementation of LDA looks great; I think that it is common to do all the preprocessing in another language say Perl. And then do all the rest of the work in R.


Different versions of LDA


There are now a lot of different LDA models geared towards different domains. Let me just mention a couple:

Online LDA
Online means that: you do learning of the models in small batches; instead of on all the documents. This is useful for a continuously running system.

Dynamic LDA
Good for handling text that stretches over a long time interval say 100 years.

Hierarchical LDA
This will handle topics are organized in hierarchies.


Gray box approach to LDA


The math needed for LDA is advanced. If you do not succeed in understand it I still think that you can learn to use the code, if you are willing to take something on faith and get your hands dirty.

5 comments:

Frédéric Morain-Nicolier said...

Hello,

I'm Frédéric Morain-Nicolier, a french researcher on computer vision working on visual similarity. I also maintain a blog on image analysis and processing (Pixel shaker at http://pixel-shaker.fr, in french and english).

Yesterday, i created another blog (Pixel Shakers at http://shakers.pixel-shaker.fr ) that is an aggregator of the blogs on computer vision, machine vision and image processing. The current list includes the following blogs (including your blog) :

AI Computer Vision
Computer Vision Central’s blog
Computer Vision Software
Cris’s Image Analysis Blog
Helping The Blind
Learning in Vision
Pixel shaker
solem’s vision blog
Steve on Image Processing
tombone’s blog

The aim is to try to aggregates the contents of blogs (on vision). The advantages would be a better communications between our blogs, potentially expanding the audience of the blogs and allowing searches of blog posts on all the blogs on a given subject (e.g. image segmentation http://shakers.pixel-shaker.fr/?cat=348).

This is a first trial and a lot of points can be improved :
- the interface
- the DNS
- ...

If you have the time, just send me your opinion of this initiative. And if you are aware of other blogs (in any language), help me to complete the list. And i am very open to any  suggestion or help on this work.

regards,
Frédéric Morain-Nicolier
f.nicolier@gmail.com

Unknown said...

How did the NLP In depth training insert? nlp practitioner I seasoned transforming a matter just by switching my response to it.

Anonymous said...

Thanks for referring these computer mapping resources for the NLP practice field. I will recommend these to nlp perth organization so that it may benefit them.

Kimberly McNeeley said...

Well great proposal while such an amazing plans however we reexamine that instruction approach to accomplish the original source dreams and approach to get administrations of paper online and a few understudies not concentrate on training and we help by means of article administrations.

Unknown said...

first, i would say thanks for give me some simple explanation about LDA method but recently i found other implementation of LDA on C# in this page "https://gibbsldasharp.codeplex.com/", from those page what do you think is more powerful to process text ?c or c#?
thank you.