Wednesday, March 17, 2010

SharpNLP vs NLTK called from C# review

C# and VB.net have fewer open source NLP libraries than languages like C++, Java, LISP and Perl. My last blog post: Open Source NLP in C# 3.5 using NLTK is about calling NLTK, which is written in Python, from IronPython embedded under C# or VB.net.

An alternative is to use SharpNLP, which is the leading open source NLP project written in C# 2.0. SharpNLP is not as big as other Open Source NLP projects. This blog posting is a short comparison of SharpNLP and NLTK embedded in C#.

Documentation

NLTK has excellent documentation, including an introductory online book on NLP and Python programming.

For SharpNLP the source code is the documentation. There is also a short introductory article by SharpNLP's author Richard J. Northedge.

Ease of learning

NLTK is very easy to work with under Python, but integrating it as embedded IronPython under C# took me a few days. It is still a lot simpler to get Python and C# to work together than Python and C++.

SharpNLP's lack of documentation makes it harder to use; but it is very simple to install.

Ease of use

NLTK it is great to work with in the Python interpreter.

SharpNLP simplifies life by not having to deal with the embedding of IronPython under C# and the mismatching between the 2 languages.

Machine learning and statistical models

NLTK comes with a variety of machine learning and statistical models: decision trees, naive Bayesian, and maximum entropy. They are very easy to train and validate, but do not preform well for large data sets.

SharpNLP is focused on maximum entropy modeling.

Tokenizer quality

NLTK has a very simple RegEx based tokenizer that works well in most cases.

SharpNLP has a more advanced maximum entropy based tokenizer that can split "don't" into "do | n't". On the other hand it sometimes makes errors and splits a normal word into 2 words.

Development community

NLTK has an active development community, with an active mailing list.

SharpNLP was last release was in December 2006. It is a port of the Java based OpenNLP, and can read models from OpenNLP. SharpNLP has a low volume mailing list.

Code quality

NLTK lets you write programs that read from web pages, clean HTML out of text and do machine learning in a few lines of code.

SharpNLP is written in C# 2.0 using generics. It is a port from OpenNLP and maintains a Java flavor, but it is still very readable and pleasant to work with.

License

NLTK's license is Apache License, Version 2.0, which should fit most people's need.

SharpNLP's license is LGPL 2.1. This is a versatile license, but maybe a little harder to work with when the project is not active.

Applications

NLTK comes with a theorem prover for reasoning about semantic content of text.

SharpNLP comes with an name, organization, time, date and percentage finder.
It is very simple to add an advanced GUI, using WPF or WinForms.

Conclusion

Both packages comes with a lot of functionality. They both have weaknesses, but they are definitely usable. I have both SharpNLP and embedded NLTK in my NLP toolbox.

-Sami Badawi

2 comments:

Anonymous said...

Hi, I would be really interested to see how you call NLTK from C#

Would it be possible for you to share code/project on how you did this?

I attempted it but sadly failed!

Many thanks

Unknown said...

Nice comparison of SharpNLP and NLTK embedded in C#.

karthick
.Net Training in chennai