Thursday, March 11, 2010

Open Source NLP in C# 3.5 using NLTK

I am working on natural language processing algorithms in a C# 3.5 environment. I did not find any open source NLP packages for C# or VB.NET.
NLTK is a great open source NLP package written in Python. It comes with an online book. I decided to try to embed IronPython under C# and run NLTK from there. Here are a few thoughts about the experience.

Problems with embedding IronPython and NLTK

  • Some libraries that NLTK uses are not installed in IronPython, e.g. zlib and numpy, you can mainly patch this up
  • You need a good understanding of how embedded IronPython works
  • The connection between Python and C# is not seamless
  • Sending data between Python and C# takes work
  • NLTK is pretty slow at starting up
  • Doing large scale machine learning in NLTK is slow

C# and IronPython

IronPython is a very good implementation of Python, but in C# 3.5 there is still a mismatch between C# and Python; this becomes an issue when you are dealing with a library as big as NLTK.
The integration between IronPython and C# is going to improve with C# 4.0. How much remains to be seen.

To embed or not to embed

When is embedding IronPython and NLTK inside C# a good idea?

Separate processes for NLTK under CPython and C#

If your C# tasks and your NLP tasks are not interacting too much, it might be simpler to have a C# program call a NLP CPython program as an external process. E.g. you want to analyze the content of a Word document. You would open the Word document in C# create a Python process pipe the text into it and read the result back in JSON or XML and display it in ASP, WPF or WinForms.

Small NLP tasks

There is a learning curve for both NLTK and embedded IronPython, that slows down you down when you start work.

Medium sized NLP projects

The setup cost is not an issue so embedding IronPython and NLTK could work very well here.

Big NLP projects

The setup cost is not an issue, but at some point the mismatch between Python and C#, will start to outweigh the advantages you get.

Prototyping in NLTK

Start writing your application in NLTK either under CPython or IronPython. This should improve development time substantially. You might find that your prototype is good enough and you do not need to port it to C#; or you will have a working program that you can port to C#.

References


-Sami Badawi

7 comments:

Paulo Gomes said...

Have you checked this .Net NLP platform:

http://www.proxem.com/Default.aspx?tabid=119

It's still on beta, but you can give it a go anyway.

Unknown said...

Sami
I need help for step by step of installing Python and integrating the NLTK library & IronPython.

I am working on a project to auto summarize using key phrases. So i need POS tagger for it. Platform am working on is C# 4.0 on Visual Studio 2010.

Reply ASAP.

Su said...

Hi, Can you please explain how you solved this problem? e.g. Where to copy zlib, etc
Thanks

Unknown said...

How did nlp practitioner the NLP Thorough educating guidebook? I experienced modifying a predicament just by shifting my response to it.

Anonymous said...

These libraries should also extend to Python and C++. Counseling perth uses Linux and Unix frameworks for NLP.

Unknown said...

It is summer, 2013. Have you done anything new with this using .NET 4.5? Is there an "integration cookbook" out there?

Sami Badawi said...

Hi Isamu,

No I have not looked at this for a long time.