Tuesday, July 12, 2011

Natural language processing in F# and Scala

I do natural language processing in C# 3.5 and Python. My work includes classification, named entity recognition, sentiment analysis and information extraction. Both C# and Python are great languages, but I have some unmet needs. I am investigating if there are any new languages that would help.

I 2010 I tried out 3 new languages:
Natural language processing in Clojure, Go and Cython

Recently I have investigated F# and Scala. They are both hybrid functional - object oriented languages; inspired by ML / OCaml / Haskell and Java / C#.

Python as the benchmark

Python is widely used in natural language processing. I am most productive in Python for NLP work. Here are a few reasons why:

  • NLTK is a great Python NLP library
  • Lot of open source math and science libraries e.g. NumPy and SciPy
  • PyDev is a good development environment
  • Good integration with MongoDB library
  • Great for rapid development

Python shortcomings
  • Slow compared to compiled language
  • GUI support is crude
  • Multi-threading is crude
  • Compilation does give more robustness

It should be possible to make a super language that has the elegance of Python, but without these shortcomings.


My first Scala experience

In 2006 I thought Scala was this super language. It is very advanced; you can call any Java libraries from Scala, including all the open source libraries. But I ran into a list of problems with Scala:

  • The Scala IDE was far behind Eclipse Java
  • Scala is a quite complex language
  • The Java libraries and the functional programming libraries were badly integrated
  • There were no Scala REPL or interpreter like in Python
Scala was stable enough for use, but it did not improve my productivity so after some months I went back to using Python as my scripting language.


Python's weakness

Recently I had to make a small text processing application that end users could use directly. This was not the best fit for Python. Normally my Python programs have no GUI and are controlled by command line parameters.

I had 2 Python options:

Make simple GUI using TkInter
TkInter is a Python wrapper of TK, the cross platform GUI toolkit. It is pretty crude by modern GUI standards, but would have been good enough. However trying to install all the Python libraries that I needed on the end users machine would be setting myself of for a maintains nightmare.

Wrap code in web application
I could wrap a web interface around it. The application is using a lot of memory and I would have to maintain a web application.

I had a 1 week hard deadline for the task and both of these options looked unappealing. I needed something else...


My first F# application

I took a chance on F#, and managed to learn enough F# to finish the program by my 1 week deadline.

There is no GUI builder for F# in Visual Studio, but it was pretty easy to hand code a simple WinForms GUI to wrap around the core code. It was not pretty but you could give it to an end user. The whole application ended up being one 40KB executable file, and it was very fast. F# had actually filled a niche that Python does not do so well.

There were also problems, I wrote the whole application from scratch, while in Python I would have been able to use NLTK, write the code faster and get better results.

All in all this was very good experience. I thought that F# would be a good supplement to my Python library. It would both give me raw speed when I need it and good connectivity with C#, ASP.NET, WPF and Microsoft Office.


Functional programming benefits

Functional programming is a great fit for my NLP work.

I have a lot of different text sources: database, flat file, directory, public RESTful web application services.

I have many word transformations: stop word filters, stemmers, custom filters.

I need many operations building on other operations: Bigram finder, POS tagger, named entity recognizer.

Created different reports: database, csv, Excel.

In functional languages you can just take any combinations of these operations and easily pipe them together while getting good compiler support. This does not fit so well with object oriented programming were you are more concerned with encapsulation.


F# impression

F# is the first compiled language I tried that is comparable to Python in simplicity and elegance. It has a real Pythonic feel:

  • F# is fast
  • Simple and elegant
  • Good development environment in Visual Studio 2010
  • Best concurrency support of any language I have seen
  • Good database support
  • Good MongoDB library
  • Simple to combine F# with C# or VB.NET for ASP or WPF
  • Good REPL

Issues

  • Runs best under Windows
  • For an IDE you really need Visual Studio 2008 or 2010, and that cost at least $700
  • F# can be compiled and run the shell from SharpDevelop 4.0 and 4.1, but you do not have the same productivity
  • The math libraries under .NET are not as good as NumPy and SciPy
  • The NLP libraries are better under Python


Scala revisited

After the success with F# I was very curious about why F# has been so much more successful than my first experience with Scala.

I looked at an F# and Scala cheat sheet and thought they look remarkably similar. I watched a few screen casts and found no obvious problems. I bought the book: Programming in Scala, Second Edition, it turned out to be a very interesting computer science book and I read the whole 852 pages. Scala still looked good.

I installed the Scala Eclipse plugin and wrote some code. Both the language and the IDE have come a long way during the last 5 years:
  • 15 books about Scala
  • 2 great free books
  • Tooling is much better
  • IDE is much better with code completion
  • Native NLP libs: ScalaNLP and Kiama

Of all the issues I had when I first tried Scala. The only remaining one is:
Scala is a pretty complex language

It is incredible how Scala has taken a lot of messy features from Java and turned it into a clean modular system, at the cost of some complex abstractions.


F# vs. Scala

Despite many similarities, the languages have a different feel. F# is simpler to understand, while Scala is the more orthogonal language. I have been very impressed by both.

F# better

  • Simpler to understand
  • Fantastic concurrency
  • Tail recursion optimized
  • Works well with Windows Azure

Scala better 

  • More orthogonal, reusing the same constructs
  • Works with any Java library so more libraries
  • Better NLP libraries
  • Works well with Hadoop


Cloud computing

Functional programming works well with cloud computing. For me the availability of a good functional language is a substantial factor in selecting a cloud platform.

Google introduced MapReduce to handle massive parallel multi computer applications.

Hadoop is the Java based open source version of MapReduce. To run Hadoop natively it has to run a JVM language like Java or Scala.

Hadoop Streaming extends a limited version of Hadoop to work with programs written in other programming languages as long as they work like a UNIX pipes that read from stdin and write to stdout.

There is a Python wrapper for Hadoop Streaming called Dumbo. Python is around 10 times slower than Java and Dumbo is a limited version of the Hadoop, so if you are trying to do NLP on massive amount of data this might not solve your problems.

Scala is fast and will give you full access to run native Hadoop.

Microsoft's version or MapReduce is called: Dryad or LINQ to HPC. It is not officially released yet, but F# works well with Windows Azure.


NLP and other languages

Let me finish by giving a few short comparisons of F# and Scala with other languages:


Clojure vs. Scala

Clojure is a LISP dialect that it also running on the JVM, and it the other big functional language running there. Clojure has some distinct niches for NLP:

Clojure better
  • Language understanding
  • Formal semantic: taking text and translating it to first order propositional logic
  • Artificial intelligence tasks

Scala better
  • It is easy to write fast Scala code
  • Smaller learning curve coming from Java
I tried Clojure recently and was very impressed; but more of my work falls in the category that would benefit from Scala.


Java vs. Scala

Java better

  • Better IDE tools and support
  • Better GUI builders
  • Great refactoring support
  • Many more programmers that know Java

Scala better

  • Terser code
  • Closures
  • First class function
  • More expressive language


C# vs. F#


C# better

  • Better IDE tools and support
  • Better GUI builders
  • There are a lot more programmers that know C#
  • Better LINQ to SQL support

F# better

  • Terse code
  • Better support for concurrency, Synch, continuations
  • More productive for NLP


Conclusion

F# and Scala are similar hybrid functional object oriented languages.

For years I have periodically tried functional programming languages to see if they were ready for mainstream corporate computing; and they were not. With the recent spread of functional features into object oriented languages I thought that real functional programming languages would soon be forgotten.

I was pleasantly surprised by how well F# and Scala work now. Functional languages are finally coming of age and becoming useful in mainstream corporate computing. They are stable enough, and they have niches were they are more productive than object oriented languages like C# and Java.

I really enjoy programming in F# and Scala, they are a very good fit for natural language processing and cloud computing. For bigger NLP projects I now prefer to use F# or Scala over C# or Java.

For GUI and web programming the object oriented languages still rules. Stick with C# or Java if the NLP part is small or GUI or web interface is the domineering part.

Java and C# are also improving e.g. by adding more functional features. Many working programmers are well served by just waiting for Java 8 or C# 5. But functional programming is here to stay. Rejoice...


Newer follow up article covering Scala, ScalaNLP and Spark MLib

11 comments:

Karim Ahmed said...
This comment has been removed by the author.
Onur Gümüş said...

Nemerle is better than all. It is the only one which has metaprogramming capabilities

Alex Peake said...

You do not need to pay for the Visual Studio IDE for F#. Just download the free Shell and install F#.

http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=115

Sami Badawi said...

Hi Alex Peake,

I did get the F# interactive shell running from within SharpDevelop 4.1, but I do not get the code completion to work. Could you?

Anonymous said...

@Sami - I think that Alex is talking about using the free Visual Studio Shell and not trying to shoe-horn F# into SharpDevelop.

http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=115

There is also an F# plug-in for MonoDevelop if you find yourself developing on Linux or Mac and want to use F#:

http://functional-variations.net/monodevelop/

Thanks for the article. This is a really great and informative comparison.

I should also point out that IKVM.NET makes it really easy to use Java libraries in your .NET projects. I have used big Java libraries a few times and been really happy with how well they integrate into my .NET apps. So, you could still use F# if you wanted without having to give up the Java libraries.

One of the reasons that I prefer the CLR is that I find it gives me access to all the great .NET libraries plus all the great Java ones. For example, I wrote a reporting module for a website a couple of years ago that use the HTML Agility Pack (a great .NET HTML parser) and Flying Saucer (a fantastic Java based CSS parser/renderer) to create on the fly PDF reports from dynamically generated HTML pages CSS templates. I even deployed the whole thing on Linux/Apache using Mono. It worked really well.

I am obviously more of a .NET guy but I have been thinking for a while that Scala would eventually win me over to the JVM.

F# scared me away at first but lately I have been liking it more and more. So maybe the CLR has me for a while longer.

Clojure really interests me as well but I never seem to get around to it. The JVM is obviously it's primary home but the CLR version looks quite good (unlike the CLR version of Scala).

Danno said...

What about writing your UI in C# and doing work part in F#? This could be easily done by creating class libraries for the different parts. One of the big advantages of the .NET platform is being able to match the language to the functionality desired.

Sami Badawi said...

Hi Danno,

I would normally write the GUI in C# and the logic in F#, but for my first F# program I wanted to write the whole thing in F#.

Anonymous said...

Hi Sami,
you wrote that in 2006 Scala had no REPL, but do not mention that there currently exist a very useful REPL.

Then, comparing Scala and F# you attest F# "fantastic concurrency" .
I don't know F#, so I've no idea where F# is better than Scala's actor approach.
But you definitely should put it into account for Scala when comparing that with Java.

Then Scala's alleged "complexity" is widely discussed. Throwing that in without any further comment doesn't help that discussion, because OTOH you attest Scala being "More orthogonal, reusing the same constructs", which in some perspective reduces complexity e.g. compared with Java.

In the end, what I did not understand is: "For GUI and web programming the object oriented languages still rules."

For GUI I am with you. For Web I do not see the point, as there are currently so much approaches to functional web services: First of all Node.js, v8cgi and Rhino-for-webapps (all Javascript), then regarding Scala the Unfiltered library or the http package in Scalaz, to just mention a few.
As web is always semantically response = webservice(request), and as webservices live from parallelism and other things where FP is strong (e.g. map/reduce algorithms), I can't see where OO shines in this area.

KR
Det

Sami Badawi said...

Hi Det,

You are making a lot of good detailed points; and I agree with what you say. I had to keep the blog post short and readable. My focus was the great progress of FP seen from an NLP viewpoint.

Yes functional programming should work fine for Web applications. I worked with ASP.NET using Visual Studio 2008 and it is a tremendous amount of work Microsoft has put into making this easy to use. So my statement was more about the web tools than the merit of FP. I have not tried any of the Scala web libraries, so maybe they have caught up with ASP.NET on Visual Studio.

Win Myo Htet said...

Hi Sami

Good post. You have mentiotned Azure for F# and Hadoop (via EC2 or one's own cloud) for Scala. Well google's colud of GAE also is on the Scala side. For the web libraries, there is lift web frame work for Scala. Here is Scala lift frame work on Google App Engine. http://lift-example.appspot.com/

David Beardsley said...
This comment has been removed by the author.