I do natural language processing in C# 3.5 and Python. My work includes classification, named entity recognition, sentiment analysis and information extraction. Both C# and Python are great languages, but I have some unmet needs. I am investigating if there are any new languages that would help.
I 2010 I tried out 3 new languages:
Natural language processing in Clojure, Go and Cython
Recently I have investigated
F# and
Scala. They are both hybrid functional - object oriented languages; inspired by
ML /
OCaml /
Haskell and Java / C#.
Python as the benchmark
Python is widely used in natural language processing. I am most productive in Python for NLP work. Here are a few reasons why:
- NLTK is a great Python NLP library
- Lot of open source math and science libraries e.g. NumPy and SciPy
- PyDev is a good development environment
- Good integration with MongoDB library
- Great for rapid development
Python shortcomings
- Slow compared to compiled language
- GUI support is crude
- Multi-threading is crude
- Compilation does give more robustness
It should be possible to make a super language that has the elegance of Python, but without these shortcomings.
My first Scala experience
In 2006 I thought Scala was this super language. It is very advanced; you can call any Java libraries from Scala, including all the open source libraries. But I ran into a list of problems with Scala:
- The Scala IDE was far behind Eclipse Java
- Scala is a quite complex language
- The Java libraries and the functional programming libraries were badly integrated
- There were no Scala REPL or interpreter like in Python
Scala was stable enough for use, but it did not improve my productivity so after some months I went back to using Python as my scripting language.
Python's weakness
Recently I had to make a small text processing application that end users could use directly. This was not the best fit for Python. Normally my Python programs have no GUI and are controlled by command line parameters.
I had 2 Python options:
Make simple GUI using TkInter
TkInter is a Python wrapper of
TK, the cross platform GUI toolkit. It is pretty crude by modern GUI standards, but would have been good enough. However trying to install all the Python libraries that I needed on the end users machine would be setting myself of for a maintains nightmare.
Wrap code in web application
I could wrap a web interface around it. The application is using a lot of memory and I would have to maintain a web application.
I had a 1 week hard deadline for the task and both of these options looked unappealing. I needed something else...
My first F# application
I took a chance on F#, and managed to learn enough F# to finish the program by my 1 week deadline.
There is no GUI builder for F# in Visual Studio, but it was pretty easy to hand code a simple WinForms GUI to wrap around the core code. It was not pretty but you could give it to an end user. The whole application ended up being one 40KB executable file, and it was very fast. F# had actually filled a niche that Python does not do so well.
There were also problems, I wrote the whole application from scratch, while in Python I would have been able to use NLTK, write the code faster and get better results.
All in all this was very good experience. I thought that F# would be a good supplement to my Python library. It would both give me raw speed when I need it and good connectivity with C#, ASP.NET, WPF and Microsoft Office.
Functional programming benefits
Functional programming is a great fit for my NLP work.
I have a lot of different text sources: database, flat file, directory, public RESTful web application services.
I have many word transformations: stop word filters, stemmers, custom filters.
I need many operations building on other operations: Bigram finder, POS tagger, named entity recognizer.
Created different reports: database, csv, Excel.
In functional languages you can just take any combinations of these operations and easily pipe them together while getting good compiler support. This does not fit so well with object oriented programming were you are more concerned with encapsulation.
F# is the first compiled language I tried that is comparable to Python in simplicity and elegance. It has a real Pythonic feel:
- F# is fast
- Simple and elegant
- Good development environment in Visual Studio 2010
- Best concurrency support of any language I have seen
- Good database support
- Good MongoDB library
- Simple to combine F# with C# or VB.NET for ASP or WPF
- Good REPL
Issues
- Runs best under Windows
- For an IDE you really need Visual Studio 2008 or 2010, and that cost at least $700
- F# can be compiled and run the shell from SharpDevelop 4.0 and 4.1, but you do not have the same productivity
- The math libraries under .NET are not as good as NumPy and SciPy
- The NLP libraries are better under Python
Scala revisited
After the success with F# I was very curious about why F# has been so much more successful than my first experience with Scala.
I looked at an
F# and Scala cheat sheet and thought they look remarkably similar. I watched a few screen casts and found no obvious problems. I bought the book:
Programming in Scala, Second Edition, it turned out to be a very interesting computer science book and I read the whole 852 pages. Scala still looked good.
I installed the Scala Eclipse plugin and wrote some code. Both the language and the IDE have come a long way during the last 5 years:
- 15 books about Scala
- 2 great free books
- Tooling is much better
- IDE is much better with code completion
- Native NLP libs: ScalaNLP and Kiama
Of all the issues I had when I first tried Scala. The only remaining one is:
Scala is a pretty complex language
It is incredible how Scala has taken a lot of messy features from Java and turned it into a clean modular system, at the cost of some complex abstractions.
F# vs. Scala
Despite many similarities, the languages have a different feel. F# is simpler to understand, while Scala is the more orthogonal language. I have been very impressed by both.
F# better
- Simpler to understand
- Fantastic concurrency
- Tail recursion optimized
- Works well with Windows Azure
Scala better
- More orthogonal, reusing the same constructs
- Works with any Java library so more libraries
- Better NLP libraries
- Works well with Hadoop
Cloud computing
Functional programming works well with cloud computing. For me the availability of a good functional language is a substantial factor in selecting a cloud platform.
Google introduced MapReduce to handle massive parallel multi computer applications.
Hadoop is the Java based open source version of MapReduce. To run Hadoop natively it has to run a JVM language like Java or Scala.
Hadoop Streaming extends a limited version of Hadoop to work with programs written in other programming languages as long as they work like a UNIX pipes that read from stdin and write to stdout.
There is a Python wrapper for Hadoop Streaming called
Dumbo. Python is around 10 times slower than Java and Dumbo is a limited version of the Hadoop, so if you are trying to do NLP on massive amount of data this might not solve your problems.
Scala is fast and will give you full access to run native Hadoop.
Microsoft's version or MapReduce is called:
Dryad or LINQ to HPC. It is not officially released yet, but F# works well with Windows Azure.
NLP and other languages
Let me finish by giving a few short comparisons of F# and Scala with other languages:
Clojure is a LISP dialect that it also running on the JVM, and it the other big functional language running there. Clojure has some distinct niches for NLP:
Clojure better
- Language understanding
- Formal semantic: taking text and translating it to first order propositional logic
- Artificial intelligence tasks
Scala better
- It is easy to write fast Scala code
- Smaller learning curve coming from Java
I tried Clojure recently and was very impressed; but more of my work falls in the category that would benefit from Scala.
Java vs. Scala
Java better
- Better IDE tools and support
- Better GUI builders
- Great refactoring support
- Many more programmers that know Java
Scala better
- Terser code
- Closures
- First class function
- More expressive language
C# vs. F#
C# better
- Better IDE tools and support
- Better GUI builders
- There are a lot more programmers that know C#
- Better LINQ to SQL support
F# better
- Terse code
- Better support for concurrency, Synch, continuations
- More productive for NLP
Conclusion
F# and Scala are similar hybrid functional object oriented languages.
For years I have periodically tried functional programming languages to see if they were ready for mainstream corporate computing; and they were not. With the recent spread of functional features into object oriented languages I thought that real functional programming languages would soon be forgotten.
I was pleasantly surprised by how well F# and Scala work now. Functional languages are finally coming of age and becoming useful in mainstream corporate computing. They are stable enough, and they have niches were they are more productive than object oriented languages like C# and Java.
I really enjoy programming in F# and Scala, they are a very good fit for natural language processing and cloud computing. For bigger NLP projects I now prefer to use F# or Scala over C# or Java.
For GUI and web programming the object oriented languages still rules. Stick with C# or Java if the NLP part is small or GUI or web interface is the domineering part.
Java and C# are also improving e.g. by adding more functional features. Many working programmers are well served by just waiting for Java 8 or C# 5. But functional programming is here to stay. Rejoice...
Newer follow up article covering Scala, ScalaNLP and Spark MLib