Saturday, July 23, 2011

Scala, Eclipse and Maven integration tutorial

I have evaluated Scala as a language for cloud computing and Hadoop. One requirement was a robust development environment, with a real build system, a good IDE with code completion and debugging.

The combination of ScalaEclipse and Maven seemed like a fit for this requirement, but my initial experience was mixed.


Problems with Scala, Eclipse and Maven integration

It was easy to install Scala, Eclipse and Maven, but when I set up a project it had a persistent error in Eclipse:

object Predef does not have a member AnyRef

Other problems:

  • There were problems running the unit test.
  • I had to restart Eclipse a lot.
  • Eclipse had Scala set to version 2.9.0.1 while Maven had 2.8.0. When I tried to change Maven to use 2.9.0.1 the pom.xml file would be marked as having an error.
I searched internet for help but could not find it. After a good deal of experimenting I sorted out the problems and found a good solution.


Software versions

My setup is:
  • Scala 2.9.0.1.
  • Eclipse 3.7 Indigo
  • Scala-ide Eclipse plugin: scala nightly 29 - http://download.scala-ide.org/nightly-update-wip-experiment-2.9.0-1


Scala, Eclipse Maven project setup tutorial

Here are that steps that I took to set up at new Scala, Eclipse and Maven project so it works with unit testing.

Press menu item: File - New - Other...



Select Maven Project




Select the org.scala-tools.archetypes scala-archetype-simple




Add group id and artifact id to project. Click Finish



This will create the project with example program and unit tests, but it will leave Eclipse in an unstable state







In the project's pom.xml file make the changes that I have marked in red:


<properties>
 <maven.compiler.source>1.5</maven.compiler.source>
 <maven.compiler.target>1.5</maven.compiler.target>
 <encoding>UTF-8</encoding>
 <scala.version>2.9.0-1</scala.version>
</properties>


<dependency>
 <groupId>org.scala-tools.testing</groupId>
 <artifactId>specs_${scala.version}</artifactId>
 <version>1.6.8</version>
 <scope>test</scope>
</dependency>

Now both Scala IDE and Maven are both using the same version of Scala. Scala 2.9.0.1


Right click the whole project and select: Configure - Add Scala Nature




Now use the Maven build system to clean, build and run unit tests. Run from either Eclipse or command line.

From Eclipse, right click the whole project and selecting:
Maven clean
Maven install





From command line:

C:\prog\apache-maven-2.2.1\bin\mvn clean
C:\prog\apache-maven-2.2.1\bin\mvn install

Note that you have to use Maven 2.2 and not Maven 3.



Now there should be no more errors.
The unit test: "scalatest.scala" has some problems, delete it.

Run all unit tests from Eclipse. By right clicking the whole project and select Run As JUnit Test



Now you can see the result in the JUnit runner.


Final impression of Scala, Eclipse and Maven integration

Once I had resolved the problems the Scala, Eclipse and Maven combination was a great development environment meeting my requirements.

One thing that is currently missing from the Scala Eclipse plugin is code refactoring. Refactoring works very well in both Eclipse for Java and Visual Studio for C#.



Tuesday, July 12, 2011

Natural language processing in F# and Scala

I do natural language processing in C# 3.5 and Python. My work includes classification, named entity recognition, sentiment analysis and information extraction. Both C# and Python are great languages, but I have some unmet needs. I am investigating if there are any new languages that would help.

I 2010 I tried out 3 new languages:
Natural language processing in Clojure, Go and Cython

Recently I have investigated F# and Scala. They are both hybrid functional - object oriented languages; inspired by ML / OCaml / Haskell and Java / C#.

Python as the benchmark

Python is widely used in natural language processing. I am most productive in Python for NLP work. Here are a few reasons why:

  • NLTK is a great Python NLP library
  • Lot of open source math and science libraries e.g. NumPy and SciPy
  • PyDev is a good development environment
  • Good integration with MongoDB library
  • Great for rapid development

Python shortcomings
  • Slow compared to compiled language
  • GUI support is crude
  • Multi-threading is crude
  • Compilation does give more robustness

It should be possible to make a super language that has the elegance of Python, but without these shortcomings.


My first Scala experience

In 2006 I thought Scala was this super language. It is very advanced; you can call any Java libraries from Scala, including all the open source libraries. But I ran into a list of problems with Scala:

  • The Scala IDE was far behind Eclipse Java
  • Scala is a quite complex language
  • The Java libraries and the functional programming libraries were badly integrated
  • There were no Scala REPL or interpreter like in Python
Scala was stable enough for use, but it did not improve my productivity so after some months I went back to using Python as my scripting language.


Python's weakness

Recently I had to make a small text processing application that end users could use directly. This was not the best fit for Python. Normally my Python programs have no GUI and are controlled by command line parameters.

I had 2 Python options:

Make simple GUI using TkInter
TkInter is a Python wrapper of TK, the cross platform GUI toolkit. It is pretty crude by modern GUI standards, but would have been good enough. However trying to install all the Python libraries that I needed on the end users machine would be setting myself of for a maintains nightmare.

Wrap code in web application
I could wrap a web interface around it. The application is using a lot of memory and I would have to maintain a web application.

I had a 1 week hard deadline for the task and both of these options looked unappealing. I needed something else...


My first F# application

I took a chance on F#, and managed to learn enough F# to finish the program by my 1 week deadline.

There is no GUI builder for F# in Visual Studio, but it was pretty easy to hand code a simple WinForms GUI to wrap around the core code. It was not pretty but you could give it to an end user. The whole application ended up being one 40KB executable file, and it was very fast. F# had actually filled a niche that Python does not do so well.

There were also problems, I wrote the whole application from scratch, while in Python I would have been able to use NLTK, write the code faster and get better results.

All in all this was very good experience. I thought that F# would be a good supplement to my Python library. It would both give me raw speed when I need it and good connectivity with C#, ASP.NET, WPF and Microsoft Office.


Functional programming benefits

Functional programming is a great fit for my NLP work.

I have a lot of different text sources: database, flat file, directory, public RESTful web application services.

I have many word transformations: stop word filters, stemmers, custom filters.

I need many operations building on other operations: Bigram finder, POS tagger, named entity recognizer.

Created different reports: database, csv, Excel.

In functional languages you can just take any combinations of these operations and easily pipe them together while getting good compiler support. This does not fit so well with object oriented programming were you are more concerned with encapsulation.


F# impression

F# is the first compiled language I tried that is comparable to Python in simplicity and elegance. It has a real Pythonic feel:

  • F# is fast
  • Simple and elegant
  • Good development environment in Visual Studio 2010
  • Best concurrency support of any language I have seen
  • Good database support
  • Good MongoDB library
  • Simple to combine F# with C# or VB.NET for ASP or WPF
  • Good REPL

Issues

  • Runs best under Windows
  • For an IDE you really need Visual Studio 2008 or 2010, and that cost at least $700
  • F# can be compiled and run the shell from SharpDevelop 4.0 and 4.1, but you do not have the same productivity
  • The math libraries under .NET are not as good as NumPy and SciPy
  • The NLP libraries are better under Python


Scala revisited

After the success with F# I was very curious about why F# has been so much more successful than my first experience with Scala.

I looked at an F# and Scala cheat sheet and thought they look remarkably similar. I watched a few screen casts and found no obvious problems. I bought the book: Programming in Scala, Second Edition, it turned out to be a very interesting computer science book and I read the whole 852 pages. Scala still looked good.

I installed the Scala Eclipse plugin and wrote some code. Both the language and the IDE have come a long way during the last 5 years:
  • 15 books about Scala
  • 2 great free books
  • Tooling is much better
  • IDE is much better with code completion
  • Native NLP libs: ScalaNLP and Kiama

Of all the issues I had when I first tried Scala. The only remaining one is:
Scala is a pretty complex language

It is incredible how Scala has taken a lot of messy features from Java and turned it into a clean modular system, at the cost of some complex abstractions.


F# vs. Scala

Despite many similarities, the languages have a different feel. F# is simpler to understand, while Scala is the more orthogonal language. I have been very impressed by both.

F# better

  • Simpler to understand
  • Fantastic concurrency
  • Tail recursion optimized
  • Works well with Windows Azure

Scala better 

  • More orthogonal, reusing the same constructs
  • Works with any Java library so more libraries
  • Better NLP libraries
  • Works well with Hadoop


Cloud computing

Functional programming works well with cloud computing. For me the availability of a good functional language is a substantial factor in selecting a cloud platform.

Google introduced MapReduce to handle massive parallel multi computer applications.

Hadoop is the Java based open source version of MapReduce. To run Hadoop natively it has to run a JVM language like Java or Scala.

Hadoop Streaming extends a limited version of Hadoop to work with programs written in other programming languages as long as they work like a UNIX pipes that read from stdin and write to stdout.

There is a Python wrapper for Hadoop Streaming called Dumbo. Python is around 10 times slower than Java and Dumbo is a limited version of the Hadoop, so if you are trying to do NLP on massive amount of data this might not solve your problems.

Scala is fast and will give you full access to run native Hadoop.

Microsoft's version or MapReduce is called: Dryad or LINQ to HPC. It is not officially released yet, but F# works well with Windows Azure.


NLP and other languages

Let me finish by giving a few short comparisons of F# and Scala with other languages:


Clojure vs. Scala

Clojure is a LISP dialect that it also running on the JVM, and it the other big functional language running there. Clojure has some distinct niches for NLP:

Clojure better
  • Language understanding
  • Formal semantic: taking text and translating it to first order propositional logic
  • Artificial intelligence tasks

Scala better
  • It is easy to write fast Scala code
  • Smaller learning curve coming from Java
I tried Clojure recently and was very impressed; but more of my work falls in the category that would benefit from Scala.


Java vs. Scala

Java better

  • Better IDE tools and support
  • Better GUI builders
  • Great refactoring support
  • Many more programmers that know Java

Scala better

  • Terser code
  • Closures
  • First class function
  • More expressive language


C# vs. F#


C# better

  • Better IDE tools and support
  • Better GUI builders
  • There are a lot more programmers that know C#
  • Better LINQ to SQL support

F# better

  • Terse code
  • Better support for concurrency, Synch, continuations
  • More productive for NLP


Conclusion

F# and Scala are similar hybrid functional object oriented languages.

For years I have periodically tried functional programming languages to see if they were ready for mainstream corporate computing; and they were not. With the recent spread of functional features into object oriented languages I thought that real functional programming languages would soon be forgotten.

I was pleasantly surprised by how well F# and Scala work now. Functional languages are finally coming of age and becoming useful in mainstream corporate computing. They are stable enough, and they have niches were they are more productive than object oriented languages like C# and Java.

I really enjoy programming in F# and Scala, they are a very good fit for natural language processing and cloud computing. For bigger NLP projects I now prefer to use F# or Scala over C# or Java.

For GUI and web programming the object oriented languages still rules. Stick with C# or Java if the NLP part is small or GUI or web interface is the domineering part.

Java and C# are also improving e.g. by adding more functional features. Many working programmers are well served by just waiting for Java 8 or C# 5. But functional programming is here to stay. Rejoice...


Newer follow up article covering Scala, ScalaNLP and Spark MLib