Showing posts with label cloud computing. Show all posts
Showing posts with label cloud computing. Show all posts

Tuesday, July 12, 2011

Natural language processing in F# and Scala

I do natural language processing in C# 3.5 and Python. My work includes classification, named entity recognition, sentiment analysis and information extraction. Both C# and Python are great languages, but I have some unmet needs. I am investigating if there are any new languages that would help.

I 2010 I tried out 3 new languages:
Natural language processing in Clojure, Go and Cython

Recently I have investigated F# and Scala. They are both hybrid functional - object oriented languages; inspired by ML / OCaml / Haskell and Java / C#.

Python as the benchmark

Python is widely used in natural language processing. I am most productive in Python for NLP work. Here are a few reasons why:

  • NLTK is a great Python NLP library
  • Lot of open source math and science libraries e.g. NumPy and SciPy
  • PyDev is a good development environment
  • Good integration with MongoDB library
  • Great for rapid development

Python shortcomings
  • Slow compared to compiled language
  • GUI support is crude
  • Multi-threading is crude
  • Compilation does give more robustness

It should be possible to make a super language that has the elegance of Python, but without these shortcomings.


My first Scala experience

In 2006 I thought Scala was this super language. It is very advanced; you can call any Java libraries from Scala, including all the open source libraries. But I ran into a list of problems with Scala:

  • The Scala IDE was far behind Eclipse Java
  • Scala is a quite complex language
  • The Java libraries and the functional programming libraries were badly integrated
  • There were no Scala REPL or interpreter like in Python
Scala was stable enough for use, but it did not improve my productivity so after some months I went back to using Python as my scripting language.


Python's weakness

Recently I had to make a small text processing application that end users could use directly. This was not the best fit for Python. Normally my Python programs have no GUI and are controlled by command line parameters.

I had 2 Python options:

Make simple GUI using TkInter
TkInter is a Python wrapper of TK, the cross platform GUI toolkit. It is pretty crude by modern GUI standards, but would have been good enough. However trying to install all the Python libraries that I needed on the end users machine would be setting myself of for a maintains nightmare.

Wrap code in web application
I could wrap a web interface around it. The application is using a lot of memory and I would have to maintain a web application.

I had a 1 week hard deadline for the task and both of these options looked unappealing. I needed something else...


My first F# application

I took a chance on F#, and managed to learn enough F# to finish the program by my 1 week deadline.

There is no GUI builder for F# in Visual Studio, but it was pretty easy to hand code a simple WinForms GUI to wrap around the core code. It was not pretty but you could give it to an end user. The whole application ended up being one 40KB executable file, and it was very fast. F# had actually filled a niche that Python does not do so well.

There were also problems, I wrote the whole application from scratch, while in Python I would have been able to use NLTK, write the code faster and get better results.

All in all this was very good experience. I thought that F# would be a good supplement to my Python library. It would both give me raw speed when I need it and good connectivity with C#, ASP.NET, WPF and Microsoft Office.


Functional programming benefits

Functional programming is a great fit for my NLP work.

I have a lot of different text sources: database, flat file, directory, public RESTful web application services.

I have many word transformations: stop word filters, stemmers, custom filters.

I need many operations building on other operations: Bigram finder, POS tagger, named entity recognizer.

Created different reports: database, csv, Excel.

In functional languages you can just take any combinations of these operations and easily pipe them together while getting good compiler support. This does not fit so well with object oriented programming were you are more concerned with encapsulation.


F# impression

F# is the first compiled language I tried that is comparable to Python in simplicity and elegance. It has a real Pythonic feel:

  • F# is fast
  • Simple and elegant
  • Good development environment in Visual Studio 2010
  • Best concurrency support of any language I have seen
  • Good database support
  • Good MongoDB library
  • Simple to combine F# with C# or VB.NET for ASP or WPF
  • Good REPL

Issues

  • Runs best under Windows
  • For an IDE you really need Visual Studio 2008 or 2010, and that cost at least $700
  • F# can be compiled and run the shell from SharpDevelop 4.0 and 4.1, but you do not have the same productivity
  • The math libraries under .NET are not as good as NumPy and SciPy
  • The NLP libraries are better under Python


Scala revisited

After the success with F# I was very curious about why F# has been so much more successful than my first experience with Scala.

I looked at an F# and Scala cheat sheet and thought they look remarkably similar. I watched a few screen casts and found no obvious problems. I bought the book: Programming in Scala, Second Edition, it turned out to be a very interesting computer science book and I read the whole 852 pages. Scala still looked good.

I installed the Scala Eclipse plugin and wrote some code. Both the language and the IDE have come a long way during the last 5 years:
  • 15 books about Scala
  • 2 great free books
  • Tooling is much better
  • IDE is much better with code completion
  • Native NLP libs: ScalaNLP and Kiama

Of all the issues I had when I first tried Scala. The only remaining one is:
Scala is a pretty complex language

It is incredible how Scala has taken a lot of messy features from Java and turned it into a clean modular system, at the cost of some complex abstractions.


F# vs. Scala

Despite many similarities, the languages have a different feel. F# is simpler to understand, while Scala is the more orthogonal language. I have been very impressed by both.

F# better

  • Simpler to understand
  • Fantastic concurrency
  • Tail recursion optimized
  • Works well with Windows Azure

Scala better 

  • More orthogonal, reusing the same constructs
  • Works with any Java library so more libraries
  • Better NLP libraries
  • Works well with Hadoop


Cloud computing

Functional programming works well with cloud computing. For me the availability of a good functional language is a substantial factor in selecting a cloud platform.

Google introduced MapReduce to handle massive parallel multi computer applications.

Hadoop is the Java based open source version of MapReduce. To run Hadoop natively it has to run a JVM language like Java or Scala.

Hadoop Streaming extends a limited version of Hadoop to work with programs written in other programming languages as long as they work like a UNIX pipes that read from stdin and write to stdout.

There is a Python wrapper for Hadoop Streaming called Dumbo. Python is around 10 times slower than Java and Dumbo is a limited version of the Hadoop, so if you are trying to do NLP on massive amount of data this might not solve your problems.

Scala is fast and will give you full access to run native Hadoop.

Microsoft's version or MapReduce is called: Dryad or LINQ to HPC. It is not officially released yet, but F# works well with Windows Azure.


NLP and other languages

Let me finish by giving a few short comparisons of F# and Scala with other languages:


Clojure vs. Scala

Clojure is a LISP dialect that it also running on the JVM, and it the other big functional language running there. Clojure has some distinct niches for NLP:

Clojure better
  • Language understanding
  • Formal semantic: taking text and translating it to first order propositional logic
  • Artificial intelligence tasks

Scala better
  • It is easy to write fast Scala code
  • Smaller learning curve coming from Java
I tried Clojure recently and was very impressed; but more of my work falls in the category that would benefit from Scala.


Java vs. Scala

Java better

  • Better IDE tools and support
  • Better GUI builders
  • Great refactoring support
  • Many more programmers that know Java

Scala better

  • Terser code
  • Closures
  • First class function
  • More expressive language


C# vs. F#


C# better

  • Better IDE tools and support
  • Better GUI builders
  • There are a lot more programmers that know C#
  • Better LINQ to SQL support

F# better

  • Terse code
  • Better support for concurrency, Synch, continuations
  • More productive for NLP


Conclusion

F# and Scala are similar hybrid functional object oriented languages.

For years I have periodically tried functional programming languages to see if they were ready for mainstream corporate computing; and they were not. With the recent spread of functional features into object oriented languages I thought that real functional programming languages would soon be forgotten.

I was pleasantly surprised by how well F# and Scala work now. Functional languages are finally coming of age and becoming useful in mainstream corporate computing. They are stable enough, and they have niches were they are more productive than object oriented languages like C# and Java.

I really enjoy programming in F# and Scala, they are a very good fit for natural language processing and cloud computing. For bigger NLP projects I now prefer to use F# or Scala over C# or Java.

For GUI and web programming the object oriented languages still rules. Stick with C# or Java if the NLP part is small or GUI or web interface is the domineering part.

Java and C# are also improving e.g. by adding more functional features. Many working programmers are well served by just waiting for Java 8 or C# 5. But functional programming is here to stay. Rejoice...


Newer follow up article covering Scala, ScalaNLP and Spark MLib

Friday, June 17, 2011

Cloud Computing For Data Mining Part 1

The first half of this blog post is about selecting a cloud provider for a data mining and natural language processing system. I will compare 3 leading cloud computing providers Amazon Web Services, Windows Azure, OpenStack.
To help me chose a cloud provider I have been looking for users with experience running cloud computing for application similar to data mining. I found them at CloudCamp New York June 2011. It was an unconference, so the attendees were split into user discussion groups. The last half of the post I will mention the highlight from these discussions.


The Hype

"If you are not in the cloud you are not going to be in business!"

This is the message many programmers, software architects and project managers faces today. You do not want to go out of business because you could not keep up with the latest technologies; but looking back many companies have gone out of business because they invested in the latest must have technology, that turned out to be expensive and over engineered.


Reason For Moving To Cloud

I have a good business case from using cloud computing: Namely scale a data mining system to handle a lot of data. To begin with it could be a moderate amount of data, but it could be changed to a Big Data with short notice.


Horror Cloud Scenario

I am trying minimize the risk of this scenario:
  1. I port to a cloud solution that is tied closely to one cloud provider
  2. Move the applications over
  3. After a few months I find that there are unforeseen problems
  4. No easy path back
  5. Angry customers are calling


Goals

Here are my cloud computing goals in a little more details:
  • Port data mining system and ASP.NET web applications to the cloud
  • Chose cloud compatible with code base in .NET and Python
  • Initially the data volume is moderate but it could possibly scale to Big Data
  • Keep cost and complexity under control
  • No downtime during transition
  • Minimize risk
  • Minimize vendor lock in
  • Run the same code in house and in the cloud
  • Make rollback to in house application possible


Amazon Web Services vs. Windows Azure vs. OpenStack

Choosing the right cloud computing provider has been time consuming, but also very important.

I took a quick stroll through Cloud Expo 2011, and most big computer companies were there presenting their cloud solutions.

Google App Engine is a big cloud service well suited for front end web application, but not good for data mining, so I will not cover that here.

The other 3 providers that have generated most momentum are: EC2, Azure and OpenStack.

Let me start by listing their similarities:
  • Virtual computers that can be started with short notice
  • Redundant robust storage
  • NoSQL structured data
  • Message queue for communication
  • Mountable hard disk
  • Local non persistent hard disk
Now I will write a little more about where they differ, and their good and the bad part:


Amazon Web Services, AWS, EC2, S3

Good:
  • This is the oldest cloud provider dating back to 2004
  • Very mature provider
  • Other providers are catching up with AWS's features
  • Well documented
  • Work well with open source, LAMP and Java
  • Integrated with Hadoop: Electric Map Reduce
  • A little cheaper than Windows Azure
  • Runs Linux, Open Solaris and Windows servers
  • You can run code on your local machine and just save the result into S3 storage

Bad:
  • You cannot run the same code in house and in the cloud
  • Vendor lock in


Windows Azure

Good:
  • Works well with the .NET framework and all Microsoft's tools
  • It is very simple to port an ASP.NET application to Azure
  • You can run the same code on you development machine and in the cloud
  • Very good development and debugging tools
  • F# is a great language for data mining in cloud computing
  • Great series of video screen casts

Bad:
  • Only run Windows
  • You need a Windows 7, Windows Server 2008 or Windows Vista to develop
  • Preferably you should have Visual Studio 2010
  • Vendor lock in


OpenStack

OpenStack is a new open source collaboration that is making a software stack that can be run both in house and it the cloud.

Good:
  • Open source
  • Generating a lot of buzz
  • Main participants NASA and Rackspace
  • Backed by 70 companies
  • You can run your application either in house or in the cloud

Bad:
  • Not yet mature enough for production use
  • Windows support is immature


Java, .NET Or Mixed Platform

For data mining selecting the right platform is a hard choice. Both Java and .NET are very attractive options.

Java only
For data mining and NLP there are a lot of great open source project written in Java. E.g. Mahout is a system for  collaborative filtering and clustering of Big Data, with distributed machine learning. It is integrated with Hadoop.
There are many more OSS: OpenNLP, SolrManifoldCF,

.NET only
The development tools in .NET are great. It works well with Microsoft Office.
Visual Studio 2010 comes with F#, which is a great language for writing worker roles. It is very well suited for light weight threads or async, for highly parallel reactive programs.

Mix Java and .NET
You can mix Java and .NET. Cloud computing makes is easier than ever to integrate different platforms. You already have abstract language agnostic service for communication with message queue, blob storage, structured data. If you have an ASP.NET front end on top of a collaborative filtering of Big Data this would be a very attractive option.

I still think that combining 2 big platforms like Java and .NET is introducing complexity, compared to staying within one platform. You need an organization with good resources and coordination to do this.


Choice Of Cloud Provider

I still have a lot of unanswered questions at this point.

At the time of writing June 2011 OpenStack is not ready for production use. So that is out for now.

I have run some test on AWS. It was very easy to deploy my Python code to EC2 under Linux. Programming C# that used AWS services was simple.

I am stuck waiting to get a Window 7 machine so I can test Window Azure.

Both EC2 and Azure seem like viable options for what I need. I will get back to this in part 2 of the blog post.


Highlights from Cloud Camp 2011

A lot of people are trying to sell you cloud computing solutions. I have heard plenty of cloud computing hype. I have been seeking advice from people that were not trying to sell me anything and had some real experience, and try to find some of the failures and problems in cloud computing. 

I went to Cloud Camp June 2011 during Cloud Expo 2011 in New York. Cloud computing users shared their experience. It was an unconference, meaning spontaneous user discussion breakout groups were formed. The rest of this post is highlight from these discussions.


Hadoop Is Great But Hard

Hadoop is a Java open source implementation of Google's Map Reduce. You can set up a workflow of operations and Hadoop will distribute them over a multiple computers, aggregate the result and rerun operations that fail. This sounds fantastic, but Hadoop is a pretty complex system, with a lot of new terminology and a steep learning curve.


Security Is Your Responsibility

Security is a big issue. You might assume that the cloud will take care of security, but you should not. E.g. you should clean up the hard disks that you have used it, so the next user cannot see your data.


Cloud Does Not Automatically Scale To Big Data

The assumption is that you put massive amounts of data in the cloud. And the cloud takes care of the scaling problems.
If you have a lot of data that needs little processing. Then cloud computing becomes expensive: you store all data in 3 different locations and it is expensive and slow to take it down to different compute nodes. This was mentioned as the reason why NASA could not using S3, but build its own Nebula platform.


You Accumulate Cost During Development

An entrepreneur building a startup ended up paying $2000 / month for EC2. He used a lot of different servers and they had to be running with multiple instances, even though he was no using a lot of resources. This might be cheap compared to going out and buying your own servers, but it was more expensive than he expected.


Applications Written In .NET Run Fine Under EC2 Windows

An entrepreneur said that he was running his company's .NET code under EC2. He thought that Amazon was more mature than Azure, and Azure was catching up. He preferred to make his own framework.


Simpler To Run .NET Application On Azure Than On EC2

A cloud computing consultant with lots of experience in both Azure and EC2 said: EC2 gives you a raw machine you have to do more to get your application running than if you plop it into Windows Azure.
It is very easy to port an ASP.NET application to Windows Azure.


Cash Flow, Operational Expenses And Capital Expenses

An often cited reason why cloud computing is great is that a company can replace big upfront capital expenses with smaller operational expenses. A few people mentioned that companies live by their cash flow and they do not like to have an unpredictable operational expenses, but are more comfortable with predictable capital expenses.