Languages and Logic: NLP

Showing posts with label NLP. Show all posts

Tuesday, September 29, 2015

Practical Scala, Haskell and Category Theory

Functional programming has moved from academia to industry in the last few years. It is theoretical with a steep learning curve. I have worked with strongly typed functional programming for 4 years. I took the normal progression, first Scala then Haskell and ended with category theory.

What practical results does functional programming give me?

I typically do data mining, NLP and back end programming. How does functional programming help me with NLP, AI and math?

Scala

Scala is a complex language that can take quite some time to learn. For a couple of years I was unsure if it really improved my productivity compared to Java or Python.

After 2 years my productivity in Scala went up. I find that Scala is an excellent choice for creating data mining pipelines because it is:

Fast
Stable
Has lot of quality libraries
Has a very advanced type system
Good DSL for Hadoop (Scalding)

Natural Language Processing in Scala

Before Scala I did NLP in Python. I used NLTK the Natural Language Toolkit for 3 years.

NLTK vs. ScalaNLP

NLTK

Easy to learn and very flexible
Gives you a lot of functionality out of the box
Very adaptable, handles a lot of different structured file formats

What I did not like about NLTK was:

It had a very inefficient representation of a text features as a Dictionary
The file format readers were not producing exactly matching structures and this did not get caught by the type system
You have to jump between Python, NumPy and C or Fortran for low level work

ScalaNLP

ScalaNLP merged different Scala numeric and NLP libraries. It is a very active parent project of Breeze and Eric.

ScalaNLP Breeze

Breeze is a full featured, fast numeric library that uses the type system to great effect.

Linear algebra
Probability Distribution
Regression algorithms
You can drop down to the bottom level without having to program in C or Fortran

ScalaNLP Eric

Eric is the natural language processing part of ScalaNLP. It has become a competitive NLP library with many algorithms for several human languages:

Reader for text corpora
Tokenizer
Sentence splitter
Part-of-speech tagger
Named entity recognition
Statistical parser

Video lecture by David Hall the Eric lead

Machine Learning in Scala

The most active open source Scala machine learning library is MLib which is part of the Spark project.
Spark now has data frames like R and Pandas.
It is easy to set up machine learning pipelines, do cross validation and optimization of hyper parameters.

I did text classification and set it up in Spark MLib in only 100 lines of code. The result had satisfactory accuracy.

AI Search Problem in Scala vs. in Lisp

I loved Lisp when I learned it at the university. You could do all these cool Artificial Intelligence tree search problems. For many years I suffered from Lisp envy.

Tree search works a little differently in Scala, let me illustrate by 2 examples.

Example 1: Simple Tree Search for Bird Flu

You have an input HTML page and parsed into a DOM tree. Look for the word bird and flu in a paragraph that is not part of the advertisement section.
I can visualize what a search tree for this would look like.

Example2: Realistic Bird Flu Medication Search

The problems I deal with at work are often more complex:
Given a list of medical websites, search for HTML pages with bird flu and doctor recommendations for medications to take. Then do a secondary web search to see if the doctors are credible.

Parts of Algorithm for Example 2

This is a composite search problem:

Search HTML pages for the words bird and flu close to each other in DOM structure
Search individual match to ensure this is not in advertisement section
Search for Dr names
Find what Dr name candidates could be matched up with the section about bird flu
Web search for Dr to determine popularity and credentials

Visualizing this as a tree search is hard for me.

Lazy Streams to the Rescue

Implementing solutions to the Example 2 bird flu medication problem takes:

Feature extractors
Machine learning on top of that
Correlation of a disease and a doctor

This lends itself well to using Scala's lazy streams. Scala makes it easy to use the lazy streams and the type system gives a lot of support, especially when plugging together various streams.

Outline of Lazy Streams Algorithm for Example 2

Stream of all web pages
Stream of tokenized trees
Steam of potential text matches e.g. avian influenza, H5N1
Filter Stream 3 if it is an advertisement part of the DOM tree, (no Dr Mom)
Stream of potential Dr text matches from Stream 2
Stream of good Dr names. Detected with machine learning
Merge Stream 3 and Stream 6 to get bird flu and doctor name combination
Web search stream for the doctor names from Stream 7 for ranking of result

AI Search Problem in Lisp

Tree search is Lisp's natural domain. Lisp could certainly handle Example 2 the more complex bird flu medication search. Even using a similar lazy stream algorithm.

Additionally, Lisp has the ability to do very advanced meta programming:
Rules that create other rules or work on multiple levels. Things I do not know how to do in Scala.

Lisp gives you a lot of power to handle open ended problems and it is great for knowledge representation. When you try to do the same in Scala you end up either writing Lisp or Prolog style code or using RDF or graph databases.

Some Scala Technical Details

Here are a few observations on working with Scala.

Scala's Low Rent Monads

Monads are a general way to compose functionality. They are a very important organizing principle in Scala. Except is not really monads it is just syntactic sugar.

You give us a map and a flatMap function and we don't ask any questions.

Due to the organization of the standard library and subtyping you can even combine an Option and a List, which should strictly not be possible. Still this give you a lot of power.
I do use Scala monads with no shame.

Akka and Concurrency

Scala's monads make it convenient to work with two concurrency constructs: Futures and Promises.

Akka is a library implementing an Erlang style actor model in Scala.

I have used Akka for years and it is a good framework to organize a lot of concurrent computation that requires communication.

The type system does not help you with the creation of parent actors so you are not sure that they exist. This makes it hard to write unit tests for actors.

Akka is good but the whole Erlang actor idea is rather low level.

Scalaz and Cake Patterns

Scalaz is a very impressive library that implements big parts of Haskell’s standard library in Scala.
Scalaz’s monad typeclass is invariant, which fixes the violations allowed in the standard library.

Cake Patterns allows for recursive modules, which make dependency injection easier. This is used in the Scala compiler.

Both of these libraries got me into trouble as a beginner Scala programmer. I would not recommend them for beginners.

How do you determine if you should use this heavy artillery?
Once you feel that you are spending a lot of time repeating code due to insufficient abstraction you can consider it. Otherwise:

Keep It Simple.

Dependent Types and Category Theory in Scala

There are many new theoretical developments in Scala:

Dotty - a new compiler built on DOT a new type-theoretic foundation of Scala
Cats library - a simplified version of Scalaz implementing concepts from category theory
Shapeless library for dependent types. I am using this in my production code since Shapeless is used in Slick and Parboiled2

Haskell

Haskell is a research language from 1990. In 2008 its popularity started to rise. You can now find real jobs working in Haskell. Most publicized is that Facebook wrote their spam filter in Haskell.

Why is Haskell so Hard to Learn?

It took me around 2 years to learn to program in Haskell, which is exceptionally long. I have spoken to other people at Haskell meetups who have told me the same.

Mathematical Precision

Python effectively uses the Pareto principle: 20% of the features will give give you 80% of the functionality; Python has very few structures in the core language and reuses them.

Haskell uses many more constructs. E.g. exception handling can be done in many different ways each with small advantages. You can chose the optimal exception monad transformer that has least dependencies for your problem.

Cabal Hell and Stack

Haskell is a fast developing language with a very deep stack of interdependent libraries.
When I started programming in it, it was hard to set up even a simple project since you could not get the libraries to compile with versions that were compatible with each other.
The build system is called cabal, and this phenomenon is called Cabal Hell.
If you have been reading mailing list there are a lot of references to Cabal Hell.

The Haskell consulting company FPComplete first released Stackage a curated list of libraries that works together. In 2015 they went further and released Stack which is a system that installs different versions of Haskell to work with Stackage versions.

This has really made Haskell development easier.

Dependently Typed Constructs in Haskell

Dependently typed languages are the next step after Haskell. In normal languages the type system and the objects of the language are different systems. In dependently typed languages the objects and the types inhabits the same space. This gives more safety and greater flexibility but also makes it harder to program in.

The type checker has to be replaced with a theorem-prover.

You have to prove that the program is correct, and the proofs are part of the program and first order constructs.

Haskell has a lot of activities towards emulating dependently typed languages.
The next version of the Haskell compiler GHC 8 is making a big push for more uniform handling of types and kinds.

Practical Haskell

Haskell is a pioneering language and still introducing new ideas. It has clearly shown that it is production ready by being able to handle Facebook's spam filter.

Aesthetically I prefer terse programming and like to use Haskell for non work related programming.

There is a great Haskell community in New York City. Haskell feels like a subculture where Scala has now become the establishment. That said I do not feel Haskell envy when I program in Scala on a daily basis.

Learning Haskell is a little like running a marathon. You get in good mental shape.

Category Theory

Category theory is often called Abstract Nonsense both by practitioners and detractors.
It is a very abstract field of mathematics and its utility is pretty controversial.

It abstracts internal properties of objects away and instead looks at relations between objects.
Categories require very little structure and so there are categories everywhere. Many mathematical objects can be turned into categories in many different ways. This high level of abstraction makes it hard to learn.

There is a category Hask of Haskell types and functions.

Steve Awodey lecture series on category theory

Vector Spaces Described With a Few String Diagrams

To give a glimpse of the power of category theory: In this video lecture John Baez shows how you can express the axioms of finite dimensional vector spaces with a few string diagrams.

Video lecture by John Baez

With 2 more simple operations you can extend it to control theory.

Quest For a Solid Foundation of Mathematics

At the university I embarked on a long quest for a solid scientific foundation. Fist I studied chemistry and physics. Quantum physics drove me to studying mathematics for more clarity. For higher clarity and a solid foundation I studied mathematical logic.
I did not find clarity in mathematical logic. Instead I found:

Some random badly motivated axioms and inference rules
And Godel's nebulous incompleteness theorem

The Dirty Secret About the Foundation of Mathematics

My next stop was the normal foundation for modern mathematics: ZFC, Zermelo–Fraenkel set theory with the axiom of choice.

This was even less intuitive than logic. There were more non intuitive axioms. This was like learning computer science from a reference of x86 assembly: A big random mess. There were also an uncertain connection between the axioms of logic and the axioms set theory.

ZFC and first order logic makes 2 strong assumptions:

Law of Excluded Middle
Axiom of Choice

Law of Excluded Middle is saying that every mathematical sentence is either true or false. This is a very strong assumption that was not motivated at all. And it certainly does not extend to other sentences.

Constructive Mathematics / Intuitionistic Logic

There was actually a debate about what should be a foundation for mathematics at the beginning of the 20th century.
A competing foundation of mathematics was Brouwer's constructive mathematics. In order to prove something about a mathematical object you need to be able to construct it and via the Curry-Howard correspondence this is equivalent to writing a program constructing a particular type.

This was barely mentioned at the university. I had one professor who once briefly said that there was this other thing called intuitionistic logic, but it was so much harder to prove things in it, why should we bother.

Recently constructive mathematics have had a revival with Homotopy Type Theory. HoTT is based on category theory, type theory, homotopy theory and intuitionistic logic.
This holds a lot of promise and is another reason why category theory is practical for me.

Robert Harper's lectures on type theory
end with an introduction to HoTT

Future of Intelligent Software

There are roughly 2 main approaches to artificial intelligence

Top down or symbolic techniques e.g. logic or Lisp
Bottom up or machine learning techniques e.g. neural networks

The symbolic approach was favored for a long time but did not deliver on its promise. Now machine learning is everywhere and has created many advances in modern software.

To me it seems obvious that more intelligent software needs both. But combining them has been an elusive goal since they are very different by nature.

Databases created a revolution in data management. They reduce data retrieval to simplified first order logic, you just write a logic expression for what you want.

Dependently typed language is the level of abstraction where programs and logic merge.
I think that intelligent software of the future will be a combination of dependently typed languages and machine learning.
A promising approach is: Discovery of Bayesian network models from data. This finds causality in a form that can be combined with logic reasoning.

Conclusion

I invested a lot of time in statically typed functional languages and was not sure how much this would help me in my daily work. It helped a lot, especially with reuse and stability.

Scala has made it substantially easier to create production quality software.

MLib and ScalaNLP are 2 popular open source projects. They show me that Scala is a good environment for NLP and machine learning.

I am only starting to see an outline of category theory, dependently typed languages and HoTT. It looks like computer science and mathematics are not mainly done, but we still have some big changes ahead of us.

Tuesday, July 12, 2011

Natural language processing in F# and Scala

I do natural language processing in C# 3.5 and Python. My work includes classification, named entity recognition, sentiment analysis and information extraction. Both C# and Python are great languages, but I have some unmet needs. I am investigating if there are any new languages that would help.

I 2010 I tried out 3 new languages:
Natural language processing in Clojure, Go and Cython

Recently I have investigated F# and Scala. They are both hybrid functional - object oriented languages; inspired by ML / OCaml / Haskell and Java / C#.

Python as the benchmark

Python is widely used in natural language processing. I am most productive in Python for NLP work. Here are a few reasons why:

NLTK is a great Python NLP library
Lot of open source math and science libraries e.g. NumPy and SciPy
PyDev is a good development environment
Good integration with MongoDB library
Great for rapid development

Python shortcomings

Slow compared to compiled language
GUI support is crude
Multi-threading is crude
Compilation does give more robustness

It should be possible to make a super language that has the elegance of Python, but without these shortcomings.

My first Scala experience

In 2006 I thought Scala was this super language. It is very advanced; you can call any Java libraries from Scala, including all the open source libraries. But I ran into a list of problems with Scala:

The Scala IDE was far behind Eclipse Java
Scala is a quite complex language
The Java libraries and the functional programming libraries were badly integrated
There were no Scala REPL or interpreter like in Python

Scala was stable enough for use, but it did not improve my productivity so after some months I went back to using Python as my scripting language.

Python's weakness

Recently I had to make a small text processing application that end users could use directly. This was not the best fit for Python. Normally my Python programs have no GUI and are controlled by command line parameters.

I had 2 Python options:

Make simple GUI using TkInter
TkInter is a Python wrapper of TK, the cross platform GUI toolkit. It is pretty crude by modern GUI standards, but would have been good enough. However trying to install all the Python libraries that I needed on the end users machine would be setting myself of for a maintains nightmare.

Wrap code in web application
I could wrap a web interface around it. The application is using a lot of memory and I would have to maintain a web application.

I had a 1 week hard deadline for the task and both of these options looked unappealing. I needed something else...

My first F# application

I took a chance on F#, and managed to learn enough F# to finish the program by my 1 week deadline.

There is no GUI builder for F# in Visual Studio, but it was pretty easy to hand code a simple WinForms GUI to wrap around the core code. It was not pretty but you could give it to an end user. The whole application ended up being one 40KB executable file, and it was very fast. F# had actually filled a niche that Python does not do so well.

There were also problems, I wrote the whole application from scratch, while in Python I would have been able to use NLTK, write the code faster and get better results.

All in all this was very good experience. I thought that F# would be a good supplement to my Python library. It would both give me raw speed when I need it and good connectivity with C#, ASP.NET, WPF and Microsoft Office.

Functional programming benefits

Functional programming is a great fit for my NLP work.

I have a lot of different text sources: database, flat file, directory, public RESTful web application services.

I have many word transformations: stop word filters, stemmers, custom filters.

I need many operations building on other operations: Bigram finder, POS tagger, named entity recognizer.

Created different reports: database, csv, Excel.

In functional languages you can just take any combinations of these operations and easily pipe them together while getting good compiler support. This does not fit so well with object oriented programming were you are more concerned with encapsulation.

F# impression

F# is the first compiled language I tried that is comparable to Python in simplicity and elegance. It has a real Pythonic feel:

F# is fast
Simple and elegant
Good development environment in Visual Studio 2010
Best concurrency support of any language I have seen
Good database support
Good MongoDB library
Simple to combine F# with C# or VB.NET for ASP or WPF
Good REPL

Issues

Runs best under Windows
For an IDE you really need Visual Studio 2008 or 2010, and that cost at least $700
F# can be compiled and run the shell from SharpDevelop 4.0 and 4.1, but you do not have the same productivity
The math libraries under .NET are not as good as NumPy and SciPy
The NLP libraries are better under Python

Scala revisited

After the success with F# I was very curious about why F# has been so much more successful than my first experience with Scala.

I looked at an F# and Scala cheat sheet and thought they look remarkably similar. I watched a few screen casts and found no obvious problems. I bought the book: Programming in Scala, Second Edition, it turned out to be a very interesting computer science book and I read the whole 852 pages. Scala still looked good.

I installed the Scala Eclipse plugin and wrote some code. Both the language and the IDE have come a long way during the last 5 years:

15 books about Scala
2 great free books
Tooling is much better
IDE is much better with code completion
Native NLP libs: ScalaNLP and Kiama

Of all the issues I had when I first tried Scala. The only remaining one is:
Scala is a pretty complex language

It is incredible how Scala has taken a lot of messy features from Java and turned it into a clean modular system, at the cost of some complex abstractions.

F# vs. Scala

Despite many similarities, the languages have a different feel. F# is simpler to understand, while Scala is the more orthogonal language. I have been very impressed by both.

F# better

Simpler to understand
Fantastic concurrency
Tail recursion optimized
Works well with Windows Azure

Scala better

More orthogonal, reusing the same constructs
Works with any Java library so more libraries
Better NLP libraries
Works well with Hadoop

Cloud computing

Functional programming works well with cloud computing. For me the availability of a good functional language is a substantial factor in selecting a cloud platform.

Google introduced MapReduce to handle massive parallel multi computer applications.

Hadoop is the Java based open source version of MapReduce. To run Hadoop natively it has to run a JVM language like Java or Scala.

Hadoop Streaming extends a limited version of Hadoop to work with programs written in other programming languages as long as they work like a UNIX pipes that read from stdin and write to stdout.

There is a Python wrapper for Hadoop Streaming called Dumbo. Python is around 10 times slower than Java and Dumbo is a limited version of the Hadoop, so if you are trying to do NLP on massive amount of data this might not solve your problems.

Scala is fast and will give you full access to run native Hadoop.

Microsoft's version or MapReduce is called: Dryad or LINQ to HPC. It is not officially released yet, but F# works well with Windows Azure.

NLP and other languages

Let me finish by giving a few short comparisons of F# and Scala with other languages:

Clojure vs. Scala

Clojure is a LISP dialect that it also running on the JVM, and it the other big functional language running there. Clojure has some distinct niches for NLP:

Clojure better

Language understanding
Formal semantic: taking text and translating it to first order propositional logic
Artificial intelligence tasks

Scala better

It is easy to write fast Scala code
Smaller learning curve coming from Java

I tried Clojure recently and was very impressed; but more of my work falls in the category that would benefit from Scala.

Java vs. Scala

Java better

Better IDE tools and support
Better GUI builders
Great refactoring support
Many more programmers that know Java

Scala better

Terser code
Closures
First class function
More expressive language

C# vs. F#

C# better

Better IDE tools and support
Better GUI builders
There are a lot more programmers that know C#
Better LINQ to SQL support

F# better

Terse code
Better support for concurrency, Synch, continuations
More productive for NLP

Conclusion

F# and Scala are similar hybrid functional object oriented languages.

For years I have periodically tried functional programming languages to see if they were ready for mainstream corporate computing; and they were not. With the recent spread of functional features into object oriented languages I thought that real functional programming languages would soon be forgotten.

I was pleasantly surprised by how well F# and Scala work now. Functional languages are finally coming of age and becoming useful in mainstream corporate computing. They are stable enough, and they have niches were they are more productive than object oriented languages like C# and Java.

I really enjoy programming in F# and Scala, they are a very good fit for natural language processing and cloud computing. For bigger NLP projects I now prefer to use F# or Scala over C# or Java.

For GUI and web programming the object oriented languages still rules. Stick with C# or Java if the NLP part is small or GUI or web interface is the domineering part.

Java and C# are also improving e.g. by adding more functional features. Many working programmers are well served by just waiting for Java 8 or C# 5. But functional programming is here to stay. Rejoice...

Newer follow up article covering Scala, ScalaNLP and Spark MLib

Friday, June 17, 2011

Cloud Computing For Data Mining Part 1

The first half of this blog post is about selecting a cloud provider for a data mining and natural language processing system. I will compare 3 leading cloud computing providers Amazon Web Services, Windows Azure, OpenStack.
To help me chose a cloud provider I have been looking for users with experience running cloud computing for application similar to data mining. I found them at CloudCamp New York June 2011. It was an unconference, so the attendees were split into user discussion groups. The last half of the post I will mention the highlight from these discussions.

The Hype

"If you are not in the cloud you are not going to be in business!"

This is the message many programmers, software architects and project managers faces today. You do not want to go out of business because you could not keep up with the latest technologies; but looking back many companies have gone out of business because they invested in the latest must have technology, that turned out to be expensive and over engineered.

Reason For Moving To Cloud

I have a good business case from using cloud computing: Namely scale a data mining system to handle a lot of data. To begin with it could be a moderate amount of data, but it could be changed to a Big Data with short notice.

Horror Cloud Scenario

I am trying minimize the risk of this scenario:

I port to a cloud solution that is tied closely to one cloud provider
Move the applications over
After a few months I find that there are unforeseen problems
No easy path back
Angry customers are calling

Goals

Here are my cloud computing goals in a little more details:

Port data mining system and ASP.NET web applications to the cloud
Chose cloud compatible with code base in .NET and Python
Initially the data volume is moderate but it could possibly scale to Big Data
Keep cost and complexity under control
No downtime during transition
Minimize risk
Minimize vendor lock in
Run the same code in house and in the cloud
Make rollback to in house application possible

Amazon Web Services vs. Windows Azure vs. OpenStack

Choosing the right cloud computing provider has been time consuming, but also very important.

I took a quick stroll through Cloud Expo 2011, and most big computer companies were there presenting their cloud solutions.

Google App Engine is a big cloud service well suited for front end web application, but not good for data mining, so I will not cover that here.

The other 3 providers that have generated most momentum are: EC2, Azure and OpenStack.

Let me start by listing their similarities:

Virtual computers that can be started with short notice
Redundant robust storage
NoSQL structured data
Message queue for communication
Mountable hard disk
Local non persistent hard disk

Now I will write a little more about where they differ, and their good and the bad part:

Amazon Web Services, AWS, EC2, S3

Good:

This is the oldest cloud provider dating back to 2004
Very mature provider
Other providers are catching up with AWS's features
Well documented
Work well with open source, LAMP and Java
Integrated with Hadoop: Electric Map Reduce
A little cheaper than Windows Azure
Runs Linux, Open Solaris and Windows servers
You can run code on your local machine and just save the result into S3 storage

Bad:

You cannot run the same code in house and in the cloud
Vendor lock in

Windows Azure

Good:

Works well with the .NET framework and all Microsoft's tools
It is very simple to port an ASP.NET application to Azure
You can run the same code on you development machine and in the cloud
Very good development and debugging tools
F# is a great language for data mining in cloud computing
Great series of video screen casts

Bad:

Only run Windows
You need a Windows 7, Windows Server 2008 or Windows Vista to develop
Preferably you should have Visual Studio 2010
Vendor lock in

OpenStack

OpenStack is a new open source collaboration that is making a software stack that can be run both in house and it the cloud.

Good:

Open source
Generating a lot of buzz
Main participants NASA and Rackspace
Backed by 70 companies
You can run your application either in house or in the cloud

Bad:

Not yet mature enough for production use
Windows support is immature

Java, .NET Or Mixed Platform

For data mining selecting the right platform is a hard choice. Both Java and .NET are very attractive options.

Java only
For data mining and NLP there are a lot of great open source project written in Java. E.g. Mahout is a system for collaborative filtering and clustering of Big Data, with distributed machine learning. It is integrated with Hadoop.
There are many more OSS: OpenNLP, Solr, ManifoldCF,

.NET only
The development tools in .NET are great. It works well with Microsoft Office.
Visual Studio 2010 comes with F#, which is a great language for writing worker roles. It is very well suited for light weight threads or async, for highly parallel reactive programs.

Mix Java and .NET
You can mix Java and .NET. Cloud computing makes is easier than ever to integrate different platforms. You already have abstract language agnostic service for communication with message queue, blob storage, structured data. If you have an ASP.NET front end on top of a collaborative filtering of Big Data this would be a very attractive option.

I still think that combining 2 big platforms like Java and .NET is introducing complexity, compared to staying within one platform. You need an organization with good resources and coordination to do this.

Choice Of Cloud Provider

I still have a lot of unanswered questions at this point.

At the time of writing June 2011 OpenStack is not ready for production use. So that is out for now.

I have run some test on AWS. It was very easy to deploy my Python code to EC2 under Linux. Programming C# that used AWS services was simple.

I am stuck waiting to get a Window 7 machine so I can test Window Azure.

Both EC2 and Azure seem like viable options for what I need. I will get back to this in part 2 of the blog post.

Highlights from Cloud Camp 2011

A lot of people are trying to sell you cloud computing solutions. I have heard plenty of cloud computing hype. I have been seeking advice from people that were not trying to sell me anything and had some real experience, and try to find some of the failures and problems in cloud computing.

I went to Cloud Camp June 2011 during Cloud Expo 2011 in New York. Cloud computing users shared their experience. It was an unconference, meaning spontaneous user discussion breakout groups were formed. The rest of this post is highlight from these discussions.

Hadoop Is Great But Hard

Hadoop is a Java open source implementation of Google's Map Reduce. You can set up a workflow of operations and Hadoop will distribute them over a multiple computers, aggregate the result and rerun operations that fail. This sounds fantastic, but Hadoop is a pretty complex system, with a lot of new terminology and a steep learning curve.

Security Is Your Responsibility

Security is a big issue. You might assume that the cloud will take care of security, but you should not. E.g. you should clean up the hard disks that you have used it, so the next user cannot see your data.

Cloud Does Not Automatically Scale To Big Data

The assumption is that you put massive amounts of data in the cloud. And the cloud takes care of the scaling problems.
If you have a lot of data that needs little processing. Then cloud computing becomes expensive: you store all data in 3 different locations and it is expensive and slow to take it down to different compute nodes. This was mentioned as the reason why NASA could not using S3, but build its own Nebula platform.

You Accumulate Cost During Development

An entrepreneur building a startup ended up paying $2000 / month for EC2. He used a lot of different servers and they had to be running with multiple instances, even though he was no using a lot of resources. This might be cheap compared to going out and buying your own servers, but it was more expensive than he expected.

Applications Written In .NET Run Fine Under EC2 Windows

An entrepreneur said that he was running his company's .NET code under EC2. He thought that Amazon was more mature than Azure, and Azure was catching up. He preferred to make his own framework.

Simpler To Run .NET Application On Azure Than On EC2

A cloud computing consultant with lots of experience in both Azure and EC2 said: EC2 gives you a raw machine you have to do more to get your application running than if you plop it into Windows Azure.
It is very easy to port an ASP.NET application to Windows Azure.

Cash Flow, Operational Expenses And Capital Expenses

An often cited reason why cloud computing is great is that a company can replace big upfront capital expenses with smaller operational expenses. A few people mentioned that companies live by their cash flow and they do not like to have an unpredictable operational expenses, but are more comfortable with predictable capital expenses.

Tuesday, February 15, 2011

Is IBM Watson Beginning An AI Boom?

Artificial intelligence fell out of favor in the 1970s, the start of first artificial intelligence winter, and has mainly been out of favor since. In April 2010 I wrote a post about how you can now get a paying job doing AI, machine learning and natural language processing outside academia.

Now barely one year later I have seen a few demonstrations that signal that artificial intelligence has taken another leap towards mainstream acceptance:

Yann LeCun demonstrated a computer vision system that could learn to recognize objects from his pocket after being shown a few examples, under a talk about learning feature hierarchies for computer vision
Andrew Hogue demonstrated Google Squared and Google Sentiment Analysis at Google Tech Talk, those systems both show rudimentary understanding of web pages and use word association
IBM Watson super computer is competing against the best human players on Jeopardy

All these 3 systems contain some real intelligence. Rudimentary by human standard, but AI has gone from the very specialized systems to handling more general tasks. It feels like AI is picking up steam. I am seeing startups based on machine learning pop up. This reminds me of the Internet boom in 1990s. I moved to New York in 1996, at the beginning of the Internet boom. I saw firsthand the crazy gold rush where fortunes were made and lost in short time, Internet startups were everywhere and everybody was talking about IPOs. This got me thinking, are we headed towards an artificial intelligence boom, and what would it look like?

IBM Watson

IBM Watson is a well executed factoid extraction system, but it is a brilliant marketing move, promoting IBM's new POWER7 system and their Smart Planet consulting services. It gives some people the impression that we already have human-like AI, and in that sense it could serve as a catalyst for investments in AI. This post is not about human-like artificial intelligence, but about the spread of shallow artificial intelligence.

Applications For Shallow Artificial Intelligence

Both people and corporations would gain value from having AI systems that they could ask free form questions to and get answers from in very diverse topics. In particular in these fields:

Medical science
Law
Surveillance
Military

Many people, me included, are concerned about a big brother state and military use of AI, but I do not think that is going to stop adaption. These people play for keeps.

There are signs that the financial service industry is starting to use sentiment analysis for their pricing and risk models. Shallow AI would be a good candidate for more advanced algorithmic trading.

Bottom Up vs. Top Down Approaches

Here is a very brief simplified introduction to AI techniques and tools. AI is a loosely defined field, with a loose collection of techniques. You can roughly categorize them it top down and bottom up approaches.

Top down or symbolic techniques

Automated reasoning
Logic
Many forms of tree search
Semantic networks
Planning

Bottom up or machine learning techniques

Neural networks, computer with similar structure to the brain
Machine learning

The top down systems are programmed by hand, while the bottom up systems learns themselves based on examples without human intervention, a bit like the brain.

What Is Causing This Sudden Leap?

Many top down techniques were developed by the 1960s. They were very good ideas, but they did not scale; they only worked for small toy problems.
Neural networks are an important bottom up technique. They started in 1950s, but fell out of favor; they came roaring back in 1980s. In the 1990 the machine learning / statistical approaches to natural language processing beat out Chomsky's generative grammar approach.

The technology that is needed for what we are doing now have been around for a long time. Why are these systems popping up now?

I think that we are seeing the beginning of a combination machine learning with top down techniques. The reason why this have taken so long is that it is hard to combine top down and bottom up techniques. Let me elaborate a little bit:

Bottom up AI / machine learning are black boxes that you give some input and expected output and it will adjust a lot of parameter numbers so it can mimic the result. Usually the numbers will not make much sense they just work.

In top down / symbolic AI you are creating detailed algorithms for working with concepts that make sense.

Both top down and bottom up techniques are now well developed and better understood. This makes it easier to combine them.

Other reasons for the leap are:

Cheap, powerful and highly parallel computers
Open source software, were programmers from around the world develop free software. This makes programming into more of an industrial assembly of parts.

Who Will Benefit From An AI Boom?

Here are some groups of companies that made a lot of money during the Internet boom:

Cisco and Oracle the tool makers
Amazon and eBay small companies that grew to become domineering in e-commerce
Google and Yahoo advertisement driven information companies

Initially big companies like IBM and Google that can create the technology should have an advantage, whether it will be in the capacity of tool makers or domineering players.

It is hard to predict how high the barrier to entry in AI will be. AI programs are just trained on regular text found on or off the Internet. And today's super computer is tomorrow's game console. The Internet has a few domineering players, but it is generally decentralized and anybody can have a web presence.

New York is now filled with startups using machine learning as a central element. They are "funded", but it seems like they got some seed capital. So maybe there is room for smaller companies to compete in the AI space.

Job Skills That Will Be Required In An AI Boom

During the Internet boom I met people with a bit of technical flair and no education beyond high school who picked up HTML in a week and next thing they were making $60/hour doing plain HTML. I think that the jobs in artificial intelligence are going to be a little more complex than those 1990s web developer jobs.

In my own work I have noticed a move from writing programming to teaching software based on examples. This is a dramatic change, and it requires a different skill set.

I think that there will still be plenty of need for programmers, but cognitive science, mathematics, statistics and linguistics will be skills in demand.

My work would benefit from me having better English language skills. The topic that I am dealing with is, after all, the English language. So maybe that English literature degree could come in handy.

Currently I feel optimistic about the field of artificial intelligence; there is progress after years of stagnation. We are wrestling a few secrets away from Mother Nature, and are making progress in understanding how the brain works. Sill, introduction of such powerful technology as artificial intelligence is going to affect society for better and worse.

Friday, October 29, 2010

Natural language processing in Clojure, Go and Cython

I work in natural language processing, programming in C# 3.5 and Python. My work includes classification, named entity recognition, sentiment analysis and information extraction. Both C# and Python are great languages, but I do have some unmet needs. I investigated if there are any new languages that would help. I only looked at minimal language that would be simple to learn. The 3 top contenders were: Clojure, Go and Cython. Both Clojure, Go have innovative approaches to non locking concurrency. This is my first impression of working with these languages.

For contrast let me start by listing the features of my current languages.

C# 3.5

C# is an advanced object orientated / functional hybrid language and programming platform:

It is fast
Great development environment
You can do almost any tasks in it
Great database support with LINQ to SQL
Advanced web development with ASP.net
Advanced GUI toolkit with WPF
Good concurrency with threading library
Good MongoDB library

Issues

Works best on Windows
Not well suited for rapid development

While many features of C# are not directly related to NLP they are very convenient. C# has some NLP libraries: SharpNLP is a port of OpenNLP from Java. Lucene has also been ported. The ports are behind the Java implementation, but still give a good foundation.

Python

Python is an elegant scripting language, with a strong focus on simplicity.

NLTK is a great NLP library
Lot of open source math and science libraries
PyDev is a good development environment
Good MongoDB library
Great for rapid development

Issues

It is interpreted and not very fast
Problems with GIL based threading model

C# vs. Python and unmet needs

I was not sure what language I would prefer to work with. I suspected that C# would win out with all it advanced features. Due to demand for fast turnaround, I ended up doing more work in Python, and have been very happy with that choice. I have a lot of scripts that can be piped together to create new applications, with the help of the very fast and flexible MongoDB.

I do have some concerns about Python moving forward:

Will it scale if I get really large amount of text
Will speed improve on multi core processors
Will it work with cloud computing
Part of speech tagging is slow

Java

Java is a modern object oriented language. Like C# it is a programming platform:

Has most NLP libraries: OpenNLP, Mahout, Lucene, WEKA
It is fast
Great development environment: Eclipse and NetBeans
You can do almost any tasks in it
Great database support with JDBC and Hibernate
Many web development frameworks
Good GUI toolkit: Swing and JavaFX
Good concurrency with threading library

Issues

Functional style programming is clumsy
Working with MongoDB is clumsy
Java code is verbose

I would not hesitate using Java for NLP, but my company is not a Java shop.

Clojure

Clojure was released in 2007. It is a right sized LISP. Not very big like Common LISP or very small like Scheme.

Gives easy access to Java libraries: OpenNLP, Mahout, Lucene, WEKA, OpinionFinder
Innovative non locking concurrency primitives
Good IDEs in Eclipse and NetBeans
Easy to work with
Code and data is unified
Interactive REPL
LISP is the classic artificial intelligence language
If you need speed you can write Java code
Good MongoDB library

Issues

The IDE is not working as well as IDEs for Java or C#

Clojure is minimal in the sense that it is build on around 10 primitive programming constructs. The rest of the language is constructed with macros written in Clojure.

Once I got Clojure installed it was easy to work with and program in. Most of the good features about Python also applies to Clojure: it is minimal and has batteries included. Still I think that Python is a simpler language than Clojure.

Use case
Clojure is a good way to script the extensive Java libraries, for rapid development. It has more natural interaction with MongoDB than Java.

Clojure OpenNLP

The clojure-opennlp project is a thin Clojure wrapper around OpenNLP. It came with all the corpora used as training data for OpenNLP nicely packaged and it works well. You can script OpenNLP approximately as terse as NLTK, from an interactively repl.

I tried it in both Eclipse and NetBeans. They seem somewhat equal in number of features. I had a little better luck with the Eclipse version.

clojure-opennlp is using a Maven built system, but has a nontraditional directory layout, this caused problems for both Eclipse and NetBeans, they both took some configuration.

Eclipse Counterclockwise
The Counterclockwise instruction for labrepl mainly worked for installing clojure-opennlp.
When you were done you had to go in add the example directory the source directories under properties.

NetBeans Enclojure
I imported the project. I had to move the Clojure file from example directory to a different position to get it to work.

Maven plugins for Clojure
The standard Maven directory layout has several advantages, e.g. if you want to mix Java and Clojure code. I created my own Maven pom configuration file up, based on examples of other Clojure Maven projects. They used Clojure plugins for Maven, I could not get this to work. Eventually I ripped these plugins out and was left with very pain POM file that worked.

Go / Golang

Go was announced November 2009. It is created by Google to deal with multicore and networked machines. It feels like a mixture of Python and C. It is a very clean and minimal language.

It is fast
Good standard library
Excellent support for concurrency
It is trivial to write your own load balancer

Issues

The Eclipse IDE is in an early stage
Debugger is not working
Windows port is not done and has just been released

It was hard to find the right Go Windows port, there are several Go windows port projects with no code.

Use cases
I currently have a problem when downloading a lot HTML pages and parsing them to a tree structure. This does not have the best support in C#. I found a library that translates HTML to XHTML and then I can use LINQ to process it. The library is not documented, not very fast and fails for some HTML files.

Go comes with a HTML library that parses HTML 5, it is simple to write a program with some threads that download and other that parse the files into a DOM tree structure.
I would use Golang for loading large amounts of text in a cloud computing environment.

Cython

Cython was released in July 2007. It is a static compiler to write Python extension modules in a mixture of Python and C.

Process for using Cython

Start by writing normal Python code
Find modules that are too slow
Add static types
Compile it with Cython using the setup tool
This produces compiled modules that can be used with normal Python

Issues

It is still more complex that normal Python code
You need to know C to use it

I was surprised how simple it was to get it working both under Windows and Linux. I did not have to mess with make files or configure the compiles. Cython integrated well with NumPy and SciPy. This expands the programming tasks you can do with Python substantially.

Use cases
Speed up slow POS tagging.

My previous language experience

Over the years I have experimented with a long list of non mainstream languages: functional, declarative, parallel, array, dynamic and hybrid languages. Many of these were frustrating experiences. I would read about a new language and get very excited. However this would often be the chain of events:

Download language
Installed Cygwin
Find out how the language's build system works
Try to find a version of the GCC compiler that will compile it
Get the right version of Emacs installed
Try to get the debugger working under Emacs
Start programming from scratch since the libraries were sparse
Burn out

You only have so much mental capacity, and if you do not use a language you forget it. Only Scala made it into my toolbox.

Do Clojure, Go or Cython belong in your programmer's toolbox

Clojure, Go and Cython are all simple languages. They are easy to install, easy learn, they all have big standard libraries so you can be productive in them right away. This is my first impression:

Clojure is a good way to script the extensive Java libraries, for rapid application development and for AI work.
Go is a great language but it is still rough around the edges. There are not any big NLP libraries written for Go yet. I would not try to use it for my main NLP tasks.
Cython was useful right away for my NLP work. It makes it possible to do fast numerical programming in Python without too much trouble.

-Sami Badawi

Wednesday, March 17, 2010

SharpNLP vs NLTK called from C# review

C# and VB.net have fewer open source NLP libraries than languages like C++, Java, LISP and Perl. My last blog post: Open Source NLP in C# 3.5 using NLTK is about calling NLTK, which is written in Python, from IronPython embedded under C# or VB.net.

An alternative is to use SharpNLP, which is the leading open source NLP project written in C# 2.0. SharpNLP is not as big as other Open Source NLP projects. This blog posting is a short comparison of SharpNLP and NLTK embedded in C#.

Documentation

NLTK has excellent documentation, including an introductory online book on NLP and Python programming.

For SharpNLP the source code is the documentation. There is also a short introductory article by SharpNLP's author Richard J. Northedge.

Ease of learning

NLTK is very easy to work with under Python, but integrating it as embedded IronPython under C# took me a few days. It is still a lot simpler to get Python and C# to work together than Python and C++.

SharpNLP's lack of documentation makes it harder to use; but it is very simple to install.

Ease of use

NLTK it is great to work with in the Python interpreter.

SharpNLP simplifies life by not having to deal with the embedding of IronPython under C# and the mismatching between the 2 languages.

Machine learning and statistical models

NLTK comes with a variety of machine learning and statistical models: decision trees, naive Bayesian, and maximum entropy. They are very easy to train and validate, but do not preform well for large data sets.

SharpNLP is focused on maximum entropy modeling.

Tokenizer quality

NLTK has a very simple RegEx based tokenizer that works well in most cases.

SharpNLP has a more advanced maximum entropy based tokenizer that can split "don't" into "do | n't". On the other hand it sometimes makes errors and splits a normal word into 2 words.

Development community

NLTK has an active development community, with an active mailing list.

SharpNLP was last release was in December 2006. It is a port of the Java based OpenNLP, and can read models from OpenNLP. SharpNLP has a low volume mailing list.

Code quality

NLTK lets you write programs that read from web pages, clean HTML out of text and do machine learning in a few lines of code.

SharpNLP is written in C# 2.0 using generics. It is a port from OpenNLP and maintains a Java flavor, but it is still very readable and pleasant to work with.

License

NLTK's license is Apache License, Version 2.0, which should fit most people's need.

SharpNLP's license is LGPL 2.1. This is a versatile license, but maybe a little harder to work with when the project is not active.

Applications

NLTK comes with a theorem prover for reasoning about semantic content of text.

SharpNLP comes with an name, organization, time, date and percentage finder.
It is very simple to add an advanced GUI, using WPF or WinForms.

Conclusion

Both packages comes with a lot of functionality. They both have weaknesses, but they are definitely usable. I have both SharpNLP and embedded NLTK in my NLP toolbox.

-Sami Badawi