Friday, May 17, 2013

LISP Prolog and Evolution

I just saw David Nolen give a talk at a LispNYC Meetup called:


LISP is Too Powerful

It was a provocative and humorous talk. David showed all the powerful features of LISP and said that the reason why LISP is not more adapted is that it is too powerful. Everybody laughed but it made me think. LISP was decades ahead of other languages, why did it not become a mainstream language?

David Nolen is a contributor to Clojure and ClojureScript.
He is the creator of Core Logic a port of miniKanren. Core Logic is a Prolog like system for doing logic programming.

When I went to university my two favorite languages were LISP and Prolog. There was a big debate weather LISP or Prolog would win dominance. LISP and Prolog were miles ahead of everything else back then. To my surprise they were both surpassed by imperative and object oriented languages, like: Visual Basic, C, C++ and Java.

What happened? What went wrong for LISP?

Prolog

Prolog is a declarative or logic language created in 1972.

It works a little like SQL: You give it some facts and ask a question, and, without specifying how, prolog will find the results for you. It can express a lot of things that you cannot express in SQL.

A relational database that can run SQL is a complicated program, but Prolog is very simple and works using 2 simple principles:

  • Unification
  • Backtracking

The Japanese Fifth Generation Program was built in Prolog. That was a big deal and scared many people in the West in the 1980s.

LISP

LISP was created by John McCarthy in 1958, only one year after Fortran, the first computer language. It introduced so many brilliant ideas:

  • Garbage collection
  • Functional programming
  • Homoiconicity code is just a form of data
  • REPL
  • Minimal syntax, you program in abstract syntax trees

It took other languages decades to catch up, partly by borrowing ideas from LISP.

Causes for LISP Losing Ground

I discussed this with friends. Their views varied, but here are some of the explanations that came up:

  • Better marketing budget for other languages
  • Start of the AI winter
  • DARPA stopped funding LISP projects in the 1990s
  • LISP was too big and too complicated and Scheme was too small
  • Too many fraction in the LISP world
  • LISP programmer are too elitist
  • LISP on early computers were too slow
  • An evolutionary accident
  • Lowest common denominator wins

LISP vs. Haskell

I felt it was a horrible loss that the great ideas of LISP and Prolog were lost. Recently I realized:

Haskell programs use many of the same functional programming techniques as LISP programs. If you ignore the parenthesis they are similar.

On top of the program Haskell has a very powerful type system. That is based on unification of types and backtracking, so Haskell's type system is basically Prolog.

You can argue that Haskell is the illegitimate child of LISP and Prolog.

Similarity between Haskell and LISP

Haskell and LISP both have minimal syntax compared to C++, C# and Java.
LISP is more minimal, you work directly in AST.
In Haskell you write small snippets of simple code that Haskell will combine.

A few Haskell and LISP differences

  • LISP is homoiconic, Haskell is not
  • LISP has a very advanced object system CLOS
  • Haskell uses monadic computations

Evolution and the Selfish Gene

In the book The Selfish Gene, evolutionary biologist Richard Dawkins makes an argument that genes are much more fundamental than humans. Humans have a short lifespan while genes live for 10,000s of years. Humans are vessels for powerful genes to propagate themselves, and combine with other powerful genes.

If you apply his ideas to computer science, languages, like humans, have a relatively short lifespan; ideas, on the other hand, live on and combine freely. LISP introduced more great ideas than any other language.

Open source software has sped up evolution in computer languages. Now languages can inherit from other languages at a much faster rate. A new language comes along and people start porting libraries.

John McCarthy's legacy is not LISP but: Garbage collection, functional programming, homoiconicity, REPL and programming in AST.

The Sudden Rise of Clojure

A few years back I had finally written LISP off as dead. Then out of nowhere Rich Hickey single-handed wrote Clojure.

Features of Clojure

  • Run on the JVM
  • Run under JavaScript
  • Used in industry
  • Strong thriving community
  • Immutable data structures
  • Lock free concurrency

Clojure proves that it does not take an Google, Microsoft or Oracle to create a language. It just takes a good programmer with a good idea.

Typed LISP

I have done a lot of work in both strongly typed and dynamic languages.

Dynamic languages give you speed of development and are better suited for loosely structured data.
After working with Scala and Haskell I realized that you can have a less obtrusive type system. This gives stability for large applications.

There is no reason why you cannot combine strong types or optional types with LISP, in fact, there are already LISP dialects out there that did this. Let me briefly mention a few typed LISPs that I find interesting:

Typed Racket and Typed Clojure do not have as powerful types systems as Haskell. None of these languages have the momentum of Haskell, but Clojure showed us how fast a language can grow.

LISP can learn a lesson from all the languages that borrowed ideas from LISP.
It is nature's way.

Wednesday, May 1, 2013

Collision with the Zeitgeist

It has been 5 years since I started my blog. Back then I was alone with my obscure computer interests: functional programming languages, machine learning and AI.

I felt lucky when I met a Python programmer that I could chat with once in a while. Occasionally I would boil over and just start a computer rant to people at social events, until they ran off.

When I met my first Haskell programmer at a machine learning Meetup, it was like seeing a unicorn. Now I use Scala for work and discuss Haskell, Idris, LISP, category theory or machine learning on a daily basis.

In April 2013 my blog had over 8000 page views, which is more than I ever had imagined.


I am still surprised that my once obscure interests have become part of the zeitgeist, and hope they will stay for a while. But with technology you never know what is coming next.

Monday, April 1, 2013

Akka vs. Finagle vs. Storm

Akka, Finagle and Storm are 3 new open source frameworks for distributed parallel and concurrent programming. They all run on the JVM and work well with Java and Scala.

They are very useful for many common problems:

  • Real-time analytics
  • Complex website with different input and outputs
  • Finance
  • Multiplayer games
  • Big data


Akka, Finagle and Storm are all very elegant solutions optimized for different problems. It is confusing what framework you should use for what problem. I hope that I can clarify this.


The 30 seconds history of parallel programming

Parallel / concurrent programming is hard. It had rudimentary support in C and C++ in the 1990s.

In 1995 Java made it much simpler to do simple concurrent programming on one machine by adopting the monitor primitive. Still if you had more than a few threads you could easily get deadlocks, and it did not solve the bigger problem of spreading a computation over many machines.

The growth of big data created a strong demand for better frameworks.


MapReduce and Hadoop 

Hadoop the open source version of Google's MapReduce is the most well known method for distributed parallel programming.
It does streaming batch computation, by translating an algorithm to a sequence of map and reduce steps.
Map is fully parallel.
Reduce collect the result from the mapping step.

Hadoop has a steep learning curve. You should only use it when you have no other choice. Hadoop has a heavy stack with a lot of dependencies.

My post Hive, Pig, Scalding, Scoobi, Scrunch and Spark describes simpler frameworks built on top of Hadoop, but they are still complex.

Hadoop has long response time, and it not suited for real-time responses.

The Akka, Finagle and Storm frameworks are all easier to use than Hadoop and are suited for real-time responses.


Storm

Storm is created by Twitter and open sourced in 2011. It is written in Clojure and Java, but it works well with Scala. It is well suited for doing statistics and analytics on massive streams of data.

Storm can describe streaming computation very simply: You make a graph of you computation with some input data source called spouts at the top, below that computation nodes called bolts that can depend on any spout or bolt that has been computed above it, but you cannot have cycles. The graph is called a topology.

Features


  • Storm will deal with communication between machines and bolts
  • Consistent hashing to spread computation to right instance of a bolt
  • Error recovery due to hardware or network failure
  • Storm does not handle computations that fail due to inherent errors well
  • Can do analytics on Twitter scale input, literally
  • You can create a bolt in a non JVM language as long at it talks Thrift
  • Build in support for Ruby, Python, and Fancy

Word counter example in Storm

The hello world example for Hadoop is to count word frequencies in a big expanding text corpus.

This is a word counting topology in Storm:

  • Input spout is a stream of texts
  • Tokenization blot, you tell Storm how many instance of this you want, they can run either in local mode on one machine or in distributed mode on several machines.
  • Counting blot. This too can be split over different machines.


Spam filter topology in Storm

Let me give an outline of how to make a statistical spam filter using simple natural language processing. This can be used on emails, postings or comments.

There are 2 data paths:

Training path

Containing messages that have been classified as spam or ham.

  1. Input spout: is tuple of the text and status: spam or ham
  2. Tokenizer bolt. Input 1
  3. Word frequency bolt. Input bolt: 2
  4. Bigram finder bolt. Input bolts: 2 and 3
  5. Most important word and bigram selector bolt. Input bolts: 2, 3 and 4
  6. Feature extractor bolt. Input bolt: 5
  7. Bayesian model bolt. Input 3, 4 and 6,

Categorizer path

The unprocessed messages would be sent to the Bayesian model bolt so it can be categorized as spam or ham.

  • Input spout: tuple of the text and an id for message
  • Tokenizer
  • The feature extractor. Extracting optimal feature set of words and bigram. Bolt 6 from above.
  • Bayesian model. Bolt 7 from above
  • Result bolt tuple message id and probability of spam


This spam filter is quite a complex system that fits well with Storm.


Akka

Akka v0.6 was released in 2010. Akka is now at v2.2 and is maintained by TypeSafe the company that maintain Scala and it part of their stack. It is written in Scala and Java.

Features


  • Akka is based on the Erlang language that was developed to handle telecom equipment and be highly parallel and fault tolerant.
  • Akka gives you very fine grained parallel programming. You split you tasks up into smaller sub tasks and have a lot of actors compute these tasks and communicate the result.
  • Works well for two way communication.
  • An actor that can communicate with mailboxes and they are single threaded.
  • The motto of Akka's team is: Let it fail early.
  • If an actor fails Akka has different strategies for recovery.
  • Actors can have share access to an object on one machine, e.g. a cache.
  • Communication between different actors is very simple, it is just a method call and it will give you a future to a value.
  • Actors are very lightweight, you can have 2.7 million actors per GB of heap. 
  • There is a executions context with shared thread pool.
  • This works well when there is IO bound computations.
  • Turns computation into a Future, a monadic composable asynchronous computation.
  • Incorporate dataflow variables from a research language called Oz.

Use cases


  • Agents
  • Simulations
  • Social media 
  • Multiplayer game 
  • Finance
  • Online betting


Finagle

Finagle is created by Twitter and open sourced in 2011. It is written in Scala and Java.

Finagle lets internal and external heterogeneous services communicate. It implement asynchronous Remote Procedure Call (RPC) clients and servers.

Features


  • It is simple to create servers and clients for your application
  • Remote services calling each other
  • Speaks different protocols: HTTP, Thrift, Memcashed
  • Failover between servers
  • Is good for having one service query other services 
  • Uniform error policy
  • Uniform limits on retry
  • Two way communication
  • Load balancing
  • Turns computation into a Future, a monadic composable asynchronous computation.

Use cases

  • Complex website using different services and supporting many different protocols
  • Web crawler


Side by side comparisons


Akka vs. Finagle

Akka and Finagle can both do two-way communication.

Akka is great as long as everything lives on one actor system or multiple remote actors systems. It is very simple to have them communicate.

Finagle is much more flexible if you have heterogeneous services and you need different protocols and fallbacks between servers.


Akka vs. Storm

Akka is better for actors that talk back and forth, but you have to keep track the actors, and make strategies for setting up different actor systems on different servers and make asynchronous request to those actor systems. Akka is more flexible than Storm but there is also more to keep track of.

Storm is for computations that move from upstream sources to different downstream sinks. It is very simple to set this up in Storm so it run computation over many distributed servers.

Finagle vs. Storm

Finagle and Storm both handle failover between machines well.

Finagle does heterogeneous two-way communication.

Storm does homogeneous one way communication, but does it in a simple way



Serialization of objects

Serialization of objects is a problem for all of these, since you have to send object between different machine, and the default Java serialization has problems:

  • It is very verbose
  • It does not work well with Scala's singleton objects
Here are a few Scala open source libraries:





My experience

It has been hard for me to decide what framework is the better fit in a given situation, despite a lot of reading and experimenting.

I started working on an analytics program and Storm was a great fit and much simpler than Hadoop.

I moved to Akka since I needed to:
  • Incorporate more independent sub systems all written in Scala
  • Make asynchronous call to external services that could fail
  • Setup up on-demand services

Now I have to integrate this analytics program with other internal and external services some in Scala some in other languages I am now considering if Finagle might be a better fit for this. Despite Akka being easier in a homogeneous environment.


Afterthought

When I went to school parallel programming was a buzz word. The reasoning was: 

Parallel programming works like the brain and it will solve our computational problems

We did not really know how you would program it. Now it is finally coming of age.

Akka, Finagle, Storm and MapReduce are different elegant solutions to distributed parallel programming. They all use ideas from functional programming.

Tuesday, February 5, 2013

Scala vs. Haskell vs. Python

Functional programming is on the upswing, but should you bet your career on it, or is it a short-lived technology fad?

I have long wanted to use functional programming professionally and for the last year I have. Mainly Scala, written in Haskell style, plus some real Haskell programming.

Here is my impression of Scala and Haskell compared to my benchmark language, Python.


Scala

Scala is a functional object oriented hybrid language running on the JVM. It was created by Martin Odersky in 2003. Scala took Java / JVM and organized it nicely according to a few orthogonal principles.
Working in Scala has been a pleasure, there is a lot to like:
  • You have easy access to the giant world of Java libraries
  • Lot of libraries written for Scala
  • Very fast only 2 to 3 time slower than C
  • Big ecosystem
  • Easy to define a DSL in Scala so you can do everything in Scala
  • Very advanced type system
  • Adapted in the industry by: Twitter, LinkedIn, Foursquare, ...
  • Scala is the most adapted functional language
  • Web frameworks: Play, Scalatra, any kind of Java Servlets
  • Scalding: a very nice framework for Hadoop programming
  • Akka: an Erlang style actor system
  • Mixin composition
  • Good GUI with Swing and JavaFX
  • SBT the best build tool I have used
  • Scala is a full stack multi purpose language

Issues

  • It is very complex
  • It is a kitchen sink language
  • Confusing to keep Scala collections and parallel Java collections apart

Eclipse Plugin Scala IDE for Eclipse




The Scala Eclipse plugin is very solid, but not quite as good as the fantastic Java support.

  • Syntax highlighting
  • Code completion
  • Debugger
  • Shows compile errors with explanation
  • Rudimentary refactoring
  • Jump to definition


Monad and Applicative Functor

Two very important concepts in functional programming are monad and applicative functor.

The best reference I found was: Learn You a Haskell for a Great Good!.

A monad gives you simple ways of composing different operations. First it seems like an odd principle. Understanding monad took me several months.

In UNIX and OS X you can create complex program by piping simple commands together. A monad generalizes this a lot.

Once you understand the monad you will see monads pop up so many places. The monad is an amazingly powerful construct.

The last place I found monads unexpectedly showed up was in asynchronous programming, e.g. used in AJAX.
You send an external request and you do not block but you have a callback for when the result comes back. This is efficient but messy to program especially if you have a chain of requests to process and you have to have a lot of callbacks floating around. You can do this type of calculations using a future / promise, and luckily a future is a monad so you string a long list of operations after each other in a very simple way.


Scalaz

Scalaz is a Scala library that replicates a lot Haskell constructs, at the cost of being similarly hard to understand.

You can work with monads in Scala without using Scalaz since the "for-statement" in Scala is syntactic sugar for monadic "for-comprehension".

I programed Java in a functional style both professionally and for my open source project. It is possible but it is rather verbose and clunky. Scala is much more powerful, simpler and cleaner than both Java approaches, and Scalaz is a big step up from Scala.

When I started programming in Scala I read a really funny blog post called Truth about Scala that describes how a team starts to use Scala and first they are excited, but quickly it decent into a death spiral of complexity. I was concerned with this and tried to keep my code as simple as possible for and avoid Scalaz for a long time. I would advice other to become very comfortable with Scala before starting to work with Scalaz.


Haskell

Haskell is a strongly typed, lazy, pure functional programming language. It is an academic research language created by a committee in 1987.
One reason that I got into Haskell was in order to understand monads and applicative functors, they are important constructs in Haskell and category theory.

There is a steep learning curve for Haskell. Maybe it is more like a hump you have to get over. Just getting to basic proficiency is hard. It took me around one year of low intensity studying, but one day it just made sense.

  • Haskell now has a lot of libraries
  • Libraries and dependencies are handled by Cabal
  • It is fast only 2 - 3 times slower than C
  • Great concurrency
  • Repa native parallel numerical array processing
  • Very small language
  • Very pure
  • Very terse code
  • Very advanced type system
  • Hoogle a Haskell search engine
  • Great web frameworks Happstack, Snap and Yesod


Issues

  • Bad GUI support
  • Module system is crude


    Hoogle, a Search Engine for Haskell

    A colleague told me that when he needed a function he would write it out its signature and put it into Hoogle and often it will take him to the function that he needed. First time I tried it and it actually took me to a function solved a bigger part of the problem than what I was looking for.

    When I searched Hoogle for this function signature:

    (a -> Bool) -> [a] -> [Int]

    I got these results in EclipseFP:


    Eclipse Plugin EclipseFP

    EclipseFP with Hoogle


    The Haskell Eclipse plugin is quite good:
    • Syntax highlighting
    • Cabal integration
    • Hoogle integration
    • Code completion
    • Debugger
    • GHCi integration with automatic reload


    Python

    Python is a high-level language built on ideas from functional, imperative and object oriented programming. It was created by Guido van Rossum in 1989.

    For many years Python was my favorite language. It is a language for kids and also for scientists and a lot of people in between.
    • Python is probably the easiest language to learn
    • It took me a day to learn well enough to use
    • Very minimal language
    • Very terse code
    • Excellent wrapper language
    • Many implementations: CPython, Jython (JVM), IronPython (CLR), PyPy
    • Good bindings to numerical packages: NumPy, SciPy
    • Used in computer vision since OpenCV choosing Python to be its scripting language
    • Used in natural language processing due to the NLTK
    • Great web frameworks: Django, TurboGear, CherryPy

    Issues

    Python is not quite a full stack language there are a few missing pieces:
    • Bad GUI support
    • Low-level numerical programming had to be done in external packages
    • Concurrency
    • Speed around 50 times slower compared to C

    Eclipse Plugin PyDev




    I like PyDev it has:
    • Syntax highlighting
    • Code completion
    • Debugger


    Best Programming Language for Kids

    If a kid can understand a technology it is well designed. My daughter is turning 5 and I am thinking about what language I should introduce her to first.

    Python

    My first inclination was to teach her Python since it is the simplest, but it needs to give immediate visual feedback. Python's lack of a good GUI is a problem.

    Haskell

    I have also been tempted to show her some Haskell to teach her good habits in a pure and minimal language. But if I tell her that:

    "A monad is just a monoid in the category of endofunctors"

    she will walk away or scream.

    Scala

    Kojo is a LOGO like graphical turtle programming environment written in Scala. Scala's type inference makes it simpler for kids who will not have a good concept of types.

    My daughter plays with Kojo and she likes it. She comes and asks me if we can do the turtle?


    Kojo notice green drawing turtle in the middle


    So unexpectedly, Scala the biggest language was the most kid friendly language. Based on a very small sample size.


    Category Theory

    Haskell is using plenty of concepts from category theory. E.g. the monad. In my quest to understand it I started to study category theory.

    Category theory has been called: "Abstract nonsense", both by its practitioners and critics. And for very good reasons. It can suck you into a black hole of abstraction.

    Category Theory Introductions

    You do not need to understand category theory to program in Haskell or Scalaz, but if it helps you here are a few introduction videos.

    Dominic Verity presents a gentle introduction to Category Theory:

    http://vimeo.com/17207564


    Dominic Verity on Category Theory (Part 2)



    Error792's category theory class, currently there are 5 parts



    Math and Programming

    I have often said that there is no connection between math and programming. The only math you need to program is counting, and occasionally, addition. I felt:

    Programmers are the grease monkeys of today

    We move some data around and throw it on webpages

    After working in Scala and Haskell I have changed my tune:

    When you program in Scala you feel like an engineer

    When you program in Haskell you feel like a mathematician


    Adapting Haskell and Scalaz for a Team

    Using Haskell and Scalaz takes a special mindset and a lot of dedication. I have been very lucky to work at a place that has attracted physicists, mathematicians and theoretical CS people.

    If a big part of your team do not have these qualities you risk wasting time and chasing developers away.

    On the other hand if your team is using Haskell or Scalaz you will attract this brand of developers.



    Conclusion

    I had high expectation when I started using functional programming full time, but I have been disappointed by new technology many times before. Functional programming met my high expectations. It has been challenging and very enjoyable.

    I was a C++ programmed for 8 years, and considered C++ the one true way for high speed, high level programming.
    Recently I looked at a code sample written in C++ and it hurts my eyes: Filled with boilerplate and state.

    Functional programming is addictive and will make you spoiled


    Functional programming is here to stay. It has been an important part of C# since v3.0. It is finally getting added to Java in Java 8 coming out soon. The classic functional languages LISP or ML are are the basis of: Clojure and F# that have thriving community and are used in industry. The time has come to invest some time in understanding functional programming.


    Python

    I enjoy Scala and Haskell more than Python, but Python seem to be the language that I always go back to. It is a power tool that adds very little weight to your programmer's toolbox. You get high return on investment with Python, while with Scala and especially Haskell you have to invest a lot and for a long time before you break even.

    Scala

    Scala is now popular enough that you can get a job doing it. Moving from Java or C# to Scala is pretty easy. Since you can start programming Scala like Java.  Scala is a big and complex language with a big ecosystem and it takes months to get a deeper understanding. Scala is substantially more powerful than Java 7, but Java 8 has supposedly taken a lot of ideas from Scala.

    Haskell

    Haskell is definitely the road less traveled, but it is a road, not a trail. It is an academic research language created in 1987. Recently it has started to break into the mainstream. There are a few jobs in Haskell. Gaining basic proficiency in Haskell is quite hard, but afterwards other languages look a little clunky. Writing Haskell feels like doing math.

    Scala vs. Haskell

    Scala is a safer bet for most programmers, since it is better adapted to more tasks, and you can approximate Haskell pretty well with Scalaz. Scala has a very advanced type system to handle its object oriented features.

    Haskell appeals to functional language purists, mathematicians and category theorists. Esthetically I prefer Haskell. It is terser and the type inference is better.

    In most cases external factors would dictate whether Scala or Haskell would be a better fit for your project.


    Haskell vs. Python

    Haskell and Python have a lot in common:
    • Minimalistic languages
    • White space delimited
    • Very terse
    • List comprehension
    • Important tuple type
    • GUI binding to wxWidget, GTK
    Haskell is statically typed and optimized towards purity and speed.

    Python is dynamically typed and optimized towards pragmatism and simplicity.


    Monday, March 26, 2012

    Hive, Pig, Scalding, Scoobi, Scrunch and Spark

    Comparison of Hadoop Frameworks


    I had to do simple processing of log files in a Hadoop cluster. Writing Hadoop MapReduce classes in Java is the assembly code of Big Data. There are several high level Hadoop frameworks that make Hadoop programming easier. Here is the list of Hadoop frameworks I tried:

    • Pig
    • Scalding
    • Scoobi
    • Hive
    • Spark
    • Scrunch
    • Cascalog

    The task was to read log files join with other data do some statistics on arrays of doubles. Programming this without Hadoop is simple, but caused me some grief with Hadoop.

    This blog post is not a full review, but my first impression of these Hadoop frameworks.


    Pig


    http://pig.apache.org/

    Created by Yahoo!
    Language Pig Latin.

    Pig is a data flow language / ETL system. It work at a much higher level than direct Hadoop in Java.
    You are working with named tuples. It is mildly typed, meaning you can define a type for each field in a tuple or it will default to byte array.

    • Pig is well documented
    • Pig scripts are concise
    • You can use it for both script and interactive sessions
    • You can make Pig scripts embedded in Python run as Jython
    • Pig is a popular part of the Hadoop ecosystem

    Issues

    • Pig Latin is a new language to learn
    • Pig Latin is not a full programming language, only a data flow language
    • You have to write User Defined Function / UDF in Java if you want to do something that the language does not support directly
    • Pig Latin does not feel uniform


    Scalding


    https://github.com/twitter/scalding

    Created by Twitter.
    Language Scala.

    Scalding looks very promising. It had just been open sourced when I looked at it.

    What sets Scalding apart from other Scala based frameworks is that you work with tuples with named fields.

    There is a blog with code example:
    http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/
    It contains numerical code for Hadoop that did not look much harder than the equivalent non Hadoop code.

    Issues

    You call Scalding by running a Ruby build script, scald.rb.

    This did not work on my Mac, OS X Lion, but it ran under Ubuntu with no problems.

    Scalding has little documentation, but it is built on Cascading that does have good documentation.

    Note on aggregate functions in Scalding

    Scalding has some predefined aggregate function such as Sum and Count. Unfortunately Sum only works numerical types and I needed it to work on arrays of doubles.

    You can build your own aggregator function using Group Builder followed by a scanLeft or foldLeft operation.

    Workaround to get Scalding to run on OS X Lion


    After spending some time I found a workaround for the problems with running Scalding under OS X Lion:


    In the build scripts/scald.rb set:

    COMPILE_CMD="scalac"

    In build.sbt set scalaVersion to the version of Scala you are using:

    scalaVersion := "2.9.1"

    After that I was able to run the 5 tutorials, coming with Scalding.


    Scoobi


    https://github.com/NICTA/scoobi
    http://www.opennicta.com/home/scoobi

    Created by OpenNICTA.
    Language Scala.

    Scoobi looked powerful and simple.


    Scoobi was easy to build with the Scala build system SBT.

    Issues

    I was having problem running examples. Turned out that Scoobi had a dependency of the Cloudera Hadoop distributions version 0.20.2. This is a popular Hadoop distribution.

    I could not get it to run on my Mac, which has the Apache Hadoop distribution so I gave up. I have not revisited it yet.


    Hive


    http://hive.apache.org/

    Created by Facebook.
    Language SQL.

    Hive works on tables made of named tuples with types. It does not check the type at write time, you just copy files into the directory that represent a table. Writing to Hive is very fast, but it does check types at read time.

    You can run JDBC against Hive.

    It was easy to get Hive running and I really liked it.

    Issues

    A problem was that Hive only understands a few file format:

    • Text format
    • Tab delimited format
    • Hadoop SequenceFile format

    Starting from Hive version 0.9, is has support for Avro file format that can be used from different languages.

    In order to do Sum by group I would have to create User Defined Aggregation Function. Turns out that UDF and UDAF is badly documented. I did not find any examples about how to write them for arrays of doubles.


    Spark


    http://www.spark-project.org/

    Created by Matei Zaharia from UC Berkeley AMP Lab.
    Language Scala.

    I was very impressed by Spark. It was easy to build with SBT. It was very simple to write my program. It was trivial to define group by sum for double array, just by defining a vector class with addition.

    Issues

    I tested my program in local mode. I was very happy that I had a workable solution. Then I investigated moving it to a Hadoop cluster. For this Spark had dependency on Apache Mesos cluster manager. Mesos is a thin virtual layer that Hadoop is running on top of.

    It turned out that Spark is not actually running on Hadoop. It is running on HDFS and is an alternative to Hadoop. Spark can run side by side with Hadoop if you have Apache Mesos installed.

    Spark is an interesting framework that can outperform Hadoop for certain calculation. It uses the same code from running in memory calculation and code on a big HDSF cluster.

    If you have full control over your cluster Spark could be a good option, but if you have to run on an established Hadoop cluster it is very invasive.


    Scrunch


    https://github.com/cloudera/crunch/tree/master/scrunch

    Created by Cloudera.
    Language Scala.


    Scrunch looked powerful and simple. It is easy to build with SBT.

    Issues

    Dependent on Cloudera's Hadoop 0.20.2 distribution.

    The web site describes it as alpha quality, so I did not do much with Scrunch.


    Cascalog


    https://github.com/nathanmarz/cascalog

    Created by Nathan Marz from Twitter.
    Language Clojure, a modern Lisp dialect.

    As Scalding Cascalog is built on Cascading.

    Easy to build with the Clojure build system Leiningen.

    It is used as a replacement for Pig. You can run it from the Clojure REPL or run scripts, and get a full and consistent language.

    I tried Cascalog and was impressed. It is a good option if you are working in Clojure


    Hadoop vs. Storm


    I had to solve the same log file statistics problem in real-time using the Storm framework and this was much simpler.

    Why is Hadoop so hard

    • Hadoop is solving a hard problem
    • Hadoop is a big software stack with a lot of dependencies
    • Libraries only work with specific versions of Hadoop
    • Serialization is adding complexity, see next section
    • The technology is still not mature
    Looks like some of these problems are getting addressed. Hadoop should be more stable now that Hadoop 1.0 has been released.


    Serialization in Hadoop


    Java have a built in serialization format, but it is not memory efficient. Serialization in Hadoop has to be: 
    • Memory efficient
    • Use compression
    • Support self-healing
    • Support splitting a file in several parts

    Hadoop SequenceFile format has these properties, but unfortunately it does not speak so well with the rest of the world.

    Serialization is adding complexity to Hadoop. One reason Storm is simpler is that it just uses Kryo Java serialization to send objects over the wire.

    Apache Avro is a newer file format that does everything that is needed by Hadoop but is speaks well with other languages as well. Avro is supported in Pig 0.9 and should be in Hive 0.9.


    High level Hadoop frameworks in Java


    You do not have to use Scala or Clojure to do high level Hadoop in Java. Cascading and Crunch are two Java based high level Hadoop frameworks. They are both based on the idea is that you set up a Hadoop data flow with pipes.

    Functional constructs are clumpy in Java. It is a nuisance but doable. When you deploy code to a Hadoop cluster you have to pack up all your library dependencies into on super jar. When you are using Scala or Clojure you need to also package the whole language into this super jar. This also adds complexity. So using Java is a perfectly reasonable choice.

    Here are two high level Java Hadoop libraries:


    Cascading


    Both Scalding and Cascalog is built on top of Cascading.

    Cascading is well documented. You can write concise Java code for Hadoop.


    Crunch


    Scrunch is built on top of Crunch.


    Conclusion


    I liked all of the Hadoop frameworks I tried, but there is a learning curve and I found problems with all of them.

    Extract Transform Load

    For ETL Hive and Pig are my top picks. They are easy to use, well supported, and part of the Hadoop ecosystem. It is simple to integrate a prebuilt Map Reduce classes in data flow in both. It is trivial to join data source. This is hard to do in plain Hadoop.

    Cascalog is serious contender for ETL if you like Lisp / Clojure.

    Hive vs. Pig

    I prefer Hive. It is based on SQL. You can use your database intuition and you can access it though JDBC.

    Scala based Hadoop frameworks

    They all made Hadoop programming look remarkable close to normal Scala programming.

    For programming Hadoop Scalding is my top pick since I like the named fields. 

    Both Scrunch and Scoobi are simple and powerful Scala based Hadoop frameworks. They require Cloudera's Hadoop distribution, which is a very popular distribution.


    Saturday, July 23, 2011

    Scala, Eclipse and Maven integration tutorial

    I have evaluated Scala as a language for cloud computing and Hadoop. One requirement was a robust development environment, with a real build system, a good IDE with code completion and debugging.

    The combination of ScalaEclipse and Maven seemed like a fit for this requirement, but my initial experience was mixed.


    Problems with Scala, Eclipse and Maven integration

    It was easy to install Scala, Eclipse and Maven, but when I set up a project it had a persistent error in Eclipse:

    object Predef does not have a member AnyRef

    Other problems:

    • There were problems running the unit test.
    • I had to restart Eclipse a lot.
    • Eclipse had Scala set to version 2.9.0.1 while Maven had 2.8.0. When I tried to change Maven to use 2.9.0.1 the pom.xml file would be marked as having an error.
    I searched internet for help but could not find it. After a good deal of experimenting I sorted out the problems and found a good solution.


    Software versions

    My setup is:
    • Scala 2.9.0.1.
    • Eclipse 3.7 Indigo
    • Scala-ide Eclipse plugin: scala nightly 29 - http://download.scala-ide.org/nightly-update-wip-experiment-2.9.0-1


    Scala, Eclipse Maven project setup tutorial

    Here are that steps that I took to set up at new Scala, Eclipse and Maven project so it works with unit testing.

    Press menu item: File - New - Other...



    Select Maven Project




    Select the org.scala-tools.archetypes scala-archetype-simple




    Add group id and artifact id to project. Click Finish



    This will create the project with example program and unit tests, but it will leave Eclipse in an unstable state







    In the project's pom.xml file make the changes that I have marked in red:


    <properties>
     <maven.compiler.source>1.5</maven.compiler.source>
     <maven.compiler.target>1.5</maven.compiler.target>
     <encoding>UTF-8</encoding>
     <scala.version>2.9.0-1</scala.version>
    </properties>
    
    
    <dependency>
     <groupId>org.scala-tools.testing</groupId>
     <artifactId>specs_${scala.version}</artifactId>
     <version>1.6.8</version>
     <scope>test</scope>
    </dependency>
    

    Now both Scala IDE and Maven are both using the same version of Scala. Scala 2.9.0.1


    Right click the whole project and select: Configure - Add Scala Nature




    Now use the Maven build system to clean, build and run unit tests. Run from either Eclipse or command line.

    From Eclipse, right click the whole project and selecting:
    Maven clean
    Maven install





    From command line:

    C:\prog\apache-maven-2.2.1\bin\mvn clean
    C:\prog\apache-maven-2.2.1\bin\mvn install

    Note that you have to use Maven 2.2 and not Maven 3.



    Now there should be no more errors.
    The unit test: "scalatest.scala" has some problems, delete it.

    Run all unit tests from Eclipse. By right clicking the whole project and select Run As JUnit Test



    Now you can see the result in the JUnit runner.


    Final impression of Scala, Eclipse and Maven integration

    Once I had resolved the problems the Scala, Eclipse and Maven combination was a great development environment meeting my requirements.

    One thing that is currently missing from the Scala Eclipse plugin is code refactoring. Refactoring works very well in both Eclipse for Java and Visual Studio for C#.



    Tuesday, July 12, 2011

    Natural language processing in F# and Scala

    I do natural language processing in C# 3.5 and Python. My work includes classification, named entity recognition, sentiment analysis and information extraction. Both C# and Python are great languages, but I have some unmet needs. I am investigating if there are any new languages that would help.

    I 2010 I tried out 3 new languages:
    Natural language processing in Clojure, Go and Cython

    Recently I have investigated F# and Scala. They are both hybrid functional - object oriented languages; inspired by ML / OCaml / Haskell and Java / C#.

    Python as the benchmark

    Python is widely used in natural language processing. I am most productive in Python for NLP work. Here are a few reasons why:

    • NLTK is a great Python NLP library
    • Lot of open source math and science libraries e.g. NumPy and SciPy
    • PyDev is a good development environment
    • Good integration with MongoDB library
    • Great for rapid development

    Python shortcomings
    • Slow compared to compiled language
    • GUI support is crude
    • Multi-threading is crude
    • Compilation does give more robustness

    It should be possible to make a super language that has the elegance of Python, but without these shortcomings.


    My first Scala experience

    In 2006 I thought Scala was this super language. It is very advanced; you can call any Java libraries from Scala, including all the open source libraries. But I ran into a list of problems with Scala:

    • The Scala IDE was far behind Eclipse Java
    • Scala is a quite complex language
    • The Java libraries and the functional programming libraries were badly integrated
    • There were no Scala REPL or interpreter like in Python
    Scala was stable enough for use, but it did not improve my productivity so after some months I went back to using Python as my scripting language.


    Python's weakness

    Recently I had to make a small text processing application that end users could use directly. This was not the best fit for Python. Normally my Python programs have no GUI and are controlled by command line parameters.

    I had 2 Python options:

    Make simple GUI using TkInter
    TkInter is a Python wrapper of TK, the cross platform GUI toolkit. It is pretty crude by modern GUI standards, but would have been good enough. However trying to install all the Python libraries that I needed on the end users machine would be setting myself of for a maintains nightmare.

    Wrap code in web application
    I could wrap a web interface around it. The application is using a lot of memory and I would have to maintain a web application.

    I had a 1 week hard deadline for the task and both of these options looked unappealing. I needed something else...


    My first F# application

    I took a chance on F#, and managed to learn enough F# to finish the program by my 1 week deadline.

    There is no GUI builder for F# in Visual Studio, but it was pretty easy to hand code a simple WinForms GUI to wrap around the core code. It was not pretty but you could give it to an end user. The whole application ended up being one 40KB executable file, and it was very fast. F# had actually filled a niche that Python does not do so well.

    There were also problems, I wrote the whole application from scratch, while in Python I would have been able to use NLTK, write the code faster and get better results.

    All in all this was very good experience. I thought that F# would be a good supplement to my Python library. It would both give me raw speed when I need it and good connectivity with C#, ASP.NET, WPF and Microsoft Office.


    Functional programming benefits

    Functional programming is a great fit for my NLP work.

    I have a lot of different text sources: database, flat file, directory, public RESTful web application services.

    I have many word transformations: stop word filters, stemmers, custom filters.

    I need many operations building on other operations: Bigram finder, POS tagger, named entity recognizer.

    Created different reports: database, csv, Excel.

    In functional languages you can just take any combinations of these operations and easily pipe them together while getting good compiler support. This does not fit so well with object oriented programming were you are more concerned with encapsulation.


    F# impression

    F# is the first compiled language I tried that is comparable to Python in simplicity and elegance. It has a real Pythonic feel:

    • F# is fast
    • Simple and elegant
    • Good development environment in Visual Studio 2010
    • Best concurrency support of any language I have seen
    • Good database support
    • Good MongoDB library
    • Simple to combine F# with C# or VB.NET for ASP or WPF
    • Good REPL

    Issues

    • Runs best under Windows
    • For an IDE you really need Visual Studio 2008 or 2010, and that cost at least $700
    • F# can be compiled and run the shell from SharpDevelop 4.0 and 4.1, but you do not have the same productivity
    • The math libraries under .NET are not as good as NumPy and SciPy
    • The NLP libraries are better under Python


    Scala revisited

    After the success with F# I was very curious about why F# has been so much more successful than my first experience with Scala.

    I looked at an F# and Scala cheat sheet and thought they look remarkably similar. I watched a few screen casts and found no obvious problems. I bought the book: Programming in Scala, Second Edition, it turned out to be a very interesting computer science book and I read the whole 852 pages. Scala still looked good.

    I installed the Scala Eclipse plugin and wrote some code. Both the language and the IDE have come a long way during the last 5 years:
    • 15 books about Scala
    • 2 great free books
    • Tooling is much better
    • IDE is much better with code completion
    • Native NLP libs: ScalaNLP and Kiama

    Of all the issues I had when I first tried Scala. The only remaining one is:
    Scala is a pretty complex language

    It is incredible how Scala has taken a lot of messy features from Java and turned it into a clean modular system, at the cost of some complex abstractions.


    F# vs. Scala

    Despite many similarities, the languages have a different feel. F# is simpler to understand, while Scala is the more orthogonal language. I have been very impressed by both.

    F# better

    • Simpler to understand
    • Fantastic concurrency
    • Tail recursion optimized
    • Works well with Windows Azure

    Scala better 

    • More orthogonal, reusing the same constructs
    • Works with any Java library so more libraries
    • Better NLP libraries
    • Works well with Hadoop


    Cloud computing

    Functional programming works well with cloud computing. For me the availability of a good functional language is a substantial factor in selecting a cloud platform.

    Google introduced MapReduce to handle massive parallel multi computer applications.

    Hadoop is the Java based open source version of MapReduce. To run Hadoop natively it has to run a JVM language like Java or Scala.

    Hadoop Streaming extends a limited version of Hadoop to work with programs written in other programming languages as long as they work like a UNIX pipes that read from stdin and write to stdout.

    There is a Python wrapper for Hadoop Streaming called Dumbo. Python is around 10 times slower than Java and Dumbo is a limited version of the Hadoop, so if you are trying to do NLP on massive amount of data this might not solve your problems.

    Scala is fast and will give you full access to run native Hadoop.

    Microsoft's version or MapReduce is called: Dryad or LINQ to HPC. It is not officially released yet, but F# works well with Windows Azure.


    NLP and other languages

    Let me finish by giving a few short comparisons of F# and Scala with other languages:


    Clojure vs. Scala

    Clojure is a LISP dialect that it also running on the JVM, and it the other big functional language running there. Clojure has some distinct niches for NLP:

    Clojure better
    • Language understanding
    • Formal semantic: taking text and translating it to first order propositional logic
    • Artificial intelligence tasks

    Scala better
    • It is easy to write fast Scala code
    • Smaller learning curve coming from Java
    I tried Clojure recently and was very impressed; but more of my work falls in the category that would benefit from Scala.


    Java vs. Scala

    Java better

    • Better IDE tools and support
    • Better GUI builders
    • Great refactoring support
    • Many more programmers that know Java

    Scala better

    • Terser code
    • Closures
    • First class function
    • More expressive language


    C# vs. F#


    C# better

    • Better IDE tools and support
    • Better GUI builders
    • There are a lot more programmers that know C#
    • Better LINQ to SQL support

    F# better

    • Terse code
    • Better support for concurrency, Synch, continuations
    • More productive for NLP


    Conclusion

    F# and Scala are similar hybrid functional object oriented languages.

    For years I have periodically tried functional programming languages to see if they were ready for mainstream corporate computing; and they were not. With the recent spread of functional features into object oriented languages I thought that real functional programming languages would soon be forgotten.

    I was pleasantly surprised by how well F# and Scala work now. Functional languages are finally coming of age and becoming useful in mainstream corporate computing. They are stable enough, and they have niches were they are more productive than object oriented languages like C# and Java.

    I really enjoy programming in F# and Scala, they are a very good fit for natural language processing and cloud computing. For bigger NLP projects I now prefer to use F# or Scala over C# or Java.

    For GUI and web programming the object oriented languages still rules. Stick with C# or Java if the NLP part is small or GUI or web interface is the domineering part.

    Java and C# are also improving e.g. by adding more functional features. Many working programmers are well served by just waiting for Java 8 or C# 5. But functional programming is here to stay. Rejoice...