Saturday, February 22, 2020

Haskell and Hadoop the Aftermath

In 2012 Haskell and Hadoop were the hottest technologies. They had a lot of hype and I loved them. Both were based on functional programming and built on towering abstractions.

Elite functional programmers used Haskell. Serious tech startups had to use big data, meaning Hadoop. Three years later I had learned Haskell and Hadoop and my top advice to startups was:

Don't use Haskell or Hadoop!

They won't you give you a competitive advantage they will just slow you down.

That was my personal experience. For years after that I avoided jobs involving Hadoop, but for the last couple of years I have mainly been working in Hadoop with Spark. It's now solid and very productive.

I found the productivity increase quite remarkable. Some of it is a textbook example of technology life-cycle, but a some of it comes down to understanding the power and limitation of functional programming.

Modern Programming Paradigms

There are three main modern programming paradigms:
  • Object oriented
  • Functional
  • Declarative

Object oriented programming gives you fine-grained control. Functional programming uses transformations with less control. In declarative programming you just write queries and you have little control. The higher the abstraction the less control.

Essential Hadoop

The breakthrough that Hadoop / MapReduce made was that by using functional programming transformation you could distribute a computation over thousands of computers in a fault tolerant way. This was a monumental achievement, but what made Hadoop the dominant data platform it is today was that it later combined functional with the declarative programming available in Spark SQL, HIVE or PIG.

Combined functional and declarative programming was once the holy grail in computing, but nobody knew how to do it. Today it is ubiquitous and it is free, until you get the bill from your cloud provider.

Essential Haskell

I expected Haskell to be a mathematical version of Python. It was not. If you are trying to do object oriented programming in Haskell it will cause you a lot of pain. Unlike doing OOP in hybrid languages like F#, OCaml or Scala.

The power of Haskell is that it limits you to a small set of basic operations that compose. This allows you to build a big machine out of simple parts. The lazy evaluation makes it natural to work on infinite streams of data. The powerful type system makes it possible to connect small pieces of code in many different dimensions. My metaphor is:

Haskell is an extra dimensional Lego set

Haskell started as a playground for language researchers experimenting. I wanted to play with all these shiny theoretical toys. That was a big time sink and a part of the reason it took me a long time to learn.

Common Problems

One reason I gave up on both Haskell and Hadoop was that it was hard to get things done. Both were beautiful abstractions built on a tower of unstable software libraries. Everything was evolving quickly. This made it hard to keep the libraries underneath on compatible version. Every time your Hadoop distribution was updated your code would break.

In Haskell this problem was called Cabal Hell after the build system Cabal. There were simple solutions. Haskell now has stable versions of libraries that work with each other. It has a modern build system called Stack. Now tooling in both Haskell and Hadoop is quite good.

The Aftermath

I spent more time and effort learning Haskell and Hadoop than any other technologies. With that much effort I expected them to give me superpowers. Instead they slowed me down. This caused a backlash. I felt naive for jumping on the Haskell and Hadoop bandwagon and wasting so much time.

Now 8 years later the dust has settled and part of my problem was that I was an early adapter of immature technologies. Haskell and Hadoop are now mature but inherently complex technologies. They draw their power from giving up fine control. Instead they let you build machines that you can pipe data through.

Big data in the case of Hadoop. Infinite data in the case of Haskell.

Hadoop is highly successful, and is now a cornerstone of data engineering. Even though it is currently standing in the shadow of Spark that was built to run on top of Hadoop infrastructure.

Haskell is a practical programming language well suited for constructive mathematics and category theory, but it is not a better version of Scala. It is pretty successful at number 19 on RedMonk programming language ranking and is used in industry.

No comments: