Thursday, April 25, 2019

Benefits of Different Python Distributions on Mac

There are at least 5 popular ways to install Python on OS X / Mac.

  • OS X default Python installation, currently Python 2.7.10
  • Use brew install python
  • Use brew install pyenv
  • Anaconda
  • Python pkg installer from python.org

I have used all of these distributions. They are all high quality and easy to install, but you run into conflicts later. You think that you are installing a library into one Python distribution but it get installed into another distribution so you cannot use it. This causes many frustrating errors.

Every time I install a Mac I have to decide what is the best Python distribution for my use case and there is no simple choice. It has been hard to find good documentation on the trade offs between the Python distributions. I have a compiled a short list of benefits and issues and where I think that the different distribution make sense.


OS X Default Python Installation


  • You don't have to install anything
  • If you only want to have one Python distribution this will be the one
  • It is a pretty recent version of Python 2.7 currently 2.7.10

Issue

  • Not supporting Python 3 which is now in common use

If you only are doing light Python 2 scripting this is probably the easiest choice.


brew install python


  • Brew is the de facto package manager on OS X so most software is installed with brew
  • Very up to date versions of Python 2 and Python 3
  • Works well when you want to install many Python libraries
  • Python 3 is the default, but brew install python@2 will install Python 2
  • It takes precedence over the OS X default Python by being in earlier on PATH env
  • Brew will probably install Python as a requirement for other packages so you get it whether you want it or not

Good for more demanding programming and installing libraries.


brew install pyenv 


  • pyenv is a tool to have different versions of Python to chose from
  • It has no dependencies on either Python 2 or 3 but manipulate PATH env
  • It can co exists with brew install python 
  • It can also work with virtual environments

Issues

  • You have to install other libraries say gzip before you can install this
  • Python is compiled from scratch and you easily run into compile problems

Good if you are a serious programmer who need many different versions of Python possibly with conflicting versions of libraries.


Use Anaconda


  • Anaconda installs different version of Python with high quality curated packages specialized for data science libraries
  • It can be hard to get data science libraries working with manual installs
  • It is a whole ecosystem of software 
  • Includes good Python GUI called Spyder
  • Great support for Jupyter notebook
  • Has good built in support for Python's virtual environments 

Issue

  • It is a pretty heavy distribution taking up around 3GB

I usually need the data science libraries so I install Anaconda but also end up with the brew version of Python.


Python pkg Installer From python.org


  • It is the official Python distribution
  • You can always get the newest version of Python
  • Self contained installer

It is an easy way to get the last versions of Python installed.

Thursday, February 7, 2019

ML and Data in AWS, Azure and GCP

Machine learning and data technology are changing fast and the big cloud providers compete with new offerings. This blog is a short introduction to what this looks like in 2019. It is focused on cloud providers Amazon Web Services, Microsoft Azure and Google Compute Platform.

A few things I will discuss -
  • Most data in an organization can be put into a data lake to query and combine
  • We now have very powerful, user friendly open source ML libraries
  • We have NLP and computer vision REST APIs from cloud providers
Let me start with a little history of both ML and data.


History of Machine Learning Libraries


Simplified timeline for languages, libs and APIs

  • 1960 Lisp since ML was a small part of A.I.
  • 1986 C++ neural network software on a floppy disk in the back of book
  • 1997 Open source Java ML like WEKA, good but hard to integrate with you data and code
  • 2010 Modern Python open source libs NumPy, Pandas, Scikit-learn easy to use and integrate
  • 2015 Spark ML, attempts to make a fast ML pipeline as easy to use as Scikit-learn
  • 2017 Deeplearning open source libraries Tensorflow, Keras and PyTorch 
  • 2017 Cloud Vision API and Natural Language API
We now have several strong contenders to build or buy production quality ML functionality.


Convergence of Data


Recently I talked with a DBA and was surprised how much the DBA profession has changed. He told me big organizations used to have a big database such as Oracle, SQL Server, Sybase or DB2 and a lot of data stored in different files.

Now maintaining the data lake is one of his main responsibilities. The data lake is a system that allows you to store log files, structured, semi structured and unstructured data files in cheap cloud blob storage and still query and join it with SQL.
He was also in charge of an Oracle database and a few open source databases running, MySQL and Postgres and a MongoDB.


Data Lake Fundamentals


Uniform data that can be joined is very powerful. Here are a few underlying technologies that makes this possible.

In 2004 Google released the famous MapReduce paper, describing how you can do distributed computation using functional programming operations. The idea is that you send your computation to were you data is.

In 2010 Hadoop was released. Hadoop is an open source Java implementation of MapReduce. It turned out of be very hard to program in. Two new technologies made it easier to program MapReduce: Hive and Spark.

Hive

A lot of MapReduce job was just queries on data. Hive is a tool that lets you write these queries as simple SQL. Hive will translate the SQL to a MapReduce job, all you had to do was to add schemas definition describing the files with your data.

Spark

With Spark you can write more complicated MapReduce jobs. Spark is written in Scala which is a natural language to write MapReduce in. Spark is often use to ingest data into the data lake.

All the cloud providers have great support for Spark, AWS has EMR, Azure has HDInsight and GCP has Dataproc.


Combining Data Lake and Normal Database


Combine a data lake with a RDBMS is not easy. There are several approaches.

You can copy over all your relational data to your data lake every day. It takes work to build and operate, but when it is working everything is unified and it is easy to do any kind of analytic queries. Some data lake products have specialized functionality to do this in an easier way, see below.


Data Lake on AWS, Azure and GCP


AWS, Azure and GCP have different data lake solution.

AWS Redshift and Redshift Spectrum


AWS Redshift is a proprietary columnar database build on Postgres 8.
Redshift Spectrum is a query engine that can read files from S3 in these formats: avro, csv, json, parquet, orc and txt and treat them as database tables. First you have to make a Hive table definition in Glue Data Catalog.

Azure Data Lake Store


Microsoft data lake is called Azure Data Lake Storage works with blob storage and is compliant with HDFS the Hadoop distributed file system.

U-SQL is a query tool to combine Azure SQL DB and your data lake.

Google BigQuery


GCP's data lake is called BigQuery works with blob storage and stores native data in proprietary columnar format called Capacitor.
BigQuery is very fast and has a nice web GUI for SQL queries. It is very easy to get started with, since it can do schema auto-detection of your blob data, unlike Hive that needs a table definition before it can process the data.

New Cloud ML APIs


In 2017 Google released their Cloud Vision API and Natural Language API. I heard from several data scientists that instead of building their own computer vision system, named entity or sentiment analysis system, they just use APIs.

It feels like cheating, but ML APIs are here to stay.

When you should build your own ML models and when you you use the APIs?
If you have a hard problem in computer vision or NLP that is not essential to your goal, then using API seems like a good idea. Here are a few reasons why it can be problematic:

  • It is not free
  • Sometimes it works badly
  • There are privacy and compliance issues
  • Are you helping train a model that your competitor is going to use next
  • Speed e.g. if you are doing live computer vision


Working with ML APIs


If you decide to use the ML API your job will be quite different than if you chose to build and train your own models. Your challenges will be:

  • Transparency of data
  • Evolution or your data sources
  • Transparency of ML models
  • ML model evolution
  • QA of ML models
  • Interaction between ML models

The 2014 book Linked Data is a great source of techniques to use for data transparency and evolution. It describes linked data as transparent data with enough meta data that it can be linked from other data sources. It is advocating using self describing data technologies like RDF and SPARQL.

The response to a Cloud Vision query is nested and complex. I think that schemas or a gradual type system, similar to TypeScript's could give stability when working with semi structured evolving data. Some of the Google's Node API wrappers are already written in TypeScript and so they already have the type definitions.


Cloud ML Developments


There are a few minor cloud ML developments that deserve a mention.


Cloud Jupyter Notebooks


Amazon SageMaker, Microsoft Azure Notebooks and Google Cloud Datalab are Jupyter notebooks directly integrated into the cloud offerings.

I find Jupyter notebooks a natural place to combine code, data and presentation. One problem I have had when programming on cloud is that there are so many places where you can put programming logic.

Model Deployment


Model deployment has traditionally received less attention than other part of the ML pipeline.  Azure and GCP have done a great job of optimizing model deployment into something that can be done in few line of code. It will train a model, save it in a bucket and spin up a serverless function that serves up the model as a REST call.

Auto ML


ML tools that help find best ML models there are now available for GCP, AutoML, Amazon SageMaker and Azure, Automated Machine Learning. These will help you to chose the best model and tune hyper parameters. This seems like a natural expansion of current ML techniques. It does involve using cloud specific libraries.

Transfer Learning


If you have an image categorization task, you could build a classifier from scratch by training a deep convolutional neural network. This can take a long time. With transfer learning you will start with a trained CNN for example Inception or ResNet network. It should be trained on data that is similar to the data that you will be processing.
You train your classifier model by taking the second to last layer in the trained CNN as input. This is much less work than staring to build a 100 layered CNN from scratch. While transfer learning is not specific to the cloud it is easy to do it on the cloud where you have easy access to the per-trained models.


AWS vs Azure vs GCP


The cloud service market is projected to be worth $200 billion in 2019. There is a healthy competition despite AWS head start. Let me end with a very brief general comparison.

AWS was the first cloud service. It started in 2006 and has biggest market share. It is very mature offering both Linux and Windows VMs. They continue to innovate, but the number of services they have are a little overwhelming.

Azure is a very slick experience. Microsoft has embraced open source, offering both Linux and Windows VMs. It has great integration with the Microsoft and Windows ecosystem: SQL Server, .net, C#, F#, Office 365 and SharePoint.

Google Cloud Platform is polished. Is easy to get started with BigQuery and do data exploration in it. GCP has hosted Apache Airflow workflow system. GCP shines with machine learning offering great ML, vision and NLP APIs.

Monday, February 4, 2019

VM, Lambda, Kubernetes & Terraform Best Practice

I work with these popular cloud technologies.
  • VMs, virtual machines like EC2 or GCE
  • Docker
  • Kubernetes
  • Terraform
  • Lambda / serverless functions
This post contains a short introduction to these technologies and my best practices for which cloud technology to use in different situations.


Virtualization Technologies


Here is a quick history and brief summary of difference.

A Highly Abbreviated Virtualization History

  • 2006 Amazon released EC2 a cloud VM you could spin up fast on demand.
  • 2013 Docker. Describes everything VM needs in a small file, used to build lightweight image.
  • 2014 Google open sourced Kubernetes a system to run Docker images together.
  • 2015 Serverless functions / lambdas. Code independent of VM.
  • 2018 Firecracker. A microVM with 125ms start time used for AWS lambda and Fargate.

VM vs Containers vs Lambdas


Main difference
  • VM has a full operating system that run on a hypervisor.
  • Docker / Kubernetes runs as layers on top of a guest Linux OS.
  • Lambda serverless function running in a minimal VM with a good sandbox separation.
There has been a development from heavyweight VM to super lightweight VM.

Recently AWS lambdas started running in a microVM called Firecracker that can spin up in around 125ms with only 5MB memory overhead.


Best Practices for Virtualization


When should you use full VMs, Docker, Kubernetes or lambdas?

When Should You Use Serverless / Lambdas

There are many names for the same concept: AWS Lambdas, Azure Functions and Cloud Functions on GCP.

Good use cases for serverless functions
  • RESTful call with no state.
  • RESTful call that only interact with a database.
  • Database maintenance tasks.
  • Logging operation.
  • On Azure and GCP they are used to server up machine learning models when they are trained.
Lambdas / serverless functions don't need to have a VM running and they scales from no use to massive use. They are very cheap and flexible.

Serverless functions have been marketed as the future of cloud computing and are clearly going to play a big role.

When Should You Use VMs or Kubernetes


Good use cases for VM or Kubernetes
  • Your program has to load a lot of data on startup.
  • Web application with a lot of functionality that are naturally grouped together. 
  • Your program has to do a long sequence of operations.
You could use lambdas for a long sequence of operations. You would just push messages along from one lambda to the next. This is similar to Erlang or Akka actors model. I find that this gives you little control and it makes error handling hard.

When Should You Use Kubernetes


Good use cases for Kubernetes
  • If you are running a lot of daily tasks from some scheduling system, say Airflow or Luigi, it is faster to start them in Kubernetes than to spin up a new full VM instance for each.
  • You find a Docker image with a program that does what you need.
  • If you have several programs that needs to run together one program might need to be installed on Debian another on Ubuntu and one on CentOS. Kubernetes handles this very well. You can actually deploy all 3 containers to the same Kubernetes pod that share a hard disk.

When Should You Use a Full VM

There is overhead with setting up Kubernetes. You also need to have a Kubernetes master node running which cost money. So sometimes the simplest solution is to use a full VM.

Should You Run Docker Inside a VM?

The advantage of Docker is that you package up the Docker image and you can test it locally running in the same way as it will run on the VM.

The disadvantages are that you still have an extra step of creating the Docker file, build and deploy the Docker image to DockerHub or some other repository. You have to install Docker on your VM. There can be some performance hit by an extra level of virtualization.

I use Docker on my laptop and on Kubernetes but I usually do not use Docker in full VM.


Terraform


Terraform is a new tool for infrastructure as code, released by Hashicorp in 2014. It is a small functional programming language focused of configuration.

In your Terraform program you define the state you want to put your cloud system in. You run these commands from command line in the directory where you have your program:

terraform init
terraform plan
terraform apply

This will start a VM or create your infrastructure for you, and Terraform stores the state of your system in what is called a Terraform state file. This state file can be stored locally or shared in a cloud bucket.

When you want to make changes to your cloud infrastructure you change your Terraform program and you run another:

terraform plan
terraform apply

Terraform is declarative it will compare the state of your system with the state you want it to be in find out what changes it need to make.

I have used Terraform a lot with AWS to spin up EC2 and EMR clusters, but also to create IAM roles, policies, VPNs and security groups.

The documentation is good but there is a steep learning curve for Terraform. I found a class Learn DevOps: Infrastructure Automation With Terraform that helped me.

Terraform Modules

Terraform has a concept called a module. It enables code reuse. It is an advanced topic, but I find it absolutely essential in writing maintainable code. Especially if you have multiple environment say dev, staging and prod.

Terraform Version Problem

A problem that I experienced several times is that one team member accidentally updates Terraform to the current version, the next time somebody runs an update script they get this message:

Terraform doesn't allow running any operations against a state
that was written by a future Terraform version. The state is
reporting it is written by Terraform '0.11.8'.

The good news is that the Terraform state file is written in json and is somewhat robust. So you can download the state file and change the version number back to the old version and there is a good chance that it will work. Still this is not the kind of error message that you want to see when you are doing a prod release.

Issues with Terraform

Terraform is a nice declarative framework, but Terraform state file is stored either locally or in cloud bucket.
  • Local state file makes is hard for a team to collaborate. They will get a different state file.
  • Cloud storage allows you to collaborate but now you are still dealing with a shared mutable state that is susceptible to the version problem mentioned above.
I used Terraform to create a lambda function with IAM roles, policies and code. When I tried to update lambda to newer version. Terraform did not sense the changed program files so I had to destroy everything and recreate it.

Using Terraform is often safter than making manual changes in a web console, but I would hesitate to update a database using Terraform.

There is an enterprise version of Terraform that might alleviate some of these problems, but I have only used the open source version.


Kubernetes


Kubernetes is container orchestration framework. It was open sources by Google in 2014 and it works very well on GCP, Google Compute Platform. Many cloud providers has Kubernetes offerings e.g. AWS, Azure and DigitalOcean.

Kubernetes uses declarative cloud definition. I a yaml file you define how many instances of a web server do you want to run. If a web server crashes Kumbernetes will start a new one without intervention.

Kubernetes was one of the most active developed open source framework in 2018. It feels mature.
The state is part of the Kubernetes system not a file living locally or in an S3 bucket.

Issues with Kubernetes

It is quite complicated to set Kubernetes up in a private cloud. You need highly dedicated DevOps staff to do this. A lot of things can and do go bad. I have many memories of DNS server going missing and the block storage / hard disks disappearing after running programs for hours.


Terraform or Kubernetes


When should you use Terraform and when should you use Kubernetes?

They are both declarative tools that you can use to start programs and define things like security groups in your cloud environment.

Terraform is a good option if you want to define your infrastructure and spin up VMs, EMR clusters etc. It is not AWS specific but works very well with AWS.

Kubernetes is a good option if you chose to use containers and you are working on a cloud that has good Kubernetes support. AWS has a competing technology Fargate and AWS integration with Kubernetes is less mature.

Tuesday, March 22, 2016

Static vs. Dynamic Functional Languages

You can divide functional programming languages into 2 groups: Static and dynamic.

Dynamic functional languages: Clojure, Common Lisp, Racket and Scheme. They have few types often only known at run time.

Statically typed functional languages: F#, ML, Haskell, Idris and Scala. They have advanced types that are known at compile time.

Functional programming languages have similarities but are very different from one another. Some are quite hard to learn. What should you pick?

Slogans

Here are two extreme positions in functional programming reduced to slogans:

Lisp: Everything is Data

Lisp has a great story: Everything is data.

Lisp is homoiconic. There is one datatype: The S-expression in Lisp or an EDN in Clojure. This encodes:
  • Records
  • List
  • Map
  • Stream
  • Programs
Everything is unified and first class. This makes Lisp very elastic and adaptable to handle open ended problems like AI. It also leaves a lot of room for mistakes when dealing with complex data structures since everything sticks to everything.

Haskell: Everything is a Computation

Computation sounds like an equally strong unifying foundation. This is a strong counter argument to Lisp.

Haskell turns the world into mathematics by giving strong guarantees and the ability to reason about programs. It is fast, elegant and remarkably safe.

Most of the world is messy so programming in Haskell is both an art and a science.

A Type System is a Must For Production Code

In my experience:

A complex production server application demands a static type system for stability

The type system is doing at least half of my work when I work alone and prevents total anarchy when working in teams.

History of Static and Dynamic Type System

Like many other programmers, I have gone back and forth between preferring static and dynamic languages several times.

C++ and Java

I started using and loving the sophisticated statically typed object oriented languages: C++ and Java.
Why would anybody want to program in Basic?

Perl, Python and Ruby

At some point I had to make a small script for text processing. I realized that dynamically typed languages Perl, Python and Ruby are much simpler and faster to work with.
They borrow a lot of ideas form functional languages and saves you a lot of boiler plate. Programming became fun again.
I never wanted to go back.

F#, Scala and Haskell

Then came the raise of F#, Scala and Haskell.
I thought that you got the best of both worlds:
  • There are few visible types due to type inference
  • They look like dynamic language
  • Still you get strong safety from the invisible type system
  • They are fast
My stability concern for production application ruled out dynamic languages. The future belongs to F#, Scala and Haskell.

Living with Static Types

For the last 4 years I have been happy programming in Scala. It really improved my productivity.
I mainly deal with stable data types. Each data structure get immutable case class and they flow beautifully and it even works well in a concurrent system.

I am a little concerned about the amount of black magic going on at the type level in Scala and Haskell.

Web, Scripting and Data Exploration

Some fields continue to be dominated by dynamic languages:
  • Data exploration
  • Data science
  • Scripting
  • Web front end work in JavaScript, PHP and Ruby
I do data mining in Scala and can quickly add a new data source with unit tests to a stable functional reactive ingestion pipeline, but during a hackaton I had to explore a lot of different data sources and my normal startup time was too slow for the deadline.

Dynamic languages have an edge for small systems.

Problems Using Scala for NLP

Idiomatic Scala has been great for NLP.
I had to extract all the hidden and visible information on a html page and had to parse the DOM tree for everything: elements, attributes, code and json data.
The DOM tree is similar to an S-expression.
The best idiomatic Scala representation I could find was Play JSON. The DOM tree and Play JSON are not that similar and processing json in dynamic languages is more natural than in strongly typed languages.

Dynamic languages have an edge for some complex systems.

Lisp Revisited

I used Lisp in school. It was the cool AI language and my first functional language. I loved it, it blew my mind but I had a very shallow understanding.

Impressions from revisiting Lisp after using statically typed functional languages:
  • Lisp is small and elegant
  • Easy learning curve
  • Great at traversing dynamic data
  • Well suited for exploration
  • A lot of the principles of statically typed functional programming translate directly
  • I still think in Scala like types making my Clojure code better organized
  • Macros feel natural unlike in C++ and Scala
  • Lisp is really fluid combining in so many crazy ways
  • You lose a lot of safety

Going from Haskell to Clojure left me with the feeling I had when moving from C++ to Python. You get a lot of value for less effort.

Raise of the Gradual Type Systems

There has been slow movement towards gradual types. Here are a few place where they have popped up:

Ambrose Bonnaire-Sergeant on gradual typing in Clojure

The type systems in Typed Clojure and Typed Racket are pretty different than in Scala and Haskell. Generally weaker, but Typed Clojure and Typed Racket have union types that are only now investigated in Scala's experimental new type system Dotty.
These advances in gradual types make it possible to harden Lisp code to improve stability.

Data or Calculation

I was puzzled by the Lisp and Haskell slogans:
  • Everything is data
  • Everything is a calculation
It was a paradox. Which is a better foundation for computer science?
I could not easily dismiss either. For now I have accepted that we are stuck with both.

For a long time I suffered from the misunderstanding that F#, Scala and Haskell are like dynamic languages, with the addition of speed and safety. But they are fundamentally different.

Tuesday, September 29, 2015

Practical Scala, Haskell and Category Theory

Functional programming has moved from academia to industry in the last few years. It is theoretical with a steep learning curve. I have worked with strongly typed functional programming for 4 years. I took the normal progression, first Scala then Haskell and ended with category theory.

What practical results does functional programming give me?

I typically do data mining, NLP and back end programming. How does functional programming help me with NLP, AI and math?

    Scala

    Scala is a complex language that can take quite some time to learn. For a couple of years I was unsure if it really improved my productivity compared to Java or Python.

    After 2 years my productivity in Scala went up. I find that Scala is an excellent choice for creating data mining pipelines because it is:
    • Fast
    • Stable
    • Has lot of quality libraries
    • Has a very advanced type system
    • Good DSL for Hadoop (Scalding)

    Natural Language Processing in Scala

    Before Scala I did NLP in Python. I used NLTK the Natural Language Toolkit for 3 years.

    NLTK vs. ScalaNLP


    NLTK

    • Easy to learn and very flexible
    • Gives you a lot of functionality out of the box
    • Very adaptable, handles a lot of different structured file formats

    What I did not like about NLTK was:
    • It had a very inefficient representation of a text features as a Dictionary
    • The file format readers were not producing exactly matching structures and this did not get caught by the type system
    • You have to jump between Python, NumPy and C or Fortran for low level work

    ScalaNLP

    ScalaNLP merged different Scala numeric and NLP libraries. It is a very active parent project of Breeze and Eric.

    ScalaNLP Breeze

    Breeze is a full featured, fast numeric library that uses the type system to great effect.
    • Linear algebra
    • Probability Distribution
    • Regression algorithms
    • You can drop down to the bottom level without having to program in C or Fortran

    ScalaNLP Eric

    Eric is the natural language processing part of ScalaNLP. It has become a competitive NLP library with many algorithms for several human languages:
    • Reader for text corpora
    • Tokenizer
    • Sentence splitter
    • Part-of-speech tagger
    • Named entity recognition
    • Statistical parser


    Video lecture by David Hall the Eric lead

    Machine Learning in Scala

    The most active open source Scala machine learning library is MLib which is part of the Spark project.
    Spark now has data frames like R and Pandas.
    It is easy to set up machine learning pipelines, do cross validation and optimization of hyper parameters.

    I did text classification and set it up in Spark MLib in only 100 lines of code. The result had satisfactory accuracy.

    AI Search Problem in Scala vs. in Lisp

    I loved Lisp when I learned it at the university. You could do all these cool Artificial Intelligence tree search problems. For many years I suffered from Lisp envy.

    Tree search works a little differently in Scala, let me illustrate by 2 examples.

    Example 1: Simple Tree Search for Bird Flu

    You have an input HTML page and parsed into a DOM tree. Look for the word bird and flu in a paragraph that is not part of the advertisement section.
    I can visualize what a search tree for this would look like.

    Example2: Realistic Bird Flu Medication Search

    The problems I deal with at work are often more complex:
    Given a list of medical websites, search for HTML pages with bird flu and doctor recommendations for medications to take. Then do a secondary web search to see if the doctors are credible.

    Parts of Algorithm for Example 2

    This is a composite search problem:
    • Search HTML pages for the words bird and flu close to each other in DOM structure
    • Search individual match to ensure this is not in advertisement section
    • Search for Dr names
    • Find what Dr name candidates could be matched up with the section about bird flu
    • Web search for Dr to determine popularity and credentials
    Visualizing this as a tree search is hard for me.

    Lazy Streams to the Rescue

    Implementing solutions to the Example 2 bird flu medication problem takes:
    • Feature extractors
    • Machine learning on top of that
    • Correlation of a disease and a doctor
    This lends itself well to using Scala's lazy streams. Scala makes it easy to use the lazy streams and the type system gives a lot of support, especially when plugging together various streams.

    Outline of Lazy Streams Algorithm for Example 2

    1. Stream of all web pages
    2. Stream of tokenized trees
    3. Steam of potential text matches e.g. avian influenza, H5N1
    4. Filter Stream 3 if it is an advertisement part of the DOM tree, (no Dr Mom)
    5. Stream of potential Dr text matches from Stream 2
    6. Stream of good Dr names. Detected with machine learning
    7. Merge Stream 3 and Stream 6 to get bird flu and doctor name combination
    8. Web search stream for the doctor names from Stream 7 for ranking of result

    AI Search Problem in Lisp

    Tree search is Lisp's natural domain. Lisp could certainly handle Example 2 the more complex bird flu medication search. Even using a similar lazy stream algorithm.

    Additionally, Lisp has the ability to do very advanced meta programming:
    Rules that create other rules or work on multiple levels. Things I do not know how to do in Scala.

    Lisp gives you a lot of power to handle open ended problems and it is great for knowledge representation. When you try to do the same in Scala you end up either writing Lisp or Prolog style code or using RDF or graph databases.

    Some Scala Technical Details

    Here are a few observations on working with Scala.

    Scala's Low Rent Monads

    Monads are a general way to compose functionality. They are a very important organizing principle in Scala. Except is not really monads it is just syntactic sugar.

    You give us a map and a flatMap function and we don't ask any questions.

    Due to the organization of the standard library and subtyping you can even combine an Option and a List, which should strictly not be possible. Still this give you a lot of power.
    I do use Scala monads with no shame.

    Akka and Concurrency

    Scala's monads make it convenient to work with two concurrency constructs: Futures and Promises.

    Akka is a library implementing an Erlang style actor model in Scala.
    I have used Akka for years and it is a good framework to organize a lot of concurrent computation that requires communication.

    The type system does not help you with the creation of parent actors so you are not sure that they exist. This makes it hard to write unit tests for actors.

    Akka is good but the whole Erlang actor idea is rather low level.

    Scalaz and Cake Patterns

    Scalaz is a very impressive library that implements big parts of Haskell’s standard library in Scala.
    Scalaz’s monad typeclass is invariant, which fixes the violations allowed in the standard library.

    Cake Patterns allows for recursive modules, which make dependency injection easier. This is used in the Scala compiler.

    Both of these libraries got me into trouble as a beginner Scala programmer. I would not recommend them for beginners.

    How do you determine if you should use this heavy artillery?
    Once you feel that you are spending a lot of time repeating code due to insufficient abstraction you can consider it. Otherwise:

    Keep It Simple.

    Dependent Types and Category Theory in Scala

    There are many new theoretical developments in Scala:
    • Dotty - a new compiler built on DOT a new type-theoretic foundation of Scala
    • Cats library - a simplified version of Scalaz implementing concepts from category theory
    • Shapeless library for dependent types. I am using this in my production code since Shapeless is used in Slick and Parboiled2


    Haskell

    Haskell is a research language from 1990. In 2008 its popularity started to rise. You can now find real jobs working in Haskell. Most publicized is that Facebook wrote their spam filter in Haskell.



    Why is Haskell so Hard to Learn?

    It took me around 2 years to learn to program in Haskell, which is exceptionally long. I have spoken to other people at Haskell meetups who have told me the same.

    Mathematical Precision

    Python effectively uses the Pareto principle: 20% of the features will give give you 80% of the functionality; Python has very few structures in the core language and reuses them.

    Haskell uses many more constructs. E.g. exception handling can be done in many different ways each with small advantages. You can chose the optimal exception monad transformer that has least dependencies for your problem.

    Cabal Hell and Stack

    Haskell is a fast developing language with a very deep stack of interdependent libraries.
    When I started programming in it, it was hard to set up even a simple project since you could not get the libraries to compile with versions that were compatible with each other.
    The build system is called cabal, and this phenomenon is called Cabal Hell.
    If you have been reading mailing list there are a lot of references to Cabal Hell.

    The Haskell consulting company FPComplete first released Stackage a curated list of libraries that works together. In 2015 they went further and released Stack which is a system that installs different versions of Haskell to work with Stackage versions.

    This has really made Haskell development easier.

    Dependently Typed Constructs in Haskell

    Dependently typed languages are the next step after Haskell. In normal languages the type system and the objects of the language are different systems. In dependently typed languages the objects and the types inhabits the same space. This gives more safety and greater flexibility but also makes it harder to program in.
    The type checker has to be replaced with a theorem-prover.

    You have to prove that the program is correct, and the proofs are part of the program and first order constructs.

    Haskell has a lot of activities towards emulating dependently typed languages.
    The next version of the Haskell compiler GHC 8 is making a big push for more uniform handling of types and kinds.

    Practical Haskell

    Haskell is a pioneering language and still introducing new ideas. It has clearly shown that it is production ready by being able to handle Facebook's spam filter.

    Aesthetically I prefer terse programming and like to use Haskell for non work related programming.

    There is a great Haskell community in New York City. Haskell feels like a subculture where Scala has now become the establishment. That said I do not feel Haskell envy when I program in Scala on a daily basis.

    Learning Haskell is a little like running a marathon. You get in good mental shape.


    Category Theory

    Category theory is often called Abstract Nonsense both by practitioners and detractors.
    It is a very abstract field of mathematics and its utility is pretty controversial.

    It abstracts internal properties of objects away and instead looks at relations between objects.
    Categories require very little structure and so there are categories everywhere.  Many mathematical objects can be turned into categories in many different ways. This high level of abstraction makes it hard to learn.

    There is a category Hask of Haskell types and functions.


    Steve Awodey lecture series on category theory

    Vector Spaces Described With a Few String Diagrams

    To give a glimpse of the power of category theory: In this video lecture John Baez shows how you can express the axioms of finite dimensional vector spaces with a few string diagrams.

    Video lecture by John Baez

    With 2 more simple operations you can extend it to control theory.

    Quest For a Solid Foundation of Mathematics

    At the university I embarked on a long quest for a solid scientific foundation. Fist I studied chemistry and physics. Quantum physics drove me to studying mathematics for more clarity. For higher clarity and a solid foundation I studied mathematical logic.
    I did not find clarity in mathematical logic. Instead I found:

    The Dirty Secret About the Foundation of Mathematics

    My next stop was the normal foundation for modern mathematics: ZFC, Zermelo–Fraenkel set theory with the axiom of choice.

    This was even less intuitive than logic. There were more non intuitive axioms. This was like learning computer science from a reference of x86 assembly: A big random mess. There were also an uncertain connection between the axioms of logic and the axioms set theory.

    ZFC and first order logic makes 2 strong assumptions:
    1. Law of Excluded Middle
    2. Axiom of Choice
    Law of Excluded Middle is saying that every mathematical sentence is either true or false. This is a very strong assumption that was not motivated at all. And it certainly does not extend to other sentences.

    Constructive Mathematics / Intuitionistic Logic

    There was actually a debate about what should be a foundation for mathematics at the beginning of the 20th century.
    A competing foundation of mathematics was Brouwer's constructive mathematics. In order to prove something about a mathematical object you need to be able to construct it and via the Curry-Howard correspondence this is equivalent to writing a program constructing a particular type.

    This was barely mentioned at the university. I had one professor who once briefly said that there was this other thing called intuitionistic logic, but it was so much harder to prove things in it, why should we bother.

    Recently constructive mathematics have had a revival with Homotopy Type Theory. HoTT is based on category theory, type theory, homotopy theory and intuitionistic logic.
    This holds a lot of promise and is another reason why category theory is practical for me.


    Robert Harper's lectures on type theory
    end with an introduction to HoTT

    Future of Intelligent Software

    There are roughly 2 main approaches to artificial intelligence
    • Top down or symbolic techniques e.g. logic or Lisp
    • Bottom up or machine learning techniques e.g. neural networks
    The symbolic approach was favored for a long time but did not deliver on its promise. Now machine learning is everywhere and has created many advances in modern software.

    To me it seems obvious that more intelligent software needs both. But combining them has been an elusive goal since they are very different by nature.

    Databases created a revolution in data management. They reduce data retrieval to simplified first order logic, you just write a logic expression for what you want.

    Dependently typed language is the level of abstraction where programs and logic merge.
    I think that intelligent software of the future will be a combination of dependently typed languages and machine learning.
    A promising approach is: Discovery of Bayesian network models from data. This finds causality in a form that can be combined with logic reasoning.

    Conclusion

    I invested a lot of time in statically typed functional languages and was not sure how much this would help me in my daily work. It helped a lot, especially with reuse and stability.

    Scala has made it substantially easier to create production quality software.

    MLib and ScalaNLP are 2 popular open source projects. They show me that Scala is a good environment for NLP and machine learning.

    I am only starting to see an outline of category theory, dependently typed languages and HoTT. It looks like computer science and mathematics are not mainly done, but we still have some big changes ahead of us.

    Monday, September 23, 2013

    Big Data: What Worked?

    "Big data" created an explosion of new technologies and hype: NoSQL, Hadoop, cloud computing, highly parallel systems and analytics.

    I have worked with big data technologies for several years. It has been a steep learning curve, but lately I had more success stories.

    This post is about the big data technologies I like and continue to use. Big data is a big topic. These are some highlights from my experience.

    I will relate big data technologies to modern web architecture with predictive analytics and raise the question:

    What big data technologies should I use for my web startup?

    Classic Three Tier Architecture

    For a long time software development was dominated by the three tier architecture / client server architecture. It is well described and conceptually simple:
    • Client
    • Server for business logic
    • Database
    It is straightforward to figure out what computations should go where.

    Modern Web Architecture With Analytics

    The modern web architecture is not nearly as well established. It is more like an 8 tiered architecture with the following components:
    • Web client
    • Caching
    • Stateless web server
    • Real-time services
    • Database
    • Hadoop for log file processing
    • Text search
    • Ad hoc analytics system
    I was hoping that I could do something closer to the 3-tier architecture, but the components have very different features. Kicking off a Hadoop job from a web request could adversely affect your time to first byte.

    A problem with the modern web architecture is that a given calculation can be done in many of those components.

    Architecture for Predictive Analytics

    It is not at all clear what component predictive analytics should be done in.

    First you need to collect user metrics. In what components can do you do this?
    • Web servers, store the metric in Redis / caching
    • Web servers, store the metric in the database
    • Real-time services aggregates the user metric
    • Hadoop run on the log files

    User metric is needed by predictive analytics and machine learning. Here are some scenarios for this:
    • If you are doing simple popularity based predictive analytics this can be done in the web server or a real-time service.
    • If you use a Bayesian bandit algorithm you will need to use a real-time service for that.
    • If you recommend based on user similarity or item similarity you will need to use Hadoop.

    Hadoop

    Hadoop is a very complex piece of software to handle very large amounts of data that cannot be handled by conventional software because it is too big to fit on one computer.

    I compared different Hadoop libraries in my post: Hive, Pig, Scalding, Scoobi, Scrunch and Spark.

    Most developers that have used Hadoop complain about it. I am no exception. I still have problems with Hadoop jobs failing due to errors that are hard to diagnose. Generally I have been a lot happier about Hadoop lately. I am only using Hadoop for big custom extractions or calculations from log files stored in HDFS. I do my Hadoop work in Scalding or HIVE

    The Hadoop library Mahout can calculate user recommendations based on user similarity or item similarity.

    Scalding

    Hadoop code in Scalding looks a lot like normal Scala code. The scripts I am writing are often just 10 lines of code and look a lot like my other Scala code. The catch is that you need to be able to write idiomatic functional Scala code.

    HIVE

    HIVE makes it easy to extract and combine data from HDFS. You just write SQL after some setup of a directory with table structure in HDFS.

    Real-time Services

    Libraries like Akka, Finagle and Storm are good for having long running stateful computations.
    It is hard to write correct highly parallel code that scales to multiple machines using normal multithreaded programming. For more details see my blog post: Akka vs. Finagle vs. Storm.

    Akka and Spray

    Akka is a simple actor model taken from the Erlang language. In Akka you have a lot of very lightweight actors, they can share a thread pool. They do not block on shared state but communicate by sending immutable messages.

    One reason that Akka is a good fit for real-time services is that you can do varying degrees of loose coupling and all services can talk with each other.

    It is hard to change from traditional multithreaded programming to using the actor model. There are just a lot of new actor idioms and design patterns that you have to learn. At first the actor model seems like working with a sack of fleas. You have much less control over the flow due to the distributed computation.

    Spray makes it easy to put a web or RESTful interface to your service. This makes it easy to connect your service with the rest of the world. Spray also has the best Scala serialization system I have found.

    Akka is well suited for: E-commerce, high frequency trading in finance, online advertising and simulations.

    Akka in Online Advertising

    Millions of users are interacting with fast changing ad campaigns. You could have actors for:
    • Each available user
    • Each ad campaign
    • Each ad
    • Clustering algorithms
    • Each cluster of users
    • Each cluster of ads
    Each actor is developing in time and can notify and query all other actors.

    NoSQL

    There are a lot of options, with no query standard:
    • Cassandra
    • CouchDB
    • HBase
    • Memcached
    • MongoDB
    • Redis
    • SOLR
    I will describe my initial excitement about NoSQL, the comeback of SQL databases and my current view on where to use NoSQL and where to use SQL.

    MongoDB

    MongoDB was my first NoSQL technology. I used it to store structured medical documents.

    Creating a normalized SQL database that represents a structured data format is a sizable task and you easily end up with 20 tables. It is hard to insert a structured document into the database in the right sequence, so foreign key constraints are satisfied. LINQ to SQL helped with this but it was slow.

    I was amazed by MongoDB's simplicity:
    • It was trivial to install
    • It could insert 1 million documents very fast
    • I could use the same Python NLP tools for many different types of documents

    I felt that SQL databases were so 20th century.

    After some use I realized that interacting with MongoDB was not as easy from Scala. I tried different libraries Subset and Casbah.
    I also realized that it is a lot harder to query data from MongoDB than a SQL database both in syntax and expressiveness.
    Recently SQL databases have added JSON as a data type, taking away some of MongoDB's advantage.

    Today I use SQL databases for curated data. But MongoDB for ad hoc structured document data.

    Redis

    Redis is an advanced key value store that is mainly living in memory but with backup to disk. Redis is a good fit for caching. It has some specialized operations:
    • Simple to age out data
    • Simulates pub sub
    • Atomic update increments
    • Atomic list append
    • Set operations

    Redis also supports sharding well, in the driver you just give a list of Redis servers and it will send the data to the right server. Redistributing data after adding more sharded servers to Redis is cumbersome.

    I first thought that Redis had an odd array of features but it fits the niche of real-time caching.

    SOLR

    SOLR is the most used enterprise text search technology. It is built on top of Lucene.
    It can store and search document with many fields using an advanced query language.
    It has an ecosystem of plugins doing a lot of the things that you would want. It is also very useful for natural language processing. You can even use SOLR as a presentation system for your NLP algorithms.

    To Cloud or not to Cloud

    A few years back I thought that I would soon be doing all my work using cloud computing services like Amazon's AWS. This did not happen, but virtualization did. When I request a new server the OPS team usually spins up a virtual machine.
    A problem with cloud services is that storage is expensive. Especially Hadoop sized storage.

    If I were in a startup I would probably consider the cloud.

    Big and Simple

    My fist rule for software engineering is: Keep is simple.

    This is particularly important in big data since size creates inherent complexity.
    I made the mistake of being too ambitious too early and think out too many scenarios.

    Startup Software Stack

    Back to the question:

    What big data technologies should I use for my web startup?

    A common startup technology stack is:
    Ruby on Rails for your web server and Python for your analytics and hope that a lot of beefy Amazon EC2 servers will scale your application when your product takes off.
    It is fast to get started and the cloud will save you. What could possibly go wrong?

    The big data approach I am describing here is more stable and scalable, but before you learn all these technologies you might run out of money.

    My answer is: It depends on how much data and how much money you have.

    Big Data Not Just Hype

    "Big data" is misused and hyped. Still there is a real problem, we are generating an astounding amount of data and sometimes you have to work with it. You need new technologies to wrangle this data.

    Whenever I see a reference to Hadoop in a library I get very uneasy. These complex big data technologies are often used where much simpler technologies would have sufficed. Make sure your really need them before you start. This could be the difference between your project succeeding or failing.

    It has been humbling to learn these technologies but after much despair I now enjoy working with them and find them essential for those truly big problems.

    Friday, May 17, 2013

    LISP Prolog and Evolution

    I just saw David Nolen give a talk at a LispNYC Meetup called:


    LISP is Too Powerful

    It was a provocative and humorous talk. David showed all the powerful features of LISP and said that the reason why LISP is not more adapted is that it is too powerful. Everybody laughed but it made me think. LISP was decades ahead of other languages, why did it not become a mainstream language?

    David Nolen is a contributor to Clojure and ClojureScript.
    He is the creator of Core Logic a port of miniKanren. Core Logic is a Prolog like system for doing logic programming.

    When I went to university my two favorite languages were LISP and Prolog. There was a big debate weather LISP or Prolog would win dominance. LISP and Prolog were miles ahead of everything else back then. To my surprise they were both surpassed by imperative and object oriented languages, like: Visual Basic, C, C++ and Java.

    What happened? What went wrong for LISP?

    Prolog

    Prolog is a declarative or logic language created in 1972.

    It works a little like SQL: You give it some facts and ask a question, and, without specifying how, prolog will find the results for you. It can express a lot of things that you cannot express in SQL.

    A relational database that can run SQL is a complicated program, but Prolog is very simple and works using 2 simple principles:

    • Unification
    • Backtracking

    The Japanese Fifth Generation Program was built in Prolog. That was a big deal and scared many people in the West in the 1980s.

    LISP

    LISP was created by John McCarthy in 1958, only one year after Fortran, the first computer language. It introduced so many brilliant ideas:

    • Garbage collection
    • Functional programming
    • Homoiconicity code is just a form of data
    • REPL
    • Minimal syntax, you program in abstract syntax trees

    It took other languages decades to catch up, partly by borrowing ideas from LISP.

    Causes for LISP Losing Ground

    I discussed this with friends. Their views varied, but here are some of the explanations that came up:

    • Better marketing budget for other languages
    • Start of the AI winter
    • DARPA stopped funding LISP projects in the 1990s
    • LISP was too big and too complicated and Scheme was too small
    • Too many factions in the LISP world
    • LISP programmers are too elitist
    • LISP on early computers was too slow
    • An evolutionary accident
    • Lowest common denominator wins

    LISP vs. Haskell

    I felt it was a horrible loss that the great ideas of LISP and Prolog were lost. Recently I realized:

    Haskell programs use many of the same functional programming techniques as LISP programs. If you ignore the parenthesis they are similar.

    On top of the program Haskell has a very powerful type system. That is based on unification of types and backtracking, so Haskell's type system is basically Prolog.

    You can argue that Haskell is the illegitimate child of LISP and Prolog.

    Similarity between Haskell and LISP

    Haskell and LISP both have minimal syntax compared to C++, C# and Java.
    LISP is more minimal, you work directly in AST.
    In Haskell you write small snippets of simple code that Haskell will combine.

    A few Haskell and LISP differences

    • LISP is homoiconic, Haskell is not
    • LISP has a very advanced object system CLOS
    • Haskell uses monadic computations

    Evolution and the Selfish Gene

    In the book The Selfish Gene, evolutionary biologist Richard Dawkins makes an argument that genes are much more fundamental than humans. Humans have a short lifespan while genes live for 10,000s of years. Humans are vessels for powerful genes to propagate themselves, and combine with other powerful genes.

    If you apply his ideas to computer science, languages, like humans, have a relatively short lifespan; ideas, on the other hand, live on and combine freely. LISP introduced more great ideas than any other language.

    Open source software has sped up evolution in computer languages. Now languages can inherit from other languages at a much faster rate. A new language comes along and people start porting libraries.

    John McCarthy's legacy is not LISP but: Garbage collection, functional programming, homoiconicity, REPL and programming in AST.

    The Sudden Rise of Clojure

    A few years back I had finally written LISP off as dead. Then out of nowhere Rich Hickey single-handed wrote Clojure.

    Features of Clojure

    • Run on the JVM
    • Run under JavaScript
    • Used in industry
    • Strong thriving community
    • Immutable data structures
    • Lock free concurrency

    Clojure proves that it does not take an Google, Microsoft or Oracle to create a language. It just takes a good programmer with a good idea.

    Typed LISP

    I have done a lot of work in both strongly typed and dynamic languages.

    Dynamic languages give you speed of development and are better suited for loosely structured data.
    After working with Scala and Haskell I realized that you can have a less obtrusive type system. This gives stability for large applications.

    There is no reason why you cannot combine strong types or optional types with LISP, in fact, there are already LISP dialects out there that did this. Let me briefly mention a few typed LISPs that I find interesting:

    Typed Racket and Typed Clojure do not have as powerful types systems as Haskell. None of these languages have the momentum of Haskell, but Clojure showed us how fast a language can grow.

    LISP can learn a lesson from all the languages that borrowed ideas from LISP.
    It is nature's way.