Sunday, October 20, 2019

F# vs Scala

F# and Scala are both hybrid functional object-oriented languages created for popular virtual machines.

  • F# for CLR / .NET
  • Scala for JVM / Java


F# and Scala are now in more direct competition after Microsoft open sourced F# and .NET Core. They have many similarities but a distinctly different feel. It was hard for me to put my finger on the difference. This blog post investigates their design decisions and use cases. Starting with a brief overview of F# and Scala.


F# (F Sharp)


F Sharp


F# is a mature, open source, cross-platform, functional-first programming language. It was created by Don Syme in 2005 as a port of the OCaml language to .NET.
  • Core: Strict, strong, inferred, hybrid
  • Popularity: Some use in industry and backed by Microsoft
  • Complexity: Easy to learn, but part of a big ecosystem
  • Maturity: It is 14 years old and part of the .NET, so quite mature
  • Tooling: Very good, both .NET based and F# specialized
  • Cross platform: with Mono and .NET Core and JavaScript
  • IDE: Visual Studio, VS Code


Scala


Scala

Scala combines object-oriented and functional programming in one concise, high-level language. It was created by Martin Odersky in 2004.

  • Core: Strict and lazy, nominal and structural, hybrid, implicit for IoC
  • Popularity: Very popular. No 13 on Red Monk June 2019 list. Spark is written in Scala
  • Complexity: It is a quite complex language, but it is easy to get started with
  • Maturity: Very stable. Run on JVM, well integrated with JVM ecosystem
  • Tooling: Great build tool and package managers
  • Cross platform: JVM and JS. Also early work on native / LLVM version
  • IDE: IntelliJ, VS Code, Eclipse


Microsoft Open Source Bet


Choosing between F# and Scala used to be pretty easy. If you were doing Windows development you would use F# if you were on an open source stack you would choose Scala.

In 2012 Microsoft open sourced F# and started porting it to Mono a Microsoft supported cross platform version of the CLR. That was cool but not something I would run production code on.

However, in September 2019 Microsoft released .NET Core 3 an open source cross platform version of a big part of their SDK and also a first release of Apache Spark for .NET.

After this .NET and F# are serious contenders for being parts of an open source stack.


Relation to Java and C#


You might think that F# is just the .NET version of Scala and moving from Java to Scala is similar to moving from C# to F#. This is not the case.

Java was a small and simple language with a lot of innovations but some annoying problems. A big part of Scala appeal was that it was a better Java with more features.

C# was also made to be a better Java. It fixed some of the flaws in the original Java, e.g. auto boxing of integers, generics and lambdas. C# is a great but also very big language. F# is more like a leaner version of C#, with less features.


Collection Libraries


Scala has made a big effort to make a full set of immutable and mutable Scala collections and make different Java collections look like they are native Scala collections.

I tried Scala in 2007 it had generic and could use Java generic, but either you were programming in Java or in Scala. This took a long time to get this right and cost was that the standard library code was very complicated. This is not really a problem for the user who won't see this.

Generally, F# is using a few collections: Arrays, lists, seq, set and map.
It is a bit messy to bridge the OCaml and the C# heritage, especially map / dictionaries are clumsy.


Monads


A monad is an important part of functional programming. It is a general principle to express a sequence of operations and work on a lot of different data types:
List, Seq, Future, Option.

Scala monadic for comprehension



Scala's version is syntactic sugar over flatMap(). It is more flexible, it can mix two types of monads say List and Option. Scala's monad will also return the same type as the input type.

F# monadic for comprehension



F#'s version of the monad computation is called computational expression. It has more features than Scala's.



Classes


Classes are considered an anti-pattern by functional programming purist. Some problems with classes are:

  • A class maintains state
  • A class creates a custom language instead of reuse of operations
  • Inheritance is crating tight coupling

Scala has a very sophisticated type and class system and classes are central part of Scala.

F# has support for classes, but it is billing itself as object programming not object-oriented programming. It is made to use classes defined in C#, but will often define objects with methods without a full class definition.

I like that F# is exploring more lightweight alternatives but classes are easy to create and feels natural to use.


Type Classes, IoC, DI, Type Providers


Type class is a powerful abstraction that can make a third party class implement an interface. It plays an important role in Scala and are implemented with helper classes created by implicit.

Inversion of control and dependency injection, are first class in Scala with implicit. This is an advanced but very useful feature of Scala.

Scala has developed these ideas to the point where you can do logic style programming with implicit. A lot of the more sophisticated category theory like programming is based on this.


You can do inversion of control and dependency injection in F# using libraries.

F# has type providers that on the fly generated typed access to a lot of different data sources, e.g. a table on a webpage.


Design Decisions


F# is white space indentation-based language. Scala is a curly bracket language.

F# is a lightweight language with strong compose-ability.

Scala has sophisticated type system, including type classes, this unifies a lot of different classes and facilitating reuse.

F# program feel a little more like a loose collection of definitions while Scala program feels more like a carefully packaged system.


Conclusion


F# and Scala have a lot in common. To a large extent you would still chose F# or Scala based on your platform choice.

Both languages are very well suited for building back end programs that can interact with a universe of libraries written in C# or Java.

Scala has more momentum and a better niche. It is still having status as a better Java. Even after Java added some of the best constructs from Scala. Spark has made Scala a corner stone of data engineering.

F# is more lightweight than Scala. This makes it great for data exploration and great for building small scripts. It still remains to be seen how well supported Spark is going to be for .NET.

From a language evolution perspective, the object functional hybrid has been very successful. F# and Scala's different emphasis has produced different language from similar goals. I am very happy that we now can compare their design decisions on merit not just compare .NET and Java ecosystems.

This article is an elaboration on my last blog post Typed Functional Languages 2019.
Disclaimer I have been a happy Scala user for years, and only occasionally use F#.

Tuesday, September 10, 2019

Typed Functional Languages 2019

This post is a brief status of the state of typed functional languages in late 2019.

Typed functional languages like Clean, Haskell and OCaml were developed within academia in the 1990s. Around 2010, languages like F# and Scala were gaining some acceptance in industry. Today there are many great typed functional languages, several used in industry. I will give a brief side by side introduction to the following languages:

  • F#
  • F*
  • Haskell
  • OCaml
  • Rust
  • Scala
  • TypeStript
Concepts from typed functional languages have also spread into object oriented languages like C++, C# and Java. The distinction between OOP and typed functional is fluid, so that list might seem a little arbitrary.

These languages are best of breed so the point of this article is not to compare them by merits, but to explore what language to use for what purpose. Follow up post covers F# vs Scala.


F# (F Sharp)


F Sharp


F# is a mature, open source, cross-platform, functional-first programming language. 
  • Core: Strict, strong, inferred, hybrid
  • Popularity: Some use in industry and backed by Microsoft
  • Complexity: Easy to learn, but part of a big ecosystem
  • Maturity: It is 14 years old and part of the .NET, so quite mature
  • Tooling: Very good
  • Cross platform: with Mono and .NET Core and JavaScript
  • IDE: Visual Studio, VS Code

Strengths

Simple, open source, cross platform with good integration with the whole .NET universe.
Well suited for backend programming, Azure, web-serving and finance.
Type providers give easy typed access to a lot of different data sources.

Issues

IDE, GUI programming and LINQ is not as well developed as for C#.


F* (F Star)


F Star

F* is a general-purpose functional programming language with effects aimed at program verification.

  • Core: Strict, dependently typed, tactical theorem prover, constraint solver, refinement type, algebraic effect tracking
  • Popularity: Research language with very few users
  • Complexity: Quite complex
  • Maturity: Several researchers are working on it, but it is not used a lot
  • Tooling: Not super polished, but build on top of good tooling in OCaml and F#
  • Cross platform: OCaml, F#, C, WASM and ASM
  • IDE: Support for Emacs

Strengths

F* has implemented a lot of powerful and interesting ideas that you can try and actually use. It is a very well developed dependently typed language.
Good for validating highly sensitive security programs, encryption protocols.

Issues

There is little adaptation and it has not stood the test of time yet.


Haskell


Haskell

Haskell is an advanced purely-functional programming language.
  • Core: Lazy, pure, effect tracking using effect monads
  • Popularity: Prestigious research language with some industry adoption. Number 19 on Red Monk June 2019 list
  • Complexity: Very complex language
  • Maturity: It has been around for 30 years, used in industry, used for research
  • Tooling: New build tool Stack is quite nice
  • Cross platform: Runs on OS X, Linux and Windows
  • IDE: Several decent plugins for: VSCode, emacs, Spacemacs, SpaceVim and IntelliJ

Strengths

Very influential research language, test bed for a lot of language research and development.
It has been optimized for years and has some use in industry.
Type classes are built into the language so you can reuse code very broadly.
Aesthetically pleasing if you love math or category theory.

Issues

It is a very complex language and tracking effect in non pure computations is quite hard.
It has some use in industry, but is still very much a research language.


OCaml



 OCaml is a strictly evaluated functional language with some imperative features.
  • Core: Strict, strong, inferred, hybrid
  • Popularity: Used as teaching language and by a few big companies
  • Complexity: It is a simple language to learn
  • Maturity: It has been around for 20 years and is used in industry so quite mature
  • Tooling: Recently it got a good build tool and package manager
  • Cross platform: Runs on a lot of different operating system, hardware
  • IDE: Language server with good integration with Eclipse, VS Code, Emacs and Vim

Strengths

Great REBL, very fast compiler, makes it suited for tooling. Facebook using it for web tooling.
Popular in theorem provers.

Issues

Concurrency is not great.


Rust


Rust
TM Mozilla

Rust is a multi-paradigm system programming language focused on safety, especially safe concurrency.
  • Core: Inferred, linear type, nominal, static, strict, strong, build around concurrency
  • Popularity: Quite popular and raising. No 21 on Red Monk June 2019 list
  • Complexity: Somewhat complex language
  • Maturity: Pretty new language, but used in Firefox and by AWS Firecracker
  • Tooling: Excellent build tool and package manager
  • Cross platform: Work on many different OSs
  • IDE: Good VS Code support

Strengths

Rust is a combination of ideas from OCaml, Haskell, C++, linear types and low level imperative control. It is very fast and well suited for system programming and secure programming. There is no garbage collector and no runtime, this makes Rust great for writing libraries and WebAssembly. Rust has started to make inroads in cloud infrastructure.

Issues

Getting rid of the garbage collector makes the language harder to understand and program in.
It is a pretty new language, still developing, and there are fewer libraries.


Scala


Scala

Scala combines object-oriented and functional programming in one concise, high-level language.

  • Core: Strict and lazy, nominal and structural, hybrid, implicits for IoC
  • Popularity: Very popular. No 13 on Red Monk June 2019 list. Spark is written in Scala
  • Complexity: It is a quite complex language, but it is easy to get started with
  • Maturity: Very stable. Run on JVM, well integrated with JVM ecosystem
  • Tooling: Great build tool and package managers
  • Cross platform: JVM and JS. Also early work on native / LLVM version
  • IDE: IntelliJ, VS Code, Eclipse

Strengths

Back-end programming, data engineering, web serving.
It is a great all around language. A lot of work has gone into creating language constructs that makes Scala work well with Java libraries. In Scala 2.0 this was not the case.
Spark is a cornerstone in data engineering.

Issues

There is quite a lot of complexity: Implicits, macros,  type classes / ad hoc polymorphism is possible but it takes some work.
Not super easy to set up a small project.
GUI programming support is not that great.


TypeStript


TypeScript

TypeScript brings you optional static type-checking along with the latest ECMAScript features.
  • Core: Gradually typed, structural, many new sophisticated type constructs, data language
  • Popularity: Very popular. No 10 on Red Monk June 2019 list
  • Complexity: Pretty complex
  • Maturity: A lot of money has gone into JavaScript, it is improving but it still feels wonky
  • Tooling: NPM. There are a lot of tools in the Nodes ecosystem, too many
  • Cross platform: Runs in every browser and on Node.js
  • IDE: Amazing support in VS Code

Strengths

Typescript makes big JavaScript codebases a lot more robust.
It is really easy to process semi structured data in json.
Starting to see some use of TS in machine learning e.g. with TensorFlow.js.

Issues

The JavaScript modules seem simple like in Python or Java, but there are many different module systems and it is pretty complicated. There are a lot of NPM packages but it still feels less mature. Getting setup with a small project with unit tests is more work than it should be.
Concurrency: Async await dramatically simplified call back style of programming, but still not great.


Golden Age Programming Languages


For many years I was puzzled about why language evolution seems to favor bloated and hacky development, while ignoring more principled computer science ideas. Twenty years ago I got very excited to read about these new functional languages with strong types. Unfortunately they were only popular in academia.

We are finally living in the golden age of programming languages. It just took some time. Development is moving quickly now and not slowing down.

Apologies in advance for omissions, outdated information and other mistakes.

Thursday, April 25, 2019

Benefits of Different Python Distributions on Mac

There are at least 5 popular ways to install Python on OS X / Mac.

  • OS X default Python installation, currently Python 2.7.10
  • Use brew install python
  • Use brew install pyenv
  • Anaconda
  • Python pkg installer from python.org

I have used all of these distributions. They are all high quality and easy to install, but you run into conflicts later. You think that you are installing a library into one Python distribution but it get installed into another distribution so you cannot use it. This causes many frustrating errors.

Every time I install a Mac I have to decide what is the best Python distribution for my use case and there is no simple choice. It has been hard to find good documentation on the trade offs between the Python distributions. I have a compiled a short list of benefits and issues and where I think that the different distribution make sense.


OS X Default Python Installation


  • You don't have to install anything
  • If you only want to have one Python distribution this will be the one
  • It is a pretty recent version of Python 2.7 currently 2.7.10

Issue

  • Not supporting Python 3 which is now in common use

If you only are doing light Python 2 scripting this is probably the easiest choice.


brew install python


  • Brew is the de facto package manager on OS X so most software is installed with brew
  • Very up to date versions of Python 2 and Python 3
  • Works well when you want to install many Python libraries
  • Python 3 is the default, but brew install python@2 will install Python 2
  • It takes precedence over the OS X default Python by being in earlier on PATH env
  • Brew will probably install Python as a requirement for other packages so you get it whether you want it or not

Good for more demanding programming and installing libraries.


brew install pyenv 


  • pyenv is a tool to have different versions of Python to chose from
  • It has no dependencies on either Python 2 or 3 but manipulate PATH env
  • It can co exists with brew install python 
  • It can also work with virtual environments

Issues

  • You have to install other libraries say gzip before you can install this
  • Python is compiled from scratch and you easily run into compile problems

Good if you are a serious programmer who need many different versions of Python possibly with conflicting versions of libraries.


Use Anaconda


  • Anaconda installs different version of Python with high quality curated packages specialized for data science libraries
  • It can be hard to get data science libraries working with manual installs
  • It is a whole ecosystem of software 
  • Includes good Python GUI called Spyder
  • Great support for Jupyter notebook
  • Has good built in support for Python's virtual environments 

Issue

  • It is a pretty heavy distribution taking up around 3GB

I usually need the data science libraries so I install Anaconda but also end up with the brew version of Python.


Python pkg Installer From python.org


  • It is the official Python distribution
  • You can always get the newest version of Python
  • Self contained installer

It is an easy way to get the last versions of Python installed.

Thursday, February 7, 2019

ML and Data in AWS, Azure and GCP

Machine learning and data technology are changing fast and the big cloud providers compete with new offerings. This blog is a short introduction to what this looks like in 2019. It is focused on cloud providers Amazon Web Services, Microsoft Azure and Google Compute Platform.

A few things I will discuss -
  • Most data in an organization can be put into a data lake to query and combine
  • We now have very powerful, user friendly open source ML libraries
  • We have NLP and computer vision REST APIs from cloud providers
Let me start with a little history of both ML and data.


History of Machine Learning Libraries


Simplified timeline for languages, libs and APIs

  • 1960 Lisp since ML was a small part of A.I.
  • 1986 C++ neural network software on a floppy disk in the back of book
  • 1997 Open source Java ML like WEKA, good but hard to integrate with you data and code
  • 2010 Modern Python open source libs NumPy, Pandas, Scikit-learn easy to use and integrate
  • 2015 Spark ML, attempts to make a fast ML pipeline as easy to use as Scikit-learn
  • 2017 Deeplearning open source libraries Tensorflow, Keras and PyTorch 
  • 2017 Cloud Vision API and Natural Language API
We now have several strong contenders to build or buy production quality ML functionality.


Convergence of Data


Recently I talked with a DBA and was surprised how much the DBA profession has changed. He told me big organizations used to have a big database such as Oracle, SQL Server, Sybase or DB2 and a lot of data stored in different files.

Now maintaining the data lake is one of his main responsibilities. The data lake is a system that allows you to store log files, structured, semi structured and unstructured data files in cheap cloud blob storage and still query and join it with SQL.
He was also in charge of an Oracle database and a few open source databases running, MySQL and Postgres and a MongoDB.


Data Lake Fundamentals


Uniform data that can be joined is very powerful. Here are a few underlying technologies that makes this possible.

In 2004 Google released the famous MapReduce paper, describing how you can do distributed computation using functional programming operations. The idea is that you send your computation to were you data is.

In 2010 Hadoop was released. Hadoop is an open source Java implementation of MapReduce. It turned out of be very hard to program in. Two new technologies made it easier to program MapReduce: Hive and Spark.

Hive

A lot of MapReduce job was just queries on data. Hive is a tool that lets you write these queries as simple SQL. Hive will translate the SQL to a MapReduce job, all you had to do was to add schemas definition describing the files with your data.

Spark

With Spark you can write more complicated MapReduce jobs. Spark is written in Scala which is a natural language to write MapReduce in. Spark is often use to ingest data into the data lake.

All the cloud providers have great support for Spark, AWS has EMR, Azure has HDInsight and GCP has Dataproc.


Combining Data Lake and Normal Database


Combine a data lake with a RDBMS is not easy. There are several approaches.

You can copy over all your relational data to your data lake every day. It takes work to build and operate, but when it is working everything is unified and it is easy to do any kind of analytic queries. Some data lake products have specialized functionality to do this in an easier way, see below.


Data Lake on AWS, Azure and GCP


AWS, Azure and GCP have different data lake solution.

AWS Redshift and Redshift Spectrum


AWS Redshift is a proprietary columnar database build on Postgres 8.
Redshift Spectrum is a query engine that can read files from S3 in these formats: avro, csv, json, parquet, orc and txt and treat them as database tables. First you have to make a Hive table definition in Glue Data Catalog.

Azure Data Lake Store


Microsoft data lake is called Azure Data Lake Storage works with blob storage and is compliant with HDFS the Hadoop distributed file system.

U-SQL is a query tool to combine Azure SQL DB and your data lake.

Google BigQuery


GCP's data lake is called BigQuery works with blob storage and stores native data in proprietary columnar format called Capacitor.
BigQuery is very fast and has a nice web GUI for SQL queries. It is very easy to get started with, since it can do schema auto-detection of your blob data, unlike Hive that needs a table definition before it can process the data.

New Cloud ML APIs


In 2017 Google released their Cloud Vision API and Natural Language API. I heard from several data scientists that instead of building their own computer vision system, named entity or sentiment analysis system, they just use APIs.

It feels like cheating, but ML APIs are here to stay.

When you should build your own ML models and when you you use the APIs?
If you have a hard problem in computer vision or NLP that is not essential to your goal, then using API seems like a good idea. Here are a few reasons why it can be problematic:

  • It is not free
  • Sometimes it works badly
  • There are privacy and compliance issues
  • Are you helping train a model that your competitor is going to use next
  • Speed e.g. if you are doing live computer vision


Working with ML APIs


If you decide to use the ML API your job will be quite different than if you chose to build and train your own models. Your challenges will be:

  • Transparency of data
  • Evolution or your data sources
  • Transparency of ML models
  • ML model evolution
  • QA of ML models
  • Interaction between ML models

The 2014 book Linked Data is a great source of techniques to use for data transparency and evolution. It describes linked data as transparent data with enough meta data that it can be linked from other data sources. It is advocating using self describing data technologies like RDF and SPARQL.

The response to a Cloud Vision query is nested and complex. I think that schemas or a gradual type system, similar to TypeScript's could give stability when working with semi structured evolving data. Some of the Google's Node API wrappers are already written in TypeScript and so they already have the type definitions.


Cloud ML Developments


There are a few minor cloud ML developments that deserve a mention.


Cloud Jupyter Notebooks


Amazon SageMaker, Microsoft Azure Notebooks and Google Cloud Datalab are Jupyter notebooks directly integrated into the cloud offerings.

I find Jupyter notebooks a natural place to combine code, data and presentation. One problem I have had when programming on cloud is that there are so many places where you can put programming logic.

Model Deployment


Model deployment has traditionally received less attention than other part of the ML pipeline.  Azure and GCP have done a great job of optimizing model deployment into something that can be done in few line of code. It will train a model, save it in a bucket and spin up a serverless function that serves up the model as a REST call.

Auto ML


ML tools that help find best ML models there are now available for GCP, AutoML, Amazon SageMaker and Azure, Automated Machine Learning. These will help you to chose the best model and tune hyper parameters. This seems like a natural expansion of current ML techniques. It does involve using cloud specific libraries.

Transfer Learning


If you have an image categorization task, you could build a classifier from scratch by training a deep convolutional neural network. This can take a long time. With transfer learning you will start with a trained CNN for example Inception or ResNet network. It should be trained on data that is similar to the data that you will be processing.
You train your classifier model by taking the second to last layer in the trained CNN as input. This is much less work than staring to build a 100 layered CNN from scratch. While transfer learning is not specific to the cloud it is easy to do it on the cloud where you have easy access to the per-trained models.


AWS vs Azure vs GCP


The cloud service market is projected to be worth $200 billion in 2019. There is a healthy competition despite AWS head start. Let me end with a very brief general comparison.

AWS was the first cloud service. It started in 2006 and has biggest market share. It is very mature offering both Linux and Windows VMs. They continue to innovate, but the number of services they have are a little overwhelming.

Azure is a very slick experience. Microsoft has embraced open source, offering both Linux and Windows VMs. It has great integration with the Microsoft and Windows ecosystem: SQL Server, .net, C#, F#, Office 365 and SharePoint.

Google Cloud Platform is polished. Is easy to get started with BigQuery and do data exploration in it. GCP has hosted Apache Airflow workflow system. GCP shines with machine learning offering great ML, vision and NLP APIs.

Monday, February 4, 2019

VM, Lambda, Kubernetes & Terraform Best Practice

I work with these popular cloud technologies.
  • VMs, virtual machines like EC2 or GCE
  • Docker
  • Kubernetes
  • Terraform
  • Lambda / serverless functions
This post contains a short introduction to these technologies and my best practices for which cloud technology to use in different situations.


Virtualization Technologies


Here is a quick history and brief summary of difference.

A Highly Abbreviated Virtualization History

  • 2006 Amazon released EC2 a cloud VM you could spin up fast on demand.
  • 2013 Docker. Describes everything VM needs in a small file, used to build lightweight image.
  • 2014 Google open sourced Kubernetes a system to run Docker images together.
  • 2015 Serverless functions / lambdas. Code independent of VM.
  • 2018 Firecracker. A microVM with 125ms start time used for AWS lambda and Fargate.

VM vs Containers vs Lambdas


Main difference
  • VM has a full operating system that run on a hypervisor.
  • Docker / Kubernetes runs as layers on top of a guest Linux OS.
  • Lambda serverless function running in a minimal VM with a good sandbox separation.
There has been a development from heavyweight VM to super lightweight VM.

Recently AWS lambdas started running in a microVM called Firecracker that can spin up in around 125ms with only 5MB memory overhead.


Best Practices for Virtualization


When should you use full VMs, Docker, Kubernetes or lambdas?

When Should You Use Serverless / Lambdas

There are many names for the same concept: AWS Lambdas, Azure Functions and Cloud Functions on GCP.

Good use cases for serverless functions
  • RESTful call with no state.
  • RESTful call that only interact with a database.
  • Database maintenance tasks.
  • Logging operation.
  • On Azure and GCP they are used to server up machine learning models when they are trained.
Lambdas / serverless functions don't need to have a VM running and they scales from no use to massive use. They are very cheap and flexible.

Serverless functions have been marketed as the future of cloud computing and are clearly going to play a big role.

When Should You Use VMs or Kubernetes


Good use cases for VM or Kubernetes
  • Your program has to load a lot of data on startup.
  • Web application with a lot of functionality that are naturally grouped together. 
  • Your program has to do a long sequence of operations.
You could use lambdas for a long sequence of operations. You would just push messages along from one lambda to the next. This is similar to Erlang or Akka actors model. I find that this gives you little control and it makes error handling hard.

When Should You Use Kubernetes


Good use cases for Kubernetes
  • If you are running a lot of daily tasks from some scheduling system, say Airflow or Luigi, it is faster to start them in Kubernetes than to spin up a new full VM instance for each.
  • You find a Docker image with a program that does what you need.
  • If you have several programs that needs to run together one program might need to be installed on Debian another on Ubuntu and one on CentOS. Kubernetes handles this very well. You can actually deploy all 3 containers to the same Kubernetes pod that share a hard disk.

When Should You Use a Full VM

There is overhead with setting up Kubernetes. You also need to have a Kubernetes master node running which cost money. So sometimes the simplest solution is to use a full VM.

Should You Run Docker Inside a VM?

The advantage of Docker is that you package up the Docker image and you can test it locally running in the same way as it will run on the VM.

The disadvantages are that you still have an extra step of creating the Docker file, build and deploy the Docker image to DockerHub or some other repository. You have to install Docker on your VM. There can be some performance hit by an extra level of virtualization.

I use Docker on my laptop and on Kubernetes but I usually do not use Docker in full VM.


Terraform


Terraform is a new tool for infrastructure as code, released by Hashicorp in 2014. It is a small functional programming language focused of configuration.

In your Terraform program you define the state you want to put your cloud system in. You run these commands from command line in the directory where you have your program:

terraform init
terraform plan
terraform apply

This will start a VM or create your infrastructure for you, and Terraform stores the state of your system in what is called a Terraform state file. This state file can be stored locally or shared in a cloud bucket.

When you want to make changes to your cloud infrastructure you change your Terraform program and you run another:

terraform plan
terraform apply

Terraform is declarative it will compare the state of your system with the state you want it to be in find out what changes it need to make.

I have used Terraform a lot with AWS to spin up EC2 and EMR clusters, but also to create IAM roles, policies, VPNs and security groups.

The documentation is good but there is a steep learning curve for Terraform. I found a class Learn DevOps: Infrastructure Automation With Terraform that helped me.

Terraform Modules

Terraform has a concept called a module. It enables code reuse. It is an advanced topic, but I find it absolutely essential in writing maintainable code. Especially if you have multiple environment say dev, staging and prod.

Terraform Version Problem

A problem that I experienced several times is that one team member accidentally updates Terraform to the current version, the next time somebody runs an update script they get this message:

Terraform doesn't allow running any operations against a state
that was written by a future Terraform version. The state is
reporting it is written by Terraform '0.11.8'.

The good news is that the Terraform state file is written in json and is somewhat robust. So you can download the state file and change the version number back to the old version and there is a good chance that it will work. Still this is not the kind of error message that you want to see when you are doing a prod release.

Issues with Terraform

Terraform is a nice declarative framework, but Terraform state file is stored either locally or in cloud bucket.
  • Local state file makes is hard for a team to collaborate. They will get a different state file.
  • Cloud storage allows you to collaborate but now you are still dealing with a shared mutable state that is susceptible to the version problem mentioned above.
I used Terraform to create a lambda function with IAM roles, policies and code. When I tried to update lambda to newer version. Terraform did not sense the changed program files so I had to destroy everything and recreate it.

Using Terraform is often safter than making manual changes in a web console, but I would hesitate to update a database using Terraform.

There is an enterprise version of Terraform that might alleviate some of these problems, but I have only used the open source version.


Kubernetes


Kubernetes is container orchestration framework. It was open sources by Google in 2014 and it works very well on GCP, Google Compute Platform. Many cloud providers has Kubernetes offerings e.g. AWS, Azure and DigitalOcean.

Kubernetes uses declarative cloud definition. I a yaml file you define how many instances of a web server do you want to run. If a web server crashes Kumbernetes will start a new one without intervention.

Kubernetes was one of the most active developed open source framework in 2018. It feels mature.
The state is part of the Kubernetes system not a file living locally or in an S3 bucket.

Issues with Kubernetes

It is quite complicated to set Kubernetes up in a private cloud. You need highly dedicated DevOps staff to do this. A lot of things can and do go bad. I have many memories of DNS server going missing and the block storage / hard disks disappearing after running programs for hours.


Terraform or Kubernetes


When should you use Terraform and when should you use Kubernetes?

They are both declarative tools that you can use to start programs and define things like security groups in your cloud environment.

Terraform is a good option if you want to define your infrastructure and spin up VMs, EMR clusters etc. It is not AWS specific but works very well with AWS.

Kubernetes is a good option if you chose to use containers and you are working on a cloud that has good Kubernetes support. AWS has a competing technology Fargate and AWS integration with Kubernetes is less mature.