A few things I will discuss -
- Most data in an organization can be put into a data lake to query and combine
- We now have very powerful, user friendly open source ML libraries
- We have NLP and computer vision REST APIs from cloud providers
History of Machine Learning Libraries
Simplified timeline for languages, libs and APIs
- 1960 Lisp since ML was a small part of A.I.
- 1986 C++ neural network software on a floppy disk in the back of book
- 1997 Open source Java ML like WEKA, good but hard to integrate with you data and code
- 2010 Modern Python open source libs NumPy, Pandas, Scikit-learn easy to use and integrate
- 2015 Spark ML, attempts to make a fast ML pipeline as easy to use as Scikit-learn
- 2017 Deeplearning open source libraries Tensorflow, Keras and PyTorch
- 2017 Cloud Vision API and Natural Language API
Convergence of Data
Recently I talked with a DBA and was surprised how much the DBA profession has changed. He told me big organizations used to have a big database such as Oracle, SQL Server, Sybase or DB2 and a lot of data stored in different files.
Now maintaining the data lake is one of his main responsibilities. The data lake is a system that allows you to store log files, structured, semi structured and unstructured data files in cheap cloud blob storage and still query and join it with SQL.
He was also in charge of an Oracle database and a few open source databases running, MySQL and Postgres and a MongoDB.
Data Lake Fundamentals
Uniform data that can be joined is very powerful. Here are a few underlying technologies that makes this possible.
In 2004 Google released the famous MapReduce paper, describing how you can do distributed computation using functional programming operations. The idea is that you send your computation to were you data is.
In 2010 Hadoop was released. Hadoop is an open source Java implementation of MapReduce. It turned out of be very hard to program in. Two new technologies made it easier to program MapReduce: Hive and Spark.
Hive
A lot of MapReduce job was just queries on data. Hive is a tool that lets you write these queries as simple SQL. Hive will translate the SQL to a MapReduce job, all you had to do was to add schemas definition describing the files with your data.Spark
With Spark you can write more complicated MapReduce jobs. Spark is written in Scala which is a natural language to write MapReduce in. Spark is often use to ingest data into the data lake.All the cloud providers have great support for Spark, AWS has EMR, Azure has HDInsight and GCP has Dataproc.
Combining Data Lake and Normal Database
Combine a data lake with a RDBMS is not easy. There are several approaches.
You can copy over all your relational data to your data lake every day. It takes work to build and operate, but when it is working everything is unified and it is easy to do any kind of analytic queries. Some data lake products have specialized functionality to do this in an easier way, see below.
Data Lake on AWS, Azure and GCP
AWS, Azure and GCP have different data lake solution.
AWS Redshift and Redshift Spectrum
AWS Redshift is a proprietary columnar database build on Postgres 8.
Redshift Spectrum is a query engine that can read files from S3 in these formats: avro, csv, json, parquet, orc and txt and treat them as database tables. First you have to make a Hive table definition in Glue Data Catalog.
Azure Data Lake Store
Microsoft data lake is called Azure Data Lake Storage works with blob storage and is compliant with HDFS the Hadoop distributed file system.
U-SQL is a query tool to combine Azure SQL DB and your data lake.
Google BigQuery
GCP's data lake is called BigQuery works with blob storage and stores native data in proprietary columnar format called Capacitor.
BigQuery is very fast and has a nice web GUI for SQL queries. It is very easy to get started with, since it can do schema auto-detection of your blob data, unlike Hive that needs a table definition before it can process the data.
New Cloud ML APIs
In 2017 Google released their Cloud Vision API and Natural Language API. I heard from several data scientists that instead of building their own computer vision system, named entity or sentiment analysis system, they just use APIs.
It feels like cheating, but ML APIs are here to stay.
When you should build your own ML models and when you you use the APIs?
If you have a hard problem in computer vision or NLP that is not essential to your goal, then using API seems like a good idea. Here are a few reasons why it can be problematic:
- It is not free
- Sometimes it works badly
- There are privacy and compliance issues
- Are you helping train a model that your competitor is going to use next
- Speed e.g. if you are doing live computer vision
Working with ML APIs
If you decide to use the ML API your job will be quite different than if you chose to build and train your own models. Your challenges will be:
- Transparency of data
- Evolution or your data sources
- Transparency of ML models
- ML model evolution
- QA of ML models
- Interaction between ML models
The 2014 book Linked Data is a great source of techniques to use for data transparency and evolution. It describes linked data as transparent data with enough meta data that it can be linked from other data sources. It is advocating using self describing data technologies like RDF and SPARQL.
The response to a Cloud Vision query is nested and complex. I think that schemas or a gradual type system, similar to TypeScript's could give stability when working with semi structured evolving data. Some of the Google's Node API wrappers are already written in TypeScript and so they already have the type definitions.
Cloud ML Developments
There are a few minor cloud ML developments that deserve a mention.
Cloud Jupyter Notebooks
Amazon SageMaker, Microsoft Azure Notebooks and Google Cloud Datalab are Jupyter notebooks directly integrated into the cloud offerings.
I find Jupyter notebooks a natural place to combine code, data and presentation. One problem I have had when programming on cloud is that there are so many places where you can put programming logic.
Model Deployment
Model deployment has traditionally received less attention than other part of the ML pipeline. Azure and GCP have done a great job of optimizing model deployment into something that can be done in few line of code. It will train a model, save it in a bucket and spin up a serverless function that serves up the model as a REST call.
Auto ML
ML tools that help find best ML models there are now available for GCP, AutoML, Amazon SageMaker and Azure, Automated Machine Learning. These will help you to chose the best model and tune hyper parameters. This seems like a natural expansion of current ML techniques. It does involve using cloud specific libraries.
Transfer Learning
If you have an image categorization task, you could build a classifier from scratch by training a deep convolutional neural network. This can take a long time. With transfer learning you will start with a trained CNN for example Inception or ResNet network. It should be trained on data that is similar to the data that you will be processing.
You train your classifier model by taking the second to last layer in the trained CNN as input. This is much less work than staring to build a 100 layered CNN from scratch. While transfer learning is not specific to the cloud it is easy to do it on the cloud where you have easy access to the per-trained models.
AWS vs Azure vs GCP
The cloud service market is projected to be worth $200 billion in 2019. There is a healthy competition despite AWS head start. Let me end with a very brief general comparison.
AWS was the first cloud service. It started in 2006 and has biggest market share. It is very mature offering both Linux and Windows VMs. They continue to innovate, but the number of services they have are a little overwhelming.
Azure is a very slick experience. Microsoft has embraced open source, offering both Linux and Windows VMs. It has great integration with the Microsoft and Windows ecosystem: SQL Server, .net, C#, F#, Office 365 and SharePoint.
Google Cloud Platform is polished. Is easy to get started with BigQuery and do data exploration in it. GCP has hosted Apache Airflow workflow system. GCP shines with machine learning offering great ML, vision and NLP APIs.