Data science assistant for Data Engineers

by Boyan Stoyanov February 9, 2021

In my previous post I told the story of Datasy — a cloud data assistant developed by us at Data Cloud Solutions where I explained the core features a cloud data platform should have and how Datasy builds such platform in fully automated way. This article will cover some more advanced capabilities like creating forecasts and real time predictive models. Datasy was inspired by the emerging cloud technology needs and we want to make these technologies easily available to our customers through Datasy. After all when you get an assistant you want him not just to do all tasks you need till now but be able to do different tasks in the future. This is why Datasy is not just an ETL tool to process data but also a tool for training and using machine learning models inside the data platform that Datasy builds for you. Not only that but with Datasy you don’t really need a degree in Data Science to train, tune, deploy and use the models you just need to start experimenting with your data — Datasy will help you and even guide you throughout the steps to get the data from your data warehouse and use it to create the forecast or prediction you want.

Since in most data warehouses mostly scalar types re used (number, Boolean, date-time/timestamp and text) we have focused in the first version of Datasy to provide support for such data types and therefore we do not process pictures or video when using machine learning models yet. Having defined what types of data we will initially want to process it does reduce the types of problems we can address with ML.

Predictions and Forecasts

When using historic data for predictions we need to separate the results we get from an ML model into 2 different types — predictions and forecasts

Predictions

Let's define prediction as a single result of a predictive model from given row of input. For example we give a row of data (1,2,0,no,yes,false,2020–01–01) and we get a single predicted value Y/N for example. Now if we do have only these Y/N then the problem we try to solve is called Binary Classification. If we need to predict more than 2 labels (for example result of football match could be H/D/A- home/draw/away) then the problem we need to solve is called Multi-class Classification. And if we need to guess the value of some numeric field (for example predict the number of goals for the next round in given football championship) then the problem is called Regression. These are the 3 types of problems which Datasy can currently help you with regarding predictions. However there are some cool features worth noting here:

Automated ML — generating a number of experiments and choosing the best candidate with best performance according to chosen metric
Automated Feature Engineering — once you choose the relevant fields (this is why we say you need to know what your data means) Datasy will do the feature encoding to process you features and standardize them to use by model training, also create several candidates to see for best feature preprocessing options.
Automated Model Tuning — Hyper-parameter tuning is applied on every preprocessing candidate to find most best set of parameters
Generation of preprocessing and training Jupyter Notebooks — Full transparency on the steps taken to produce the model by generation the notebooks with explanation what was done
Ability to create Real-time predictors which can be accessed with REST API and AWS SDK — Datasy will create real-time prediction endpoints to serve predictions requests which can come from other application thus creation a prediction and data backbone for other applications used by a customer.
Ability to predict in huge batches if needed — just put the data in an S3 folder and configure batch inference pipeline which will check the data at chosen interval and provide the predicted values for each file in an output S3 folder

Forecasts

A forecast is a prediction as well but the key difference is that you don’t predict a single value but a series of values and you feed it not with a single row but with multiple rows which should all have a timestamp.

This type of data is usually referred to as timeseries data. Forecast are made based on timeseries data in the general case. Since timeseries forecasting is here for a long time there are two ways of producing such forecast — with a machine learning model and with statistical model. Datasy supports both by covering few algorithms for both approaches — DeepAR Plus and FBProphet as ML algorithms and ETS, NTPS and ARIMA as statistical algorithms.

Forecasts have some common parameters for all these algorithms which are worth listing

Frequency— the interval at which the training data is sampled. For example if you want to predict data for each day for next 30 days your frequency will be 1 day.
Horizon — the amount of data points to predict in the future. For example if you want to predict data for each day for next 30 days your horizon will be 30 (days).
Context — the amount of past data relevant for the current forecast. For example if you think only last 3 months of data are relevant for a prediction you will train on the whole dataset but context for prediction will be 3 months

Once you train and create your forecast you can easily visualize them inside Datasy or export them to S3 where they will be ready for ingestion in the DWH or to be used by the Superset Analytics directly.

In the next series we will get more details on using Datasy for building the Datasy environment, ingesting and transforming data and creating prediction and forecast.