Machine learning automation with Datasy
When we started making Datasy we wanted to automate using the data rather than only storing and transforming it with different ETL operations. This is why we integrated Apache Superset in the Datasy environment to allow standard reporting and dashboards. Even though you can make quite cool visualizations of the data, Superset does not support predictive analytics where a model needs to be trained and tuned before it can be used. In this story I will tell you what use cases can Datasy help with when it comes to machine learning automation. Our main goal was to cover the types of ML that a company can apply the data in their data warehouse. For most legacy data warehouses these are mainly the scalar types — numbers, text, Boolean, dates and timestamps. As of now Datasy does not support ML on media files and documents but the future is in front of us.
Datasy Classification and Regression
Datasy can automate Binary and Multiclass Classification as well as Regression problems using the XGBoost algorithm which stands for eXtreme Gradient Boosting. Example for binary classification is a prediction if a football match will have more than 2.5 goals based on given set of related features/columns. It is binary as we only have 2 labels to guess — more that 2.5 and less than 2.5 goals. A multiclass classification example could be to guess the winner of the football match. In this case we will have 3 labels/classes to predict — Home win, Away win or Draw. A regression example will be to predict the exact amount of goals for the match or the number of red/yellow cards. The thing is you don’t need to know that but you just need to specify the input data table and features to use as well as the target column which you want to get predicted.
Datasy will then automatically create a DAG in the Datasy Airflow cluster to train and tune the model. Once the best candidate model is chosen Datasy customers can then use it in two ways:
- Real Time Predictions — with this option Datasy will create a real-time prediction endpoint which can be called with REST API and with AWS Sagemaker SDK. Once the endpoint is created we can make a single prediction from the Datasy UI by just providing the feature values in same order as they were provided for the training. Another way to use this endpoint is to directly invoke any of the available APIs from downstream applications.
- Batch Predictions — with this option we can read an S3 folder which is expected to contain CSV files with the features used for training and for every CSV input to produce a CSV output containing all features plus the predicted value as last column. Datasy will automatically create the DAG to execute this regularly per user schedule.
It is worth mentioning here that Datasy will also export 2 Jupyter notebooks — one for data processing and one for model training and store them on S3. These notebooks can be a good starting point for future improvements and experimentation as well as good explanation of what and how was done to create the models and input data.
To summarize the problems we can use the classification and regression for
1. Predict a label from a known set of labels
2. Predict the actual value of some metric
As we learned above Datasy can help with prediction of given metric value, but what if we want to make a forecast for this value for lets say 30 days ahead and not just for one day. This is where timeseries forecasting comes into play and Datasy allows you to use them quite easily. The basic idea here is that the dataset is more complex to create, but Datasy will help you all the way.
For the timeseries forecasting we use AWS forecast service which allows you to choose from several forecasting algorithms such as FBProphet, DeepAr+ ARIMA, ETS, NTPS.
Example use cases:
1. Product Demand Planning
2. Financial planning
3. Resource planning
With Datasy you can easily train and select the best performing algorithm as well as prepare the dataset needed for that in S3 automatically by providing input data in the Datasy Redshift DWH. Since the dataset creation is quite complex there are a bit more input fields needed in the UI. All explained in the Datasy documentation.
Once input is provided Datasy will automatically create a DAG for the training and train the model with the best algorithm — usually the one with lowest RMSE. From the Datasy UI then you will be able to display the forecast for the given trained period as well as make a forecast for another period in the future once more data arrived in the input tables in the DWH with time. After a forecast is created it can either be queried for given item or exported to S3 to be used further downstream.
Datasy can easily automate machine learning on data in the Datasy Redshift DWH allowing you to:
- Make batch and real time predictions for Classification and Regression tasks
- Create forecasts with a broad range of use cases.
- Easily use the models and predictions
- Export and visualization of the forecasts.