Data Cloud Solutions - Technologies

Machine learning and computer vision problems

2021-04-26T14:26:46+03:00

Machine learning is definitely among the hot topics in computer science nowadays. It has found diverse applications in the real world and its impact is expanding even further.

Recently I was approached by a client who had lost some money while playing blackjack online. He claimed that the online casino was cheating and wanted to prove his hypothesis. For this purpose, he wanted an automatic way to record and review some of the games that have been played. We decided to create a computer vision system to be able to track the last games

After some research for available object detection algorithms, we stopped at the Darknet Yolo v3/v4. With a C++ backend and stable results among benchmarks, it looked like a solid choice.

To train the algorithm, we had to gather 1000 samples per class. For example a set of thousand As, Qs … yellow or red cars. Gathering this dataset and labeling is basically 90% of the work required for building a solid model. So we did quite a bit of coding so that the data collection is automated and fed straight into the system.

All looked good and easy. Until we moved into production and actual deployment of the model. We realized that on a pc with a decent GPU a prediction with Yolo v4 could take about 0.1 seconds. However, a prediction on a CPU computer deployed in the cloud would be much slower – between 1 and 3 seconds depending on the number of cores and their frequency. The problem here was that for a system processing live data of a few tables with a lag of few seconds created imperfections and errors while processing. At the same time paying for a few computers with a strong GPU on the cloud is rather expensive. We had to solve this problem in a way that is both affordable and meets the requirements for processing a live feed instead of storing data into queues. And there came the clever solution. Instead of feeding a single image of a great resolution to the model, we could only crop the area of the interest (the table with cars ) and stitch together multiple pictures of tables in 1 image. Thus we could fit around 64 inference images into a single picture. Finally, we could interpret results from which original image they came and therefore map the results to the input. This would require only 1 computer with a GPU or a single fast CPU. Therefore the proposed solution fit into all requirements and was applicable.

We managed to record play data for 3 weeks period for random players. It turned out that the casino was receiving a 1.79% profit for that time which was close to the expected 1.5% average edge and our client just turned to be really unlucky for some of his play. However, we learned a very useful trick for computer vision inference.

How my Oracle 19C Real Application Cluster crashed?

2021-04-06T13:17:48+03:00

Hi all,

Wanted to share an interesting situation where one of our clusters crashed in a quite strange manner.

We have an application doing various things on a Oracle 19c RAC environment and almost all of them pass through a global sequence.

As you would be expected if for some reason this sequence stopped working - then the whole application will die in a matter of minutes due to connection pool exhaustion.

And now on the interesting part - we were waked up during the night with an application down high priority call.

When checking the sessions - it was apparent that we had a latch/mutex situation as all sessions ( ~200 ) were blocked on this particular sequence with event "enq: SV - contention", and additionally the SQL_EXEC_START times were not moving - which indicated a locking situation, not just a high volume of transactions and concurrency.

When I tried to manually select the next value from the sequence the session hanged with the same wait event which confirmed the situation.

Then we continued to troubleshoot this as a normal locking issue by ordering the active sessions based on the SQL_EXEC_START , with the difference that there was no easy way to find which is the root of the locks as there were no BLOCKING_SESSIONS populated.

So we started to check session by session and the connection with the longest wait time (oldest SQL_EXEC_START) was actually blocked on another event - > "gc current request".

Now this event normally indicates that the current instance is waiting for a block to be sent from another node, which should be in a matter of milliseconds wait time. The strange thing was that the WAIT_TIME was ~40minutes - approximately as much the time since the application was unavailable.

Now all this started to smell as a bug as it's unexpected this wait event to take more than a second.

We did a full hang analyze dump and killed this particular session with "gc current request" - > which unlocked the contention and made the application available again.

Subsequently, Oracle confirmed that we are hitting bug 32245850 "TXTSDAN : DML OPERATIONS HUNG ON "GC CURRENT REQUEST" WAITS”

Currently, our DB's were patched for the bug above and everything looks good so far, but we will keep you in touch if issues arise from this.

As a final note - be careful of the GC-related waits in your application as 19C seems prone to bugs related with cluster wait events. During the investigation with oracle, there were around 5 similar bugs that could be related with the behavior which we experienced.

Additionally never forget to get a hang to analyze the dump in such a situation.

Regards

7 ways to accelerate becoming a data driven enterprise with Datasy

2021-03-02T09:06:00+02:00

Becoming a data driven enterprise has become a goal for many companies once they realized that data is one of the most important assets they have. This made them want to make better use of their data and provide better quality products and services using the insights their data provides. In this article I’ll present 7 ways how Datasy can speed up this process or reduce the costs massively.

Automatically and easily create a cloud data platform inside your secured AWS account.

Datasy uses Terraform to get a complete data platform up and running for you in less than 1 hour in the AWS cloud. While most data platforms are separate and ask you to load your data in Datasy takes a different approach. We believe that data is one of the most important assets in a company and having your own private data platform brings every company huge competitive advantage. This is why with Datasy data will never leave yours’ AWS Account, while you still have the options to share data with the world and third party applications. Not only share but also power these applications with advanced ML capabilities but more about this in the next points. With the automated Datasy deployment the DevOps involvement is minimal as just to provide a VPC details where the data platform will securely live and serve you. The data platform which you’ll get in 1 hour consists of a centralized configuration webserver, metadata instance, Data Lake, processing cluster and analytics server.
The configuration webserver is the central place to perform all kind of operations like create and scale your environment, configure and manage the ETL and machine learning processes. It has a UI interface that allows you to configure all this. SQL knowledge is needed to configure complex ingestions and transformations.
The metadata instance holds centralized metadata for the Datasy data platform that keeps all components connected together. This includes Airflow and Superset metadata, ETL metadata and ML metadata.
The processing cluster is a customized Apache Airflow cluster fully integrated into the data platform that has templated ETL and ML operators allowing you to perform various operations like ingesting data to the Data Lake, data transformation, data extraction, model training and performing batch predictions. Datasy allows you to do real time predictions as well but more on this in the next chapters.

Take advantage of latest open source developments to reduce costs for ETL and BI tools.

Datasy uses several open source tools that became quite popular in recent years.

The first and maybe most important is Apache Airflow. This tool initially developed and open sourced by Airbnb has grown a huge community and customer base. It allows for very complex and customizable pipeline definitions called DAGs and the custom Datasy operators allow it to become an ETL tool covering most of the ETL operations. Datasy however takes some of the complexity of writing your own DAGs away by introducing a DAG generation functionality and allows DAGs to be created from the central configuration web UI.
The second is the analytics server based on Apache Superset — also open sourced by Airbnb and having a big community. It is a tool for creating rich visualizations of the data and has the capability to be connected to many data sources. Datasy provides and manages Apache Superset for you while allowing the BI developers to focus on the actual report and dashboard development to provide your business with rich and insightful data visualizations. One huge benefit to other BI tools is that Superset is free. Not only you don’t pay per user or per server but also it is not part of the Datasy price calculation. This means you are only paying only to AWS for the EC2 instances it it running on. This can provide massive cost savings for any company ready to adopt it.

Reduce ETL development time

Datasy removes a lot of the efforts for ETL development by providing advanced automation and templated metadata configurable operators together with advanced pipeline auto generation. It suggest bests candidates for primary keys, CDC columns and even some performance optimization parameters for each table but more on this in the next chapters. With most ETL tools development boils down to producing metadata and writing business rules as SQL statements. Datasy takes the metadata definition away so development can be focused in implementing the business rules thus producing more data flows for the same time.

Enable data analysts to apply machine learning without data science degree

The templated operators and advanced automation in Datasy is not only applied to the ETL processes but to the ML flows as well. In Datasy there are few ways to use ML on the data from your data lake. For all of them the training and inference processes are automated and the data analysts don’t even need to know what ML algorithm they are using. Datasy implements AutoML for the training process in order to produce the best model. This means Datasy takes care of automatically preprocessing data, selecting the best algorithm and tuning the algorithm parameters so you can get the best predictions and forecasts. This all happens from the central Datasy UI. Datasy currently supports ML which works on scalar data types mainly Classification, Regression and Timeseries Forecasting. It allows predictions to be made in batches or by directly calling a prediction endpoint with REST API or with AWS SDK. Such endpoints can power your other applications by providing predictions as they need them. There is also the batch option where data is dropped in bigger amounts in S3 and a ML pipeline provides predictions for all data entries in the batch.

Optimize performance from day one

Already mentioned the performance suggestions provided while configuring the data flows for making the data access in the data lake fast. In Datasy this is the requirement for every pipeline and every statement generated. The general assumption is that if we use the default suggested parameters we will get a maximum performance for our data ingestion pipelines, which take a lot of the IO intensive time in the DWH. Of course performance of the data platform in general and performance of accessing data depend on many more factors but Datasy takes some of that optimization and automates it allowing developers to focus on more complex performance problems during transformation for example.

Easily scale the environment to cover the expected workloads

This is one of the other reasons why migrating data warehouses to the cloud with Datasy is so easy and fast. You scale your environment up for the initial data migration and once your data is in you scale down to cover the daily workload thus reducing the infrastructure cost and making the migration fast enough to meet the deadlines. Another option is to scale up the processing cluster at night when most of the ETL runs and scale down during day so analytical functions are not impacted while running in AWS Redshift.

Summary

Datasy assists you with many of the most important aspects in your journey to a data driven company in the cloud. It covers a broad range of uses cases like enterprise data warehouse features, easy data migration, automatic performance optimization and also allows you to use machine learning services without highly trained data scientists. All this on top of the massive cost saving it can bring by removing license costs for databases, ETL tools and BI tools makes it a perfect candidate for mid and big size companies who want to take the data driven enterprise approach and bring themselves a competitive advantage in the market.

Machine learning automation with Datasy

2021-02-22T08:51:46+02:00

When we started making Datasy we wanted to automate using the data rather than only storing and transforming it with different ETL operations. This is why we integrated Apache Superset in the Datasy environment to allow standard reporting and dashboards. Even though you can make quite cool visualizations of the data, Superset does not support predictive analytics where a model needs to be trained and tuned before it can be used. In this story I will tell you what use cases can Datasy help with when it comes to machine learning automation. Our main goal was to cover the types of ML that a company can apply the data in their data warehouse. For most legacy data warehouses these are mainly the scalar types — numbers, text, Boolean, dates and timestamps. As of now Datasy does not support ML on media files and documents but the future is in front of us.

Datasy Classification and Regression

Datasy can automate Binary and Multiclass Classification as well as Regression problems using the XGBoost algorithm which stands for eXtreme Gradient Boosting. Example for binary classification is a prediction if a football match will have more than 2.5 goals based on given set of related features/columns. It is binary as we only have 2 labels to guess — more that 2.5 and less than 2.5 goals. A multiclass classification example could be to guess the winner of the football match. In this case we will have 3 labels/classes to predict — Home win, Away win or Draw. A regression example will be to predict the exact amount of goals for the match or the number of red/yellow cards. The thing is you don’t need to know that but you just need to specify the input data table and features to use as well as the target column which you want to get predicted.
Datasy will then automatically create a DAG in the Datasy Airflow cluster to train and tune the model. Once the best candidate model is chosen Datasy customers can then use it in two ways:

Real Time Predictions — with this option Datasy will create a real-time prediction endpoint which can be called with REST API and with AWS Sagemaker SDK. Once the endpoint is created we can make a single prediction from the Datasy UI by just providing the feature values in same order as they were provided for the training. Another way to use this endpoint is to directly invoke any of the available APIs from downstream applications.
Batch Predictions — with this option we can read an S3 folder which is expected to contain CSV files with the features used for training and for every CSV input to produce a CSV output containing all features plus the predicted value as last column. Datasy will automatically create the DAG to execute this regularly per user schedule.

It is worth mentioning here that Datasy will also export 2 Jupyter notebooks — one for data processing and one for model training and store them on S3. These notebooks can be a good starting point for future improvements and experimentation as well as good explanation of what and how was done to create the models and input data.

To summarize the problems we can use the classification and regression for
1. Predict a label from a known set of labels
2. Predict the actual value of some metric

Datasy Forecasting

As we learned above Datasy can help with prediction of given metric value, but what if we want to make a forecast for this value for lets say 30 days ahead and not just for one day. This is where timeseries forecasting comes into play and Datasy allows you to use them quite easily. The basic idea here is that the dataset is more complex to create, but Datasy will help you all the way.
For the timeseries forecasting we use AWS forecast service which allows you to choose from several forecasting algorithms such as FBProphet, DeepAr+ ARIMA, ETS, NTPS.
Example use cases:
1. Product Demand Planning
2. Financial planning
3. Resource planning

With Datasy you can easily train and select the best performing algorithm as well as prepare the dataset needed for that in S3 automatically by providing input data in the Datasy Redshift DWH. Since the dataset creation is quite complex there are a bit more input fields needed in the UI. All explained in the Datasy documentation.

Once input is provided Datasy will automatically create a DAG for the training and train the model with the best algorithm — usually the one with lowest RMSE. From the Datasy UI then you will be able to display the forecast for the given trained period as well as make a forecast for another period in the future once more data arrived in the input tables in the DWH with time. After a forecast is created it can either be queried for given item or exported to S3 to be used further downstream.
Summary
Datasy can easily automate machine learning on data in the Datasy Redshift DWH allowing you to:
- Make batch and real time predictions for Classification and Regression tasks
- Create forecasts with a broad range of use cases.
- Easily use the models and predictions
- Export and visualization of the forecasts.

Data science assistant for Data Engineers

2021-02-09T22:02:05+02:00

In my previous post I told the story of Datasy — a cloud data assistant developed by us at Data Cloud Solutions where I explained the core features a cloud data platform should have and how Datasy builds such platform in fully automated way. This article will cover some more advanced capabilities like creating forecasts and real time predictive models. Datasy was inspired by the emerging cloud technology needs and we want to make these technologies easily available to our customers through Datasy. After all when you get an assistant you want him not just to do all tasks you need till now but be able to do different tasks in the future. This is why Datasy is not just an ETL tool to process data but also a tool for training and using machine learning models inside the data platform that Datasy builds for you. Not only that but with Datasy you don’t really need a degree in Data Science to train, tune, deploy and use the models you just need to start experimenting with your data — Datasy will help you and even guide you throughout the steps to get the data from your data warehouse and use it to create the forecast or prediction you want.

Since in most data warehouses mostly scalar types re used (number, Boolean, date-time/timestamp and text) we have focused in the first version of Datasy to provide support for such data types and therefore we do not process pictures or video when using machine learning models yet. Having defined what types of data we will initially want to process it does reduce the types of problems we can address with ML.

Predictions and Forecasts

When using historic data for predictions we need to separate the results we get from an ML model into 2 different types — predictions and forecasts

Predictions

Let's define prediction as a single result of a predictive model from given row of input. For example we give a row of data (1,2,0,no,yes,false,2020–01–01) and we get a single predicted value Y/N for example. Now if we do have only these Y/N then the problem we try to solve is called Binary Classification. If we need to predict more than 2 labels (for example result of football match could be H/D/A- home/draw/away) then the problem we need to solve is called Multi-class Classification. And if we need to guess the value of some numeric field (for example predict the number of goals for the next round in given football championship) then the problem is called Regression. These are the 3 types of problems which Datasy can currently help you with regarding predictions. However there are some cool features worth noting here:

Automated ML — generating a number of experiments and choosing the best candidate with best performance according to chosen metric
Automated Feature Engineering — once you choose the relevant fields (this is why we say you need to know what your data means) Datasy will do the feature encoding to process you features and standardize them to use by model training, also create several candidates to see for best feature preprocessing options.
Automated Model Tuning — Hyper-parameter tuning is applied on every preprocessing candidate to find most best set of parameters
Generation of preprocessing and training Jupyter Notebooks — Full transparency on the steps taken to produce the model by generation the notebooks with explanation what was done
Ability to create Real-time predictors which can be accessed with REST API and AWS SDK — Datasy will create real-time prediction endpoints to serve predictions requests which can come from other application thus creation a prediction and data backbone for other applications used by a customer.
Ability to predict in huge batches if needed — just put the data in an S3 folder and configure batch inference pipeline which will check the data at chosen interval and provide the predicted values for each file in an output S3 folder

Forecasts

A forecast is a prediction as well but the key difference is that you don’t predict a single value but a series of values and you feed it not with a single row but with multiple rows which should all have a timestamp.

This type of data is usually referred to as timeseries data. Forecast are made based on timeseries data in the general case. Since timeseries forecasting is here for a long time there are two ways of producing such forecast — with a machine learning model and with statistical model. Datasy supports both by covering few algorithms for both approaches — DeepAR Plus and FBProphet as ML algorithms and ETS, NTPS and ARIMA as statistical algorithms.

Forecasts have some common parameters for all these algorithms which are worth listing

Frequency— the interval at which the training data is sampled. For example if you want to predict data for each day for next 30 days your frequency will be 1 day.
Horizon — the amount of data points to predict in the future. For example if you want to predict data for each day for next 30 days your horizon will be 30 (days).
Context — the amount of past data relevant for the current forecast. For example if you think only last 3 months of data are relevant for a prediction you will train on the whole dataset but context for prediction will be 3 months

Once you train and create your forecast you can easily visualize them inside Datasy or export them to S3 where they will be ready for ingestion in the DWH or to be used by the Superset Analytics directly.

In the next series we will get more details on using Datasy for building the Datasy environment, ingesting and transforming data and creating prediction and forecast.

Managing Cloud Data. Datasy a tale of Automation

2021-02-09T18:15:58+02:00

I want to tell you our story of a small cloud and data consulting company in Bulgaria. We started working together back in 2012 and we started Data Cloud Solutions in 2017. Since then we had a lot of great projects related to managing data in enterprise environments as well as migrating to the AWS Cloud.

Many customers had the same problem of both moving to the cloud and keeping up with newest developments. It was just taking lots of time configuring these servers in the cloud, configuring privileges and security policies and roles and no customer had huge DevOps team as AWS started not so long ago and there weren’t and still aren’t enough AWS experts to cover such needs. On top of that in the data warehousing field such migrations were the opportunity to move to more cost effective analytics, databases and tools. Perfect from business perspective, right? First all infrastructure is in the cloud, second our data warehouse and analytics are cheaper, third latest cloud technologies (AWS Services in this case) are available to our developers. Easily said than done but this is what we were doing for few years now so here we are.

As you can imagine such combined migration is very complex task. Customers wanted it and not only, they needed it to keep their business up to speed even though no one had the experts to do all that. So what do you do when you don’t have all the people to do a job? Automate it!

You will now say “yes but there are so many tools already for automating the AWS infrastructure — CloudFormation, Jenkins, Ansible, Terraform”. True, but how about automating data warehouse migration from Oracle or MS SQL to AWS Redshift, how about building a data lake to keep any type of data. This involves creating a whole new separate architecture for the data platforms in most of the cases. So we did redesign and redo all the ETL and Reporting to fit the new cloud architecture and we noticed that the end result is kind of the similar for most and it boils doing to having several key features.

Data Lake/Data Warehouse where all data will be stored. Customers usually want to have control over their data and with recent regulations like GDPR the data protection bar has risen for everyone. This is why usually for keeping data SaaS solutions and not very popular as they keep the data not on customer owned environment.
ETL cluster to manage all data pipelines and schedules. Everyone sees the increase of data generation in the world and realizes how important scalability is for their data platform to keep up with the data volumes increase even in the next 5 years. Also a cluster to take the heavy calculations and processing for training Machine Learning models.
Analytics cluster to pull data from the data lake and data warehouse and provide customers with valuable rich visualizations of the data to take more accurate and informed decisions. In our experience we’ve seen the major tools used for analytics and they are great but they come at a price. Either price as real money or price as you need to use given company technical stack.
Automation and Metadata — No customer wants to deal with writing code for data pipelines, creating infrastructure or migrating data. In most cases we were asked to automate these tasks too. Automation helps customers not to need experts they cannot find when their business needs it. For example they need just a few AWS Devops to manage the accounts, networking and IAM and not the whole data platform with all its complexities.
Last but not least all customers wanted to reduce costs. They looked at many option like changing the analytics tool so they don’t need to pay per user or using an opensource software for ETL and scheduling and here we are lucky for the times we live it as we have huge arsenal of advanced open source products at out disposal like Docker, Hashicorp’s Terraform, Apache Airflow, Apache Superset, Tensorflow, sklearn and so many more.

Datasy

Having these 5 common features of an advanced cloud platform we decided to make an automated software assistant that helps each customer implement this in their own AWS account. This is how we came up with Datasy — the cloud data assistant I want to tell you a little bit more about and also show you an introduction video. I will not be too long and will just highlight how Datasy implements the key features I listed above:
1. Datalake/DWH — Datasy builds automatically both the data lake and data warehouse using AWS S3 and AWS Redshift secures it in a virtual private cloud(VPC) predefined by customer in their own AWS account. Not only that but Datasy suggest you better ingestion options for your data as you go based on the source data. For example CDC column for faster ingestion , Distkey and Sortkey for improving AWS Redshift performance, detecting deletion on source, batching of data by row count and size as well as massively parallel query execution.

2. ETL Cluster — Datasy automatically builds and manages a customized Apache Airflow Cluster with custom Datasy Operators for specific ETL(ingest, copy, transform, extract) or ML operation (training, inference). All these fully automated with a configuration UI also allowing complex SQL transformation and queries if user want to go there. Datasy Airflow cluster can scale easily horizontally and vertically and all this scaling up and down is managed by Datasy.

3. Analytics — Datasy builds an Apache Superset Analytics which is open source and free of charge and automatically connects it to the data lake. We leave to the customers to redefine the reports for their needs but this solution comes ready and configured for them to use. But of course in our experience we have seen many good tools and will understand if anyone wants to keep them for their cool features even if they cost a bit more. You can plug any Analytics tool to Datasy additionally if needed

4. Automation and Metadata — Datasy uses AWS RDS to store metadata for all its components and scale together with them. With Datasy the Cloud data environment is created for less than with more than 60 AWS resources configured to run together and in sync. Each data pipeline is defined and configured with a easy to use UI and once it enters the system metadata it is automatically generated and put into the system. The only think the user of Datasy does is entering a bit of metadata through the UI while system responds and manages itself. We are fans of extreme automation and we have made it look like so.

5. Cost saving — Well, yes, back to the money talk, so the only think that you will pay is the infrastructure cost and the Datasy cost. If Superset is used no cost for analytics (apart from infrastructure costs). ETL nodes are charged as Datasy nodes so you can use them only when needed for most effective cost saving.

And now the video I promised that briefly shows the main capabilities of Datasy and in the next article I will explain how you can use Datasy not only to build your cloud data platform and ETL pipelines but also to create real time predictions and forecasts easily using the data inside your own Datasy Environment.