Managing Cloud Data. Datasy a tale of Automation
I want to tell you our story of a small cloud and data consulting company in Bulgaria. We started working together back in 2012 and we started Data Cloud Solutions in 2017. Since then we had a lot of great projects related to managing data in enterprise environments as well as migrating to the AWS Cloud.
Many customers had the same problem of both moving to the cloud and keeping up with newest developments. It was just taking lots of time configuring these servers in the cloud, configuring privileges and security policies and roles and no customer had huge DevOps team as AWS started not so long ago and there weren’t and still aren’t enough AWS experts to cover such needs. On top of that in the data warehousing field such migrations were the opportunity to move to more cost effective analytics, databases and tools. Perfect from business perspective, right? First all infrastructure is in the cloud, second our data warehouse and analytics are cheaper, third latest cloud technologies (AWS Services in this case) are available to our developers. Easily said than done but this is what we were doing for few years now so here we are.
As you can imagine such combined migration is very complex task. Customers wanted it and not only, they needed it to keep their business up to speed even though no one had the experts to do all that. So what do you do when you don’t have all the people to do a job? Automate it!
You will now say “yes but there are so many tools already for automating the AWS infrastructure — CloudFormation, Jenkins, Ansible, Terraform”. True, but how about automating data warehouse migration from Oracle or MS SQL to AWS Redshift, how about building a data lake to keep any type of data. This involves creating a whole new separate architecture for the data platforms in most of the cases. So we did redesign and redo all the ETL and Reporting to fit the new cloud architecture and we noticed that the end result is kind of the similar for most and it boils doing to having several key features.
- Data Lake/Data Warehouse where all data will be stored. Customers usually want to have control over their data and with recent regulations like GDPR the data protection bar has risen for everyone. This is why usually for keeping data SaaS solutions and not very popular as they keep the data not on customer owned environment.
- ETL cluster to manage all data pipelines and schedules. Everyone sees the increase of data generation in the world and realizes how important scalability is for their data platform to keep up with the data volumes increase even in the next 5 years. Also a cluster to take the heavy calculations and processing for training Machine Learning models.
- Analytics cluster to pull data from the data lake and data warehouse and provide customers with valuable rich visualizations of the data to take more accurate and informed decisions. In our experience we’ve seen the major tools used for analytics and they are great but they come at a price. Either price as real money or price as you need to use given company technical stack.
- Automation and Metadata — No customer wants to deal with writing code for data pipelines, creating infrastructure or migrating data. In most cases we were asked to automate these tasks too. Automation helps customers not to need experts they cannot find when their business needs it. For example they need just a few AWS Devops to manage the accounts, networking and IAM and not the whole data platform with all its complexities.
- Last but not least all customers wanted to reduce costs. They looked at many option like changing the analytics tool so they don’t need to pay per user or using an opensource software for ETL and scheduling and here we are lucky for the times we live it as we have huge arsenal of advanced open source products at out disposal like Docker, Hashicorp’s Terraform, Apache Airflow, Apache Superset, Tensorflow, sklearn and so many more.
Datasy
Having these 5 common features of an advanced cloud platform we decided to make an automated software assistant that helps each customer implement this in their own AWS account. This is how we came up with Datasy — the cloud data assistant I want to tell you a little bit more about and also show you an introduction video. I will not be too long and will just highlight how Datasy implements the key features I listed above:
1. Datalake/DWH — Datasy builds automatically both the data lake and data warehouse using AWS S3 and AWS Redshift secures it in a virtual private cloud(VPC) predefined by customer in their own AWS account. Not only that but Datasy suggest you better ingestion options for your data as you go based on the source data. For example CDC column for faster ingestion , Distkey and Sortkey for improving AWS Redshift performance, detecting deletion on source, batching of data by row count and size as well as massively parallel query execution.
2. ETL Cluster — Datasy automatically builds and manages a customized Apache Airflow Cluster with custom Datasy Operators for specific ETL(ingest, copy, transform, extract) or ML operation (training, inference). All these fully automated with a configuration UI also allowing complex SQL transformation and queries if user want to go there. Datasy Airflow cluster can scale easily horizontally and vertically and all this scaling up and down is managed by Datasy.
3. Analytics — Datasy builds an Apache Superset Analytics which is open source and free of charge and automatically connects it to the data lake. We leave to the customers to redefine the reports for their needs but this solution comes ready and configured for them to use. But of course in our experience we have seen many good tools and will understand if anyone wants to keep them for their cool features even if they cost a bit more. You can plug any Analytics tool to Datasy additionally if needed
4. Automation and Metadata — Datasy uses AWS RDS to store metadata for all its components and scale together with them. With Datasy the Cloud data environment is created for less than with more than 60 AWS resources configured to run together and in sync. Each data pipeline is defined and configured with a easy to use UI and once it enters the system metadata it is automatically generated and put into the system. The only think the user of Datasy does is entering a bit of metadata through the UI while system responds and manages itself. We are fans of extreme automation and we have made it look like so.
5. Cost saving — Well, yes, back to the money talk, so the only think that you will pay is the infrastructure cost and the Datasy cost. If Superset is used no cost for analytics (apart from infrastructure costs). ETL nodes are charged as Datasy nodes so you can use them only when needed for most effective cost saving.
And now the video I promised that briefly shows the main capabilities of Datasy and in the next article I will explain how you can use Datasy not only to build your cloud data platform and ETL pipelines but also to create real time predictions and forecasts easily using the data inside your own Datasy Environment.