7 ways to accelerate becoming a data driven enterprise with Datasy

by Boyan Stoyanov March 2, 2021

Becoming a data driven enterprise has become a goal for many companies once they realized that data is one of the most important assets they have. This made them want to make better use of their data and provide better quality products and services using the insights their data provides. In this article I’ll present 7 ways how Datasy can speed up this process or reduce the costs massively.

Automatically and easily create a cloud data platform inside your secured AWS account.

Datasy uses Terraform to get a complete data platform up and running for you in less than 1 hour in the AWS cloud. While most data platforms are separate and ask you to load your data in Datasy takes a different approach. We believe that data is one of the most important assets in a company and having your own private data platform brings every company huge competitive advantage. This is why with Datasy data will never leave yours’ AWS Account, while you still have the options to share data with the world and third party applications. Not only share but also power these applications with advanced ML capabilities but more about this in the next points. With the automated Datasy deployment the DevOps involvement is minimal as just to provide a VPC details where the data platform will securely live and serve you. The data platform which you’ll get in 1 hour consists of a centralized configuration webserver, metadata instance, Data Lake, processing cluster and analytics server.
The configuration webserver is the central place to perform all kind of operations like create and scale your environment, configure and manage the ETL and machine learning processes. It has a UI interface that allows you to configure all this. SQL knowledge is needed to configure complex ingestions and transformations.
The metadata instance holds centralized metadata for the Datasy data platform that keeps all components connected together. This includes Airflow and Superset metadata, ETL metadata and ML metadata.
The processing cluster is a customized Apache Airflow cluster fully integrated into the data platform that has templated ETL and ML operators allowing you to perform various operations like ingesting data to the Data Lake, data transformation, data extraction, model training and performing batch predictions. Datasy allows you to do real time predictions as well but more on this in the next chapters.

Take advantage of latest open source developments to reduce costs for ETL and BI tools.

Datasy uses several open source tools that became quite popular in recent years.

The first and maybe most important is Apache Airflow. This tool initially developed and open sourced by Airbnb has grown a huge community and customer base. It allows for very complex and customizable pipeline definitions called DAGs and the custom Datasy operators allow it to become an ETL tool covering most of the ETL operations. Datasy however takes some of the complexity of writing your own DAGs away by introducing a DAG generation functionality and allows DAGs to be created from the central configuration web UI.
The second is the analytics server based on Apache Superset — also open sourced by Airbnb and having a big community. It is a tool for creating rich visualizations of the data and has the capability to be connected to many data sources. Datasy provides and manages Apache Superset for you while allowing the BI developers to focus on the actual report and dashboard development to provide your business with rich and insightful data visualizations. One huge benefit to other BI tools is that Superset is free. Not only you don’t pay per user or per server but also it is not part of the Datasy price calculation. This means you are only paying only to AWS for the EC2 instances it it running on. This can provide massive cost savings for any company ready to adopt it.

Reduce ETL development time

Datasy removes a lot of the efforts for ETL development by providing advanced automation and templated metadata configurable operators together with advanced pipeline auto generation. It suggest bests candidates for primary keys, CDC columns and even some performance optimization parameters for each table but more on this in the next chapters. With most ETL tools development boils down to producing metadata and writing business rules as SQL statements. Datasy takes the metadata definition away so development can be focused in implementing the business rules thus producing more data flows for the same time.

Enable data analysts to apply machine learning without data science degree

The templated operators and advanced automation in Datasy is not only applied to the ETL processes but to the ML flows as well. In Datasy there are few ways to use ML on the data from your data lake. For all of them the training and inference processes are automated and the data analysts don’t even need to know what ML algorithm they are using. Datasy implements AutoML for the training process in order to produce the best model. This means Datasy takes care of automatically preprocessing data, selecting the best algorithm and tuning the algorithm parameters so you can get the best predictions and forecasts. This all happens from the central Datasy UI. Datasy currently supports ML which works on scalar data types mainly Classification, Regression and Timeseries Forecasting. It allows predictions to be made in batches or by directly calling a prediction endpoint with REST API or with AWS SDK. Such endpoints can power your other applications by providing predictions as they need them. There is also the batch option where data is dropped in bigger amounts in S3 and a ML pipeline provides predictions for all data entries in the batch.

Optimize performance from day one

Already mentioned the performance suggestions provided while configuring the data flows for making the data access in the data lake fast. In Datasy this is the requirement for every pipeline and every statement generated. The general assumption is that if we use the default suggested parameters we will get a maximum performance for our data ingestion pipelines, which take a lot of the IO intensive time in the DWH. Of course performance of the data platform in general and performance of accessing data depend on many more factors but Datasy takes some of that optimization and automates it allowing developers to focus on more complex performance problems during transformation for example.

Easily scale the environment to cover the expected workloads

This is one of the other reasons why migrating data warehouses to the cloud with Datasy is so easy and fast. You scale your environment up for the initial data migration and once your data is in you scale down to cover the daily workload thus reducing the infrastructure cost and making the migration fast enough to meet the deadlines. Another option is to scale up the processing cluster at night when most of the ETL runs and scale down during day so analytical functions are not impacted while running in AWS Redshift.

Summary

Datasy assists you with many of the most important aspects in your journey to a data driven company in the cloud. It covers a broad range of uses cases like enterprise data warehouse features, easy data migration, automatic performance optimization and also allows you to use machine learning services without highly trained data scientists. All this on top of the massive cost saving it can bring by removing license costs for databases, ETL tools and BI tools makes it a perfect candidate for mid and big size companies who want to take the data driven enterprise approach and bring themselves a competitive advantage in the market.