AWS Glue: Simple, Flexible, and Cost-effective ETL For Your Enterprise

Written by Gaurav Mishra | Oct 31, 2019 7:00:00 AM

An Amazon solution, AWS Glue is a fully managed extract, transform, and load (ETL) service that allows you to prepare your data for analytics. Using the AWS Glue Data Catalog gives a unified view of your data, so that you can clean, enrich and catalog it properly. This further ensures that your data is immediately searchable, queryable, and available for ETL.

It offers the following benefits:

Less Hassle: Since AWS Glue is integrated across a wide range of AWS services, it natively supports data stored in Amazon Aurora, Amazon RDS engines, Amazon Redshift, Amazon S3, as well as common database engines and Amazon VPC. This leads to reduced hassle while onboarding.
Cost Effectiveness: AWS Glue is serverless, so there are no compute resources to configure and manage. Additionally, it handles provisioning, configuration, and scaling of the resources required to run your ETL jobs on a fully managed, scale-out Apache Spark environment. This is quite cost effective as you pay only for the resources used while your jobs are running.
More Power: AWS Glue automates much of the effort spent in building, maintaining, and running ETL jobs. It crawls your data sources, identifies data formats, and suggests schemas and transformations. It even automatically generates the code to execute your data transformations and loading processes.

AWS Glue helps enterprises significantly reduce the cost, complexity, and time spent creating ETL jobs. Here’s a detailed look on why use AWS Glue:

Why Should You Use AWS Glue?

AWS Glue brings with it the following unmatched features that provide innumerable benefits to your enterprise:

Integrated Data Catalog

AWS Glue consists of an integrated Data Catalog which is a central metadata repository of all data assets, irrespective of where they are located. It contains table definitions, job definitions, and other control information that can help you manage your AWS Glue environment.

Using the Data Catalog can help you automate much of the undifferentiated heavy lifting involved in cleaning, categorizing or enriching the data, so you can spend more time analyzing the data. It computes statistics and registers partitions automatically so as to make queries against your data both efficient and cost-effective.

Clean and Deduplicate Data

You can clean and prepare your data for analysis by using an AWS Glue Machine Learning Transform called FindMatches, which enables deduplication and finding matching records. And you don’t need to know machine learning to be able to do this. FindMatches will just ask you to label sets of records as either “matching” or “not matching”. Then the system will learn your criteria for calling a pair of records a “match” and will accordingly build an ML Transform. You can then use it to find duplicate records or matching records across databases.

Automatic Schema Discovery

AWS Glue crawlers connect to your source or target data store, and progresses through a prioritized list of classifiers to determine the schema for your data. It then creates metadata and stores in tables in your AWS Glue Data Catalog. The metadata is used in the authoring process of your ETL jobs. In order to make sure that your metadata is up-to-date, you can run crawlers on a schedule, on-demand, or trigger them based on any event.

Code Generation

AWS Glue can automatically generate code to extract, transform, and load your data. You simply point AWS Glue to your data source and target, and it will create ETL scripts to transform, flatten, and enrich your data. The code is generated in Scala or Python and written for Apache Spark.

Developer Endpoints

AWS Glue development endpoints enable you to edit, debug, and test the code that it generates for you. You can use your favorite IDE (Integrated development environment) or notebook. Or write custom readers, writers, or transformations and import them into your AWS Glue ETL jobs as custom libraries. You can also use and share code with other developers using the GitHub repository.

Flexible Job Scheduler

You can easily invoke AWS Glue jobs on schedule, on-demand, or based on an event. Or start multiple parallel jobs and specify dependencies among them in order to build complex ETL pipelines. AWS Glue can handle all inter-job dependencies, filter bad data, and retry jobs if they fail. Also, all logs and notifications are pushed to Amazon CloudWatch so you can monitor and get alerts from a central service.

How It Works?

You are now familiar with the features of AWS Glue, and the benefits it brings for your enterprise. But how should you use it? Surprisingly, creating and running an ETL job is just a matter of few clicks in the AWS Management Console.

All you need to do is point AWS Glue to your data stored on AWS, and AWS Glue will discover your data and store the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL.

Here’s how it works:

Define crawlers to scan data coming into S3 and populate the metadata catalog. You can schedule this scanning at a set frequency or to trigger at every event
Define the ETL pipeline and AWS Glue with generate the ETL code on Python
Once the ETL job is set up, AWS Glue manages its running on a Spark cluster infrastructure, and you are charged only when the job runs

The AWS Glue catalog lives outside your data processing engines, and keeps the metadata decoupled. So different processing engines can simultaneously query the metadata for their different individual use cases. The metadata can be exposed with an API layer using API Gateway and route all catalog queries through it.

When to Use It?

What with all the information around AWS Glue, if you do not know where to put it in use? Here’s a look at some of the use case scenarios and how AWS Glue can make your work easier:

1 Queries Against an Amazon S3 Data Lake

Looking to build your own custom Amazon S3 data lake architecture? AWS Glue can make it possible immediately, by making all your data available for analytics even without moving the data.

2 Analyze Log Data in Your Data Warehouse

Using AWS Glue, you can easily process all the semi-structured data in your data warehouse for analytics. It generates the schema for your data sets, creates ETL code to transform, flatten, and enrich your data, and loads your data warehouse on a recurring basis.

3 Unified View of Your Data Across Multiple Data Stores

AWS Glue Data Catalog allows you to quickly discover and search across multiple AWS data sets without moving the data. It gives a unified view of your data, and makes cataloged data easily available for search and query using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

4 Event-driven ETL Pipelines

AWS Glue can run your ETL jobs based on an event, such as getting a new data set. For example, you can use an AWS Lambda function to trigger your ETL jobs to run as soon as new data becomes available in Amazon S3. You can also register this new dataset in the AWS Glue Data Catalog as part of your ETL jobs.

So there you have it, a look at how AWS Glue can help manage your data cataloguing process, and automation of the ETL pipeline.

Srijan is an advanced AWS Consulting Partner, and can help you utilize AWS solutions at your enterprise. To know more, drop us a line outlining your business requirements and our expert AWS team will get in touch.

View full post