DE by DL.AI - C4 W3&4 - Data Transformations & Tech Considerations & Serving

☝

List of notes for this specialization + Lecture notes & Repository & Quizzes + Home page on Coursera. Read this note alongside the lecture notes—some points aren't mentioned here as they're already covered in the lecture notes.

Batch transformations

Week 3 overview

This week explores advanced data transformation frameworks beyond Pandas to address scalability and performance needs.

This week plan:

Batch Transformaton

Transformation use cases.
Distributed processing frameworks:

Hadoop MapReduce: disk-based storage and processing ← many people consider this because of legacy techs due to complexity, high-cost scaling, significant maintenance requirements. ← need to understand because it presents in many distributed systems today.
Spark: in memory-based processing framework.

Compare SQL-based transformation vs Python-based transformation.
Lab: previous course, we use dbt to make the transformation inside the dataware, this time, we implementation outside datawarehouse using Spark.

Streaming transformation

Transformation use cases.
Micro-batch vs true streaming processing tools.
Lab: implement CDC pipeline (change data capture) using Kafka and Flink.

Batch transformation patterns & Use cases

A lot of DE’s work are batch processing.

Some transformation patterns:

The first 2, we’ve already consider in Course 1 (ETL) and Course 4 (ELT)

We’ve already considerd ETL and ELT with Spark and dbt

Data Wrangling: → should use tools like AWS Glue DataBrew.

Transform data for updating

Approach 1: Truncate and Reload ← only applied to small dataset.

Approach 2: CDC (Change Data Capture): identify the changes → insert / update / delete

Insert only → insert next to the old
Upsert/merge: based on ids → update if exist, add new if non-exist
Capture delete → hard (remove completely), soft (marked as deleled and then filter)

Single-row inserts → OK for row-oriented database but not for OLAP system!

Distributed processing framework - Hadoop

Google developed GFS (2003) and MapReduce (2004) for distributed data processing, leading to Yahoo's creation of Hadoop in 2006.

HDFS powers modern big data engines like EMR and Spark
MapReduce's concepts remain influential in distributed systems

HDFS (combines compute and storage on the same nodes) vs Object Storage (limited compute support for internal processing)

HDFS

MapReduce:

→ Weakness: writing on Disk → Spark uses RAM!

Distributed processing framework - Spark

Spark is native in Scala.

Driver → talks to Cluster Manager → Worker nodes. All tasks “T” in each stage run in parallel, stage 2&3 run in parallel.

Spark dataframes

Spark DataFrames let you handle large distributed datasets while interacting with them like a single table, with the complex distribution details abstracted away.

We work with high-level API

2 types of operations

Lazy evaluation

This lazy evaluation allows Spark to optimize the execution plan by rearranging transformation operations for efficiency. Instead, they're recorded as a lineage and only executed when an action is invoked.

Demo: working Spark dataframes using Python & Spark SQL

Using Python

Since Spark runs natively on JVM but Python UDFs require a separate Python process, data conversion between the two is expensive. For better performance, consider writing UDFs in Scala or Java to run directly on JVM.

Spark SQL

When working with PySpark, you can choose to manipulate your data, using SQL code, Python code or a mix of both.
Temporary view: virtual table, persists as long as the Spark session is running, providers and interface for you to work with SQL.

1transactions_df.createOrReplaceTempView('orders')

Amazon EMR

AWS Console → EMR → EMR Serverless → Get Started → Create and launch EMR Studio → leave the type as “Spark” → “Use default settings for interactive workloads” (enabled enpoint for EMR Studio → Create & start application

In EMR Studio → Daskboard → Create workspace → error → Edit Studio → choose service role “emr” → Browse S3: “emr-workspaces-…” → Save changes → View Workspaces → Create Workspace → naming… → Create Workspace → open Jupyter notebook

In Jupyter Notebook, click Compute tab (left sidebar) → attach EMR Serverless application and interactive runtime role → Attach → select kernel, choose PySpark

AWS Glue

There are 3 options to create and run Glue job:

AWS Glue DataBrew (no code/low code)

It likes working with Excel but with the power of Spark. No need understanding of Spark or coding at all.

Glue Studio → SQL code (more advanced users) + can drag and drop

Jupyter notebook (Spark code from scratch)

Check back C1W2 Lab

The purpose of this glue job is: load normalized data from RDS → apply transformation (ETL) to star schema → store in S3. ← build glue_job.py

Instead of writing line by line codes like in this glue_job.py file, we can open Glue Studio to do that.

To open Glue Studio: Console → AWS Glue → ETL jobs / Visual ETL → Visual ETL → open the canvas

Lab 1 — Data transformations with Spark

Check this notebook.

Technical considerations

Depending on your transformation use case, you might find one of the approaches more suitable than the other.

When to use Spark over pandas (when data is big, doesn’t fit into memory)

Streaming transformations

Streaming processing

Differences between Microbatch stream processing (”near” tru streaming) vs True stream processing.

Lab 2 — Capture data changes (CDC) with Flink and Debezium

CDC Infrastructure

Here is a brief description of the components:

Source Database: You already interacted with the source database in previous labs. For the prototype, the team set up a MySQL database inside a Docker container with the classicmodels database.

Debezium: Distributed platform for CDC, it connects to the source system and captures the changes based on the configuration one setup. The UI has been provisioned with the 8080 port of the EC2 instance.

Apache Kafka: The event streaming platform, has already been configured to connect with Debezium, based on your configuration in Debezium topics are created for each table that contains messages with each change.

Apache Zookeeper: Service that stores configuration information and allows the coordination of cloud applications.

Apache Zeppelin: Web-based notebook environment, it allows us to run Flink in a notebook alongside other possible interpreters. You can access the UI by connecting to the 8085 port of the EC2 instance, you can also see the Flink UI with the 8086 port.

Apache Flink: Distributed processing engine for data streams, you will use it to connect to the Kafka topics and update the target database accordingly.

Target Database: For the prototype, the team set up a Postgres database in an RDS instance with the target schema classicmodels_star_schema.

Check this notebook.

Serving Data for Analytics and Machine Learning

3 ways to serve a data:

As files: text, tabular database, images/videos
From databases and data warehouses. Benefits: Imposes order & structure through
schema, Gives you fine-grained permissions
controls, Offers high performance for queries
From streaming systems: low latency, effectively combining the features of an OLAP database.

Object storage or data lakes is the best method for serving large files of semi-structured or unstructured data.

One advantage of using a data sharing platform is to ensure a coherent and consistent version for the shared files.

VIEW → End users no need to write many joins

VIEW vs CTE (common table expression) ← CTE exists within scope of the main query where they are refrenced. VIEW is a database object, stored and persists on disk (← expensive for complex queries).

→ MATERIALIZED VIEW ← cache the query results and allows you to refresh the data.

Lab 1 - Visualization with DBT and Superset

Check these codes.

Summary of the program concepts

Think like a Data Engineering

Data Ingestion

Data Transformation

Serving

Undercurrents

Course Capstone Project

There are 2 parts:

Part 1: Create and configure the reources for the pipeline. ← codes
Part 2: Integrate data quality checks and orchestration, & createe data visualizations. ← codes

Goal

Data model

Solution Architecture ← 1st step mainly use Terraform.

(Like Medallion architecture) Landing zone (bronze zone), Transformation Zone (silver zone), Serving Zone (gold zone)

First lab:

mainly focus on using terraform and glue scrips.
At the end of the lab, set up the dbt project folder + (optional) define the tables of the star schema model ← check (to see an example of terraform folder structure) + dbt demo in

Second lab: → using AIrflow ← run daily to update the model data

Program conclusion

The boundary between DE-SE and DS-ML are blurry

Trend:

Shift of simplified and easy-to-use tools.
Abstraction + streamline dev process allows engineers focuses on complex and value problems rather than manage low level infrastructure and code.
Streaming pipelines and real-time analytical

Advices

Should work on your own DE project.
Ask questions and communication