List of notes for this specialization + Lecture notes & Repository & Quizzes + Home page on Coursera. Read this note alongside the lecture notes—some points aren't mentioned here as they're already covered in the lecture notes.
- This week explores advanced data transformation frameworks beyond Pandas to address scalability and performance needs.
- This week plan:
- Transformation use cases.
- Distributed processing frameworks:
- Hadoop MapReduce: disk-based storage and processing ← many people consider this because of legacy techs due to complexity, high-cost scaling, significant maintenance requirements. ← need to understand because it presents in many distributed systems today.
- Spark: in memory-baesd processing framework.
- Compare SQL-based transformation vs Python -based transformation.
- Lab: previous course, we use dbt to transformation inside the dataware, this time, we implementation outside datawarehouse using Spark.
- Transformation use cases.
- Micro-batch vs true streaming processing tools.
- Lab: implement CDC pipeline (change data capture) using Kafka and Flink.
Batch Transformaton
Streaming transformation
- A lot of DE’s work is batch processing.
- Some transformation patterns:
- We’ve already considerd ETL and ELT with Spark and dbt
- Data Wrangling: → should use tools like AWS Glue DataBrew.
- Transform data for updating
- Approach 1: Truncate and Reload ← only applied to small dataset.
- Approach 2: CDC (Change Data Capture): identify the changes → insert / update / delete
- Insert only → insert next to the old
- Upsert/merge: based on ids → update if exist, add new if non-exist
- Capture delete → hard (remove completely), soft (marked as deleled and then filter)
- Single-row inserts → OK for row-oriented database but not for OLAP system!
- Google developed GFS (2003) and MapReduce (2004) for distributed data processing, leading to Yahoo's creation of Hadoop in 2006.
- HDFS powers modern big data engines like EMR and Spark
- MapReduce's concepts remain influential in distributed systems
- HDFS (combines compute and storage on the same nodes) vs Object Storage (limited compute support for internal processing)
- HDFS
- MapReduce:
→ Weakness: writing on Disk → Spark uses RAM!
- Spark DataFrames let you handle large distributed datasets while interacting with them like a single table, with the complex distribution details abstracted away.
- We work with high-level API
- 2 types of operations
- Using Python
- Since Spark runs natively on JVM but Python UDFs require a separate Python process, data conversion between the two is expensive. For better performance, consider writing UDFs in Scala or Java to run directly on JVM.
- Spark SQL
- When working with PySpark, you can choose to manipulate your data, using SQL code, Python code or a mix of both.
- Temporary view: virtual table, persists as long as the Spark session is running, providers and interface for you to work with SQL.
1transactions_df.createOrReplaceTempView('orders')
- AWS Console → EMR → EMR Serverless → Get Started → Create and launch EMR Studio → leave the type as “Spark” → “Use default settings for interactive workloads” (enabled enpoint for EMR Studio → Create & start application
- In EMR Studio → Daskboard → Create workspace → error → Edit Studio → choose service role “emr” → Browse S3: “emr-workspaces-…” → Save changes → View Workspaces → Create Workspace → naming… → Create Workspace → open Jupyter notebook
- In Jupyter Notebook, click Compute tab (left sidebar) → attach EMR Serverless application and interactive runtime role → Attach → select kernel, choose PySpark
There are 3 options to create and run Glue job:
- AWS Glue DataBrew (no code/low code)
It likes working with Excel but with the power of Spark. No need understanding of Spark or coding at all.
- Glue Studio → SQL code (more advanced users) + can drag and drop
- Jupyter notebook (Spark code from scratch)
Check back C1W2 Lab
The purpose of this glue job is: load normalized data from RDS → apply transformation (ETL) to star schema → store in S3. ← build
glue_job.py