DE by DL.AI - C4 W1 - Data Modeling & Transformations for Analytics

☝

List of notes for this specialization + Lecture notes & Repository & Quizzes + Home page on Coursera. Read this note alongside the lecture notes—some points aren't mentioned here as they're already covered in the lecture notes.

Introduction to Data Modeling for Analytics

Course 4 overview

The focus of this course is really on turning data into something useful and serving it in a way that creates business value.

Some real scenarios:

Modeling diverse user interactions from clickstream, purchase, and app data.
Handling API calls triggered by user actions in data models.
Incorporating survey data into existing models.
Merging databases with overlapping users and different attributes.

Deploying AI systems faces challenges beyond software engineering, including data-driven concept drift. This occurs when training data no longer reflects current realities. Systems must detect shifts, gather new data, and update models, adding complexity to real-world deployment.

Data Modeling: Choosing a coherent data structure that aligns with the business goals and logic.

Huge mistake: DE starts to build data system without thinking about how they will organize their data to make it useful fro the business (think that it’s only for big companies).

Plan of the course 4:

Conceptual, logical and physical data modeling

When you model your data, you go from abstract concepts to concrete implementation.

Conceptual → Entity-Relationship (ER) Diagram

one-to-one relationship

zero-or-one-to-many relationship (or just one-to-many)

Logical: details about the implementation of the conceptual model. ← add types & primary keys and foreign keys

PK (primary key), FK (foreign key)

Physical: details about the implementation of the lofical model in a specific DBMS.

Normalization

Normalization is typically applied to relational databases.

An example of not-nomalized vs normalized data

First normal form (not normalized) vs Third normal form (normalized)

If we want to change an address → change all rows in the non-normalized table but only change one row in the normalized table.
If we wanna add one more column → change the structure of non-normalized table but only need to add one more table containing the new column for the normalized table.

Steps from denormalized form → first normal form → third normal form:

Denormalized form → First normal form (1NF)

The requirements of 1NF: each column must be unique + have a single value (no nested) & unique primary key.

Split OrderItems into 4 columns & add new ItemNumber to couple with OrderID to be a composite primary key.

1NF → 2NF

The requirements of 1NF must be met
Partial dependencies should be removed. Partial dependencies = a subset of non-key columns that depend on some columns in the composite key.

Eg: Columns from CustomerId to OrderDate depend on OrderID.

Split all columns from CustomerID to OrderDate to another tables with primary key is OrderID

Transitive dependency: a non-key column depends on another non-key column.

Eg: “name” and “price” depend on “sku”. “CustomerName” and “address” depend on “CustomerID”

2NF → 3NF

The requirements of 2NF must be met.
NO Transitive dependencies!

Split price and name into another table with primary key sku. Split CustomerName and address into another table with primary key CustomerID

Convention: a normalized database means it’s in the third normal form (3NF).

There's no one size fits all solution, and you might encounter cases where denormalization actually has performance advantages

Lab 1 — Data normalization

Check the notebook.

Dimensional modeling - Star schema

The star schema ~ a dimensional data model.

Fact table = contains quantitative business measurements that result from a business event or process.

Each row contains the facts of a particular business event.
Immutable (append-only)
Typically narrow and long

Dimension tables = provide the reference data, attributes and relational context for the events in the fact table.

Describe the events what, who, where and when
Typically wide and short

Example

Sometimes, we stack multiple star schemas with conformed dimensions

Example

SQL: Find the total sales amount for each product line within the USA

First step → “the total sales amount”

1SELECT
2	SUM(fact_orders.orderAmount) AS total_sales
3FROM fact_orders

Step 2 → “for each product line”

1SELECT
2	dim_products.productLine -- step 2
3	SUM(fact_orders.orderAmount) AS total_sales
4FROM fact_orders
5-- 👇 step 2
6JOIN dim_products ON
7	fact_orders.propductLine = dim_products.productLine
8GROUP BY dim_products.productLine

Step 3 → “within the USA”

1SELECT
2    dim_products.productLine,
3    SUM(fact_orders.orderAmount) AS total_sales
4FROM fact_orders
5JOIN dim_products ON
6    fact_orders.productCode = dim_products.productCode
7JOIN dim_locations ON -- step 3
8    fact_orders.postalCode = dim_locations.postalCode -- step 3
9WHERE dim_locations.country = 'USA' -- step 3
10GROUP BY dim_products.productLine

If we do the same request as the previous example but with Normalized Model (3NF) →? need to join more tables!

Start Schema vs Normalized Model (3NF)

Normalized forms ensure data integrity and avoid data redundancy.
Star schemas facilitate analytical workloads.

Data Modeling Techniques

Inmon vs Kimball Data modeling approaches for Data Warehouses

Inmon Data Modeling Approach ← Check Conversation with Bill Inmon in C3W2

With this approach, Data Warehouse represents a single source of truth

Key idea is first normalizing the data (strictly) in the Data Warehouse before apply Star Schemas based on each stakeholders.

Kimball Data Modeling Approach ← no need to normalize data in DW

Choose which one?

Exercise — from normalized model to star schema

You might need to extract normalized data that's stored in a relational database and model the data into a star schema so it's easier to query before loading it into department-specific data marts.

4 key steps to convert!

Step 1: business process → “company’s sales transactions”
Step 2: atomic grain

Total sales transactions on a particular day in each row for the fact table
Single sales transaction
Individual product item in a sales transaction

Step 3: select dimensions → “which stores”, “given day”, “product brands”

Step 4:

What we will do

Based on the needs of DA:

(Wiki) A surrogate key (or synthetic key, pseudokey, entity identifier, factless key, or technical key) in a database is a unique identifier for either an entity in the modeled world or an object in the database.

Can use MD5 to generate hash ids as surrogate keys.

1SELECT MD5(store_id) as store_key,
2			 store_id,
3			 store_name,
4			 store_city,
5			 store_zipcode
6FROM stores;

Primary key of tables in star schema should be surrogate keys.

One way to generate dim_date

1SELECT date_key,
2    EXTRACT(DAY FROM date_key) AS day_of_week,
3    EXTRACT(MONTH FROM date_key) AS month,
4    EXTRACT(Quarter FROM date_key) AS quarter,
5    EXTRACT(year FROM date_key) AS YEAR
6FROM generate_series('2020-01-01'::date,
7                     '2025-01-01'::date,
8                     '1 day'::interval) AS date_key;

To build fact_order_items table

1SELECT 
2    MD5(CONCAT(OrderItems.order_id, 
3               OrderItems.item_line_number)) AS fact_order_key,
4    OrderItems.order_id,
5    OrderItems.item_line_number,
6    MD5(Orders.store_id) AS store_key,
7    MD5(OrderItems.item_sku) AS item_key,
8    Orders.order_date AS date_key,
9    OrderItems.item_quantity,
10    Items.price AS item_price
11FROM OrderItems
12JOIN Orders ON Orders.order_id = OrderItems.order_id
13JOIN Items ON Items.sku = OrderItems.item_sku;

Use dbt which helps you model your data by abstracting away a lot of the heavy lifting with writing pure SQL code.

If you need to perform transformations on data that will move around → choose AWS Glue.

Another modeling example

Rental company → want: determine peak booking times, identify the most popular cars being rented, and adjust car rental rates based on demand.

Identify the business process and the grain: Car rental transactions form the business process, with each fact table row representing one rental booking (identified by booking ID) for a specific car and customer.

Identify the dimension tables:

dim_customers that contains the customers' details (name, address, phone number, driver's license number);
dim_cars that contains the cars' information (VIN - Vehicle Identification Number, model, brand, make, color, purchase date);
dim_dates that contains the information of a given date (year, month, time, day, quarter, day of the week).
dim_stores that contains the information of the rental store (zip code, state, city, address)

Identify the fact table: Each row represents one rental booking which is identified by the booking id. It also contains the dates that describe this rental period: rental start date, rental end rental date, and return date. It also contains the foreign keys: customer key, car key and store key. And the business measures are the booking fee, insurance fee, fuel charge, extra rental days, fuel level, and total cost.

Conversation about DVT with Brew Banin

Brew Banin is the founder of dbt tool.

What is dbt:

Transforms data within data warehouses using SQL and Python, without moving the data.
Provides secure workflows with version control and business logic management.

Before dbt:

Data processes were ad hoc and unstructured, lacking version control and proper tracking. ← bunch of SQL scripts.

After dbt:

Introduced standardized workflows, version control, and established the "Analytics Engineer" role.

For Learners:

Offers an easy starting point with scalable testing and documentation features.
Provides access to a supportive dbt community for resources and networking.

Best Practices:

Implement SQL style guides and write modular code.
Incorporate testing, code reviews, and continuous integration pipelines.

Software Skills:

Basic knowledge of version control (like Git) and testing principles is beneficial.

Community:

Participate in global meetups, Slack discussions, and the annual Coalesce conference.

Data Vault (DV) approach

Another data modeling approach besides Inmon and Kimball.

Inmon and Kimball focus on the structure of business logic in the data warehouse.
Data Vault focuses on separating the structural aspects of data, meaning the business entities and how they're related from the descriptive attributes of the data itself.

DV allows for a more flexible, agile, and scalable data warehouse structure by keeping the data as closely aligned to the business as possible.

Build Data Vault

Step 1: Model the Hubs

The business keys: customer_id, store_id, sku, order_id ← generated by the business rather than being tied to the source system!
PK = hash of business key + load_date + record_source

Step 2: Model the Links

Each link table must contain the primary and business keys from its parent hub + load_date + record_source + PK (hash of business keys of parent hub)

With link tables, you can easily add new relationships or update a current relationship without having to re engineer the Data Vault.

For example, one day the company want to use Service instead of Item → don’t use Items, and create Service

Step 3: Satellites

Each satellite tables must contain record_source + PK (hash key of parent hub) + load_date.

One big table (OBT) approach

Kimball and Inmon → expensive on premises and heavily resource constrained with tightly coupled compute and storage.

One big table → more relaxed approach. → wide table = many columns (thousands)

Highly denormalized and flexible

The wide table simply contains all of the data

An example of wide table. You can think of one big table as the denormalize extension to Kimball's approach, where you have facts and dimensions represented in the same table.

Pros vs Cons

I suggest using a wide table when you have a lot of data that needs more flexibility than a traditional data modeling approach might provide.

Transforming data with dbt

We use dbt to model some normalized data to star schema.

Exercise — Transforming a normalized model into a star schema

DA requirement: Suppose you want to serve the modeled data to a data analyst who is interested in comparing sales across different locations and assessing employee sales performance by analyzing the customer purchases they facilitated.

Normalized model

Star schema

Use below to generate surrogate keys

1{{ dbt_utils.generate_surrogate_key(['customers.customerNumber']) }}

Lab 2 — Data modeling with DBT

Check the codes of this lab.

Summerize Data Modeling Approaches

👉

DE by DL.AI - C4 W2 - Data Modeling & Transformations for Machine Learning