DE by DL.AI - C4 W1 - Data Modeling & Transformations for Analytics

Anh-Thi Dinh
List of notes for this specialization + Lecture notes & Repository & Quizzes + Home page on Coursera. Read this note alongside the lecture notes—some points aren't mentioned here as they're already covered in the lecture notes.

Introduction to Data Modeling for Analytics

Course 4 overview

  • The focus of this course is really on turning data into something useful and serving it in a way that creates business value.
  • Some real scenarios:
    • Modeling diverse user interactions from clickstream, purchase, and app data.
    • Handling API calls triggered by user actions in data models.
    • Incorporating survey data into existing models.
    • Merging databases with overlapping users and different attributes.
  • Deploying AI systems faces challenges beyond software engineering, including data-driven concept drift. This occurs when training data no longer reflects current realities. Systems must detect shifts, gather new data, and update models, adding complexity to real-world deployment.
  • Data Modeling: Choosing a coherent data structure that aligns with the business goals and logic.
  • Huge mistake: DE starts to build data system without thinking about how they will organize their data to make it useful fro the business (think that it’s only for big companies).
  • Plan of the course 4:

Conceptual, logical and physical data modeling

  • When you model your data, you go from abstract concepts to concrete implementation.
  • Conceptual → Entity-Relationship (ER) Diagram
    • one-to-one relationship
      zero-or-one-to-many relationship (or just one-to-many)
       
  • Logical: details about the implementation of the conceptual model. ← add types & primary keys and foreign keys
    • PK (primary key), FK (foreign key)
  • Physical: details about the implementation of the lofical model in a specific DBMS.

Normalization

  • Normalization is typically applied to relational databases.
  • An example of not-nomalized vs normalized data
    • First normal form (not normalized) vs Third normal form (normalized)
    • If we want to change an address → change all rows in the non-normalized table but only change one row in the normalized table.
    • If we wanna add one more column → change the structure of non-normalized table but only need to add one more table containing the new column for the normalized table.
  • Steps from denormalized form → first normal form → third normal form:
    • Denormalized form → First normal form (1NF)
      • The requirements of 1NF: each column must be unique + have a single value (no nested) & unique primary key.
        Split OrderItems into 4 columns & add new ItemNumber to couple with OrderID to be a composite primary key.
    • 1NF → 2NF
      • The requirements of 1NF must be met
      • Partial dependencies should be removed. Partial dependencies = a subset of non-key columns that depend on some columns in the composite key.
        • Eg: Columns from CustomerId to OrderDate depend on OrderID.
        Split all columns from CustomerID to OrderDate to another tables with primary key is OrderID
      • Transitive dependency: a non-key column depends on another non-key column.
        • Eg: “name” and “price” depend on “sku”. “CustomerName” and “address” depend on “CustomerID”
    • 2NF → 3NF
      • The requirements of 2NF must be met.
      • NO Transitive dependencies!
      • Split price and name into another table with primary key sku. Split CustomerName and address into another table with primary key CustomerID
  • Convention: a normalized database means it’s in the third normal form (3NF).
  • There's no one size fits all solution, and you might encounter cases where denormalization actually has performance advantages

Lab 1 — Data normalization

Dimensional modeling - Star schema

  • The star schema ~ a dimensional data model.
    • Fact table = contains quantitative business measurements that result from a business event or process.
      • Each row contains the facts of a particular business event.
      • Immutable (append-only)
      • Typically narrow and long
    • Dimension tables = provide the reference data, attributes and relational context for the events in the fact table.
      • Describe the events what, who, where and when
      • Typically wide and short
    • Example
  • Sometimes, we stack multiple star schemas with conformed dimensions
  • Example
  • SQL: Find the total sales amount for each product line within the USA
      1. First step → “the total sales amount”
        1. 1SELECT
          2	SUM(fact_orders.orderAmount) AS total_sales
          3FROM fact_orders
      1. Step 2 → “for each product line”
        1. 1SELECT
          2	dim_products.productLine -- step 2
          3	SUM(fact_orders.orderAmount) AS total_sales
          4FROM fact_orders
          5-- 👇 step 2
          6JOIN dim_products ON
          7	fact_orders.propductLine = dim_products.productLine
          8GROUP BY dim_products.productLine
      1. Step 3 → “within the USA”
        1. 1SELECT
          2    dim_products.productLine,
          3    SUM(fact_orders.orderAmount) AS total_sales
          4FROM fact_orders
          5JOIN dim_products ON
          6    fact_orders.productCode = dim_products.productCode
          7JOIN dim_locations ON -- step 3
          8    fact_orders.postalCode = dim_locations.postalCode -- step 3
          9WHERE dim_locations.country = 'USA' -- step 3
          10GROUP BY dim_products.productLine
  • If we do the same request as the previous example but with Normalized Model (3NF) →? need to join more tables!
  • Start Schema vs Normalized Model (3NF)
    • Normalized forms ensure data integrity and avoid data redundancy.
    • Star schemas facilitate analytical workloads.

Data Modeling Techniques

Inmon vs Kimball Data modeling approaches for Data Warehouses

  • Inmon Data Modeling Approach ← Check Conversation with Bill Inmon in C3W2
    • With this approach, Data Warehouse represents a single source of truth
      Key idea is first normalizing the data (strictly) in the Data Warehouse before apply Star Schemas based on each stakeholders.
  • Kimball Data Modeling Approach ← no need to normalize data in DW
  • Choose which one?

Exercise — from normalized model to star schema

  • You might need to extract normalized data that's stored in a relational database and model the data into a star schema so it's easier to query before loading it into department-specific data marts.
  • 4 key steps to convert!
    • Step 1: business process → “company’s sales transactions”
    • Step 2: atomic grain
      • Total sales transactions on a particular day in each row for the fact table
      • Single sales transaction
      • Individual product item in a sales transaction
    • Step 3: select dimensions → “which stores”, “given day”, “product brands”
    • Step 4:
  • What we will do
    • Based on the needs of DA:
  • (Wiki) A surrogate key (or synthetic key, pseudokey, entity identifier, factless key, or technical key) in a database is a unique identifier for either an entity in the modeled world or an object in the database.
    • Can use MD5 to generate hash ids as surrogate keys.
      • 1SELECT MD5(store_id) as store_key,
        2			 store_id,
        3			 store_name,
        4			 store_city,
        5			 store_zipcode
        6FROM stores;
  • Primary key of tables in star schema should be surrogate keys.
  • One way to generate dim_date
    • 1SELECT date_key,
      2    EXTRACT(DAY FROM date_key) AS day_of_week,
      3    EXTRACT(MONTH FROM date_key) AS month,
      4    EXTRACT(Quarter FROM date_key) AS quarter,
      5    EXTRACT(year FROM date_key) AS YEAR
      6FROM generate_series('2020-01-01'::date,
      7                     '2025-01-01'::date,
      8                     '1 day'::interval) AS date_key;
  • To build fact_order_items table
    • 1SELECT 
      2    MD5(CONCAT(OrderItems.order_id, 
      3               OrderItems.item_line_number)) AS fact_order_key,
      4    OrderItems.order_id,
      5    OrderItems.item_line_number,
      6    MD5(Orders.store_id) AS store_key,
      7    MD5(OrderItems.item_sku) AS item_key,
      8    Orders.order_date AS date_key,
      9    OrderItems.item_quantity,
      10    Items.price AS item_price
      11FROM OrderItems
      12JOIN Orders ON Orders.order_id = OrderItems.order_id
      13JOIN Items ON Items.sku = OrderItems.item_sku;
  • Use dbt which helps you model your data by abstracting away a lot of the heavy lifting with writing pure SQL code.
  • If you need to perform transformations on data that will move around → choose AWS Glue.

Another modeling example

  • Rental companywant: determine peak booking times, identify the most popular cars being rented, and adjust car rental rates based on demand.
  • Identify the business process and the grain: Car rental transactions form the business process, with each fact table row representing one rental booking (identified by booking ID) for a specific car and customer.
  • Identify the dimension tables:
    • dim_customers that contains the customers' details (name, address, phone number, driver's license number);
    • dim_cars that contains the cars' information (VIN - Vehicle Identification Number, model, brand, make, color, purchase date);
    • dim_dates that contains the information of a given date (year, month, time, day, quarter, day of the week).
    • dim_stores that contains the information of the rental store (zip code, state, city, address)
  • Identify the fact table: Each row represents one rental booking which is identified by the booking id. It also contains the dates that describe this rental period: rental start date, rental end rental date, and return date. It also contains the foreign keys: customer key, car key and store key. And the business measures are the booking fee, insurance fee, fuel charge, extra rental days, fuel level, and total cost.

Conversation about DVT with Brew Banin

Brew Banin is the founder of dbt tool.
  • What is dbt:
    • Transforms data within data warehouses using SQL and Python, without moving the data.
    • Provides secure workflows with version control and business logic management.
  • Before dbt:
    • Data processes were ad hoc and unstructured, lacking version control and proper tracking. ← bunch of SQL scripts.
  • After dbt:
    • Introduced standardized workflows, version control, and established the "Analytics Engineer" role.
  • For Learners:
    • Offers an easy starting point with scalable testing and documentation features.
    • Provides access to a supportive dbt community for resources and networking.
  • Best Practices:
    • Implement SQL style guides and write modular code.
    • Incorporate testing, code reviews, and continuous integration pipelines.
  • Software Skills:
    • Basic knowledge of version control (like Git) and testing principles is beneficial.
  • Community:
    • Participate in global meetups, Slack discussions, and the annual Coalesce conference.

Data Vault (DV) approach

  • Another data modeling approach besides Inmon and Kimball.
    • Inmon and Kimball focus on the structure of business logic in the data warehouse.
    • Data Vault focuses on separating the structural aspects of data, meaning the business entities and how they're related from the descriptive attributes of the data itself.
  • DV allows for a more flexible, agile, and scalable data warehouse structure by keeping the data as closely aligned to the business as possible.
  • Build Data Vault
    • Step 1: Model the Hubs
      •  
      • The business keys: customer_id, store_id, sku, order_idgenerated by the business rather than being tied to the source system!
      • PK = hash of business key + load_date + record_source
    • Step 2: Model the Links
      • Each link table must contain the primary and business keys from its parent hub + load_date + record_source + PK (hash of business keys of parent hub)
      • With link tables, you can easily add new relationships or update a current relationship without having to re engineer the Data Vault.
        For example, one day the company want to use Service instead of Item → don’t use Items, and create Service
    • Step 3: Satellites
      • Each satellite tables must contain record_source + PK (hash key of parent hub) + load_date.

One big table (OBT) approach

  • Kimball and Inmon → expensive on premises and heavily resource constrained with tightly coupled compute and storage.
  • One big table → more relaxed approach. → wide table = many columns (thousands)
  • Highly denormalized and flexible
  • The wide table simply contains all of the data
    • An example of wide table. You can think of one big table as the denormalize extension to Kimball's approach, where you have facts and dimensions represented in the same table.
  • Pros vs Cons
  • I suggest using a wide table when you have a lot of data that needs more flexibility than a traditional data modeling approach might provide.

Transforming data with dbt

  • We use dbt to model some normalized data to star schema.

Exercise — Transforming a normalized model into a star schema

  • DA requirement: Suppose you want to serve the modeled data to a data analyst who is interested in comparing sales across different locations and assessing employee sales performance by analyzing the customer purchases they facilitated.
  • Normalized model
  • Star schema
  • Use below to generate surrogate keys
    • 1{{ dbt_utils.generate_surrogate_key(['customers.customerNumber']) }}

Lab 2 — Data modeling with DBT

Check the codes of this lab.

Summerize Data Modeling Approaches