List of notes for this specialization + Lecture notes & Repository & Quizzes + Home page on Coursera. Read this note alongside the lecture notes—some points aren't mentioned here as they're already covered in the lecture notes.
- The focus of this course is really on turning data into something useful and serving it in a way that creates business value.
- Some real scenarios:
- Modeling diverse user interactions from clickstream, purchase, and app data.
- Handling API calls triggered by user actions in data models.
- Incorporating survey data into existing models.
- Merging databases with overlapping users and different attributes.
- Deploying AI systems faces challenges beyond software engineering, including data-driven concept drift. This occurs when training data no longer reflects current realities. Systems must detect shifts, gather new data, and update models, adding complexity to real-world deployment.
- Data Modeling: Choosing a coherent data structure that aligns with the business goals and logic.
- Huge mistake: DE starts to build data system without thinking about how they will organize their data to make it useful fro the business (think that it’s only for big companies).
- Plan of the course 4:
- When you model your data, you go from abstract concepts to concrete implementation.
- Conceptual → Entity-Relationship (ER) Diagram
- Logical: details about the implementation of the conceptual model. ← add types & primary keys and foreign keys
- Physical: details about the implementation of the lofical model in a specific DBMS.
- Normalization is typically applied to relational databases.
- An example of not-nomalized vs normalized data
- If we want to change an address → change all rows in the non-normalized table but only change one row in the normalized table.
- If we wanna add one more column → change the structure of non-normalized table but only need to add one more table containing the new column for the normalized table.
- Steps from denormalized form → first normal form → third normal form:
- Denormalized form → First normal form (1NF)
- 1NF → 2NF
- The requirements of 1NF must be met
- Partial dependencies should be removed. Partial dependencies = a subset of non-key columns that depend on some columns in the composite key.
- Transitive dependency: a non-key column depends on another non-key column.
- 2NF → 3NF
- The requirements of 2NF must be met.
- NO Transitive dependencies!
The requirements of 1NF: each column must be unique + have a single value (no nested) & unique primary key.
Eg: Columns from
CustomerId
to OrderDate
depend on OrderID
.Eg: “name” and “price” depend on “sku”. “CustomerName” and “address” depend on “CustomerID”
- Convention: a normalized database means it’s in the third normal form (3NF).
- There's no one size fits all solution, and you might encounter cases where denormalization actually has performance advantages
Check the notebook.
- The star schema ~ a dimensional data model.
- Fact table = contains quantitative business measurements that result from a business event or process.
- Each row contains the facts of a particular business event.
- Immutable (append-only)
- Typically narrow and long
- Dimension tables = provide the reference data, attributes and relational context for the events in the fact table.
- Describe the events what, who, where and when
- Typically wide and short
- Example
- Sometimes, we stack multiple star schemas with conformed dimensions
- Example
- SQL: Find the total sales amount for each product line within the USA
- First step → “the total sales amount”
- Step 2 → “for each product line”
- Step 3 → “within the USA”
1SELECT
2 SUM(fact_orders.orderAmount) AS total_sales
3FROM fact_orders
1SELECT
2 dim_products.productLine -- step 2
3 SUM(fact_orders.orderAmount) AS total_sales
4FROM fact_orders
5-- 👇 step 2
6JOIN dim_products ON
7 fact_orders.propductLine = dim_products.productLine
8GROUP BY dim_products.productLine
1SELECT
2 dim_products.productLine,
3 SUM(fact_orders.orderAmount) AS total_sales
4FROM fact_orders
5JOIN dim_products ON
6 fact_orders.productCode = dim_products.productCode
7JOIN dim_locations ON -- step 3
8 fact_orders.postalCode = dim_locations.postalCode -- step 3
9WHERE dim_locations.country = 'USA' -- step 3
10GROUP BY dim_products.productLine
- If we do the same request as the previous example but with Normalized Model (3NF) →? need to join more tables!
- Start Schema vs Normalized Model (3NF)
- Normalized forms ensure data integrity and avoid data redundancy.
- Star schemas facilitate analytical workloads.
- Inmon Data Modeling Approach ← Check Conversation with Bill Inmon in C3W2
- Kimball Data Modeling Approach ← no need to normalize data in DW
- Choose which one?
- You might need to extract normalized data that's stored in a relational database and model the data into a star schema so it's easier to query before loading it into department-specific data marts.
- 4 key steps to convert!
- Step 1: business process → “company’s sales transactions”
- Step 2: atomic grain
- Total sales transactions on a particular day in each row for the fact table
- Single sales transaction
- Individual product item in a sales transaction
- Step 3: select dimensions → “which stores”, “given day”, “product brands”
- Step 4:
- What we will do
Based on the needs of DA:
- (Wiki) A surrogate key (or synthetic key, pseudokey, entity identifier, factless key, or technical key) in a database is a unique identifier for either an entity in the modeled world or an object in the database.
- Can use
MD5
to generate hash ids as surrogate keys.
1SELECT MD5(store_id) as store_key,
2 store_id,
3 store_name,
4 store_city,
5 store_zipcode
6FROM stores;
- Primary key of tables in star schema should be surrogate keys.
- One way to generate
dim_date
1SELECT date_key,
2 EXTRACT(DAY FROM date_key) AS day_of_week,
3 EXTRACT(MONTH FROM date_key) AS month,
4 EXTRACT(Quarter FROM date_key) AS quarter,
5 EXTRACT(year FROM date_key) AS YEAR
6FROM generate_series('2020-01-01'::date,
7 '2025-01-01'::date,
8 '1 day'::interval) AS date_key;
- To build
fact_order_items
table
1SELECT
2 MD5(CONCAT(OrderItems.order_id,
3 OrderItems.item_line_number)) AS fact_order_key,
4 OrderItems.order_id,
5 OrderItems.item_line_number,
6 MD5(Orders.store_id) AS store_key,
7 MD5(OrderItems.item_sku) AS item_key,
8 Orders.order_date AS date_key,
9 OrderItems.item_quantity,
10 Items.price AS item_price
11FROM OrderItems
12JOIN Orders ON Orders.order_id = OrderItems.order_id
13JOIN Items ON Items.sku = OrderItems.item_sku;
- Use dbt which helps you model your data by abstracting away a lot of the heavy lifting with writing pure SQL code.
- If you need to perform transformations on data that will move around → choose AWS Glue.
- Rental company → want: determine peak booking times, identify the most popular cars being rented, and adjust car rental rates based on demand.
- Identify the business process and the grain: Car rental transactions form the business process, with each fact table row representing one rental booking (identified by booking ID) for a specific car and customer.
- Identify the dimension tables:
dim_customers
that contains the customers' details (name, address, phone number, driver's license number);dim_cars
that contains the cars' information (VIN - Vehicle Identification Number, model, brand, make, color, purchase date);dim_dates
that contains the information of a given date (year, month, time, day, quarter, day of the week).dim_stores
that contains the information of the rental store (zip code, state, city, address)
- Identify the fact table: Each row represents one rental booking which is identified by the booking id. It also contains the dates that describe this rental period: rental start date, rental end rental date, and return date. It also contains the foreign keys: customer key, car key and store key. And the business measures are the booking fee, insurance fee, fuel charge, extra rental days, fuel level, and total cost.
Brew Banin is the founder of dbt tool.
- What is dbt:
- Transforms data within data warehouses using SQL and Python, without moving the data.
- Provides secure workflows with version control and business logic management.
- Before dbt:
- Data processes were ad hoc and unstructured, lacking version control and proper tracking. ← bunch of SQL scripts.
- After dbt:
- Introduced standardized workflows, version control, and established the "Analytics Engineer" role.
- For Learners:
- Offers an easy starting point with scalable testing and documentation features.
- Provides access to a supportive dbt community for resources and networking.
- Best Practices:
- Implement SQL style guides and write modular code.
- Incorporate testing, code reviews, and continuous integration pipelines.
- Software Skills:
- Basic knowledge of version control (like Git) and testing principles is beneficial.
- Community:
- Participate in global meetups, Slack discussions, and the annual Coalesce conference.
- Another data modeling approach besides Inmon and Kimball.
- Inmon and Kimball focus on the structure of business logic in the data warehouse.
- Data Vault focuses on separating the structural aspects of data, meaning the business entities and how they're related from the descriptive attributes of the data itself.
- DV allows for a more flexible, agile, and scalable data warehouse structure by keeping the data as closely aligned to the business as possible.
- Build Data Vault
- Step 1: Model the Hubs
- The business keys:
customer_id
,store_id
,sku
,order_id
← generated by the business rather than being tied to the source system! - PK = hash of business key +
load_date
+record_source
- Step 2: Model the Links
- Each link table must contain the primary and business keys from its parent hub +
load_date
+record_source
+ PK (hash of business keys of parent hub) - Step 3: Satellites
- Each satellite tables must contain
record_source
+ PK (hash key of parent hub) +load_date
.
With link tables, you can easily add new relationships or update a current relationship without having to re engineer the Data Vault.
- Kimball and Inmon → expensive on premises and heavily resource constrained with tightly coupled compute and storage.
- One big table → more relaxed approach. → wide table = many columns (thousands)
- Highly denormalized and flexible
- The wide table simply contains all of the data
- Pros vs Cons
- I suggest using a wide table when you have a lot of data that needs more flexibility than a traditional data modeling approach might provide.
- We use dbt to model some normalized data to star schema.
- DA requirement: Suppose you want to serve the modeled data to a data analyst who is interested in comparing sales across different locations and assessing employee sales performance by analyzing the customer purchases they facilitated.
- Normalized model
- Star schema
- Use below to generate surrogate keys
1{{ dbt_utils.generate_surrogate_key(['customers.customerNumber']) }}
Check the codes of this lab.