DE by DL.AI - C4 W2 - Data Modeling & Transformations for Machine Learning

Anh-Thi Dinh
List of notes for this specialization + Lecture notes & Repository & Quizzes + Home page on Coursera. Read this note alongside the lecture notes—some points aren't mentioned here as they're already covered in the lecture notes.

Modeling and Processing Tabular Data for ML

Week 2 Overview

  • Data modeling is different from ML modeling.
  • The roles of ML engineers, data scientists, and data engineers often overlap, with responsibilities varying significantly across organizations.
    • Basically, DEs help organization adopt a data-centric approach to ML:
    • Enhance the ML system by collecting high-quality data
    • “Garbage in, garbage out”
    • → Extract accurate and meaningful insights.
  • The plan of this week:

ML Overview

Skipping notes for this section since I'm already familiar with the concepts.

Modeling Data for traditional ML algorithms

Skipping notes for this section since I'm already familiar with the concepts.

Conversation with Wes McKinney

  • Background
    • Wes McKinney is the creator of Pandas (2008), an open-source data manipulation library for Python.
    • Released the project in 2009 and authored Python for Data Analysis.
    • Contributed to other open-source projects like Apache Arrow and Ibis.
    • Invests in data companies and promotes open-source data science.
  • Pandas Overview
    • Purpose: Tabular data manipulation and management in Python.
    • Key Features:
      • DataFrame: A table-like structure for data operations (e.g., cleaning, merging, and exploratory analysis).
      • Integration: Works as a pre-step for machine learning libraries like Scikit-learn, TensorFlow, and PyTorch.
    • Popularity: Became a staple in data science due to its accessibility and alignment with the rise of Python.
  • Origins of Pandas
    • Created to meet the demands of fast-paced data analysis during McKinney's work at a quantitative hedge fund.
    • Inspired by a lack of Python tools comparable to MATLAB or R.
    • Named after "panel data" and "Python data analysis".
  • Reasons for Success
    • Right timing: A growing demand for data science tools in the early 2010s.
    • Open-source: Free access removed barriers to entry compared to proprietary software.
    • Community support: Boosted by the release of Python for Data Analysis in 2012.
  • Advice and Trends
    • For Aspiring Practitioners:
      • Focus on data manipulation and visualization.
      • Use interactive tools like Jupyter Notebook for iterative exploration.
    • Future of Data:
      • Python will remain central to data science and AI.
      • AI assistants like ChatGPT will enhance productivity by automating repetitive tasks.
  • Final Note
    • McKinney envisions an ecosystem where data scientists focus more on creative, value-adding tasks with the help of evolving tools and frameworks.

Demo: Processing tabular data with Scikit-Learn

Skipping notes for this section since I'm already familiar with the concepts.

Lab 1 — Feature Engineering for ML

  • For the test set, we only use the .transform method with the fitted parameters from the training phase to ensure consistent preprocessing.

Modeling and Processing Unstructured Data for ML

Modeling image data for ML algorithms

  • Flattening images into single vectors consumes excessive resources and reduces ML algorithm performance—using CNN is the better approach.
  • In CNNs, the initial layers identify fundamental image characteristics, which is why ML engineers often use pre-trained models and fine-tune them with new images.
  • As a Data Engineer, your role is to provide ML Engineers with properly prepared images through Image Augmentation (including flipping, rotating, cropping, and brightness adjustments) using TensorFlow.

Code example: Image processing using Tensorflow

Processing texts for Analysis and Text classification

  • As DE, you should give ML Eng team the cleaned text data → need preprocessing text data.
  • Processing text:
    • Cleaning: Removing punctuations, extra spaces, characters that add no meaning
    • Normalization: converting texts to consistent format (to lower case, numbers/symbols to characters, expanding contractions,…)
    • Tokenization: splitting each review into individual tokens (words, subwords, short sentences,…)
    • Removal of stop words: removing frequently used words such as “is”, “are”, “the”, “for”, “a”,…
    • Lemmatization: replacing each word with its base form or lemma (eg. getting/got → get) ← can use NLP libraries.

Text Vectorization and Embedding

  • Tradditional vectorization: ← weakness: don’t consider the semantic meaning of the words
    • Bag of words: each entry → number of occurences → weakness: there are many words who have high frequency but little meaning → low frequency but more signifiant.
    • TF-IDF (Term-Frequency Inverse-Document-Frequency): TF (the number of times the term occurred in a document divided by the length of that document), IDF ( how common or rare that word is in the entire corpus)
  • Word Embedding: A vector representation that captures the semantic meaning of words. Popular libraries include word2vec and GLOVE. Words with similar meanings have vectors that cluster close together in the vector space. ← weakness: the position of the words in sentence doesn’t take into account (”A man ate a snake” is the same as “A snake ate a man”) → sentence embedding
  • Sentence embedding: if 2 sentences have similar meanings, their embedding vectors should be closed to each other.
    • Lower dimension than the vector generated by IF-IDF
    • Pretrained NLP models based on LLM: SBERT (opensource), OpenAI/Anthropic/Google has closed sources libraries.

Code example: Vectorizing text with Scikit-learn

Lab 2 — Modeling and Transforming text Data for ML

Check the codes.