List of notes for this specialization + Lecture notes & Repository & Quizzes + Home page on Coursera. Read this note alongside the lecture notes—some points aren't mentioned here as they're already covered in the lecture notes.
- Data modeling is different from ML modeling.
- The roles of ML engineers, data scientists, and data engineers often overlap, with responsibilities varying significantly across organizations.
- Enhance the ML system by collecting high-quality data
- “Garbage in, garbage out”
Basically, DEs help organization adopt a data-centric approach to ML:
→ Extract accurate and meaningful insights.
- The plan of this week:
Skipping notes for this section since I'm already familiar with the concepts.
Skipping notes for this section since I'm already familiar with the concepts.
- Background
- Wes McKinney is the creator of Pandas (2008), an open-source data manipulation library for Python.
- Released the project in 2009 and authored Python for Data Analysis.
- Contributed to other open-source projects like Apache Arrow and Ibis.
- Invests in data companies and promotes open-source data science.
- Pandas Overview
- Purpose: Tabular data manipulation and management in Python.
- Key Features:
- DataFrame: A table-like structure for data operations (e.g., cleaning, merging, and exploratory analysis).
- Integration: Works as a pre-step for machine learning libraries like Scikit-learn, TensorFlow, and PyTorch.
- Popularity: Became a staple in data science due to its accessibility and alignment with the rise of Python.
- Origins of Pandas
- Created to meet the demands of fast-paced data analysis during McKinney's work at a quantitative hedge fund.
- Inspired by a lack of Python tools comparable to MATLAB or R.
- Named after "panel data" and "Python data analysis".
- Reasons for Success
- Right timing: A growing demand for data science tools in the early 2010s.
- Open-source: Free access removed barriers to entry compared to proprietary software.
- Community support: Boosted by the release of Python for Data Analysis in 2012.
- Advice and Trends
- For Aspiring Practitioners:
- Focus on data manipulation and visualization.
- Use interactive tools like Jupyter Notebook for iterative exploration.
- Future of Data:
- Python will remain central to data science and AI.
- AI assistants like ChatGPT will enhance productivity by automating repetitive tasks.
- Final Note
- McKinney envisions an ecosystem where data scientists focus more on creative, value-adding tasks with the help of evolving tools and frameworks.
Skipping notes for this section since I'm already familiar with the concepts.
- Codes for this lab.
- For the test set, we only use the
.transform
method with the fitted parameters from the training phase to ensure consistent preprocessing.
- Flattening images into single vectors consumes excessive resources and reduces ML algorithm performance—using CNN is the better approach.
- In CNNs, the initial layers identify fundamental image characteristics, which is why ML engineers often use pre-trained models and fine-tune them with new images.
- As a Data Engineer, your role is to provide ML Engineers with properly prepared images through Image Augmentation (including flipping, rotating, cropping, and brightness adjustments) using TensorFlow.
Check this notebook.
- As DE, you should give ML Eng team the cleaned text data → need preprocessing text data.
- Processing text:
- Cleaning: Removing punctuations, extra spaces, characters that add no meaning
- Normalization: converting texts to consistent format (to lower case, numbers/symbols to characters, expanding contractions,…)
- Tokenization: splitting each review into individual tokens (words, subwords, short sentences,…)
- Removal of stop words: removing frequently used words such as “is”, “are”, “the”, “for”, “a”,…
- Lemmatization: replacing each word with its base form or lemma (eg. getting/got → get) ← can use NLP libraries.
- Tradditional vectorization: ← weakness: don’t consider the semantic meaning of the words
- Bag of words: each entry → number of occurences → weakness: there are many words who have high frequency but little meaning → low frequency but more signifiant.
- TF-IDF (Term-Frequency Inverse-Document-Frequency): TF (the number of times the term occurred in a document divided by the length of that document), IDF ( how common or rare that word is in the entire corpus)
- Word Embedding: A vector representation that captures the semantic meaning of words. Popular libraries include word2vec and GLOVE. Words with similar meanings have vectors that cluster close together in the vector space. ← weakness: the position of the words in sentence doesn’t take into account (”A man ate a snake” is the same as “A snake ate a man”) → sentence embedding
- Sentence embedding: if 2 sentences have similar meanings, their embedding vectors should be closed to each other.
- Lower dimension than the vector generated by IF-IDF
- Pretrained NLP models based on LLM: SBERT (opensource), OpenAI/Anthropic/Google has closed sources libraries.
Check the codes.