DeepLearning.AI MLOps Specialization notes
DeepLearning.AI MLOps Specialization notes
Course 1: Introduction to Machine Learning in Production
Week 1: Overview of the ML Lifecycle and Deployment
Concept drift / Data drift
When data in production diverges from training data.
Steps of a ML project
- Scoping
- Data
- modeling
- Deployment
This is not a linear but an iterative process.
Deployment patterns
- gradual (Canary) deployment: Send an increasing proportion of traffic to the new version.
- blue-green deployment: Add a router between the old and new algorithm then switch, when we have to switch all users at once but are able to rollback instantly.
- Shadow mode: Execute new version in the background to compare predictions to a human or a previous version
levels of automation
- Human
- Shadow mode: ML runs in parallel with a Human to assess its performance.
- AI assistance: user checks all ML results
- Partial automation: user checks only uncertain or specific predictions
- Full automation
Steps 3 and 4 are called Human in the loop AI.
Monitoring
Three kinds of metrics to monitor (with graphs and alarms on thresholds to define):
- Software: server load…
- Input: missing values, statistical distribution…
- Output: the rate at which users override ML prediction…
When there is a pipeline with several ML components, we must define metrics for each one of them.
Week 2: Select and Train a Model
Literature overview
Look for existing or comparable solutions (open-source, papers…).
Model VS Data-centric approach
- Model-centric: Improve model while keeping data fixed, most literature uses this.
- Data-centric: Improve data while keeping the model. Often useful in practice, because good data on a correct model is often better than worst data on an excellent model
3 Steps on modeling and common problems
- Doing well on training set
- Doing well on dev/test set
- Doing well on business/project metrics
Common problems between 2 and 3.:
- Bad performance on disproportionately important cases
- Skewed data distribution: algorithm has a good average performance but is bad on rare classes.
- Bias / Discrimination
Establish a baseline
Possible baselines:
- Human Level Performance (HLP): often useful on unstructured data (text, images…), where Human are naturally good.
- Open-source solution
- Quick-and-dirty implementation
- Previous system
The baseline can be divided on some data categories, it is useful:
- To get an idea of what is possible. e.j: it’s hard to beat HLP on some tasks.
- To know what to focus on.
- To get an idea of the irreducible (Bayes) error
Sanity check
Start by trying to overfit a small dataset (or even one sample) in training.
It helps to find big issues or bugs quickly, it’s useless to continue if it doesn’t work on one sample.
Error analysis
Error analysis is used to understand how a model makes errors and how to improve it.
Here is the process:
- Go manually through a random sample of bad predictions (10s, 100s)
- Note in a spreadsheet some distinctive tags/characteristics for each one of them: “background noise”, “bad labeling”, “value X on a specific feature”…
- Decide to work on a specific issue based on frequency, importance or ease of fix.
Skewed dataset
Loss is not always a good indicator of a skewed (unbalanced) dataset, you can compute Precision, Recall and F1-Score to assess quality (for each class in a multi-class task).
Performance auditing & Data Slicing
- brainstorm possible problems and subsets where the model could go wrong(bias, fairness…)
- Slice data in several subsets and check performance metrics on each one.
Data Augmentation
Adding synthetic training samples (add background noise for speech recognition…). It’s useful for unstructured data if it respects 3 conditions:
- Synthetic samples are realistic
- Algo does poorly on those
- Humans (or another baseline) do well on those.
It usually doesn’t degrade performance on other data (very different from synthetically generated data) if the model is large enough (low bias).
Adding feature
Adding features. it’s mainly useful for structured (and limited) data. E.g: adding % of eaten meals containing meat to help a recommender system stop recommending meat restaurants to vegans.
Experiment tracking
it’s important to keep track of experiments, it can be through a shared spreadsheet or a dedicated experiment tracking system (like Weights & Biases)
Important features are:
- Infos for replicability: code version, dataset, hyperparameters…
- Results (metrics, trained model…)
Good data
A good dataset:
- has good coverage (covers all important cases)
- has consistent and unambiguous labeling
- is monitored for data/concept drift
Week 3: Why is data definition hard?
structured vs unstructured, small vs big data
- With unstructured data, you can often ask to label more data and use data augmentation.
- Consistent labeling is always useful, but crucial on a small dataset (<10k samples)
- On a small dataset, you can fix labels one by one, on a big dataset, you must improve the process (data extraction, labeling instructions)
- big data sets can also have small data problems due to a long tail of rare events (like in search or product recommendation)
Strategies to get consistent labeling?
- Define clear labeling instructions
- Add a new class to capture uncertainty
- Improve input (more or better info) if labeling is hard
- have several labelers label the same example to detect ambiguities in labeling
- Merge classes if the distinction is not clear and unnecessary
- If ambiguities cannot be fixed, use voting from multiple labelers
Human Level Performance (HLP)
- When ground truth is defined by humans, HLP is often just a measure of ambiguity in labeling.
- Labeling inconsistencies make it both irrelevant and too easy to beat HLP
- It’s useful to improve labeling consistency to improve both the model performance and the HLP to compare with.
How much data to get?
- At the start, the minimum you need to iterate as soon as possible on the ‘data->model->error analysis’ loop.
- Then increase if needed, e.g. 2x (10x max), to not spend more time and money than needed.
Data Inventory process
4 possible sources of data:
- Already owned
- Crowdsourced
- Pay for labelers (in-house or outsourced)
- Purchase existing dataset
Compare each possible source on:
- Cost
- Time
- Quantity
- Quality
- Privacy and regulatory constraints
Who will label?
- Do you need SMEs (Subject Matter Experts)?
- Good practice: ML Engineers should do a small part of the labeling to understand the data and possible labeling issues.
Data Pipeline
Data often needs pre-processing steps (cleanup…). After the POC phase, it’s important to ensure that the data pipeline is replicable and identical between modeling (training data) and production (for inference).
Data lineage, data provenance, meta-data
The pipeline can be a complex graph of steps and ML algorithms. For future diagnostic purposes, it’s important to keep track of:
- Data provenance: where does the data come from
- Data lineage: what are the steps that have been involved in the production of each final data/prediction
- Meta-data: what are the info associated with the production of the data (web browser, camera, localization…)
Balanced train/dev/test set
On a small dataset, randomly splitting samples between train/dev/test can lead to a different proportion by class between train/dev/test sets. It’s better to split sets such as each set has the same proportions.
Scoping
To choose what to work on:
- Brainstorm business problems
- brainstorm potential AI solutions
- Assess the technical feasibility (using literature, and benchmarks…) and business value of solutions (ROI)
- Define key metrics (ML, Software and business). Find shared metrics on which business and tech people can agree.
- budget resources.
- Define a timeline and milestones