“The next 10x in real-world model performance comes from improving the data, not the architecture. Production AI teams that haven't internalised this still ship the wrong things.
”
Andrew Ng's framing of data-centric AI — that the next 10x in real-world model performance comes from improving the data, not the architecture — is the lens this webinar uses to walk through what production AI teams actually spend their time on. Spoiler: it's not designing new transformer variants.
Data ingestion — the unsexy 60% of any AI project
Data ingestion is the act of moving data from many sources (CRM, ERP, sensor streams, scraped logs, partner APIs) into a centralised lake or warehouse where it can be queried at scale. It's also where AI projects most often die — not from a lack of intelligence, but from a lack of plumbing.
The hard parts:
- Schema drift — source systems silently change column types and you find out when a downstream join breaks two months later.
- Latency vs freshness trade-offs — batch is cheap and easy; streaming is expensive and complicated; most teams need both.
- Provenance — knowing which row came from which source on which day is non-negotiable for regulated industries (energy included).
A robust ingestion layer pays back compounding interest: every downstream model gets cleaner data, faster, and with audit trails the regulators can read.
The data pipeline — turning raw bytes into model-ready features
Past ingestion, the pipeline does the work nobody puts on a slide deck:
- Cleaning — null-handling, deduplication, type normalisation.
- Transformation — joins, aggregations, derived features.
- Validation — schema contracts, statistical tests on incoming distributions.
- Feature engineering — domain-specific transformations that turn raw fields into model inputs.
Investing in this layer once means every new model you ship onboards in days instead of weeks. Skipping it means every team rebuilds the same plumbing in slightly incompatible ways and the org accumulates technical debt that becomes impossible to refactor by year three.
MLOps — DevOps for the parts of the AI lifecycle that need it

MLOps is what you call DevOps once your artefacts include trained model weights, training datasets, and feature definitions in addition to source code. It covers:
- Reproducibility — given a commit hash + a dataset hash, can you regenerate this model exactly? If not, you don't have MLOps yet.
- Deployment — pushing models behind APIs, A/B harnesses, batch scoring jobs.
- Monitoring — drift detection, performance telemetry, distribution-shift alerts.
- Governance — access control, audit logs, model registry, lineage.
The non-negotiable rule: track business metrics alongside model metrics. If your fraud-detection model's AUC went up 2% but the actual loss rate stayed flat, your monitoring is wrong.
Data augmentation — making the model see more without collecting more
For computer vision and NLP, augmentation transforms training samples (rotation, scaling, cropping, colour-jittering, masking) to expand the effective training set without collecting new data. Done right, augmentation:
- Reduces overfitting on small datasets.
- Makes models robust to common real-world distortions.
- Costs nothing once the augmentation pipeline is in place.
Done wrong, it bakes in biased augmentations — e.g. randomly cropping medical images in a way that throws away the diagnostic region. Always validate augmented samples against the same QA bar you apply to original data.
AugMix — the underrated robustness method

AugMix (Hendrycks et al., ICLR 2020) is one of the most underrated augmentation methods we covered. Instead of picking one augmentation per training example, AugMix:
- Generates multiple augmentation chains of varying severity.
- Mixes their outputs with random convex weights.
- Trains the model to produce consistent predictions across the original and augmented versions (via a Jensen-Shannon-divergence consistency loss).
The result: a model that's noticeably more robust to corruptions it never saw in training (the standard ImageNet-C benchmark), and produces better-calibrated uncertainty estimates.
For production deployments — where the inference distribution rarely matches the training distribution exactly — AugMix gives you a couple of extra points of robustness for almost no engineering cost.
Glossary
- AugMix
- Robust data-augmentation technique (Hendrycks et al., ICLR 2020). Mixes multiple augmentations consistently across a sample to improve out-of-distribution robustness without hurting in-distribution accuracy.
- Data-centric AI
- Andrew Ng's framing — invest in better data, not bigger models. The next 10× in real-world performance comes from labelling, sampling, and augmentation, not from architecture search.
- MLOps
- DevOps applied to the ML lifecycle — automated training, deployment, monitoring, retraining. The connective tissue that turns a notebook into a production system.
- Schema drift
- Silent change in source-data column types or shapes that breaks downstream joins. The single most under-reported failure mode in production data pipelines.
Closing thought
Data-centric AI is a re-framing, not a new technique. The components — robust ingestion, well-engineered pipelines, MLOps discipline, thoughtful augmentation — are old. What's new is the recognition that investing in these unglamorous layers delivers more business value than chasing the next 0.5% on a leaderboard.
For teams shipping AI in regulated industries — energy, finance, healthcare — this isn't optional. The model is the easy part.
Further reading
- Data augmentation guide — Machine Learning Mastery
- Data augmentation projects on GitHub
- AugMix paper (arXiv 1912.02781)
