Skip to main content

Blog

Data-centric AI development: from Big Data to Good Data

Andrew Ng's reframing — that the next 90% of AI's value will come from improving the data, not the model — changes how production teams build pipelines. A webinar recap covering data ingestion, the data pipeline, MLOps, and why AugMix is one of the most underrated augmentation methods.

Tannistha Maitiby Tannistha Maiti5 min read
Data-centric AI development: from Big Data to Good Data
Modern data ingestion: many sources flowing into a single curated lake, ready for downstream model training.

The next 10x in real-world model performance comes from improving the data, not the architecture. Production AI teams that haven't internalised this still ship the wrong things.

Andrew Ng's framing of data-centric AI — that the next 10x in real-world model performance comes from improving the data, not the architecture — is the lens this webinar uses to walk through what production AI teams actually spend their time on. Spoiler: it's not designing new transformer variants.

Data ingestion — the unsexy 60% of any AI project

Data ingestion is the act of moving data from many sources (CRM, ERP, sensor streams, scraped logs, partner APIs) into a centralised lake or warehouse where it can be queried at scale. It's also where AI projects most often die — not from a lack of intelligence, but from a lack of plumbing.

The hard parts:

  • Schema drift — source systems silently change column types and you find out when a downstream join breaks two months later.
  • Latency vs freshness trade-offs — batch is cheap and easy; streaming is expensive and complicated; most teams need both.
  • Provenance — knowing which row came from which source on which day is non-negotiable for regulated industries (energy included).

A robust ingestion layer pays back compounding interest: every downstream model gets cleaner data, faster, and with audit trails the regulators can read.

The data pipeline — turning raw bytes into model-ready features

Past ingestion, the pipeline does the work nobody puts on a slide deck:

  • Cleaning — null-handling, deduplication, type normalisation.
  • Transformation — joins, aggregations, derived features.
  • Validation — schema contracts, statistical tests on incoming distributions.
  • Feature engineering — domain-specific transformations that turn raw fields into model inputs.

Investing in this layer once means every new model you ship onboards in days instead of weeks. Skipping it means every team rebuilds the same plumbing in slightly incompatible ways and the org accumulates technical debt that becomes impossible to refactor by year three.

MLOps — DevOps for the parts of the AI lifecycle that need it

MLOps lifecycle diagram showing model training, deployment, monitoring, and feedback loop.
The MLOps lifecycle: model training, deployment, monitoring, retraining.

MLOps is what you call DevOps once your artefacts include trained model weights, training datasets, and feature definitions in addition to source code. It covers:

  • Reproducibility — given a commit hash + a dataset hash, can you regenerate this model exactly? If not, you don't have MLOps yet.
  • Deployment — pushing models behind APIs, A/B harnesses, batch scoring jobs.
  • Monitoring — drift detection, performance telemetry, distribution-shift alerts.
  • Governance — access control, audit logs, model registry, lineage.

The non-negotiable rule: track business metrics alongside model metrics. If your fraud-detection model's AUC went up 2% but the actual loss rate stayed flat, your monitoring is wrong.

Data augmentation — making the model see more without collecting more

For computer vision and NLP, augmentation transforms training samples (rotation, scaling, cropping, colour-jittering, masking) to expand the effective training set without collecting new data. Done right, augmentation:

  • Reduces overfitting on small datasets.
  • Makes models robust to common real-world distortions.
  • Costs nothing once the augmentation pipeline is in place.

Done wrong, it bakes in biased augmentations — e.g. randomly cropping medical images in a way that throws away the diagnostic region. Always validate augmented samples against the same QA bar you apply to original data.

AugMix — the underrated robustness method

AugMix pipeline showing multiple augmentation chains being mixed together with a Jensen-Shannon consistency loss.
AugMix combines diverse augmentation chains and pairs them with a Jensen-Shannon consistency loss.

AugMix (Hendrycks et al., ICLR 2020) is one of the most underrated augmentation methods we covered. Instead of picking one augmentation per training example, AugMix:

  1. Generates multiple augmentation chains of varying severity.
  2. Mixes their outputs with random convex weights.
  3. Trains the model to produce consistent predictions across the original and augmented versions (via a Jensen-Shannon-divergence consistency loss).

The result: a model that's noticeably more robust to corruptions it never saw in training (the standard ImageNet-C benchmark), and produces better-calibrated uncertainty estimates.

For production deployments — where the inference distribution rarely matches the training distribution exactly — AugMix gives you a couple of extra points of robustness for almost no engineering cost.

Glossary

AugMix
Robust data-augmentation technique (Hendrycks et al., ICLR 2020). Mixes multiple augmentations consistently across a sample to improve out-of-distribution robustness without hurting in-distribution accuracy.
Data-centric AI
Andrew Ng's framing — invest in better data, not bigger models. The next 10× in real-world performance comes from labelling, sampling, and augmentation, not from architecture search.
MLOps
DevOps applied to the ML lifecycle — automated training, deployment, monitoring, retraining. The connective tissue that turns a notebook into a production system.
Schema drift
Silent change in source-data column types or shapes that breaks downstream joins. The single most under-reported failure mode in production data pipelines.

Closing thought

Data-centric AI is a re-framing, not a new technique. The components — robust ingestion, well-engineered pipelines, MLOps discipline, thoughtful augmentation — are old. What's new is the recognition that investing in these unglamorous layers delivers more business value than chasing the next 0.5% on a leaderboard.

For teams shipping AI in regulated industries — energy, finance, healthcare — this isn't optional. The model is the easy part.

Further reading

EarthScan
Continuous AI for explorers

info@earthscan.io

Go to Top

© 2026 Copyright. Earthscan