BLOG7 min read

Optimisers and learning-rate schedulers: a practical deep dive

by Tannistha MaitiSenior AI Researcher · 19 Apr 2023

Why the choice of optimiser (SGD, Adam, RMSprop) and the shape of your learning-rate schedule (step decay, exponential, cyclic) often matter more than another 1% of model capacity. A webinar recap, with the live optimiser-trajectory plot we built to make these trade-offs visible.

“
Optimiser choice matters most early in training. LR-schedule choice matters most late. If you have to pick one to obsess over, pick the schedule.
”

A practical webinar on the part of deep-learning training that doesn't get talked about enough: the optimiser and the learning-rate schedule are often the difference between a model that converges and one that doesn't - regardless of how much capacity you've thrown at the architecture.

We built a live optimiser-trajectory plot for this session so you can watch SGD, Adam, and RMSprop fight their way across a contoured loss surface in real time. Recording at the top.

The webinar in numbers

Optimiser families covered (SGD, SGD+momentum, Adam, AdamW, RMSprop)

LR schedule shapes (step / exponential / cosine / cyclic / one-cycle)

1×

Hyperparameter that beats most architecture choices: the learning rate

Random > grid

Bergstra & Bengio's headline - random search beats grid at the same compute budget

Pick a schedule · See its shape

5 LR schedules, 5 personalities

Click any tile to see what the schedule does to your learning rate, when it earns its keep, and which optimiser family it pairs with cleanly.

Optimisers - picking the right gradient-descent flavour

Optimisers determine how parameters get updated each backprop step by minimising the loss function. The popular choices and where each one earns its keep:

Stochastic Gradient Descent (SGD) - the workhorse. Slow and noisy, but its noise is a feature: SGD with momentum routinely matches Adam on image classification at convergence and generalises better. Pick it when you have time + a large dataset.
Adam - adaptive per-parameter step sizes. Fast convergence, especially on sparse gradients (NLP, recommender systems). Pays for it with worse final-epoch generalisation than SGD on some computer-vision tasks.
RMSprop - the "Adam without the momentum" baseline. Still useful for RNNs where the gradient distribution is wildly per-step.
AdamW - Adam with decoupled weight decay. Closes most of the generalisation gap that vanilla Adam leaves on the table.

The right pick is task-dependent - but the meta-rule is simple: don't use Adam if SGD with momentum will converge in your training-time budget.

Learning-rate schedulers - making the optimiser navigate well

The learning rate is the step size. It's the single most influential hyperparameter in deep learning, and a fixed learning rate is almost never optimal - too high early on causes divergence, too low late on causes plateaus.

The schedules we covered:

Step decay - drop the LR by a factor every N epochs. Simple, brittle. Works when you know the right N.
Exponential decay - continuous decay. Smoother than step.
Cosine annealing - decay along a half-cosine. Empirically strong on image and language pretraining.
Cyclic learning rates (Smith, 2017) - oscillate between a low and a high LR. Lets the optimiser escape sharp minima.
One-cycle policy - Smith's follow-up, a single triangular cycle. The most reliable "default schedule" for fine-tuning today.

Pick the schedule after you've picked the optimiser, and tune them together - a great LR for SGD is a terrible LR for Adam.

Watching the trade-offs in action

The webinar walked through real examples across computer vision, NLP, and recommendation systems. Two patterns held up in every example:

Optimiser choice matters most early in training. The first few epochs are when you're far from any minimum and the geometry differences between SGD, Adam, and RMSprop are largest.
LR-schedule choice matters most late in training. Once you're near a minimum, the schedule decides whether you settle into a flat basin (good generalisation) or a sharp one (bad generalisation, even if the train loss is lower).

If you have to pick one to obsess over, pick the schedule.

Optimiser choice matters most early; LR-schedule choice matters most late. The first epochs decide whether you converge at all; the final ones decide whether you settle into a flat basin (good) or a sharp one (bad).

The webinar's core mechanism: the optimiser is the lever that matters most early in training — far from any minimum, the SGD/Adam/RMSprop geometry gap is largest — while the LR schedule matters most late, when the choice is between settling into a flat basin (good generalisation) or a sharp one. Drag the epoch playhead across the training horizon: the dominant lever flips at the crossover, and the orange schedule zone — the one the article says to obsess over — lights once you reach late training. Structural exhibit: 5 optimiser families and 5 schedule shapes are the article's own counts; the influence-curve shapes and the crossover point are schematic (the article quantifies no influence values).

Hyperparameter tuning - not optional

Every optimiser and every schedule introduces hyperparameters: initial LR, momentum, β₁/β₂ for Adam, the cycle period for cyclic LR, etc. The webinar covered three approaches to finding good values:

Grid search - brute-force, only viable when you have ≤ 3 hyperparameters.
Random search - strictly better than grid search at the same compute budget (Bergstra & Bengio, 2012). The default "no-thinking" baseline.
Bayesian optimisation - Gaussian-process or TPE-based search that builds a model of the validation-loss surface as it goes. Pays for itself when each training run is expensive.

For most production projects, random search over a reasonable hypercube is the pragmatic choice. Reach for Bayesian optimisation when single training runs cost > 24 GPU-hours.

Takeaways

Three things to remember when your model isn't converging:

Try SGD with momentum before Adam. It might just need patience.
One-cycle schedule is a strong default. Especially for fine-tuning.
Tune the optimiser and the schedule together, not separately. They interact.

Optimisers and learning-rate schedulers: a practical deep dive

5 LR schedules, 5 personalities

Optimisers - picking the right gradient-descent flavour

Learning-rate schedulers - making the optimiser navigate well

Watching the trade-offs in action

Hyperparameter tuning - not optional

Takeaways

Further reading

Glossary

Related research

Data-centric AI development: from Big Data to Good Data

Deep learning in geoscience and subsurface: a practitioner's guide

AdamW vs Adam vs SGD on a 14-Well Transformer: Why the Optimizer Is a Lever, Not a Footnote

EarthScan insights, in your inbox.