Skip to main content

Blog

Optimisers and learning-rate schedulers: a practical deep dive

Why the choice of optimiser (SGD, Adam, RMSprop) and the shape of your learning-rate schedule (step decay, exponential, cyclic) often matter more than another 1% of model capacity. A webinar recap, with the live optimiser-trajectory plot we built to make these trade-offs visible.

Tannistha Maitiby Tannistha Maiti7 min read
EarthScan insight

Optimiser choice matters most early in training. LR-schedule choice matters most late. If you have to pick one to obsess over, pick the schedule.

A practical webinar on the part of deep-learning training that doesn't get talked about enough: the optimiser and the learning-rate schedule are often the difference between a model that converges and one that doesn't — regardless of how much capacity you've thrown at the architecture.

We built a live optimiser-trajectory plot for this session so you can watch SGD, Adam, and RMSprop fight their way across a contoured loss surface in real time. Recording at the top.

The webinar in numbers

5

Optimiser families covered (SGD, SGD+momentum, Adam, AdamW, RMSprop)

5

LR schedule shapes (step / exponential / cosine / cyclic / one-cycle)

Hyperparameter that beats most architecture choices: the learning rate

Random > grid

Bergstra & Bengio's headline — random search beats grid at the same compute budget

Pick a schedule · See its shape

5 LR schedules, 5 personalities

Click any tile to see what the schedule does to your learning rate, when it earns its keep, and which optimiser family it pairs with cleanly.

Y-axis = learning rate (normalised). X-axis = training epochs (0 → 100). Curves are illustrative — the absolute values depend on the optimiser and the model.

Optimisers — picking the right gradient-descent flavour

Optimisers determine how parameters get updated each backprop step by minimising the loss function. The popular choices and where each one earns its keep:

  • Stochastic Gradient Descent (SGD) — the workhorse. Slow and noisy, but its noise is a feature: SGD with momentum routinely matches Adam on image classification at convergence and generalises better. Pick it when you have time + a large dataset.
  • Adam — adaptive per-parameter step sizes. Fast convergence, especially on sparse gradients (NLP, recommender systems). Pays for it with worse final-epoch generalisation than SGD on some computer-vision tasks.
  • RMSprop — the "Adam without the momentum" baseline. Still useful for RNNs where the gradient distribution is wildly per-step.
  • AdamW — Adam with decoupled weight decay. Closes most of the generalisation gap that vanilla Adam leaves on the table.

The right pick is task-dependent — but the meta-rule is simple: don't use Adam if SGD with momentum will converge in your training-time budget.

Learning-rate schedulers — making the optimiser navigate well

The learning rate is the step size. It's the single most influential hyperparameter in deep learning, and a fixed learning rate is almost never optimal — too high early on causes divergence, too low late on causes plateaus.

The schedules we covered:

  • Step decay — drop the LR by a factor every N epochs. Simple, brittle. Works when you know the right N.
  • Exponential decay — continuous decay. Smoother than step.
  • Cosine annealing — decay along a half-cosine. Empirically strong on image and language pretraining.
  • Cyclic learning rates (Smith, 2017) — oscillate between a low and a high LR. Lets the optimiser escape sharp minima.
  • One-cycle policy — Smith's follow-up, a single triangular cycle. The most reliable "default schedule" for fine-tuning today.

Pick the schedule after you've picked the optimiser, and tune them together — a great LR for SGD is a terrible LR for Adam.

Watching the trade-offs in action

The webinar walked through real examples across computer vision, NLP, and recommendation systems. Two patterns held up in every example:

  1. Optimiser choice matters most early in training. The first few epochs are when you're far from any minimum and the geometry differences between SGD, Adam, and RMSprop are largest.
  2. LR-schedule choice matters most late in training. Once you're near a minimum, the schedule decides whether you settle into a flat basin (good generalisation) or a sharp one (bad generalisation, even if the train loss is lower).

If you have to pick one to obsess over, pick the schedule.

Optimiser choice matters most early; LR-schedule choice matters most late. The first epochs decide whether you converge at all; the final ones decide whether you settle into a flat basin (good) or a sharp one (bad).

Hyperparameter tuning — not optional

Every optimiser and every schedule introduces hyperparameters: initial LR, momentum, β₁/β₂ for Adam, the cycle period for cyclic LR, etc. The webinar covered three approaches to finding good values:

  • Grid search — brute-force, only viable when you have ≤ 3 hyperparameters.
  • Random search — strictly better than grid search at the same compute budget (Bergstra & Bengio, 2012). The default "no-thinking" baseline.
  • Bayesian optimisation — Gaussian-process or TPE-based search that builds a model of the validation-loss surface as it goes. Pays for itself when each training run is expensive.

For most production projects, random search over a reasonable hypercube is the pragmatic choice. Reach for Bayesian optimisation when single training runs cost > 24 GPU-hours.

Takeaways

Three things to remember when your model isn't converging:

  1. Try SGD with momentum before Adam. It might just need patience.
  2. One-cycle schedule is a strong default. Especially for fine-tuning.
  3. Tune the optimiser and the schedule together, not separately. They interact.

Further reading

Key takeaways

  1. Try SGD with momentum before Adam. It might just need patience — and it generalises better at convergence on most computer-vision tasks.
  2. One-cycle policy is a strong default for fine-tuning: warm up, then anneal — Smith's super-convergence recipe gets to a strong minimum in fewer epochs.
  3. Tune the optimiser AND the schedule together, not separately. They interact — a great LR for SGD is a terrible LR for Adam.
  4. Random search beats grid search at the same compute budget (Bergstra & Bengio, 2012). Reach for Bayesian optimisation when single training runs cost > 24 GPU-hours.

Glossary

Adam
Adaptive Moment Estimation (Kingma & Ba, 2014) — combines momentum with per-parameter adaptive step sizes. Fast convergence on sparse-gradient problems (NLP, recommenders); sometimes worse final-epoch generalisation than SGD on image tasks.
AdamW
Adam with decoupled weight decay (Loshchilov & Hutter, 2017). Closes most of the generalisation gap that vanilla Adam leaves on the table — the modern transformer default.
Bayesian optimisation
Hyperparameter-search technique that builds a probabilistic model of the validation-loss surface as it tries hyperparameter combinations. Pays for itself when single training runs are expensive (>24 GPU-hours).
Cosine annealing
A learning-rate schedule that decays the LR following a half-cosine curve from initial value to zero across the training horizon. Smoother than step decay, less aggressive than exponential — the empirical winner for image and language pretraining since 2018.
Cyclic LR
A learning-rate schedule (Smith, 2017) that oscillates between a base and max value following a triangle wave. The high LRs help the optimiser escape sharp local minima and find flatter ones (better generalisation).
One-cycle policy
Smith's super-convergence schedule — a single triangular cycle that warms up to a max LR then anneals to nearly zero. Reaches strong minima in fewer epochs than constant-LR training. The modern fine-tuning default.
Random search
Hyperparameter-search baseline that samples random points in the hyperparameter hypercube. Bergstra & Bengio (2012) showed it strictly dominates grid search at the same compute budget — because most hyperparameters don't matter equally and grid search wastes samples on the unimportant axes.
RMSprop
Root Mean Square propagation — Adam without the momentum term. Maintains a running average of squared gradients to normalise step sizes per parameter. Used heavily in RNN training where gradient magnitudes vary wildly per step.
SGD
Stochastic Gradient Descent — updates model parameters in the direction of the negative gradient computed on a mini-batch. The classical, noise-friendly optimiser; with momentum, still competitive on image classification at convergence.
EarthScan
Continuous AI for explorers

info@earthscan.io

Go to Top

© 2026 Copyright. Earthscan