Averaging the Endpoints: Weight Soups and SWA for Small-Data Dense Prediction

Abstract

A practitioner who wants the stability of an ensemble but cannot afford its inference cost has two very different averages to reach for, and the literature does not always keep them apart. One averages predictions: train several networks, run them all at test time, and average their outputs. The other averages parameters: take several weight vectors and average them into a single network that is then served alone. This survey reads the published work on the second idea, weight-space averaging, in its three best-known forms, stochastic weight averaging, the exponential moving average of weights, and model soups, and distinguishes it carefully from prediction-space ensembling. We trace weight averaging to its origin in the averaging of iterates in stochastic approximation, credit the period-correct work that established why averaging weights along a single trajectory lands in a wider, flatter region of the loss surface, and set out the mode-connectivity geometry that makes averaging two weight vectors a sensible operation at all. We read the synthesis against a real small-data dense-prediction baseline from our raster well-log digitiser, VeerNet, a fifty-epoch multiclass segmentation run at batch size one that reaches a background-mask intersection over union of 0.94 and a best single-curve goodness-of-fit of R-squared 0.9891 in roughly 550 minutes of wall clock on one GPU. The central finding is that on a memory-bound, single-GPU run, averaging the weights of late checkpoints captures much of the variance reduction and flat-minimum benefit that a prediction ensemble of the same checkpoints would give, while serving as one network at one network's cost, and that this is precisely the regulariser a one-GPU dense-prediction project should want before it pays for an ensemble it then has to deploy.

The idea of averaging the iterates of a stochastic optimiser is older than deep learning by a wide margin. Polyak and Juditsky showed, in the context of stochastic approximation, that averaging the successive iterates of a noisy gradient procedure yields an estimate that converges faster and is more robust than any single iterate, because the average cancels the per-step noise that the optimiser injects [1]. That result is the conceptual root of everything this survey covers: a trained network at the end of a run is one noisy iterate, and the averaging trick says that the mean of several late iterates may be a better solution than the best of them.

Deep learning rediscovered this in two nearly simultaneous lines of work. The first kept a running, decayed average of the weights during training rather than averaging snapshots after the fact. Tarvainen and Valpola's mean-teacher method maintains an exponential moving average of the student network's weights and uses that averaged copy as a more stable teacher whose predictions supervise the student, and they report that the weight-averaged teacher gives better and more stable targets than the raw student [2]. The exponential moving average of weights is, in this framing, just the online form of the same averaging Polyak described, with a decay that weights recent iterates more heavily.

The second line averaged the trajectory explicitly and named the procedure. Izmailov and colleagues introduced stochastic weight averaging, which runs the optimiser into the late, high-learning-rate phase of training, snapshots the weights periodically, and averages those snapshots into a single set of parameters [3]. Their central empirical claim is geometric: the averaged solution sits in a wider, flatter region of the loss surface than any individual snapshot, and wider minima are associated with better generalisation. Crucially, the averaged network is one network. There is no inference-time cost beyond a single forward pass, and a one-time pass over the data to recompute batch-normalisation statistics for the averaged weights.

Why averaging two weight vectors should produce a usable third network at all is not obvious, and the period's mode-connectivity work is what makes it defensible. Garipov and colleagues showed that the distinct minima found by independent runs are not isolated basins separated by high barriers but are connected to one another by simple low-loss curves through weight space [4]. Frankle and colleagues sharpened the precondition into a property they called linear mode connectivity: under the right conditions two solutions can be joined by a straight line in weight space along which the loss does not rise, which is exactly the condition under which their average is also a low-loss solution [8]. Averaging weights is only sensible inside a region where the loss surface is connected and roughly convex, and these results map out when that holds.

Model soups carried the same operation into the fine-tuning era. Wortsman and colleagues observed that when many copies of a model are fine-tuned from the same pre-trained initialisation under different hyperparameters, the fine-tuned weights tend to land in one linearly-connected region, so their weights can simply be averaged into a single model that often beats the best individual run and, again, costs nothing extra at inference [7]. The name captures the recipe: pour many runs into one pot and serve the blend. The greedy variant they propose, adding a run to the soup only if it does not hurt held-out accuracy, is the practical safeguard against a bad ingredient.

These weight-space methods sit next to, and are frequently confused with, prediction-space ensembling, which is the baseline they must be measured against. Lakshminarayanan and colleagues established deep ensembles, several independently trained networks whose predictions are averaged, as a simple and strong method for both accuracy and calibrated uncertainty [5]. Snapshot ensembles made the cost of that idea cheaper to obtain by collecting several checkpoints along a single cyclic-learning-rate schedule and ensembling their predictions rather than training separate models [6]. Both ensemble predictions, not weights, and both therefore pay N forward passes at test time for N members. The line this survey draws is between that family and the weight-averaging family, because the deployment economics differ entirely. Two further results extend weight averaging in directions a dense-prediction practitioner should know about: the SWAG method fits a Gaussian over the optimiser's iterates around the weight-average solution to recover calibrated uncertainty from a single run [9], and SWAD adapts the averaging schedule to seek flat minima densely and with overfit awareness, reporting strong domain-generalisation gains [10].

Method

This is a structured reading of the published weight-averaging literature, not a new experiment, and the procedure was kept deliberately narrow so the claims stay defensible. We began from the operation common to all three weight-space methods, the averaging of several weight vectors into one, and traced it to its stochastic-approximation origin [1]. We then organised the deep-learning literature into the two families the rest of the survey turns on. The weight-space family was read in its three forms: the online exponential moving average of weights [2], the post-hoc averaging of trajectory snapshots in stochastic weight averaging [3], and the averaging of many fine-tuning runs in model soups [7], together with the mode-connectivity results that justify the operation [4] [8] and the two extensions that add uncertainty and a flatness-seeking schedule [9] [10]. The prediction-space family was read as the baseline against which the first must be judged: independently trained deep ensembles [5] and the cheaper single-schedule snapshot ensembles [6]. For each method we extracted three things: what is averaged, what it costs at training and at inference, and the regime its authors claim it for.

To keep the survey anchored to a real task rather than to abstractions, we read it against one concrete reference point drawn from the engagement archive: a small-data dense-prediction run from VeerNet, our raster well-log digitiser. The reference is a multiclass segmentation of a scanned log into background and two thin curves, trained for fifty epochs at a physical batch size of one, memory-bound by the variable and very wide log images, for roughly 550 minutes of wall clock on the fifteen-thousand-log synthetic set. The numbers we quote from it, a background-mask intersection over union of 0.94 and a best single-curve goodness-of-fit of R-squared 0.9891, are real and used as a worked example of where weight averaging would and would not help. They are not a new benchmark, and the survey does not re-measure any published method against the others. The interactive exhibit below is built on the same footing: a real anchor metric, with the per-checkpoint scatter and the averaging lift flagged as illustrative basin geometry rather than as a re-run ablation.

Weights versus predictions

The whole survey turns on a distinction that is easy to state and easy to lose. A prediction ensemble of N models computes N separate forward passes at test time and averages their outputs; it almost always helps, and it costs N times the inference of one model and N times the memory to hold them. A weight average of N models computes one set of parameters before any test input arrives and serves a single network; it can only help when the N weight vectors live in a connected, roughly convex region, but when it does help it costs exactly one model's inference [3] [7]. The same word, averaging, names two operations on opposite sides of the network: one before the nonlinearity stack, in parameter space, and one after it, in output space.

That difference is not a footnote on a memory-bound dense-prediction run. Our reference task already sits at a physical batch size of one because a single twelve-thousand-pixel-wide log nearly fills the device; holding several such models resident to ensemble their predictions is not a free lunch we can quietly order. A method that buys ensemble-like stability while leaving the served artifact a single network is therefore not a minor optimisation but the difference between a regulariser we can deploy inside a closed operator network and one we cannot.

What averaging in weight space actually requires

The precondition for averaging weights is a geometric one, and skipping it is how the operation goes wrong. Two weight vectors can be averaged into a useful third only if the straight line between them in parameter space does not cross a ridge of high loss, the linear-mode-connectivity condition Frankle and colleagues isolated [8]. Snapshots taken from one continuous trajectory in the late phase of a single run satisfy this almost by construction, because they are near neighbours in the same basin, which is why stochastic weight averaging and the exponential moving average of weights work within a single run without any special care [2] [3]. Independently initialised runs do not generally satisfy it, which is why model soups depend on a shared pre-trained starting point that lands all the fine-tuning runs in the same connected region [7], and why the greedy soup that drops any ingredient that hurts held-out accuracy is the right default rather than blind averaging [7].

Weight-space average of k checkpoints

\theta_{\text{soup}} = \frac{1}{k}\sum_{i=1}^{k} \theta_i, \qquad \text{served as one network at } 1\times \text{ inference cost}

The expression is deceptively plain. It is an arithmetic mean of parameter vectors, and its entire validity rests on the k weight vectors being close enough in the loss landscape that their mean is also low-loss. One practical consequence follows immediately and is worth stating because it catches people: after averaging, any normalisation layer that tracks running statistics holds the statistics of the individual checkpoints, not of the averaged weights, so the averaged network needs one forward pass over training data to recompute them before it is evaluated [3]. The average is cheap; forgetting the statistics recomputation is the usual reason a soup underperforms its ingredients on the first try.

The basin the averaging lives in

The reason averaging late iterates helps, when it helps, is that the optimiser does not converge to a point. In the late phase of training it wanders the floor of a wide basin, and each saved epoch is one noisy sample of that wander [1] [3]. Averaging those samples cancels the component of each that is per-step noise and leaves the component they share, which is the centre of the basin, and the centre of a wide basin is exactly the flat solution that the stochastic-weight-averaging work associates with better generalisation [3]. The mode-connectivity results are the licence for believing the samples share a basin in the first place [4] [8], and the flat-minima reading is what the schedule-shaping extensions push on hardest, with SWAD averaging densely and with overfit awareness precisely to settle deeper into the flat region [10].

This is also the cleanest way to see why a small-data run stands to gain more than a large one. With few training images the late-phase wander is noisier and the single best checkpoint is a chancier draw, so the variance the average cancels is a larger share of the total error, and the gap between a lucky checkpoint and an unlucky one is wider. A run with abundant data has a smoother trajectory and a more reliable final iterate, so the averaging buys less. The reference task, fifty epochs at batch size one on synthetic logs, is squarely in the regime where the wander is real and the average has noise to remove.

Results

The reference baseline gives the survey one honest anchor rather than a sweep of fabricated numbers. The fifty-epoch multiclass run reaches a background-mask intersection over union of 0.94 and, on the cleanest curve example, a single-curve goodness-of-fit of R-squared 0.9891; the run is memory-bound at a physical batch size of one and takes on the order of 550 minutes of wall clock for the fifteen-thousand-log set. The two metrics matter in different ways to the weight-averaging question. The background intersection over union of 0.94 is the stable, easy quantity, the kind of metric whose late-checkpoint variance is small and whose average is therefore a modest tidy-up. The best single-curve R-squared of 0.9891 is the opposite: it is a best case across examples, which is to say it is exactly the kind of high-variance, lucky-draw quantity that weight averaging is supposed to make less of a lottery, by pulling the served model toward the basin centre instead of betting on one fortunate epoch.

Late in a single-GPU run the optimiser does not converge to one point; it wanders the floor of a wide, flat basin, and every saved epoch is one sample of that wander. This instrument lays the last sixteen checkpoints out across the basin and lets you average a contiguous window of them in weight space into one parameter set, a checkpoint soup, then compares three ways to spend the same window: the best single checkpoint (the baseline a soup must beat, pinned to the sourced multiclass background IoU of 0.94), the averaged soup (which serves as one network at 1x inference cost), and a prediction ensemble of the same checkpoints (the strong baseline, which serves at N-times the cost). Drag the two levers to move the window start and widen the soup, and watch a wider average climb toward the flat-minimum ceiling and approach the ensemble while still serving at 1x. The run facts and the 0.94 anchor are sourced from the engagement archive; the per-checkpoint scatter, the soup lift, and the ensemble edge are an illustrative flat-basin model and are flagged as such. The orange accent marks the one element that carries the argument: the averaged soup that buys ensemble-like accuracy without ensemble serving cost.

The exhibit above renders the mechanism the stochastic-weight-averaging line describes [3], using the background metric as its sourced anchor. The string of late checkpoints scatters around the basin floor; selecting a contiguous window and averaging it in weight space produces a single result, drawn in the scarce orange, and the comparison column scores that result against the best single checkpoint and against a prediction ensemble of the same window. The shape of the lift is the argument: a wider averaging window climbs toward the flat-minimum ceiling and closes most of the distance to the ensemble, while the served model stays one network at one-times the inference cost, whereas the ensemble buys its last sliver of accuracy at N-times the serving cost. The per-checkpoint values, the lift curve, and the ensemble edge are an illustrative basin model and are flagged as such on the canvas; only the run facts and the 0.94 anchor are sourced. The exhibit is there to make the trade legible, not to assert a measured ablation our archive does not contain.

The honest reading of the two reference metrics through this lens is asymmetric, and saying so is the point. On the background intersection over union of 0.94, weight averaging is a small win at best, because there is little late-phase variance left to cancel on an easy, near-solved class. On the best single-curve R-squared of 0.9891, weight averaging is the more interesting proposition, not because it would necessarily raise that particular best-case number, but because it would make the served model less dependent on having drawn the one epoch that produced it; the value of the average there is the reduction in spread across examples, the conversion of a lucky single checkpoint into a dependable one. That is the regime the small-data, single-GPU practitioner actually cares about.

Reading the lift against an honest ensemble baseline

A survey of weight averaging that did not measure it against prediction ensembling would be flattering it. The honest baseline is the deep ensemble, and on accuracy alone an ensemble of independently trained models is usually at least as good as a weight average of one run's checkpoints, because it captures genuinely different functions rather than samples of one basin [5]. The snapshot-ensemble result narrows the training-cost gap, since it harvests several checkpoints from one cyclic schedule rather than training N models, but it still ensembles predictions and therefore still pays N forward passes at inference [6]. The case for weight averaging is not that it beats the ensemble on raw accuracy. It is that it recovers most of the ensemble's stability and flat-minimum benefit while collapsing the N forward passes back to one, which on a memory-bound box is the difference between a method that ships and one that does not.

There is one place the ensemble retains a clear edge that weight averaging does not match for free, and the period's work is explicit about it: predictive uncertainty. A deep ensemble gives calibrated uncertainty almost as a side effect of disagreement among its members [5]. A plain weight average gives a single point estimate and says nothing about its own confidence. The bridge in the literature is SWAG, which fits a Gaussian over the optimiser's iterates around the weight-average solution and samples from it to recover an approximate posterior, buying back ensemble-like uncertainty from the same single run that produced the average [9]. For a digitiser whose downstream consumers need a confidence on each recovered curve value, that distinction decides whether a weight average is sufficient on its own or wants the SWAG layer on top.

Discussion

Read together, the three weight-space methods are one operation wearing three schedules. The exponential moving average of weights runs the averaging online with a decay [2]; stochastic weight averaging runs it post-hoc over late snapshots of one trajectory [3]; model soups run it across the endpoints of many fine-tuning runs that share an initialisation [7]. All three rest on the same geometric fact, that the relevant weight vectors lie in a connected, roughly convex region whose mean is itself low-loss [4] [8], and all three deliver a single network that serves at the cost of one. The prediction-space family is the baseline that keeps the comparison honest: it is generally stronger on raw accuracy and uniquely good at uncertainty [5], but it pays a serving cost that scales with the number of members [5] [6], a cost a single-GPU dense-prediction deployment may simply not be able to pay.

Where our own work sits relative to this literature is worth marking, since it is the line between this survey and our applied writing. This is a reading of how the public field averages, in weights and in predictions. Our reference numbers are downstream of the survey: VeerNet's late-phase, batch-of-one training on small synthetic data is the exact regime where the stochastic-weight-averaging argument predicts the most benefit, because the wander is noisy and the single best checkpoint is a chancy draw, and where the prediction-ensemble alternative is the most expensive to serve, because the memory is already nearly full. The survey explains why reaching for a weight average before an ensemble is not a quirk of our pipeline but the deployment-economics-aware default the field's own results recommend for a one-GPU run. It also marks the limit of the free lunch: when the consumer needs calibrated confidence on each curve value, the weight average wants the uncertainty machinery that SWAG adds [9], rather than the bare point estimate the plain average returns.

Limitations

This is a survey and inherits a survey's limits. It synthesises what the published weight-averaging and ensembling literature reports and does not re-implement or re-measure any of the methods it discusses; where it quotes numbers, those are the real metrics of a single multiclass run from one engagement and one architecture, a background-mask intersection over union of 0.94 and a best-case single-curve R-squared of 0.9891 at fifty epochs and batch size one over roughly 550 minutes, used as a worked illustration rather than as a fresh head-to-head benchmark of weight averaging against ensembling. We did not run a weight-averaging ablation on the reference task, so the survey makes no measured claim about the size of the lift on these specific metrics; the asymmetry it argues, a small win on the easy background metric and a variance-reduction win on the high-variance best-case curve metric, is a prediction from the cited mechanism, not a result we recorded. The interactive exhibit's per-checkpoint scatter, its averaging lift, and its ensemble edge are an illustrative flat-basin model and are flagged as such on the canvas; the true late-checkpoint metrics for this run were not saved per epoch and were not reconstructed. The survey also scopes itself to the three best-known weight-space methods and the two ensembling baselines the period treats as canonical, and it stops at the close of its own quarter, so later refinements of model merging and weight interpolation that the field has since explored are out of frame. A reader should take this as a map of when averaging weights rather than predictions is the right move on a constrained run, not as a substitute for running the ablation on their own task and metric.

What to carry from the survey

The literature hides two different averages under one word. Averaging predictions (a deep ensemble) runs N networks at test time and almost always helps but costs N forward passes; averaging weights (stochastic weight averaging, the EMA of weights, model soups) builds one network before inference and serves it at the cost of one, but only helps when the weight vectors share a connected, roughly convex region.
Weight averaging is an old idea rediscovered: Polyak and Juditsky's averaging of stochastic-approximation iterates is the root, and mode-connectivity results (Garipov and colleagues; Frankle and colleagues) are the licence that makes averaging two weight vectors land on a low-loss solution rather than a ridge.
The three weight-space methods are one operation on three schedules: an online decayed average during training, a post-hoc average over late snapshots of one run, and an average across many fine-tuning runs that share an initialisation. All three deliver a single network at 1x inference cost.
On the reference VeerNet run (50 epochs, batch size 1, ~550 min, background IoU 0.94, best single-curve R-squared 0.9891) the expected benefit is asymmetric: little to gain on the easy, low-variance background metric, but a real variance-reduction win on the high-variance best-case curve metric, which is exactly where a small-data, single-GPU run wants to stop betting on one lucky checkpoint.
The ensemble keeps two edges weight averaging does not match for free: usually higher raw accuracy and calibrated uncertainty. SWAG buys back the uncertainty from the same single run, so the decision rule for a memory-bound dense-prediction deployment is to reach for a weight average first and add the SWAG layer only when each output needs a confidence.

The smallest habit this survey would install is a question to ask before the ensemble is even built: on this run, is the late trajectory noisy enough that the served model is really just one lucky draw, and if it is, the cheaper fix is to average the draws in weight space and serve the one network that results, keeping the ensemble in reserve for the day the outputs need a calibrated confidence the average alone cannot give.

References

[1] Polyak, B. T., and Juditsky, A. B. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4), 838-855 (1992). The averaging of stochastic-approximation iterates that deep-learning weight averaging rediscovers. https://doi.org/10.1137/0330046

[2] Tarvainen, A., and Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. NeurIPS (2017). Maintains an exponential moving average of the weights as a more stable teacher. https://arxiv.org/abs/1703.01780

[3] Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., and Wilson, A. G. Averaging Weights Leads to Wider Optima and Better Generalization. UAI (2018). Introduces stochastic weight averaging and argues it finds wider, flatter minima at no inference cost. https://arxiv.org/abs/1803.05407

[4] Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D., and Wilson, A. G. Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs. NeurIPS (2018). Shows distinct minima are joined by low-loss paths, the geometry weight averaging depends on. https://arxiv.org/abs/1802.10026

[5] Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. NeurIPS (2017). The strong prediction-space baseline that averages several independently trained models and yields calibrated uncertainty. https://arxiv.org/abs/1612.01474

[6] Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., and Weinberger, K. Q. Snapshot Ensembles: Train 1, Get M for Free. ICLR (2017). Collects checkpoints along a cyclic schedule and ensembles their predictions, narrowing the training cost of an ensemble. https://arxiv.org/abs/1704.00109

[7] Wortsman, M., Ilharco, G., Gadre, S. Y., Roelofs, R., Gontijo-Lopes, R., et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. ICML (2022). Averages the weights of many fine-tuning runs that share an initialisation into one served model. https://arxiv.org/abs/2203.05482

[8] Frankle, J., Dziugaite, G. K., Roy, D. M., and Carbin, M. Linear Mode Connectivity and the Lottery Ticket Hypothesis. ICML (2020). Isolates the linear-mode-connectivity condition under which two solutions can be interpolated without a loss barrier. https://arxiv.org/abs/1912.05671

[9] Maddox, W. J., Garipov, T., Izmailov, P., Vetrov, D., and Wilson, A. G. A Simple Baseline for Bayesian Uncertainty in Deep Learning. NeurIPS (2019). Fits a Gaussian over the SGD iterates around the weight-average solution to recover calibrated uncertainty from one run. https://arxiv.org/abs/1902.02476

[10] Cha, J., Chun, S., Lee, K., Cho, H.-C., Park, S., Lee, Y., and Park, S. SWAD: Domain Generalization by Seeking Flat Minima. NeurIPS (2021). A dense, overfit-aware weight-averaging schedule aimed at flat minima for generalisation. https://arxiv.org/abs/2102.08604