GroupNorm Over BatchNorm: Normalising When Your Batch Size Is One

Most computer-vision code reaches for batch normalisation without a second thought, and for a decade that reflex has been right. Drop a BatchNorm after each convolution, watch the loss fall faster and the network tolerate a larger learning rate, and move on. The layer earned that trust by doing something genuinely useful: it normalises each channel's activations using the mean and variance computed across the images in the mini-batch ^[1], which keeps the distribution feeding the next layer roughly fixed as the weights move. The quiet assumption buried in that sentence is that there is a mini-batch to average over. On the raster well-log digitisation work we are describing here, that assumption broke, and it broke in a way that is worth walking through because the fix is a small, well-understood substitution that a lot of teams reach for too late.

Why the batch collapsed to a single image

The inputs were scanned raster well logs, and the thing that made them awkward is that no two are the same size. A synthetic log in our training corpus ranged from roughly 3,200 by 480 pixels at the small end to 12,800 by 640 at the large end, an aspect-ratio spread of four times in width alone. There is no honest way to pack images that different into a rectangular tensor without either resizing them, which destroys the thin curves the model has to trace, or padding them all to the largest size, which wastes most of the batch on blank pixels. For the binary segmentation phase we took the third option and trained one image at a time, batch size one, accepting the memory and throughput cost because the alternative degraded the signal we were trying to learn. The images are single-channel grayscale, one input channel, so even the channel dimension gave us nothing extra to lean on.

A batch of one is where BatchNorm stops being a normaliser and starts being a liability. With a single image in the batch, the per-channel statistics it computes are the statistics of that one image, so the running mean and variance it accumulates are estimated from a sample of size one at every step. Those running statistics are updated with a momentum term, and in the legacy v1 residual UNet we inherited that momentum was set to 0.01, which means each step nudges the running estimate by one percent toward the current single image. The result is an estimator that is both noisy, because it is built from one sample at a time, and badly biased, because the running buffers used at inference were tracking a trajectory through single images rather than a stable population. The network would train, but the activation distributions it saw at test time did not match the ones it saw while learning, and that gap is exactly the failure mode BatchNorm is supposed to remove.

The substitution the literature already had waiting

This is not a novel problem and we do not want to dress it up as one. The small-batch failure of BatchNorm was diagnosed and answered directly by group normalisation, which proposed dividing a layer's channels into groups and computing the normalisation statistics within each group of a single sample, with no dependence on the batch dimension whatsoever ^[2]. Because the statistics come from inside one image, the accuracy stops sliding as the batch shrinks toward one, which is the precise regime we were stuck in. The decision to swap BatchNorm for GroupNorm in our encoder-decoder was therefore not invention; it was reading the right paper and applying its central result. What was ours to decide was the one free parameter the swap leaves open: how many channels go in each group.

GroupNorm has two natural extremes. One group containing every channel is layer normalisation; one channel per group is instance normalisation; everything useful lives in between, and the right point depends on how wide the layer is. A fixed group count is wrong because the encoder-decoder doubles its channel width stage by stage in the UNet lineage our backbone descends from ^[3], so a constant number of groups would put four channels per group in a narrow early layer and sixty-four per group deep in the network. We wanted a rule that adapts to the layer width while never letting a group get so small that its statistics become as noisy as the batch-of-one problem we were escaping.

The half_or_16 rule, in one line

The heuristic we settled on is compact enough to state as a sentence: give each group sixteen channels, unless the layer has fewer than sixteen channels in total, in which case put them all in one group. In other words, the group size is the smaller of sixteen and the channel count, and the group count is the channel width divided by that floor. We took to calling it the half_or_16 rule because it keeps the per-group channel count at a sixteen floor across the whole network. A 128-channel layer, which is the embedding dimension at our model's core, becomes eight groups of sixteen. A 64-channel layer becomes four groups, a 256-channel layer sixteen groups, and so on up and down the stages, with the per-group statistics always estimated from at least sixteen channels of one image. Sixteen is a deliberately unglamorous number: it is large enough that the within-group mean and variance are stable, and small enough that the normalisation still adapts locally rather than washing the whole layer into one distribution.

The instrument below makes the rule tangible. Drag the layer width and watch the group count follow half_or_16 while the per-group floor stays pinned at sixteen, and read the right-hand panel for the reason the substitution was forced in the first place: GroupNorm averages inside a single image, so a batch of one is fine, while BatchNorm has an empty batch to average over and its 0.01-momentum running statistics collapse onto that lone sample.

A probe over the half_or_16 GroupNorm group-size rule at a batch size of one. Drag the layer width and the teal bars show how many normalisation groups the rule produces: it holds the group size at a 16-channel floor, so the group count is just the channel width divided by 16 once a layer is at least that wide. The orange marker tracks the selected width, and the dashed line marks where the build's 128-dimensional core sits on the sweep, at eight groups of sixteen. The right panel is the reason the normaliser changed at all. GroupNorm computes its statistics inside one image, splitting the channels into groups, so it never needs a batch to average over and is unmoved by the memory-constrained batch of one. BatchNorm computes per-channel statistics across the batch, and at batch one there is nothing to average: its running statistics, updated at a momentum of 0.01 in the legacy v1 ResUNet, collapse onto a single image and turn into noise. The fixed figures (16-channel group floor, 128-dim core, 1 input channel grayscale, batch 1, BatchNorm momentum 0.01) are the build's own; the drawn bar geometry is schematic, showing the group-count cadence the rule produces across widths rather than a measured activation histogram.

What stayed stable, and what we would still watch

The payoff was undramatic in the best way. With GroupNorm in place of BatchNorm, the activation statistics no longer depended on a batch we did not have, the train-time and test-time distributions stopped diverging, and we could train one variable-sized image at a time without the normalisation layer quietly poisoning the inference path. The thin curves survived because we were not resizing to force a batch, and the normaliser survived because it had stopped asking for one. None of that required a new idea; it required matching the layer to the constraint we were actually under.

The sixteen-channel floor is a choice tied to our widths, not a universal constant, and that is the part worth carrying to another project. A network whose layers are mostly narrower than sixteen channels would spend most of its depth in the single-group, layer-normalisation corner, where the rule stops adapting; a network whose layers are enormous might prefer a larger floor so the group count does not explode. The portable part is not the integer. It is the reasoning: when the data shape forbids a batch, stop normalising across the batch, normalise inside the sample instead, and pick a group size that holds steady across the widths your architecture actually uses. The number sixteen happened to fit a grayscale, variable-size, batch-of-one log digitiser; the discipline of choosing it from the constraint rather than from habit is what travels.

Key takeaways

Variable-size raster logs, from about 3,200 by 480 to 12,800 by 640 pixels, could not be batched without resizing away the thin curves, so the binary segmentation phase trained one grayscale image at a time, batch size one.
At batch one, BatchNorm has no batch to average over: its per-channel statistics are estimated from a single image, and with the legacy v1 ResUNet momentum of 0.01 its running buffers track a trajectory through single images, so train-time and test-time activation distributions diverge.
GroupNorm, from Wu and He (2018), was the documented answer: it computes statistics within channel groups of one sample, with no dependence on the batch dimension, so accuracy does not degrade as the batch shrinks toward one. The swap was reading the right paper, not inventing a method.
The one free parameter, group size, we set with a half_or_16 rule: sixteen channels per group unless the layer is narrower than sixteen, so the 128-dim core becomes eight groups of sixteen and the per-group floor holds across the encoder-decoder's doubling widths.
Sixteen is tied to our widths, not a universal constant. The transferable lesson is the reasoning: when the data shape forbids a batch, normalise inside the sample and choose a group size that stays stable across the widths your architecture actually uses.

References

[1] Ioffe, S. and Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML (2015). The layer that made deep networks train fast by normalising activations using statistics computed across the mini-batch. https://arxiv.org/abs/1502.03167

[2] Wu, Y. and He, K. Group Normalization. ECCV (2018). Replaces the batch dimension with channel groups inside a single sample, so accuracy stops degrading as the batch shrinks toward one. https://arxiv.org/abs/1803.08494

[3] Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI (2015). The symmetric encoder-decoder whose channel-doubling stages set the widths a per-layer group rule has to cover. https://arxiv.org/abs/1505.04597

GroupNorm Over BatchNorm: Normalising When Your Batch Size Is One

Why the batch collapsed to a single image

The substitution the literature already had waiting

The half_or_16 rule, in one line

What stayed stable, and what we would still watch

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on