Padding to a Multiple of 32: Small Detail, Big Inference Bug

The model was good. That was the confusing part. On the held-out validation split it scored exactly where we expected, the loss curves were boring in the way good loss curves are, and a row of side-by-side predictions looked clean enough to put in a report. Then we pointed it at a freshly scanned log that had never been through our pipeline, and the predicted curve mask came back with its right edge sheared off, as if someone had taken scissors to the last column of the image. No crash. No warning. No NaN. Just a mask that was subtly, consistently the wrong width, and a digitised curve that drifted away from the paper at the deep end of every log.

This is the story of that bug. It is a small one, the kind that does not make it into architecture papers because it has nothing to do with architecture, and it cost us most of a day before we saw it. The punchline is thirty-two pixels of padding that our training data loader quietly added and our inference path quietly skipped. Everything downstream followed from that one asymmetry.

A model that lies about being fine

The first instinct, when a model that validates well fails in production, is to suspect the model. We did not. The validation numbers were real, drawn from the same held-out logs we always used, and they had not moved. The second instinct is to suspect the data: maybe the new scan was corrupt, or rotated, or a different bit depth. It was not. It was a perfectly ordinary single-channel grayscale log, the same kind we had trained on by the thousands.

What made this hard is that the failure is not loud. A fully convolutional segmenter, in the lineage of Long, Shelhamer, and Darrell [2], will happily accept an input of nearly any spatial size and emit a correspondingly sized output. That flexibility is exactly what you want for variable-width scanned logs, and it is also exactly what lets a malformed width sail straight through. The network does not assert that its input is well shaped. It just computes, and if the shapes are slightly off it computes a slightly wrong answer with total confidence. There is no exception to catch because, from the network's point of view, nothing went wrong.

The dangerous bugs are the ones that return a number instead of an error

A shape mismatch that throws is a gift: the stack trace points at the line. A shape mismatch that silently broadcasts, truncates, or rounds is the expensive kind, because the program keeps running and the only symptom is that the output is a little bit wrong. Our torn mask was the second kind. The pipeline ran end to end, produced a mask of plausible size, and never once suggested anything was amiss.

Following the width down the network

To see where the edge went, you have to follow the input width through the encoder one stage at a time. Our segmenter is a symmetric encoder-decoder of the U-Net family [1]: the encoder downsamples, the decoder upsamples back, and skip connections splice the encoder's high-resolution features into the matching decoder stage. The encoder has five stride-2 stages. Each one halves the spatial resolution, so by the bottleneck the width has been divided by two five times over, a total downsample of 32.

Halving is fine when the number is even. The trouble starts when it is odd. Convolutional downsampling does integer division, and integer division floors. Feed in a width of, say, 6448 pixels and the first stride-2 stage produces 3224, the second 1612, the third 806, the fourth 403, and the fifth floors 403 down to 201. That single floored pixel is the whole bug. The decoder now starts from 201 and doubles five times to 6432, which is sixteen pixels short of the 6448 we put in. When the decoder tries to concatenate its upsampled feature map with the encoder feature it saved on the way down, the two no longer share a width, and depending on how the framework reconciles them you get either a thrown error in the lucky case or, in the unlucky case we hit, a silent crop that lops the last columns off the prediction.

“
Five halvings can only round-trip cleanly if the width is divisible by two five times, which is to say divisible by 32. Any other width loses at least one pixel in the floor and never gets it back.
”

— The arithmetic of the bug

The drop happens at the rightmost columns because that is where the accumulated rounding lands: depth runs down the page, the curve tracks run across it, and the deep end of the log sits at the far edge of the width axis. A mask that is sixteen pixels too narrow is a mask that simply stops short of the last interval, which is why the digitised curve always drifted off the paper at the bottom of the log and nowhere else.

Train and infer through the same door

So why did training never show this? Because the training data loader was doing the right thing without anyone meaning it to. Our multiclass pipeline batched variable-width logs through a custom collate function that, among other jobs, padded every batch up to a width the network could fold cleanly, rounding to the next multiple of 32. Every image the model saw during training was therefore pre-aligned. The model learned on padded inputs and validated on padded inputs, because validation ran through the same loader. The 32-pixel rule was satisfied everywhere we looked.

Inference ran through a different door. The serving path took a single raw image, normalised it, and handed it straight to the network with no batch, no collate function, and crucially no alignment pad. For any log whose raw width happened to be a multiple of 32 the prediction was perfect, which is why a good fraction of our spot checks passed and lulled us. For every other width the floor crept in. We had, without realising it, built two preprocessing pipelines that agreed on everything except the one step that mattered.

Train and serve skew is the textbook name for this, and the textbook is right: the fix is not to be cleverer, it is to make the two paths share the exact same preprocessing. We pulled the alignment step out of the collate function and into a single shared transform that both the loader and the serving path call. After that, every input the network ever sees, in training or in production, has already been snapped up to the next multiple of 32.

The instrument below lets you reproduce the failure and the fix by hand. Drag the width ruler to any input width and watch the five halvings floor stage by stage; the reconstructed width comes back short and the mask edge tears. Then switch the alignment padding on, watch the width snap up to the next multiple of 32, and see every stage divide cleanly so the round trip is exact.

A symmetric encoder-decoder halves the feature-map width once per stride-2 stage and doubles it back up; with 5 stages it downsamples by 32, so an input width that is not a multiple of 32 cannot survive the round trip. Drag the width ruler to choose any input width, then toggle alignment padding. With padding off, watch one of the five integer halvings floor away a fractional pixel that the decoder can never recover, so the reconstructed mask comes back short and its right edge is corrupted. With padding on, the width is first snapped up to the next multiple of 32, every stage divides cleanly, and the round trip is exact. The 5 stride-2 stages, the 32x downsample, the 3200 to 12800 px width range, the single grayscale input channel, and the GroupNorm group size of 16 are sourced from the engagement archive; the stage widths are live floor-halving arithmetic, and the pad-geometry strip is an illustrative schematic.

A divisibility rule worth naming out loud

The deeper lesson is that the 32 is not a magic number and it is not ours to negotiate. It is a property of the encoder we chose. Five stride-2 stages buy a large receptive field and a compact bottleneck, and they charge a 32-pixel alignment rule in return. Pick a four-stage encoder and the rule becomes 16; pick six stages and it becomes 64. Segmentation frameworks make this explicit through an output-stride parameter, the single number that says by how much the network downsamples before it climbs back up, which DeepLabv3+ exposes precisely so that input and output dimensions stay commensurable [3]. We had simply never written our number down. It lived implicitly inside a collate function, where it was easy to satisfy by accident in one path and forget entirely in another.

There is a small irony in all of this. The whole point of moving from a classical, rule-based curve extractor like the gridline-elimination approach of Yuan and Yang [4] to a learned segmenter was to stop hand-coding brittle geometric assumptions about the page. And here we were, tripped up by a geometric assumption we had hand-coded anyway, just buried one layer deeper than we were looking.

How we keep this one from coming back

We did three things, in order of how much we trust them. First, the real fix: one shared preprocessing transform, called identically by the training loader and the serving path, so the alignment pad can never again exist in one and not the other. Second, a cheap guard at the top of inference that checks the incoming width against the network's downsample factor and pads it if it is not aligned, so even a future caller that bypasses the shared transform cannot feed the network a width it cannot fold. Third, a single assertion deep in the model that the decoder output width equals the original input width, which turns the silent crop back into the loud error it should have been all along.

The one we lean on hardest is the assertion, because it converts a class of invisible bug into a visible one. A torn mask that scores fine on aligned validation data can hide for weeks. A failed assertion that fires the instant a width does not round-trip cannot hide at all. So when a network will silently accept the inputs it ought to reject, give it the one check that forces it to speak up, and the day you lose to a torn mask becomes a day you never lose twice.

Key takeaways

The model trained and validated cleanly yet produced torn masks on real scans, because the failure returned a wrong-sized number rather than throwing an exception. Fully convolutional segmenters accept almost any input size, so a malformed width sails through without complaint.
With 5 stride-2 encoder stages the network downsamples by 32. Any input width that is not divisible by 32 loses at least one pixel to a floored integer halving on the way down, and the decoder, doubling back up, lands short and can never recover it.
The lost pixel surfaces as a sheared right edge on the mask, because that is where the accumulated rounding lands; the digitised curve drifted off the page only at the deep end of every log.
The root cause was train and serve skew: the training loader padded widths up to the next multiple of 32 inside a custom collate function, but the single-image inference path skipped that step entirely. Logs whose raw width happened to be a multiple of 32 looked fine and masked the bug.
The durable fix is one shared preprocessing transform used by both paths, backed by a guard that aligns the width and an assertion that the output width equals the input width, which turns a silent crop back into a loud, catchable error.

References

[1] Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI (2015). The symmetric encoder-decoder whose copy-and-concatenate skip connections require input dimensions divisible by the total downsampling factor. https://arxiv.org/abs/1505.04597

[2] Long, J., Shelhamer, E., and Darrell, T. Fully Convolutional Networks for Semantic Segmentation. CVPR (2015). A convolutional segmenter accepts an input of arbitrary spatial size, which is what lets a bad width slip through at inference. https://arxiv.org/abs/1411.4038

[3] Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. ECCV (2018). The output-stride parameter names the downsampling factor an input width must be commensurable with. https://arxiv.org/abs/1802.02611

[4] Yuan, B., and Yang, Q. Digitization of Well-Logging Parameter Graphs Based on Gridlines-Elimination Approach. Journal of Petroleum Exploration and Production Technology (2019). A classical, non-learning baseline for extracting curves from raster log graphs. https://doi.org/10.1007/s13202-019-0625-x

Padding to a Multiple of 32: Small Detail, Big Inference Bug

A model that lies about being fine

Following the width down the network

Train and infer through the same door

A divisibility rule worth naming out loud

How we keep this one from coming back

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on