The model worked, and that was the easy part. We were digitizing scanned raster well-logs for an anonymised Texas onshore operator, and the segmenter we built, which we call VeerNet, had reached a peak R-squared of 0.9891 on a recovered curve. The harder question arrived after the accuracy did: the deployment that served it was a serverless one, the kind that bills the footprint a model carries into memory rather than a flat monthly box, and a research-grade encoder-decoder with attention is not a small footprint. So we set ourselves a narrow, slightly nervous-making trial. How much could we compress this network, by pruning weights and quantising what was left, before the thing we had spent months building stopped recovering the curves it was built to recover?
A 0.9891 model that costs too much to keep resident
VeerNet is shaped for the awkward geometry of a scanned log: very tall, variable-width grayscale images with one or two thin curves to pull out of a wall of background. Its skeleton is a 5-stage stride-2 encoder, a bottleneck refined by 2 transformer attention layers, and a 5-stage upsampling decoder, all carrying a 128-dimensional feature representation. That depth is what buys the 0.9891 peak R-squared, and it is also what makes the model heavy. On the serverless path we had standardised on, the cost we cared about was not training, which is a one-time bill, but serving: the per-inference footprint of the resident model, set against the GPU baselines the deployment was meant to avoid, which ran between 750 and 1800 EUR per month.
The compression literature has been clear since the middle of the last decade that a trained network is usually larger than it needs to be, and that pruning the smallest weights and quantising the rest can shrink it substantially with little accuracy lost [1]. The lottery-ticket work pushed that further, arguing that a dense network contains a much sparser subnetwork that trains to the same accuracy if you can find it [3]. None of that was in dispute. What we did not know was how those general results would land on this specific model, whose output is not a class label but a pixel-thin curve, and whose accuracy floor lived somewhere we had to go and find.
The goal, and the one number we refused to move
The trial had a single objective and a single non-negotiable. The objective was to lower the per-inference serving footprint so the model sat further inside the 750-to-1800 EUR envelope rather than pressing against its ceiling. The non-negotiable was the curves. A digitizer that returns a slightly cheaper background mask is worthless; the entire product is the recovered curve, and the recovered curve was already the model's weakest output. Even at full size, the two curve masks scored an IoU of only 0.26 for the first curve and 0.21 for the second, against a background mask that the network handled comfortably. Those two faint numbers, not the comfortable background, were the accuracy floor we were compressing toward.
Where the floor actually sits
The instinct is to watch the headline R-squared and stop when it dips. That instinct is wrong for this model. The dense background mask has enormous slack and tolerates aggressive compression without complaint; the thin curve masks, already at IoU 0.26 and 0.21, have almost none. Compression had to be governed by the weakest mask, not the average.
Pruning, quantising, and watching the wrong metric first
We ran the two standard levers together. Pruning removed the lowest-magnitude weights and retrained the survivors so the remaining connections could absorb the slack [3]. Quantisation then narrowed the numeric precision of those survivors, trading floating-point range for a smaller, faster integer-arithmetic representation [2]. Stacked, the two compounded: pruning made the model sparser, quantisation made each surviving weight cheaper, and the resident footprint came down faster than either lever alone would have moved it.
Our first instinct, and our first mistake, was to track the aggregate metric. Watching the overall R-squared, the model looked almost untouchable. We could prune and quantise well into the regime where the network was a fraction of its original size and the headline number barely flinched, because the background mask, which dominates the pixel count, kept absorbing the loss. It was only when we broke the metric out per mask that the real picture appeared. The background held. The first curve mask softened. The second curve mask, the faintest structure in the image and the one already sitting at IoU 0.21, fell apart first. The aggregate had been lying to us by averaging a robust mask with a fragile one.
That reordered the whole trial. The question stopped being "how small can the model get" and became "at what compression ratio does the curve-2 mask start to fragment", because that point, and not the headline R-squared, was the wall. Past it, the cheaper model returned a curve with gaps and breaks where the real curve was continuous, which for a digitizer is not a small accuracy regression but a broken product.
The envelope, and the point where thin curves break
What the trial produced was an envelope rather than a single number: as the compression ratio climbs, retained accuracy bends down and per-inference serving cost falls, and the two move in opposite directions until they cross the point where the thin curves give. The instrument below makes that trade draggable. Pull the compression lever right and watch the teal accuracy curve, plotted against the 0.9891 peak, hold flat through a gentle erosion and then knee sharply downward at the break, while the orange serving-cost curve, anchored to the 750 and 1800 EUR footprints, falls the whole way. The orange marker is the wall: the compression ratio where the curve-2 mask fragments first, dragging the curve-1 mask down behind it.
The operating point we shipped sat just to the safe side of that marker. We took the compression that the curve masks tolerated without fragmenting, banked the serving-cost reduction that came with it, and stopped. We deliberately left compression on the table, because the marginal megabyte past the break would have been bought with a broken curve, and a broken curve is the one thing the deployment could not return.
What the compression trial settled
- The accuracy floor was the thin curve masks, not the dense background: at full size the background mask had enormous slack while the curve masks already scored only 0.26 and 0.21 IoU, so compression had to be governed by the weakest mask.
- The aggregate R-squared was a misleading guide because it averages a robust background mask with fragile curve masks, hiding the moment the faint curve-2 structure begins to fragment under heavier compression.
- The shippable operating point sits just before the break: take the compression the curve masks tolerate, bank the serving-cost reduction toward the 750 to 1800 EUR footprint, and refuse the marginal saving that would cost a broken curve.
What the faint masks taught us about where to stop
The lesson we carried out of this trial was not about compression ratios, which are specific to this network, but about which metric is allowed to authorise a stop. We had treated the model as one object with one accuracy number, and the model had quietly told us it was really two objects: a sturdy background segmenter that compresses like the textbooks promise, and a pair of fragile curve segmenters that do not. Every decision that mattered in this trial came from refusing to let the sturdy one speak for the fragile ones. The serving-cost reduction was real and we kept it; the temptation to chase a smaller model past the break was the trap, and the per-mask breakout was what kept us out of it. When the cheapest model your budget will reward is also the one that quietly drops the only output your users came for, the right footprint is the one that stops at the faintest curve.
References
-
Han, S., Mao, H., Dally, W. J. (2016). Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. ICLR 2016. https://arxiv.org/abs/1510.00149
-
Jacob, B. et al. (2018). Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. CVPR 2018. https://arxiv.org/abs/1712.05877
-
Frankle, J., Carbin, M. (2019). The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. ICLR 2019. https://arxiv.org/abs/1803.03635