Serverless Inference on a Budget: Running Vision Models in AWS Lambda

The serving side of a machine-learning project rarely gets a cost argument of its own. The training bill is visible, the GPU rent is a line item someone signs off on, and the assumption that follows is that once a model is trained you keep the box it trained on and serve from there. For a workload that runs constantly that is fine. For the one we were serving, a raster well-log digitiser that turns scanned paper into depth-indexed curves, it was the expensive default, because of the shape of the traffic rather than anything about the model. Operators send us scans in bursts: a few hundred arrive, get digitised, and then nothing comes for days. A permanently-on GPU box charges the same rent every month whether it is saturated or asleep. So the first serving question we asked was not how fast the model runs but how much of the month we would be paying for it to do nothing.

This note is deliberately narrow. It is not the VeerNet deployment write-up, which is about packaging an encoder-decoder into a function runtime and getting a large model to load inside a serverless container at all. This is the cost geometry underneath that decision: two ways to pay for the same inference, one with a fixed monthly floor and one with a per-call meter, and the volume at which they cross. The crossover is the whole decision, and most of the confusion around serverless inference comes from arguing about it without drawing it.

Two cost shapes, not two technologies

The honest way to frame the choice is not Lambda versus a GPU box. It is a flat cost versus a sloped one. The always-on box is a horizontal line: rent is fixed at 750 EUR per month for the high-end tier or 1800 EUR for the advanced one, and that number does not move with how many scans you clear. Run one scan and it cost you the whole month's rent; run a hundred thousand and the same rent is spread thin.

The serverless option is a sloped line from the origin. An EFS-backed Lambda charges per invocation and nothing at rest, so its monthly cost is the per-call price times the month's volume. At zero traffic it costs zero; as volume climbs it crosses the flat rent of the box. Below that crossover the pay-per-call setup is cheaper, because the box would have spent most of its rent idling; above it the box wins, because its fixed cost has been amortised over enough runs. Neither shape is universally better. The winner is decided by where your monthly volume sits relative to that crossing point.

The cold start is a real cost, and it moves the line

Serverless inference has a specific tax a warm box does not pay, and for a large vision model it is not a rounding error. A Lambda container that has handled a request recently keeps the model resident in memory, so the next request runs warm and cheap. A container that has gone cold, which happens whenever traffic pauses long enough for the platform to reclaim it, has to reload the model first. Ours is a single-channel grayscale segmentation network, smaller than a three-channel equivalent but still a multi-hundred-MB artifact, and we serve it from an EFS model store mounted into the function rather than baking it into the deployment package. Every cold invocation pays the cost of pulling that model across the file system before it does any inference.

For a bursty, low-volume profile this matters more than for a busy one, because burstiness is exactly what keeps containers cold. Steady traffic keeps a pool of warm containers alive; ours does the opposite, so a batch arriving after a quiet stretch lands mostly on cold containers and the cold-start share is high. That share sets the slope of the serverless line. Mostly-warm traffic gives a low blended per-call cost and a gentle rise; traffic dominated by cold starts is several times more expensive per call and rises steeply, dragging the crossover leftward toward volumes we might actually reach.

The exhibit below is that geometry made draggable. Pick the always-on tier you are racing against, then drag the cold-start share and watch the crossover marker slide. At a low share the pay-per-call line is shallow and the crossover sits far to the right, past any volume our workload produces. Push the share up to something honest for spiky traffic and the line steepens and the crossover pulls in, but for our tier and our volumes it never pulled in far enough to make the box the cheaper choice.

The serving decision for a spiky, low-volume digitisation workload, drawn as a crossover. The teal flat line is a permanently-on GPU box whose monthly rent is fixed whether it runs or idles: 750 EUR per month for the high-end tier or 1800 EUR for the advanced one. The rising teal line is an EFS-backed Lambda that pays only per invocation, so its monthly cost grows with volume from a floor of zero. Lever A toggles the always-on tier the serverless line is raced against; Lever B drags the cold-start share, the fraction of calls that must reload the multi-hundred-MB grayscale model from the EFS store instead of hitting a warm container, which steepens the serverless line and slides the crossover left. The orange marker is the only element that argues: the crossover volume where the two lines meet. Left of it the pay-per-call setup is cheaper because the always-on box is mostly paying to idle; right of it the fixed rent is finally repaid by enough runs and the always-on box wins. The two always-on tier prices, the grayscale single-channel model, and the EFS model store are sourced from the engagement archive; the per-invocation prices, the cold-start share band, and the volume axis are illustrative unit inputs chosen to show the cost geometry, not billed figures.

Why the pay-per-call line stayed under the box for us

Drawing the crossover rather than asserting a preference turns a religious argument into a measurement. We knew our approximate monthly volume, we knew it arrived in bursts that would run largely cold, and we knew the two always-on tiers we would otherwise be renting. Put together, the crossover volume, where the always-on rent would finally be repaid by per-call charges, sat above the volume we were actually serving. Below it the box is paying to idle, and a digitisation workload that goes quiet for days at a time idles a lot. That is the entire case for serverless here, and it is about traffic shape, not the elegance of functions.

The failure condition is the same crossover read from the other side. If our volume grew, or the traffic smoothed into a steady stream, it would climb toward the crossover even as warmer containers lowered the blended per-call cost. Serverless wins on spiky, low, idle-heavy traffic and loses on steady, high, always-busy traffic, and the crossover is where the two regimes meet. We were firmly in the first, and the instrument is how we showed that we checked rather than assumed it.

A memory constraint keeps the per-call side honest. We serve at a batch size of one, the same constraint that pinned training, because scans vary in size and a serverless container has a fixed, modest memory ceiling. That rules out the throughput trick a GPU box can play, where batching many scans drives the per-scan cost down. Serverless gives up that lever and wins anyway, not by being more efficient per scan but by not charging us for the idle days.

What this does not settle

A cost model is not an operations plan. Cold starts are a latency cost as well as a money cost: the first scan in a burst waits for the model to load from EFS, and for an interactive workflow that wait can matter even when the money says serverless is cheaper. The model also says nothing about the packaging work that makes serving a multi-hundred-MB artifact inside a function possible at all, which is the subject of the VeerNet deployment note. What it does settle is the framing. The serving decision for a spiky digitisation workload is not a benchmark or a preference. It is a single question, drawn as a single crossing: at your real monthly volume, with your real cold-start share, is the pay-per-call line still under the flat rent. For us it was, comfortably, and the traffic shape is what made it so.

Limitations

This is a cost model for one workload with one traffic profile, not a general verdict on serverless inference. The two always-on tier prices, 750 and 1800 EUR per month, and the two serving facts, the grayscale single-channel model and the EFS model store, are archive-sourced; the per-invocation prices, the cold-start share range, and the volume axis are illustrative inputs chosen to make the crossover geometry legible, so the exact volume the exhibit prints is a shape, not a quote. It captures only compute-style cost, and it assumes the per-call price and the cold-start share hold steady across the month, when both drift with request size and with how the platform recycles containers. The crossover tells you which way to pay, once you have decided the model is worth serving at all.

Serverless Inference on a Budget: Running Vision Models in AWS Lambda

Two cost shapes, not two technologies

The cold start is a real cost, and it moves the line

Why the pay-per-call line stayed under the box for us

What this does not settle

Limitations

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on