Standing Up Serverless Curve Inference on AWS Lambda and EFS

A model that beats the benchmark in a notebook is not a thing anyone can use. Between the trained checkpoint and a colleague clicking upload on a scanned log sits a deployment question, and for an early-stage research team that question has an awkward shape: the model is heavy, the demand is bursty, and there is no budget for a GPU that bills around the clock while it waits for the next request. We were digitizing raster well-logs for an anonymised Texas onshore operator, the segmentation model worked, and we needed it callable from a web demo without standing up always-on infrastructure. The deployment we landed on ran the whole inference path on AWS Lambda, with the model weights mounted from Elastic File System rather than packaged into the function, and it served the demo on a compute footprint of roughly 750 to 1800 EUR per month.

At a glance

The deployment turned a research checkpoint into an on-demand inference endpoint without a single hour of idle GPU.

Lambda + EFS

no always-on GPU

Inference runtime

750 to 1800 EUR

GPU baselines avoided

Monthly compute footprint

128-dim

5+5 stages, 2 attention layers

CurveNet feature depth

The constraint: a heavy model, bursty demand, and no GPU to spare

The model we needed to serve, which we call CurveNet, is a segmentation network built for the awkward geometry of a scanned log: very tall, variable-width grayscale images with one or two thin curves to recover out of a wall of background. Its shape is a 5-stage stride-2 encoder, a bottleneck refined by 2 transformer attention layers, and a 5-stage upsampling decoder, all carrying a 128-dimensional feature representation. The attention layers earn their place because a well-log is long-range structure: a curve at the top of a 12,800-pixel-wide image is the same curve at the bottom, and convolution alone reasons locally [3]. The encoder-decoder skeleton is the standard one for dense per-pixel prediction [2]. None of that is exotic. What made it awkward to deploy was simply its weight.

A checkpoint for a network this deep is not small, and the obvious serverless path runs straight into a wall. A Lambda function ships as a deployment package with a hard size ceiling, and a multi-hundred-megabyte model plus its dependency stack does not fit inside that ceiling. The textbook answer is to put the model behind a container on an always-on service with a GPU attached. For a funded team serving steady traffic that is the right answer. For us it was the wrong one twice over: the traffic was a demo and a handful of pilot conversations, not a stream, and a standing GPU bills every hour whether or not anyone uploads a log. We were looking at GPU baselines in the 750 to 1800 EUR per month range for hardware that would sit idle most of the time.

The two questions the deployment had to answer

How do you fit a model that exceeds the function package limit into a serverless function? And how do you serve bursty, low-volume inference without paying for a GPU that never sleeps? The whole design follows from refusing to accept the obvious trade between the two.

The unlock: mount the weights on EFS instead of baking them in

The decision that made serverless viable was to stop treating the model as part of the code. Instead of packaging the weights into the Lambda deployment artefact, we put them on an Elastic File System volume and mounted that volume into the function at runtime [1]. The function package stays small, well under the limit, because it carries only the inference code and the framework. The weights, which are the heavy part, live on a network file system the function reaches as if it were a local disk.

This inverts the usual serverless model-serving headache. The package-size ceiling stops being a constraint on model size at all, because the model is no longer in the package. A larger checkpoint, a swapped-in retrained version, a second model for a different curve layout: all of those become a file on EFS, not a redeploy of the function. The separation between code and weights is the entire trick, and it is the reason a research-grade encoder-decoder with attention could run inside a runtime designed for lightweight handlers.

The inference path, end to end

With the weights on EFS, the live path is a short, legible chain, and it is the chain the budget meter below walks through.

Scan upload. A scanned raster TIF arrives through an API request. At this point nothing is loaded; the request is just an instruction to digitize an image.
Cold start. Because no GPU and no warm container are standing by, the function cold-starts on demand. This is the honest cost of the design. The first request after an idle period waits while the runtime initialises and the EFS mount and weights come into memory. We accepted that latency deliberately, because the alternative was paying for a warm machine to avoid it.
EFS-mounted weights. The function reads the CurveNet checkpoint from the mounted EFS volume and loads it. The model that exceeded the package limit is now resident, served from a file system rather than from the deployment artefact.
Mask. The model runs a single forward pass over the uploaded image: 5 encoder stages down, 2 attention layers across the bottleneck, 5 decoder stages back up, emitting a per-pixel mask of where each curve runs.
S3 result. The mask is post-processed into a depth-indexed curve and written to S3 as the downloadable artefact the user came for, a clean CSV or LAS rather than a picture of a graph.

The deployment that let a research vision model serve a live demo without an always-on GPU. A scan is uploaded, a Lambda function cold-starts, mounts the CurveNet weights from EFS rather than baking them into the package, runs the 5-stage encoder / 2-attention / 5-stage decoder forward pass, and writes the result to S3. Scrub the monthly invocation volume to watch the serverless pay-per-invocation line cross the two standing-GPU baselines, the 750 EUR/month high-end box and the 1800 EUR/month advanced box: below the crossover, paying only for the inferences you run wins; above it, a fixed GPU is cheaper. The 750 / 1800 EUR baselines, the EFS-mount path, and the CurveNet shape (128 dims, 5 + 5 stages, 2 attention layers) are the engagement's own figures; the per-invocation price and the resulting cost curve are an illustrative model of pay-per-use economics, and the crossover is schematic.

Where serverless wins, and where it does not

The reason to read this as an economic decision rather than a purely technical one is that serverless is not unconditionally cheaper. Pay-per-invocation pricing is a line that climbs with volume, and a standing GPU is a flat line that does not. The two cross. Below the crossover, paying only for the inferences you actually run is cheaper than renting a box by the month. Above it, the box wins, because a high enough request rate keeps a GPU busy enough to justify its standing cost and you stop benefiting from paying per call.

For us the volume sat well to the left of that crossover. A demo and a few pilot logs a week is exactly the regime where an idle 750 to 1800 EUR GPU is mostly burning money on availability nobody is using. Lambda let us convert that fixed monthly cost into a near-zero floor that only moved when someone uploaded a log. The budget meter above makes the shape of that argument explicit: scrub the invocation volume and the serverless line stays under the GPU baselines until the request rate gets high, at which point the standing box becomes the rational choice. We were nowhere near that point, and that is precisely why the serverless deployment was correct for the stage we were at.

It is worth being clear about what the meter is and is not. The 750 and 1800 EUR monthly GPU baselines, the EFS-mounted inference path, and the CurveNet shape are the engagement's real figures. The per-invocation unit price, and therefore the exact location of the crossover, is an illustrative model of pay-per-use economics rather than a billed line item. The point it makes is structural and holds regardless of the precise price: a flat cost and a volume-scaling cost cross, and which side you are on is a function of how often the model actually runs.

What the serverless deployment bought us

The package-size ceiling stops constraining model size once the weights live on EFS and are mounted at runtime, so a 128-dim, 5+5-stage, 2-attention encoder-decoder runs inside a runtime built for lightweight handlers.
Bursty, low-volume inference is the regime where pay-per-invocation beats a standing GPU: a demo and a handful of pilot logs sat far below the cost crossover, so an idle 750 to 1800 EUR per month box would have been availability nobody was using.
Cold start is the deliberate price of having no warm machine: the first request after an idle period waits for the runtime, mount, and weights, which is the trade we accepted to keep the compute floor near zero.

The cold-start trade, named honestly

No deployment is free of a downside, and serverless inference on a heavy model has an obvious one. When the function has been idle and a request arrives, it cold-starts: the runtime initialises, the EFS volume mounts, and the multi-hundred-megabyte checkpoint loads before any pixel is processed. That first response is slow in a way a warm GPU never is. For a live demo and pilot validation this was an acceptable cost, because the latency a reviewer feels once in a while is cheaper than a machine we pay for continuously. It is also the clearest signal of when this architecture stops being the right one. The moment inference volume climbs enough that cold starts are frequent and felt, the same crossover that governs cost governs experience, and the answer becomes a warm, always-on service. The serverless build was the correct deployment for a research model proving itself, not the permanent home of a production digitizer.

The same two questions beyond well-logs

The pattern here is not specific to well-logs. Any team that has a heavy vision or sequence model, demand that is bursty rather than steady, and no appetite for an always-on GPU faces the same two questions we did, and the same two answers apply. Mounting weights from a network file system rather than baking them into the deployment artefact decouples model size from package limits, and pay-per-invocation pricing turns idle capacity from a fixed cost into a near-zero floor. The discipline is to know which side of the cost crossover you are on, and to move to warm infrastructure deliberately when volume crosses it rather than defaulting to a GPU before the traffic justifies it.

The honest limit is the one the cold start makes plain. Serverless is the right tool for proving a model in front of users on a thin budget; it is not the right tool for serving a high-throughput digitizer to a working asset team. For the stage this engagement was at, a 750 to 1800 EUR per month footprint avoided, a research checkpoint made callable, and a live demo that ran without a standing GPU, the trade was exactly the one to make.

References

Amazon Web Services (2020). Using Amazon EFS for AWS Lambda in your serverless applications. https://aws.amazon.com/blogs/compute/using-amazon-efs-for-aws-lambda-in-your-serverless-applications/
Chen, L.-C. et al. (2018). Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. ECCV 2018. https://arxiv.org/abs/1802.02611
Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS 2017. https://arxiv.org/abs/1706.03762

Standing Up Serverless Curve Inference on AWS Lambda and EFS

At a glance

The constraint: a heavy model, bursty demand, and no GPU to spare

The unlock: mount the weights on EFS instead of baking them in

The inference path, end to end

Where serverless wins, and where it does not

The cold-start trade, named honestly

The same two questions beyond well-logs

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on