Inference Output Storage: Writing Results Back to S3 Cleanly

Read enough deployment post-mortems and a pattern shows up that no architecture diagram ever draws. The box marked model is small, well understood, and almost never the thing that failed. The arrow leaving that box, the one pointing at a cylinder labelled storage, is where the incident actually started. Inference ran. The prediction was correct. Then the result was written somewhere, by some code nobody reviewed as carefully as the network, and what landed in storage was subtly not what the model produced. The reader downstream picked it up, trusted it, and the wrongness propagated from there. The serving step that nobody specs is not the inference. It is the writeback.

This is a field note about that step for one concrete service: a system that digitises raster well logs, turning the thin ink traces on a scanned paper log back into digital curves. The model is a multiclass segmenter; the product is not the mask it emits but the files a petrophysicist later opens. Between those two things sits a writeback path, and over the life of the service that path broke more often, and more quietly, than the model ever did. The literature already named most of the reasons why. What we want to do here is credit that literature, walk the actual failure modes we hit, and lay out the handful of rules that made the writes boring again.

The arrow nobody reviews

The clearest statement of why this happens is a decade old. Sculley and colleagues, surveying the hidden technical debt in machine-learning systems, made the observation that has aged better than almost anything else in MLOps: only a small fraction of a real ML system is the model, and the surrounding majority is plumbing, configuration, and glue code that moves data and predictions from one place to another [1]. The writeback is glue code by that definition. It is the least glamorous line in the service and it carries more of the system's correctness than its line count suggests, precisely because nobody treats it as a place where correctness lives.

Our writeback had a deceptively simple job. A request arrives with one scanned log. The segmenter produces a mask with three classes: a background plane and two curve planes, one per logging curve on the track. Each curve plane is then traced into a centreline and resampled onto a fixed depth grid of three hundred points, so the deliverable for one request is a small, fixed set of objects: the mask raster, plus two CSVs of depth against value, three hundred rows each. That is the whole payload. It is not large. It is not complicated. And it broke anyway, because the thing that breaks is never the size of the payload. It is the contract the payload is supposed to satisfy.

Two pipelines that disagree about one file

The first failures we chased were the classic ones, and the data-validation literature had already drawn the map. Breck and colleagues, writing about validating data in production ML, put train and serve skew near the centre of their account of silent error: when the path that produces data for training and the path that produces it at serving time differ in even one step, the model is fed something subtly different from what it learned on, and nothing throws [2]. Our skew was on the output side rather than the input side, but it was the same disease. The training and evaluation code wrote CSVs one way, with a header row, a particular depth ordering, and a fixed column naming. The serving code, written later by a different hand, wrote them a slightly different way. Both produced files that opened cleanly. Only one matched what the reader expected, and the reader was the validation notebook that everyone trusted.

The fix for that class was not clever and we will not pretend it was. We wrote the output contract down. Not in prose, in code: a single serialiser that both the offline evaluation path and the online serving path call, so there is exactly one place that decides how a curve becomes a CSV. After that, a file written during training and a file written in production are byte-for-byte the same shape, because they came out of the same function. The principle is identical to the one the skew literature prescribes for inputs, applied to the side everyone forgets: make the two paths share the exact serialisation, not merely agree on it informally.

A file that opens is not a file that is correct

Every failure in this story produced an object that parsed without error. A CSV with two hundred and twelve rows is a perfectly valid CSV. A mask written under a key the reader does not list on is a perfectly valid object in a place nobody looks. The output contract is the set of properties a successful write must have beyond being syntactically well formed, and it is invisible until something violates it, because nothing in the storage layer enforces it for you.

The write that arrives half-finished

The failure that taught us the most was the one we did not see coming, and it is the one the serverless model makes likely rather than rare. We ran the service as functions, and the Berkeley view of serverless computing spells out exactly why that choice shapes the writeback: a function has a hard execution limit and no durable local state, so any work in flight when the clock runs out is simply abandoned [4]. If a function is streaming a response body into storage when it hits its timeout, the body stops mid-stream. The object that lands is not absent, which would be easy to detect. It is present and short.

That is the worst kind of artifact, because it has the shape of success. The status was fine. The key exists. The file parses. It just has fewer than three hundred rows, because the writer was cut off after row two hundred and twelve, and the truncation is invisible to anything that does not already know the row count was supposed to be three hundred. The model was right. The object is wrong. And the reader downstream, dutifully parsing what it found, has no way to distinguish a curve that genuinely ended early from a curve whose tail was eaten by a timeout, until something much further downstream produces a number that makes no geological sense.

The instrument below renders the per-request ledger this writeback produces and runs the integrity check on each artifact. Drag the batch lever from one to sixteen, which is the multiclass batch size the service used, to scale the write volume. Then toggle between an atomic put, where each object lands whole, and a streamed write, where a timeout cuts the body short. Watch the same correct model produce a clean ledger in one mode and a corrupt one in the other, with no change to the prediction at all.

A multiclass request produces a 3-class mask (one background plane plus two curve planes), and each curve plane is traced and resampled to a fixed 300-point depth grid before it is written to object storage as a CSV. So one request fans out into three artifacts: the mask raster and two depth-versus-value CSVs, each of which must carry exactly 300 rows, a monotonic depth column, no NaN holes, and the request key prefix the reader lists on. The ledger runs that integrity check on every persisted artifact. Drag the batch lever from 1 to 16 (16 is the multiclass batch size) to scale the write volume, then toggle the write mode. With an atomic put every object lands whole and passes the contract. With a streamed write a function timeout can cut the body short, so a CSV lands with fewer than 300 rows: the model was right, the object is corrupt, and the reader cannot tell until it parses a short file. The 3 output classes, the 2 CSVs per request, the 300 depth points per curve, and the batch size of 16 are sourced from the engagement archive; the per-row byte size, the streamed short-row counts, and the key strings are illustrative.

The ledger makes the asymmetry concrete. The mask raster is a single put either way, so it survives. The CSVs are where the streamed mode lands short, and the integrity check is the only thing standing between a truncated file and a reader that trusts it. The check is unglamorous: every CSV must carry exactly three hundred rows, the depth column must increase monotonically, there must be no NaN holes, and the key must sit under the request prefix the reader will list on. None of those properties is exotic. All of them are properties the model cannot give you, because the model finished its job before the write began.

Writing as if the network will betray you

Once you accept that a write can arrive half-finished, the rest of the design follows from a question the storage and distributed-systems literature settled long ago: what happens when you retry? You will retry, because the function timed out, or the network blipped, or the queue redelivered the message, and the moment you retry you are in at-least-once territory. Helland's account of idempotence is the one to read here, and its title is the whole lesson: idempotence is not a medical condition, it is a property you have to engineer, and the way you engineer it for a writer is a stable key [3]. If the same request always writes to the same deterministic key, a retry overwrites rather than duplicates, and the second attempt that succeeds simply replaces the first attempt that did not.

We leaned on that hard. The key for every artifact is derived deterministically from the request, never from a clock or a random token, so a retried request lands exactly where the first attempt would have. Combined with the property that S3 now offers strong read-after-write consistency, which AWS shipped at the end of 2020 and which removed the old eventual-consistency excuse for a reader seeing a stale object [5], this gives a clean story: a writer that always targets the same key, a store that returns the most recent successful put for that key, and a retry policy that can fire as often as it needs to without ever leaving two competing versions of the same curve behind. The consistency guarantee does not make the write correct. It only means that once the write is correct, the reader is guaranteed to see it. The correctness is still entirely the writer's job.

The last rule we added is the cheapest and the one we trust most. The writer validates the object against the contract before it acknowledges success, and it acknowledges nothing until the validation passes. A CSV with the wrong row count is never marked done; the request is failed and retried rather than quietly recorded as a success that happens to be short. This is the move that converts the streamed-truncation failure from an invisible corruption into a loud, catchable error, which is exactly what the data-validation literature argues you should do at every boundary an ML system writes across [2]. The validation lives in the writer because the writer is the last actor that knows what the object was supposed to contain. Push it any further downstream and you are validating a file whose author has already forgotten what it should have been.

What the curve actually has to survive

Step back from the storage mechanics and the durable idea is about where you draw the boundary of correctness. It is tempting to draw it at the model: get the prediction right and the rest is transport. Every failure in this note lived strictly after the prediction was right, which means the boundary is in the wrong place. Correctness ends where the consuming process can finally read a whole, well-formed, contract-satisfying object, and everything between the model and that point, the serialisation, the key, the retry, the validation, is inside the boundary whether you reviewed it or not.

So the discipline we would hand to anyone shipping a model behind object storage is short and it is about the artifact, not the network. Decide the exact shape of every object a request must produce and write that contract as one serialiser both paths share. Make every key deterministic so a retry is a no-op rather than a duplicate. Validate each object against its contract in the writer, before you call the write a success, so a half-finished body is failed loudly instead of recorded quietly. The model will get most of the attention and deserve some of it. The thing that decides whether a petrophysicist opens a real curve or a truncated one is the three hundredth row of a CSV that either made it to storage or did not, and that row has never once cared how good the network was.

Key takeaways

The serving step that breaks serverless deployments is the writeback, not the model. Every failure in this note happened strictly after the prediction was correct, which is exactly the plumbing-not-model debt that Sculley et al. identified as the bulk of a real ML system.
One request produces a fixed, small payload: a 3-class mask (background plus 2 curve planes) and two CSVs of 300 depth points each. The output contract is the set of properties beyond being well-formed (300 rows, monotonic depth, no NaN, the right key prefix) that the storage layer does not enforce for you.
Output-side train and serve skew is real: training and serving code that serialise the same curve two different ways both produce files that open, but only one matches what the reader expects. The fix is one shared serialiser, the output-side analogue of the skew discipline Breck et al. prescribe for inputs.
Serverless writes can arrive half-finished. A function timeout cuts a streamed body mid-stream, so a CSV lands present and short rather than absent, which is the worst case because it has the shape of success. The Berkeley serverless survey names the short execution limit and stateless model that make this likely.
Determinism plus validation makes the writeback boring. A stable key turns a retry into an idempotent overwrite rather than a duplicate (Helland), S3 strong read-after-write consistency guarantees the reader sees the latest good put, and validating each object against its contract before acknowledging success converts a silent truncation into a loud, retriable error.

References

[1] Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., and Dennison, D. Hidden Technical Debt in Machine Learning Systems. NeurIPS (2015). The paper that named the small fraction of an ML system that is the model and the large fraction that is plumbing and glue code moving predictions to where they are consumed. https://proceedings.neurips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html

[2] Breck, E., Polyzotis, N., Roy, S., Whang, S. E., and Zinkevich, M. Data Validation for Machine Learning. SysML (2019). A production account of validating the data flowing through an ML system, with train and serve skew as a first-order source of silent error. https://mlsys.org/Conferences/2019/doc/2019/167.pdf

[3] Helland, P. Idempotence Is Not a Medical Condition. ACM Queue, 10(4) (2012). The reference statement of why at-least-once delivery forces every writer to be idempotent, and how a stable key turns a retry into a no-op rather than a duplicate. https://queue.acm.org/detail.cfm?id=2187821

[4] Jonas, E., Schleier-Smith, J., Sreekanti, V., Tsai, C.-C., Khandelwal, A., Pu, Q., Shankar, V., Carreira, J., Krauth, K., Yadwadkar, N., Gonzalez, J. E., Popa, R. A., Stoica, I., and Patterson, D. A. Cloud Programming Simplified: A Berkeley View on Serverless Computing. UC Berkeley Technical Report (2019). The survey of the constraints a function model imposes, including short execution limits and the absence of durable local state that lets a write be cut mid-body. https://arxiv.org/abs/1902.03383

[5] Amazon Web Services. Amazon S3 Update: Strong Read-After-Write Consistency. AWS News Blog (2020). The announcement that S3 GET, PUT, and LIST operations are strongly consistent, which removes the eventual-consistency excuse and puts the burden of a correct write back on the writer. https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/

Inference Output Storage: Writing Results Back to S3 Cleanly

The arrow nobody reviews

Two pipelines that disagree about one file

The write that arrives half-finished

Writing as if the network will betray you

What the curve actually has to survive

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on