The usual way to ship a model is to wait until it is good enough, then put a button in front of it. We did the opposite. We took VeerNet, the network we built to read scanned well-logs, and we wrapped it in a four-step self-serve dashboard while the model was still improving, and we let interpreters use it. Upload a scan, run segmentation, validate the overlay, download the CSV. The bet was that a tool a person can drive, with the model's accuracy shown to them rather than hidden from them, is more useful at an imperfect accuracy than a perfect model nobody can reach. This is the product story behind that bet, and what we learned by making it.
At a glance
The dashboard is four clicks, and the third one is where the model's honesty lives.
From scan to CSV
Peak R-squared shown at validate
Depth points in the validation grid
The inverse of waiting for perfect accuracy
There is a default instinct in applied machine learning that says the product is the model, and the interface is something you bolt on at the end once the metrics are respectable. We think that instinct is backwards for a research-stage tool, and the digitisation work is a clean example of why. Our segmenter was good in places and rough in others. Peak goodness-of-fit on a well-recovered curve reached an R-squared of 0.9891, but the mask-overlap metric, intersection over union, peaked at 0.51, which is the honest signature of a model that finds curves reliably but does not trace every thin, faded pixel of them. If we had waited for IoU to climb before letting anyone touch the tool, the interpreters who actually needed digitised logs would have waited too, and we would have learned nothing about how the tool behaves in their hands.
So we inverted it. We treated the dashboard as the product and the model as a component inside it that would keep getting better underneath. The released artefact was not a checkpoint; it was a workflow a petrophysicist could complete alone, from a raw scan to a downloadable depth-indexed CSV, without a calibration session, a support call, or a wait for us to run a notebook. The model's current accuracy became a fact the user could see and act on, not a gate we held shut on their behalf. This is the build-measure-learn discipline applied to a model rather than a feature: ship the smallest thing a user can actually run, then let real use tell you what to fix [2].
Why ship the imperfect tool
A perfect model behind a hard-to-reach interface helps no one this quarter. An imperfect model behind a four-step self-serve flow lets an interpreter digitise a log today, see exactly how confident the result is, and decide for themselves whether to trust it. The second one teaches you what to build next; the first one teaches you nothing.
The four steps, and what each one is for
The whole dashboard is four steps, deliberately. Each one maps to a single decision the user makes, and nothing on the screen asks them to understand the model to get their answer.
- Step 1, upload the scan. A scanned raster image goes in. There is no calibration step, no header rectangle to draw, no axis values to type. The work an interpreter used to do by hand before any tracing could begin is simply gone from the user's side of the screen.
- Step 2, run segmentation. The model runs on demand and returns a per-pixel mask of where each curve runs. The user does not configure it. They press one control and wait for the mask.
- Step 3, validate the overlay. This is the step the whole product is built around. The predicted curve is laid over the scan, and the model's real accuracy is put on the screen next to it, so the user can compare what the model claims against what the scan shows.
- Step 4, download the CSV. The validated curve comes out as a depth-indexed CSV, one value per depth, ready to load into whatever the interpreter works in next. The output is data, not a picture of a graph.
The exhibit above is the dashboard walked end to end. Stepping through it makes the asymmetry visible: three of the four steps are quiet plumbing, and one of them, validate, is where we deliberately exposed the model rather than smoothing over it.
The validate step is where we put the model's honesty
If there is a single design decision that defines this product, it is that the model's metrics live on the validate step, facing the user, rather than in a report facing us. When the overlay appears, the user sees the recovered curve on top of their scan and, beside it, the numbers that say how much to trust it: a peak R-squared of 0.9891 for goodness-of-fit on a well-recovered curve, a peak IoU of 0.51 for mask overlap, and the 300-point validation grid every digitised curve is scored against. We did not pick the flattering number and hide the rough one. We showed both, because the gap between them is exactly the information the interpreter needs.
That gap has a plain reading. A high R-squared with a middling IoU tells the user that where the model traced the curve, it tracked the real values closely, but it did not claim every faded pixel, so a thin or broken interval is a place to look harder. An interpreter who can see that does not need us in the room. They glance at the overlay, glance at the metrics, and decide whether this particular curve on this particular scan is good enough for what they are about to do with it. The design literature on human-AI interaction makes this the central obligation of a system that can be wrong: make clear how well the system can do what it does, so the person can calibrate their own trust rather than over- or under-relying on it [1]. Surfacing the metrics at the moment of use is how we discharge that obligation, and it is the opposite of the all-or-nothing gate we would have built if we had waited for the model to be perfect.
It also changes what an error costs. In a hidden-model product, a wrong curve is a silent defect that the user discovers downstream, if at all. In this dashboard, a wrong curve is a visible disagreement between the overlay and the scan, sitting next to a metric that already warned the user the trace was uncertain. The failure is legible at the point of decision, which is the only place it is cheap to catch.
What the self-serve dashboard is built on
- The product is the four-step workflow, not the checkpoint: a petrophysicist goes from a raw scan to a depth-indexed CSV alone, with no calibration session and no wait for the team to run a notebook.
- The validate step exposes the model's real accuracy to the user (peak R-squared 0.9891, peak IoU 0.51, scored on a 300-point grid) so they calibrate their own trust, rather than hiding the numbers behind a pass/fail gate.
- Shipping while the model improved, the inverse of waiting for perfect accuracy, is what let real use, not our assumptions, tell us where the tool actually needed work.
Why self-serve at all, and what it bought us
Self-serve was not a convenience feature. It was the mechanism that made the build-measure-learn loop close at all. As long as digitising a log required us, every result was filtered through our hands and our context, and we never saw how the tool behaved when a real interpreter, with their own scans and their own judgement, drove it. Putting the whole flow behind four buttons removed us from the loop, and that is precisely what turned the dashboard from a demo into an instrument we could learn from.
It bought us three things. It bought us volume of real use, because a tool someone can run alone gets run, while a tool that needs a support call gets shelved. It bought us honest feedback, because an interpreter validating their own overlay against their own scan tells you far more than a metric on a held-out set ever will. And it bought us a posture: by showing the model's accuracy rather than asserting it, we made the product credible at an imperfect stage, which is the only kind of credibility a research tool can earn. The serving side made this affordable, since the same on-demand inference path that let a heavy model run without an always-on machine is what let us put it in front of users at a research-stage budget [3]; the product side is what made that affordability worth anything.
The honest limits of shipping early
Shipping while the model improves is the right call, but it is not a free one, and pretending otherwise would undercut the whole argument. The IoU of 0.51 is a real ceiling on this version: there are scans, the thin and faded and overlapping ones, where the overlay will be visibly wrong, and the dashboard's job in those cases is to make that wrongness obvious rather than to hide it. That works only because the validate step is doing its job; a self-serve tool that shipped early without surfacing its own accuracy would not be honest, it would be reckless. The metrics on the validate step are the thing that earns the right to ship before perfection.
The other limit is that self-serve raises the bar on the parts of the workflow that are not the model. If a user is alone, the upload has to be forgiving, the segmentation control has to be unambiguous, and the validate overlay has to be readable without us narrating it. The model can be improving underneath, but the four steps around it have to work on the first try, because there is no one in the room to recover a confused user. That is the trade we accepted: we let the model be imperfect and demanded that the product not be.
For the stage this engagement was at, it was the right trade. We put a working digitisation tool in interpreters' hands quarters earlier than a wait-for-perfect plan would have, we showed them exactly how far to trust it, and we let their real use, not our assumptions, point at what to fix next. A four-step dashboard with an honest validate step beat a perfect model nobody could reach, which was the whole bet.
References
-
Amershi, S. et al. (2019). Guidelines for Human-AI Interaction. CHI 2019. https://www.microsoft.com/en-us/research/publication/guidelines-for-human-ai-interaction/
-
Ries, E. (2011). The Lean Startup: How Today's Entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses. Crown Business. https://theleanstartup.com/book
-
Amazon Web Services (2020). Using Amazon EFS for AWS Lambda in your serverless applications. https://aws.amazon.com/blogs/compute/using-amazon-efs-for-aws-lambda-in-your-serverless-applications/