There is a quiet failure that hides at the very end of a raster well-log digitisation pipeline, after the hard parts are done. The segmentation network has run, the mask has been traced into a centreline, and a smooth curve has been fitted, so a predicted petrophysical curve finally exists as a column of numbers indexed by depth. Sitting next to it is the ground truth, the digital LAS curve someone produced by hand long ago. The obvious next move is to subtract one from the other and report the mean absolute error. The obvious next move is also wrong, because the two curves almost never live on the same depth grid. The predicted curve inherits its spacing from the image: one sample per pixel row, at whatever scan resolution the raster happened to use. The LAS curve was recorded at the logging tool's own sampling rate, often every half foot. Subtract them sample for sample and you are differencing two arrays that do not describe the same depths. The number you get back is not error. It is noise wearing the costume of error.
The fix is unglamorous and we apply it on every recovered curve we hand to a validation notebook: resample both the prediction and the ground truth onto one shared depth grid before computing a single metric. In our pipeline that shared grid is 300 depth points spanning the comparison interval, and every MAE or MSE we quote is computed only after both curves have been interpolated onto it. This piece is about why that step is load-bearing, why it is so easy to skip, and what the metric actually means once you do it right.
Two curves, two grids, one meaningless subtraction
Be precise about the geometry, because the geometry is the whole argument. A predicted curve coming out of the digitisation stack is sampled at the image's vertical pixel cadence. A scanned log a few thousand pixels tall yields a few thousand predicted samples across the interval. The ground-truth LAS for the same interval might carry only a few hundred samples, recorded at the tool's depth step. The two arrays have different lengths, different start depths, and different spacing. There is no element i in one that corresponds to element i in the other.
If you ignore that and zip the arrays together by index, several things go wrong at once. The shorter array runs out first, so you either truncate the longer one or pad the shorter one, and both choices fabricate data. Even where both arrays have values, index i of the prediction sits at a shallower or deeper depth than index i of the truth, so you are comparing the curve at one depth to the curve at another. On a smoothly varying log that mismatch alone can manufacture an error of the same magnitude as the real reconstruction error, which means your reported MAE is dominated by a registration artefact you introduced in the last line of the notebook. The model could be excellent and the metric would still look mediocre. Worse, the model could be mediocre and a lucky alignment could make the metric look excellent. Either way the number has stopped measuring what you think it measures.
shared depth points both curves are resampled onto
CSV MAE on curve-1 after alignment (Dice loss)
CSV MAE on curve-2 after alignment (Dice loss)
Resample first, then measure
The discipline is one sentence: never compute a per-point metric across two curves until both have been placed on the same depth axis. Concretely, pick a target depth grid, interpolate each curve onto it, and only then subtract. We use a fixed 300-point grid over the comparison interval. The choice of 300 is deliberate but not magic; it is dense enough to preserve the shape of the recovered curve without oversampling beyond what the source data can support, and fixing it means every curve, every well, and every loss-function experiment reports its error on identical footing. The grid becomes a contract. Once both curves agree on the depth axis, every point on the prediction has a real, depth-matched partner on the ground truth, and the absolute difference at each of the 300 points is a true residual rather than a registration ghost.
The interpolation itself is the least interesting part of the story, which is the point. One-dimensional interpolation of an irregularly sampled curve onto a regular grid is a solved problem and a single call in the standard scientific-Python stack [1][2][3]. The intellectual work is not in the resampling; it is in remembering to do it, and in resisting the urge to read a metric off two arrays just because they happen to be the same length after a careless truncation. The credit for the resampling machinery belongs to that ecosystem; what is ours is the architecture that produces the curves, VeerNet, and the validation discipline that wraps a fixed 300-point grid around every error we report.
The instrument below makes the failure visible. It overlays a predicted curve and a ground-truth curve at their native, mismatched sampling, where a per-sample subtraction is genuinely undefined. Toggle to the shared 300-point grid and the curves snap onto a common axis where MAE finally has a meaning, settling on the sourced figures of 0.11 for curve-1 and 0.12 for curve-2 under Dice loss. Then drag the depth-offset slider to mis-register the prediction on purpose and watch the same MAE inflate above its true value, which is exactly the silent bug a common grid removes.
Why MAE and MSE both need the same grid
It is worth being clear that this is not an MAE-specific concern. Every per-point error metric inherits the same dependency on alignment. Mean squared error, which squares each residual, is if anything more sensitive, because a registration artefact that adds a constant offset to many points gets squared into a much larger penalty. The recovered curves in our pipeline carry both, an MAE of 0.11 and 0.12 and an MSE of 0.03 and 0.04 on the two curves under Dice loss, and both numbers are only trustworthy because they were computed on the shared 300-point grid. If you reported MAE on the aligned grid but MSE on the raw arrays, the two metrics would tell contradictory stories about the same prediction, and you would waste an afternoon trying to reconcile a discrepancy that exists only because half your numbers were computed against a depth axis the other half did not use.
There is also a subtler reason to fix the grid across experiments rather than letting it float. When we compared segmentation losses, each candidate produced its own recovered curves, and the only way the resulting MAE values were comparable to each other was that every one of them was measured on the same 300-point grid. Change the grid between experiments and you reintroduce, at the comparison level, the very mismatch you removed at the curve level. A fixed grid is what lets a loss-function bake-off be a fair fight rather than a set of numbers measured in different units.
The unglamorous step is the one that makes the rest count
The lesson generalises beyond well logs to any time series you reconstruct and then score against a reference. Predicted signal and reference signal rarely share a sampling rate, and the temptation to subtract them as-is is strong precisely because it is one line of code. That one line silently assumes an alignment that is not there. The honest version is two lines: resample both onto a shared axis, then subtract. It feels like bookkeeping, and it is, but it is the bookkeeping that determines whether the impressive metric at the bottom of your notebook is measuring your model or measuring your own carelessness.
None of the architecture work matters if the validation step that grades it is broken. A model that recovers a curve beautifully will be reported as mediocre if its output is differenced against ground truth on the wrong grid, and a project can spend weeks chasing model improvements that are really just absorbing a fixed registration error that a single resampling call would have removed. Putting both curves on a common 300-point depth grid is the least glamorous line in the pipeline and one of the most consequential, because it is the line that lets every other number be believed.
Key takeaways
- A predicted curve from a raster log digitiser is sampled at the image's pixel cadence; the ground-truth LAS is sampled at the logging tool's depth step. The two arrays have different lengths, start depths, and spacing, so subtracting them sample for sample compares the curve at one depth to the curve at another and reports a registration artefact, not error.
- The fix is to resample both the prediction and the ground truth onto one shared depth grid before computing any per-point metric. We use a fixed 300-point grid over the comparison interval, dense enough to preserve curve shape without oversampling, so every point on the prediction has a real depth-matched partner on the truth.
- The resampling itself is a solved, one-line problem in the standard scientific-Python stack; the discipline is remembering to do it and resisting the urge to read a metric off two arrays just because a careless truncation made them the same length.
- Once aligned on the 300-point grid, the recovered curves report an honest CSV MAE of 0.11 (curve-1) and 0.12 (curve-2) and MSE of 0.03 and 0.04 under Dice loss. MSE is even more sensitive to mis-registration than MAE because a constant depth offset gets squared into a larger penalty.
- Fix the grid across experiments, not just within one. A loss-function comparison is only a fair fight if every candidate's MAE is measured on the same 300-point grid; let the grid float and you reintroduce, at the comparison level, the mismatch you removed at the curve level.