Measuring a Digitiser With R-Squared, MAE, and MSE Against Ground Truth

Once VeerNet has turned a scanned well log into masks and those masks back into a depth-indexed curve, the curve sits next to the ground-truth curve and somebody has to say how close it is. The honest answer is not one number. It is at least three, and the reason we carry three is that they measure different things and, on a hard curve, disagree with each other loudly enough that reporting only one would mislead whoever reads it. This note is a plain-English account of what the coefficient of determination, mean absolute error, and mean squared error each actually measure on a recovered one-dimensional curve, why they can point in different directions, and how to sample the curve so the comparison is fair. It assumes the depth axes are already aligned, so every point on the recovered curve is being compared against the truth at the same depth. Getting them onto the same axis is a separate job with its own failure modes; here the axes agree and the only question left is how well the values agree.

Start with the fact the three numbers exist to summarise. At every one of the depths we sample, there is a residual: the recovered value minus the true value. On a curve read over the 300 interpolated depth points a validation run uses, that is 300 residuals, most small, some not. All three metrics are just different ways of collapsing that cloud of residuals into a single scalar, and the choice of how you collapse it is the choice of what you are willing to notice. That is the whole story, and everything below is a consequence of it.

R-squared answers a question about shape, not about size

The coefficient of determination, R-squared, is the one people quote first, and it is also the one most often quoted alone, which is where trouble starts. R-squared reports how much of the variation in the true curve the recovered curve accounts for, on a scale where one is perfect tracking and zero is no better than predicting the mean of the truth everywhere. Read plainly, it is a question about shape: does the recovered curve go up where the truth goes up and down where it goes down, in the right proportion? A curve that follows every wiggle of the truth scores near one even if it sits a little high or low, and a curve that ignores the truth and draws a flat line through the average scores near zero no matter how close that flat line happens to be on average.

That framing is exactly why R-squared can flatter a model that is quietly wrong, a point Legates and McCabe made carefully for hydrologic models and which transfers directly to a recovered log curve [3]. Because it is built on correlation, R-squared is insensitive to a consistent offset: a recovered curve that is shifted by a fixed amount at every depth, or scaled by a constant, can still report a high R-squared while being systematically off. It rewards getting the shape right and forgives getting the level wrong. On our runs the best recovered curve reported an R-squared of 0.9891, which says the shape tracking was excellent, and taken alone it would be tempting to stop there. It is a real and good number. It just does not tell you how far off the curve is in the units a petrophysicist reads.

MAE answers the question a reader actually feels

Mean absolute error answers the size question that R-squared skips. It is the average of the absolute residuals: take the miss at every depth, drop the sign, average them. In plain terms it is the typical distance between the recovered value and the truth, in the same units as the curve. There is nothing clever in it, and that is its virtue. If MAE is 0.02, then across the sampled depths the recovered curve is off by about 0.02 on average, and that is a sentence a geoscientist can act on. On the same good curve where R-squared was 0.9891, the MAE was 0.0132, so the typical miss was small in absolute terms too, and the two numbers agreed that the fit was good.

The reason MAE deserves its own slot, rather than being replaced by a squared-error summary, is that it treats every unit of error the same. A miss of 0.1 counts exactly ten times a miss of 0.01, no more. Willmott and Matsuura made the case that this linearity is what makes MAE a clean statement of average error, because it does not smuggle the variance of the error distribution into the headline the way a squared summary does [1]. When we want to know how wrong the curve is on a normal depth, the depth a reader will land on most of the time, MAE is the number that says it without exaggeration and without hiding anything.

MSE answers a question about the tail

Mean squared error takes the same residuals, squares each one before averaging, and by squaring it changes what the number is sensitive to. Squaring makes a miss of 0.1 count a hundred times a miss of 0.01, not ten times, so a handful of large residuals dominate the average while the many small ones fade. MSE is therefore a question about the tail of the error distribution: are there rare, large misses I should be afraid of? A curve that is close almost everywhere but blows out at three depths will have a modest MAE and a disproportionately large MSE, and that gap between them is the signal. On the good curve the MSE was 0.0004, tiny, which confirmed there were no large excursions hiding behind the small average miss.

None of this makes MSE a worse metric than MAE, and it is worth being explicit about that, because the two are sometimes pitted against each other as if one had to win. Chai and Draxler argued the opposite: when the large errors are precisely the ones that matter most, penalising them harder is the correct behaviour, not a distortion [2]. On a well log, a large miss at a single depth can be the difference between calling a pay zone and missing it, so a metric that screams when the tail is bad is doing its job. The point is not that MSE is right and MAE is wrong. It is that they answer different questions, and you want to hear both answers.

The same recovered 1D curve read three ways against ground truth, on evenly spaced depth samples, with depth-axis alignment assumed. Toggle the curve under the lens: on the easy curve all three summaries agree the fit is good (R-squared 0.9891, MAE 0.0132, MSE 0.0004). On the harder curve2 they disagree sharply: R-squared falls to 0.5461 because the recovered series stops tracking the shape, yet the typical miss (MAE 0.1241) is only about ten times larger while the squared-error summary (MSE 0.0253) is roughly sixty times larger, inflated by a handful of large depth misses. The residual whiskers on the left are the raw material all three metrics summarise differently; the three gauges on the right each carry the one question its metric answers. The single orange element is the disagreement bracket that appears on the harder curve, where R-squared has collapsed but the typical miss has barely moved. Every reported number is sourced from the engagement archive; the drawn curve geometry is illustrative shape only, and the reader should report all three rather than lean on any one.

When the three disagree, the disagreement is the information

On an easy curve the three numbers move together and the choice between them feels academic. The interesting case is the hard one, and our archive has a clean example. On a harder recovered curve, the second curve in a two-curve read, the coefficient of determination fell to 0.5461. Taken alone that reads as a near-failure: the recovered curve was only loosely tracking the shape of the truth. But the mean absolute error on that same curve was 0.1241, only about ten times the 0.0132 of the good curve, which says the typical depth was still recovered to within about a tenth of the reading. And the mean squared error was 0.0253, roughly sixty times the good curve's 0.0004, far out of proportion to the tenfold growth in MAE.

Read together, those three numbers tell a story no one of them tells alone. The shape tracking genuinely degraded, which is the R-squared collapse. The typical miss grew but stayed in a workable range, which is the modest MAE growth. And the squared-error summary blew up out of all proportion to the typical miss, which can only happen if a few depths went badly wrong while most stayed close, dragging the squared average up while leaving the absolute average comparatively calm. That is a specific, actionable diagnosis: not a uniformly bad curve, but a mostly acceptable curve with a small number of large local failures. If we had reported only R-squared we would have written the curve off; if we had reported only MAE we would have missed the outliers; if we had reported only MSE we would have known something was badly wrong without knowing it was localised. The disagreement between them is the diagnosis.

Sampling the curve fairly is what makes the numbers mean anything

All three metrics assume the residuals they average are a fair sample of the curve, and this is the assumption most easily broken in practice. Because we compare on a fixed grid of interpolated depth points, evenly spaced along the depth axis, every stretch of the curve contributes in proportion to how much of the depth interval it occupies, not to how many raw pixels it happened to span on the scan. A raster log does not have uniform data density: a curve can be drawn densely in one interval and sparsely in another depending on how the original was plotted. If we averaged over raw samples instead of even depths, a densely drawn interval would vote more than a sparse one and the metric would quietly describe the busy part of the log rather than the whole of it. Sampling evenly across depth is what keeps each metric an honest statement about the entire curve. It does depend on the depth axis already being aligned, which we are assuming here; a misaligned axis would compare each recovered value against the truth at the wrong depth and corrupt all three metrics at once, which is exactly why alignment is treated as its own upstream step and not folded into this comparison.

Limitations

These are three summaries of a residual cloud, and they inherit the limits of any summary. The specific values quoted, R-squared 0.9891 with MAE 0.0132 and MSE 0.0004 on the good curve and R-squared 0.5461 with MAE 0.1241 and MSE 0.0253 on the harder one, are real archive numbers, but they are point readings from particular recovered curves and do not transfer as constants to a different log, a different curve type, or a different operator's plotting conventions. The instrument's drawn curve geometry is illustrative shape only, chosen to make the disagreement visible; the reported metric values are the sourced ones. All three metrics are silent on where along the curve the errors fall, so a good MAE says nothing about whether the misses cluster in a zone that matters, and reporting all three still does not substitute for looking at the residuals against depth. The whole analysis assumes an aligned depth axis; if that assumption fails, none of these numbers can be trusted, and the failure will not always announce itself in the metric. Finally, agreeing with a ground-truth curve on held-out depths is not the same as being useful to the person who reads the log, which depends on whether the curve that scored well is the curve they needed.

What the three numbers are for

The habit worth keeping is to refuse to let one metric stand in for the comparison. R-squared tells you whether the recovered curve has the right shape, MAE tells you how far off it typically is in the reader's own units, and MSE tells you whether a few large misses are hiding behind a calm average. On an easy curve they agree and the redundancy costs nothing. On a hard curve they disagree, and the pattern of their disagreement is a more precise description of what went wrong than any single number could be, provided the residuals were sampled evenly across an aligned depth axis in the first place. Three questions, three answers, reported together.

References

[1] Willmott, C. J., and Matsuura, K. Advantages of the Mean Absolute Error (MAE) over the Root Mean Square Error (RMSE) in Assessing Average Model Performance. Climate Research 30 (2005), pp. 79-82. Why a squared-error summary folds the variance of the error distribution into the headline, so MAE and a squared-error metric can disagree on the same residuals. https://www.int-res.com/abstracts/cr/v30/cr030079

[2] Chai, T., and Draxler, R. R. Root Mean Square Error (RMSE) or Mean Absolute Error (MAE)? Arguments Against Avoiding RMSE in the Literature. Geoscientific Model Development 7 (2014), pp. 1247-1250. The case that penalising large errors harder is correct behaviour when large errors are the ones that matter, framing the MAE-versus-MSE choice as a question of what you want noticed. https://gmd.copernicus.org/articles/7/1247/2014/

[3] Legates, D. R., and McCabe, G. J. Evaluating the Use of Goodness-of-Fit Measures in Hydrologic and Hydroclimatic Model Validation. Water Resources Research 35(1) (1999), pp. 233-241. Why correlation-based measures such as R-squared are insensitive to bias and can look high even when a model systematically misses, so they should not be reported alone. https://agupubs.onlinelibrary.wiley.com/doi/10.1029/1998WR900018

Measuring a Digitiser With R-Squared, MAE, and MSE Against Ground Truth

R-squared answers a question about shape, not about size

MAE answers the question a reader actually feels

MSE answers a question about the tail

When the three disagree, the disagreement is the information

Sampling the curve fairly is what makes the numbers mean anything

Limitations

What the three numbers are for

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on