When More Data Hurts: Class Imbalance Explained for Practitioners

There is a moment in most segmentation projects where someone reasonable asks for more data, and on most problems that request is right. Under extreme class imbalance it is not. When the thing you are trying to find occupies a sliver of every image, most of what a bigger dataset adds is more of the class the model already gets right, and almost none of it is more of the class the model is failing on. The average error goes down while the error that matters stays exactly where it was. This is a primer on that seesaw, aimed at practitioners who have watched a metric look healthy while the model quietly did nothing useful, grounded in one concrete case: digitising the curves off scanned raster well logs, where the background is roughly 97 percent of every frame.

The mechanics below are the field's, not our discovery. The systematic demonstration that imbalance degrades a classifier, and that the damage interacts with dataset size and problem difficulty rather than being a fixed penalty, is Japkowicz and Stephen's [1]. The organised map of remedies and the case against accuracy as the yardstick is He and Garcia's survey [2]. The argument that a precision-recall view exposes what a headline score conceals is Davis and Goadrich's [3], and the convolutional-network confirmation that the fixes are resampling and cost-sensitive weighting rather than more of the same data is Buda, Maki, and Mazurowski's [4]. What we add is a walked example with real numbers from our own runs.

Why a bigger dataset does not touch the problem

Start with the counterintuitive claim, because it is the reason this note exists. A pixel-wise loss optimises an average over pixels. On a scanned log the average is dominated by background, so the cheapest way to lower the loss is to predict background well and ignore the thin curves entirely. Adding more logs to the training set does not change that arithmetic. Each new log is also 97 percent background, so it pours more of the class the model already predicts correctly into the average and only a trickle of the curve pixels the model is starving for. The gradient the optimiser feels stays pointed the same way. Japkowicz and Stephen showed the more careful version of this: the harm from imbalance is not a constant, it grows with the complexity of the concept and interacts with how much data you have, so you cannot assume that scaling the set up dilutes the skew away [1]. It does not, because the skew is a property of each example, not of the count of examples.

That is the sense in which more data hurts. It does not literally make the model worse, but it burns annotation budget and training time while moving the number that matters not at all, and it lets a rising frame-averaged score persuade a team that things are improving when the minority class is exactly where it was. King and Zeng made the same point decades ago in statistics: for rare events the estimate is governed by the scarce class, and what recovers it is correction or weighting, not an ocean of additional common-class observations [5].

The average lies, so read per class

The first discipline is to stop trusting any single blended number. On our weighted raster-log run the per-mask F1 on the thin curves came in at 0.37, 0.26, and 0.55, three values that scatter across a wide band even within the one thing we cared about, while the background class, which fills 97 percent of every frame, is trivially easy and would score near the top. A frame-averaged number, weighted by pixel count, would have been dragged almost entirely toward that easy background and reported a model that had, for practical purposes, learned little about the curves. He and Garcia's survey is blunt that accuracy and its blended cousins are the wrong instruments under skew precisely because they let the majority class speak for the whole [2]. The fix is not clever, it is procedural: always report per-class precision, recall, and F1, and treat the minority-class rows as the real scoreboard. The background row is a sanity check that the easy thing still works, nothing more.

Davis and Goadrich sharpen why the precision-recall pair, specifically, is the right lens [3]. Under heavy skew a model can look strong on a ROC curve, because the vast pool of true negatives keeps the false-positive rate small even when the model fires carelessly, while the precision-recall view, which never counts those true negatives, shows the failure plainly. For a curve one to three pixels wide against a near-empty frame, precision and recall are the two honest questions: of the pixels you called curve, how many were, and of the curve pixels that existed, how many did you catch.

The correction, and the seesaw it sets off

Once you accept that the model has collapsed to the majority, the standard first move is cost-sensitive weighting: make mistakes on the minority class more expensive so the optimiser stops finding the all-background shortcut cheap [2] [4]. We reached for a weighted binary cross-entropy and pushed the positive-class weight to 42, which is roughly the inverse class frequency at 97 percent background and the natural starting point. It worked in the direction it was supposed to. Recall on the two curve masks climbed to 0.96 and 0.97. The model stopped ignoring the curves.

But a weight that large is a sharp instrument, and it sets off the tradeoff that is the whole subject of this note. Every curve pixel now screams, so the model fires generously to avoid the penalty, and generous firing produces false positives. Recall went up because the model catches nearly everything; precision went down because it also flags a lot that is not curve. We do not have to guess how far precision fell, because the run logged both recall and F1 and the two together pin it. F1 is the harmonic mean of precision and recall, so precision is what F1 and recall force it to be: rearranging the harmonic mean gives precision equal to F1 times recall over twice recall minus F1. On the first curve mask, recall 0.96 with a logged F1 of 0.37 pins precision at about 0.23. On the second, recall 0.97 with a logged F1 of 0.26 pins precision near 0.15. And here is the part that surprises people the first time: the harmonic mean does not average the two, it is dragged toward whichever is smaller. So even with recall almost pegged at the top, the composite F1 sat down at 0.37 and 0.26, tracking the collapsed precision rather than the soaring recall. We paid a large weight, we transformed the model's behaviour, and the score that summarises it barely rose off the floor.

The composite-metric trap under extreme foreground scarcity, made physical. On a scanned well log about 97 percent of pixels are background, so the two thin curve classes are a sliver of every frame. The single lever is the positive-class weight w handed to a weighted binary cross-entropy. As you drag w up the model fires more generously on the curve: recall climbs toward its ceiling while precision collapses the other way, and because F1 is the harmonic mean, it is dragged toward whichever component is weaker. The plank tilts, and the orange F1 read-out is the element that argues, tracing a hump: it rises while a little weight rescues recall, then peaks and falls as heavier weight lets precision collapse faster than recall gains. So there is a moderate weight that maximises F1 and a heavier weight that quietly makes it worse. Every figure comes from one run, the engagement's own weighted-BCE at w=42, roughly the inverse class frequency. Two quantities were logged directly on the curve masks: recall 0.96 and 0.97, and F1 0.37 and 0.26. Precision is not separately logged; it is fixed once recall and F1 are, through the harmonic-mean identity, at about 0.23 and 0.15 for that same run, so nothing is borrowed from another loss function. The read-out at w=42 shows F1 0.37, the sourced value, already down the far side of the hump while recall is 0.96. The recall, precision, and F1 paths across w are illustrative geometry pinned to that sourced point, drawn to show the direction of the tradeoff, not a logged per-weight sweep.

The instrument above is that seesaw made physical. Drag the class weight from one and watch the plank tilt: the recall end climbs toward its ceiling as the weight rises, the precision end sinks the opposite way, and the F1 read-out on the fulcrum, the only element in orange, traces a hump. It rises at first, when a little weight rescues recall while precision is still healthy, then peaks and falls back as the weight is pushed further and precision collapses faster than recall can gain. The vertical marker at 42 is our sourced operating point, and by that weight the F1 has already slid down the far side of the hump to 0.37, dragged there by a precision near 0.23 even though recall is 0.96. The lesson it argues is not that weighting is useless. It is that weighting relocates the model along a precision-recall frontier, and because F1 follows the weaker of the two components, there is a moderate weight that maximises it and a heavier weight that quietly makes it worse. If F1 has stopped rising as you crank the weight, more weight is not the tool you need next.

What actually moves the frontier

If weighting only slides you along the curve, the practical question is what lifts the curve, and the literature is consistent that the answers live on the data and objective side, not in a bigger pile of the same distribution [2] [4]. Resampling changes what the optimiser sees per step: oversampling the minority class, or cropping to tiles where the curve fills a larger share of the frame, both hand the loss a gentler imbalance to start from, which Buda and colleagues found among the more reliable single remedies for convolutional networks [4]. Overlap-aware losses attack the same problem from the objective side, scoring intersection rather than per-pixel correctness so the minority class is not drowned in the average. Each of these can genuinely improve the minority-class F1 rather than trading one component for another, which is precisely what a large class weight, on its own, cannot do. Weighting is the diagnostic that tells you the model has stopped ignoring the curves; it is rarely the thing that makes the curves good.

The through-line for a practitioner is a short checklist. Do not assume more data helps until you have confirmed the imbalance is a count problem rather than a per-example one; under foreground scarcity it is almost always the latter. Never read a blended score; read the minority-class row. Use precision and recall as the pair, and expect a large weight to swap them rather than improve both. Watch F1 as the honest witness: it follows the weaker component, so it peaks at a moderate weight and then falls, and if it has stopped rising the frontier has not moved and the next lever belongs to sampling, tiling, or the loss, not the weight.

Limitations

This is a teaching note built on one engagement's numbers, and the numbers are a snapshot, not a sweep. The figures come from a single run, a weighted binary cross-entropy at class weight 42, and they are all from that one run rather than stitched across configurations. Two quantities are directly logged: the curve recall of 0.96 and 0.97, and the curve F1 of 0.37 and 0.26. The class weight of 42 and the roughly 97 percent background share are logged too. The precision figures of about 0.23 and 0.15 are not separately logged; they are derived from the recall and the F1 through the harmonic-mean identity, which pins precision exactly once recall and F1 are fixed, so they describe the same run and are not borrowed from another loss function. The recall, precision, and F1 paths the instrument draws across the full range of weights are illustrative geometry pinned to that single measured point at weight 42, not a logged per-weight experiment, so read them as the shape of the tradeoff rather than a sweep we ran. The precise curvature, and where exactly the F1 hump peaks, depends on the model, the dataset, and the loss, and ours would differ from another operator's logs or another curve count. The claim that more data does not help is specific to per-example imbalance under a pixel-wise objective; where the minority is genuinely scarce in count rather than in every frame, collecting more of it is exactly right. And F1 is one summary among several; a task that cares far more about recall than precision, or the reverse, should weight the two accordingly rather than reading their harmonic mean.

The one habit worth keeping

The habit this leaves you with is to treat imbalance as a question about each example before it is a question about the dataset size. When the target is a sliver of every frame, the majority class owns the average, more data of the same kind mostly feeds that average, and the standard weighting fix moves you along a precision-recall frontier rather than out to a better one. Read the minority row, watch F1 as the honest witness that peaks and then falls as the weight climbs, and spend the next effort on the levers that lift the frontier instead of the one that only tilts the plank.

References

[1] Japkowicz, N., and Stephen, S. The Class Imbalance Problem: A Systematic Study. Intelligent Data Analysis 6(5) (2002), pp. 429-449. The controlled demonstration that imbalance degrades a classifier and that the harm interacts with training-set size and concept complexity. https://content.iospress.com/articles/intelligent-data-analysis/ida00103

[2] He, H., and Garcia, E. A. Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering 21(9) (2009), pp. 1263-1284. The survey of resampling, cost-sensitive weighting, and the case against accuracy as the metric under skew. https://ieeexplore.ieee.org/document/5128907

[3] Davis, J., and Goadrich, M. The Relationship Between Precision-Recall and ROC Curves. Proceedings of the 23rd International Conference on Machine Learning (ICML 2006), pp. 233-240. Why the precision-recall view exposes failures that a ROC curve hides under heavy class skew. https://dl.acm.org/doi/10.1145/1143844.1143874

[4] Buda, M., Maki, A., and Mazurowski, M. A. A Systematic Study of the Class Imbalance Problem in Convolutional Neural Networks. Neural Networks 106 (2018), pp. 249-259. The CNN-era confirmation that imbalance harms deep models and that resampling and weighting, not more same-distribution data, are the practical remedies. https://www.sciencedirect.com/science/article/pii/S0893608018302107

[5] King, G., and Zeng, L. Logistic Regression in Rare Events Data. Political Analysis 9(2) (2001), pp. 137-163. The statement that rare-event estimation is governed by the scarce class and that correction or weighting recovers minority-class signal, not additional majority-class data. https://gking.harvard.edu/files/0s.pdf

When More Data Hurts: Class Imbalance Explained for Practitioners

Why a bigger dataset does not touch the problem

The average lies, so read per class

The correction, and the seesaw it sets off

What actually moves the frontier

Limitations

The one habit worth keeping

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on