Active Learning for Pixel-Level Tasks: A Decade of Acquisition Functions

Abstract

Active learning answers one question, which unlabeled example would teach the model the most if a human labeled it, and it answers it with an acquisition function that scores every example in a pool. The whole apparatus was designed for whole-example tasks, where each item carries a single label, and it works beautifully there. Dense prediction breaks the assumption: a segmentation model labels every pixel, the acquisition signal is now a field over the image rather than a scalar over the example, and on a thin-structure task the field is overwhelmingly flat because almost every pixel is trivially background. This survey traces a decade of acquisition design, from the uncertainty rule of the mid-1990s through margin, entropy, the mutual-information rule abbreviated BALD, and the core-set reframing that made diversity a first-class criterion, and credits each idea to where it was first shown. It then asks the question a practitioner faces and the bibliography rarely does: which of these acquisition functions survive being averaged over an image that is 97 percent background. We ground the question in a measured regime from our own VeerNet well-log pipeline, a three-class segmenter whose curve-1 and curve-2 F1 sit at 0.37 and 0.32 across a scale-up from 2,000 to 15,000 training instances, and we read the acquisition literature against that regime rather than against the clean classification benchmarks it was tuned on. The finding is consistent across the families: an acquisition function is a theory of where the model is wrong, and a dense imbalanced task punishes any theory that lets the empty majority of the image dilute the signal.

A decade of acquisition, briefly and with credit

Pool-based active learning assumes a large pool of unlabeled examples and a budget that buys only a few labels, and it spends that budget on whatever an acquisition function ranks highest. The reference that maps the whole family, with its theory and its taxonomy, is the survey by Settles, and we lean on it rather than re-deriving the catalogue (Settles, 2009). The oldest member of the family, and still the most used, is uncertainty sampling, introduced by Lewis and Gale for text classifiers: label the examples the current model is least confident about, on the reasoning that a confident prediction teaches little and an uncertain one sits near a decision boundary where a label moves the most (Lewis and Gale, 1994).

Uncertainty has three classic readings, and they are exactly the rules the instrument below ranks. Least-confident scores an example by one minus the probability of its top predicted class. Margin sampling, from Scheffer and colleagues, scores it by the gap between the top two class probabilities, so a near-tie ranks high (Scheffer et al., 2001). Entropy sampling scores it by the Shannon entropy of the full predictive distribution, which generalises the other two and is the right choice when more than two classes are genuinely in contention (Shannon, 1948). The three agree on the easy cases and disagree on the interesting ones, which is why all three are still in use.

The decade that matters for dense prediction begins when these scores meet deep networks. A network does not hand you a calibrated posterior, so the information-theoretic acquisitions need a way to estimate uncertainty. The mutual-information criterion that scores an example by how much labeling it would reduce uncertainty about the model parameters, rather than about the prediction itself, was proposed by Houlsby and colleagues and is now universally abbreviated BALD (Houlsby et al., 2011). Gal and colleagues then made it practical for image models by estimating the required predictive uncertainty with Monte-Carlo dropout, which let BALD and its relatives run on ordinary convolutional networks that have no explicit posterior (Gal et al., 2017). The distinction BALD draws, between what the model does not know and what is simply ambiguous in the data, is the one that matters most on a noisy boundary, because a fuzzy curve edge is ambiguous to everyone and labeling it again teaches the model nothing about its own ignorance.

The last move of the decade is the one that breaks with pure uncertainty. Sener and Savarese pointed out that querying the single most-uncertain example repeatedly returns near-duplicates, so a batch of uncertain queries is mostly the same query, and they reframed acquisition as a core-set problem that explicitly rewards diversity over the feature space rather than uncertainty over the prediction (Sener and Savarese, 2018). That reframing is what the arena below is built to show: as the labeling budget grows, the diversity-aware rule stops wasting the batch on redundancy that the uncertainty rules keep buying.

When the example becomes a field of pixels

Everything above scores a whole example. Dense prediction does not have whole examples in the same sense, because the unit of supervision is the pixel and a single image carries millions of them. The naive port of an acquisition function to segmentation reduces the per-pixel score to one number per image, usually by averaging, and that average is where the trouble starts. On a thin-curve log the foreground is one to three pixels wide threading down a tall scan, so the overwhelming majority of pixels are confidently classified background. Averaging the per-pixel uncertainty over such an image drowns the handful of genuinely informative boundary pixels under a sea of near-zero scores, and two images with very different boundary difficulty can post nearly identical mean uncertainty. The acquisition signal the literature relies on is, for this task, almost entirely a measurement of how much background an image contains.

The dense-prediction branch of the literature met this directly. Suggestive annotation kept the uncertainty idea but paired it with a representativeness term, querying examples that are both uncertain and typical of the unlabeled pool, so that the budget is not spent on rare outliers that happen to confuse the model (Yang et al., 2017). That pairing is the dense analogue of combining uncertainty with the core-set diversity criterion, arrived at independently in the medical-imaging setting. A later line went further and abandoned whole-image scoring altogether, learning a policy that selects the most informative regions of an image to label rather than whole images, on the argument that the right unit of acquisition for a dense task is a region, not an example (Casanova et al., 2020). Both are answers to the same problem: the example-level acquisition function has the wrong granularity for a task whose informative signal is local and sparse.

Method

We surveyed the public acquisition-function literature published through the fourth quarter of 2022 and organised it along the axis a practitioner actually has to reason about for a dense task: how each rule behaves when the per-pixel score is dominated by a large, easy, confidently-classified majority class. For each rule we recorded the assumption it makes about where the model is wrong, the regime in which it was originally demonstrated, and the failure mode it incurs when that regime is replaced by a 97 percent background image. We credit each idea at its origin rather than at its later popularisation, since the point of a survey is to draw the lineage correctly.

To keep the survey honest we anchored it to a measured regime rather than a purely bibliographic ranking. The regime is the multiclass segmentation stage of our own VeerNet pipeline, the encoder-decoder convolutional network with a transformer attention stage on the bottleneck that we built to read curves off raster well-log images. That stage trains three classes, background and two curves, on a dataset that scaled from a 2,000-instance binary base to 15,000 multiclass instances, and it reports per-class F1 of 0.97 for background, 0.37 for curve 1, and 0.32 for curve 2. Those three numbers fix the shape of the acquisition problem precisely: the class the model has already mastered is 97 percent of the pixels, and the two classes worth any further annotation are exactly the two whose F1 is stuck in the thirties.

The survey instrument ranks the five acquisition functions by a labeling efficiency, the share of newly queried pixels that carry foreground signal rather than redundant easy background, swept across the budget from the 2,000-instance base to the 15,000-instance scale-up. Those budget endpoints and the per-class F1 are the sourced quantities. The per-function efficiency response across the budget axis is illustrative, encoding the documented behaviour of each rule under heavy redundancy rather than a re-run benchmark, and the instrument flags that on its own canvas. The right panel allocates the next batch of annotation across the three masks by reading difficulty against saturation off the measured F1, which is the arena's instantiation on our regime rather than a logged query trace.

Results

The five rules sort into three behaviours under the dense imbalanced regime, and the measured F1 pins where each one would spend the next label.

The three uncertainty rules, least-confident, margin, and entropy, share a failure mode that the survey literature already names for the batch setting and that imbalance only sharpens. Because they score by the current model's per-pixel doubt, and because the boundary band of a thin curve is the only place that doubt concentrates, they query that same band repeatedly. In a batch this returns near-duplicates, so the marginal information of the second, third, and tenth boundary query collapses even though each one still looks maximally uncertain to the rule (Settles, 2009). Entropy is the strongest of the three on our task because all three classes can genuinely contend at a curve crossing, which is exactly the situation entropy was built to read, but it inherits the redundancy of its siblings once the budget grows.

BALD behaves better for a reason specific to a noisy dense task. Its mutual-information score separates the uncertainty that comes from the model not knowing the answer from the uncertainty that comes from the data being genuinely ambiguous, and a fuzzy rasterised curve edge is ambiguous to any labeler (Houlsby et al., 2011). A rule that refuses to spend the budget on irreducible boundary ambiguity, and spends it instead where the model's own parameters are unsettled, holds its labeling efficiency longer as the budget grows, which is what the arena shows past the mid-range of the sweep. The dropout machinery that makes this estimable on a convolutional segmenter is the contribution that carried the idea from theory into image practice (Gal et al., 2017).

The core-set rule is the one that holds the most efficiency at the largest budget, because it optimises for coverage of the feature space rather than for per-pixel doubt, so a batch is forced to be diverse by construction and cannot collapse onto the boundary band (Sener and Savarese, 2018). The dense-segmentation literature reached the same conclusion from the other direction by pairing uncertainty with representativeness, which is diversity under another name (Yang et al., 2017). The arena makes the trade-off operable: drag the budget and watch the diversity-aware and information-theoretic rules outlast the pure-uncertainty rules as redundancy grows.

An arena that ranks the published pool based acquisition functions by how efficiently each spends a labeling budget on a dense pixel level task whose pixels are 97 percent trivially background. The five teal traces are least confident sampling, margin sampling, entropy sampling, BALD, and the core set diversity rule. Drag the budget slider along the bottom track from the small binary base of two thousand instances up to the multiclass scale up of fifteen thousand, and each trace reports a labeling efficiency, the share of newly queried pixels that carry foreground signal rather than redundant easy background the model already knows. The leading function at the current budget is drawn with the heavier teal stroke; as the budget grows the diversity aware and information theoretic rules outlast pure uncertainty, which keeps querying the same near duplicate band. The right hand panel instantiates the arena on our own regime: it shows where the leader would spend its next annotation across the three masks, with each bar sized by difficulty over saturation and pinned to the measured per class F1 (background 0.97, curve 1 0.37, curve 2 0.32). The single scarce orange note is curve 2, the hardest residual at F1 0.32, the target the winning acquisition rule is steered toward. The F1 figures, the two thousand to fifteen thousand budget endpoints, and the 97 percent background reading are sourced from the engagement archive; the per function efficiency response across the budget axis and the next batch allocation are illustrative of each acquisition rule's documented behaviour under heavy redundancy, not a re run benchmark.

The right panel is where the survey meets our numbers. Background is 97 percent of the pixels and already at F1 0.97, so it is both saturated and trivial, and a sane acquisition rule sends almost none of the next batch there. Curve 1 at F1 0.37 and curve 2 at F1 0.32 are the residual, and curve 2, the harder of the two, earns the largest slice of the next annotation precisely because it is where the model is most wrong and least redundant. That allocation is the whole argument in one reading: on this task the only labels worth buying are the labels on the two thin classes that together occupy a sliver of the image, and the acquisition functions that win are the ones whose granularity and diversity criteria stop the empty majority from spending the budget for them.

Acquisition function	Origin	Behaviour on the 97% background regime
Least-confident sampling	Lewis and Gale, 1994 [2]	Redundant; reloads the same boundary band
Margin sampling	Scheffer et al., 2001 [3]	Slightly more selective; same redundancy
Entropy sampling	Shannon, 1948 [4]; Settles, 2009 [1]	Strongest uncertainty rule at class crossings
BALD (MC-dropout)	Houlsby et al., 2011 [5]; Gal et al., 2017 [6]	Skips irreducible edge ambiguity; holds efficiency
Core-set diversity	Sener and Savarese, 2018 [7]	Resists batch redundancy; best at large budgets

Discussion

The decade reads as a steady migration of the acquisition question away from the prediction and toward the model and the data distribution, and a dense imbalanced task is what makes the reason for that migration concrete. Uncertainty sampling asks where the prediction is shaky, which is the cheapest question and the one most easily fooled by an image that is mostly background or by a boundary that is irreducibly fuzzy. BALD asks where the model itself is unsettled, which costs an uncertainty estimate but refuses to be drawn onto ambiguity no label can resolve. The core-set rule asks where the labeled set fails to cover the data, which costs a feature-space computation but cannot be tricked into buying the same query ten times. Each step up that ladder buys robustness to a specific way the cheap rule fails, and every one of those failures is amplified, not invented, by 97 percent background.

Where our own pipeline sits in this map is at the friendly end for one reason and the hard end for another. The friendly part is that our training masks come from a procedural renderer, so we can synthesise as much of the easy class as we like for nearly nothing, and the active-learning question for synthetic supply is not which pixels to label but which real scans to validate against, a different problem we treat elsewhere. The hard part is that the two classes worth annotating are precisely the two whose F1 the entire engagement could not push past the thirties, which means the informative pixels are not only rare but intrinsically difficult, the worst case for any acquisition rule that assumes a label, once bought, settles the question. On a task where curve 1 and curve 2 sit at 0.37 and 0.32, the honest reading of the survey is that acquisition can decide where to spend the next label but cannot manufacture a clean label out of an ambiguous edge, and the literature that helps most is the branch that knows the difference.

The wider point the survey kept surfacing is about granularity. The example-level acquisition function is a comfortable abstraction inherited from classification, and dense prediction has spent the decade discovering that the comfortable abstraction is the wrong one. Whether the field reaches the region-level policies that learn what to label (Casanova et al., 2020) or stays with image-level scores paired with a diversity term, the direction of travel is toward matching the unit of acquisition to the unit of supervision, and on a thin-curve task that unit is a few pixels of boundary, not a whole scan.

Limitations

This is a survey, and it carries its sources' assumptions forward rather than re-running them. The ranking of acquisition behaviours under heavy imbalance is our reading of the public literature through the fourth quarter of 2022, organised for a practitioner choosing a rule for a dense imbalanced task, not a controlled head-to-head benchmark under a fixed budget and a fixed segmenter. The per-function efficiency traces and the next-batch allocation in the instrument are illustrative by design: the efficiency curves encode how each rule is documented to behave as redundancy grows, not measured degradation, and the allocation reads difficulty against saturation off the F1 rather than replaying a logged query history. Only the per-class F1 of 0.97, 0.37, and 0.32, the 2,000-to-15,000 budget endpoints, and the 97 percent background reading are sourced, and even those are a single pipeline on one dataset, so they fix the regime the survey is read against rather than prove a general rate. The synthetic-supply framing that makes our own acquisition question unusual, abundant cheap labels from a renderer against a few expensive real validations, is exactly what limits how far our experience transfers: a team labeling real scans from scratch faces the textbook pool-based problem the surveyed rules were built for, and the comfort of cheap foreground that we describe does not carry over to it.

References

[1] B. Settles. Active Learning Literature Survey. University of Wisconsin-Madison, Computer Sciences Technical Report 1648, 2009. https://minds.wisconsin.edu/handle/1793/60660

[2] D. D. Lewis, W. A. Gale. A Sequential Algorithm for Training Text Classifiers. SIGIR 1994. https://arxiv.org/abs/cmp-lg/9407020

[3] T. Scheffer, C. Decomain, S. Wrobel. Active Hidden Markov Models for Information Extraction. IDA 2001. https://link.springer.com/chapter/10.1007/3-540-44816-0_31

[4] C. E. Shannon. A Mathematical Theory of Communication. Bell System Technical Journal, 27, 1948. https://ieeexplore.ieee.org/document/6773024

[5] N. Houlsby, F. Huszar, Z. Ghahramani, M. Lengyel. Bayesian Active Learning for Classification and Preference Learning. arXiv 2011. https://arxiv.org/abs/1112.5745

[6] Y. Gal, R. Islam, Z. Ghahramani. Deep Bayesian Active Learning with Image Data. ICML 2017. https://arxiv.org/abs/1703.02910

[7] O. Sener, S. Savarese. Active Learning for Convolutional Neural Networks: A Core-Set Approach. ICLR 2018. https://arxiv.org/abs/1708.00489

[8] L. Yang, Y. Zhang, J. Chen, S. Zhang, D. Z. Chen. Suggestive Annotation: A Deep Active Learning Framework for Biomedical Image Segmentation. MICCAI 2017. https://arxiv.org/abs/1706.04737

[9] A. Casanova, P. O. Pinheiro, N. Rostamzadeh, C. J. Pal. Reinforced Active Learning for Image Segmentation. ICLR 2020. https://arxiv.org/abs/2002.06583

[10] O. Ronneberger, P. Fischer, T. Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2015. https://arxiv.org/abs/1505.04597

Active Learning for Pixel-Level Tasks: A Decade of Acquisition Functions

Abstract

A decade of acquisition, briefly and with credit

When the example becomes a field of pixels

Method

Results

Discussion

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on