Skip to main content

Blog

Few-Shot and Small-Data Vision: A Reading List for 2023

A practitioner's reading list for the label-scarce regime, ordered by the one thing that decides whether the reading pays off: realised leverage on a real pipeline. The short version is uncomfortable for anyone who came for the exotic few-shot tricks. On raster well-log digitisation, where hand-drawn per-pixel masks are exactly the labels nobody wants to draw, the payoff did not come from a clever k-shot method. It came from manufacturing data and reusing representation: procedural synthesis scaled from 2,000 to 15,000 to 20,000 instances, an autoencoder warm-start, and the plain discipline of an honest 80/20 split, run through a batch-size-1, memory-constrained pipeline to a peak held-out R-squared of 0.9891 and a lowest MAE of 0.0132. This is the list we would hand a team starting from a few hundred labels, arranged so the highest-leverage reading is not buried under the most-cited. It is a companion to, not a repeat of, our survey of few-shot segmentation method families; that piece maps the borrowed-support world, this one argues for the manufactured-support one.

The EarthScan Teamby The EarthScan Team9 min read
EarthScan insight

Most reading lists for the small-data problem are organised by novelty: the newest few-shot method at the top, the foundational papers underneath, and somewhere at the bottom the unglamorous work of building a dataset. We want to invert that order, because the order in which you read is a claim about where the leverage is, and on the one pipeline we can speak to with numbers the leverage was not where the citation counts said it would be. This is the list we would hand a team with a few hundred labels and a deadline, arranged so the reading that actually moved a metric sits on top and the exotic tricks sit last.

The setting is raster well-log digitisation: lifting analogue curves off scanned paper logs, where the label is a dense per-pixel mask a human has to trace by hand. That is the most expensive kind of scarce label, and it is exactly the kind our earlier survey of few-shot segmentation method families was about. This note is the companion to that one, not a reprise. The survey mapped the borrowed-support world, where you assume a handful of real labelled examples exist and compete to squeeze more generalisation out of them. This list argues for the world next to it, where the image is a rendering of source data you control, so the support set can be built rather than borrowed. Read whichever matches your case; ours were renderable, and that changed which reading paid.

Read one: the frame that says manufacture, do not out-clever

Start with the argument that reorders the whole list. The general lesson of the field, stated most bluntly by Sutton, is that methods which scale with data and computation win over the long run against methods that encode human cleverness [1]. It reads as a claim about foundation models, but it lands just as hard one rung down, on a single segmentation task with a few hundred real labels. The clever move is to reach for a low-data method that learns from k examples; the scaling move is to ask whether you can manufacture the examples instead, and then make a lot of them. On a rendered-source problem that manufacturing move is available in a way it is not for natural photographs, so Sutton's frame is the reason to prefer it before you have benchmarked anything. It tells you where to spend the first week: not on a method, on a data generator. Accept it and the few-shot papers drop to a contingency you reach for only if manufacturing fails.

Read two: reuse representation before you spend labels

The second rung is representation reuse, the oldest idea on the list. Before you spend a single hand-drawn mask, you can learn a representation from unlabelled input alone. The denoising-autoencoder line made the constructive version of this: corrupt an input, train a network to reconstruct the clean version, and the hidden layer is forced to learn structure that transfers [2]. The empirical follow-up asked why unsupervised pretraining helps, and found it acts as both a regularizer and a better initialisation, with the help largest exactly when labelled data is scarce [3].

In our pipeline this is an autoencoder warm-start: the encoder learns to represent log imagery before the segmentation head ever trains on a mask, so the labels it eventually sees are spent on the hard part, the decision boundary, not on relearning what a log looks like. Read two before you write a training loop, because cold-starting the whole network on the handful of labels you fought to collect throws away the free supervision in your unlabelled images, and the penalty grows as your labels shrink [3].

Read three: the generator is the model that matters

The third rung, and the one that carried the pipeline, is scaled synthetic data. The systematic case is domain randomization: render a lot of varied synthetic scenes, and a network trained on them transfers to the real distribution, with the transfer improving as you scale and randomise the generation [4]. For a printed log this is unusually clean, because the log is a deterministic rendering of known source curves, so a procedural renderer emits an image together with its pixel-perfect mask for free, at whatever volume you compute. There is no reality gap of the photographic kind to bridge, only a rendering to make faithful, which is why the synthetic route dominates our reading list specifically and why we would not promise it dominates yours if your images are photographs of the world.

The numbers are the argument. The corpora grew across the engagement, from an initial 2,000 binary-segmentation instances to a 15,000-instance multiclass set and a 20,000-log two-curve set, and the realised performance grew with them, to a peak held-out R-squared of 0.9891 and a lowest MAE of 0.0132, all on a batch-size-1 pipeline pinned there by the variable image widths. No few-shot trick is doing that work; manufacturing more, and more varied, labelled instances is. The exhibit below is that claim made draggable: the rungs of this list stacked in payoff order, with a lever that moves the synthetic corpus across its three sourced scale points and the payoff marker climbing toward 0.9891 as it goes.

SMALL-DATA VISION · READ IN ORDER OF REALISED PAYOFF0.5461held-out R-squared at 2,000 instancesManufacture and reuse data first; the exotic few-shot tricks come lastREADING LADDER · BOTTOM RUNG FIRST1. Split disciplineA clean 80/20 train-validation splitHonest held-out numbers before any trick2. Autoencoder warm-startReuse a learned representation, then labelReuse beats cold-start when labels are few3. Procedural synthesis, scaledManufacture 2,000 -> 20,000 instancesThe rung that carried the pipelineNone of these is exotic; each is data discipline the payoff rewarded.REALISED PAYOFF · SYNTHETIC SCALE DOES THE LIFTING0.500.650.800.951.002,00015,00020,000peak R-squared 0.9891SYNTHETIC-SCALE LEVERdrag corpus size across the three sourcedscale points; watch the payoff marker climb2,00015,00020,0002ksourced: 2,000 -> 15,000 -> 20,000 instances, 80/20 split, batch 1, peak R-squared 0.9891, MAE 0.0132 · the curve between the two anchors is illustrative
A reading list for the label-scarce regime, ordered by realised payoff on one raster-log-digitisation pipeline rather than by novelty. The three rungs read bottom-up: a clean 80/20 train-validation split, an autoencoder warm-start that reuses a learned representation before any labelling, and procedural synthesis scaled from 2,000 to 15,000 to 20,000 instances. The synthetic-scale lever drags the corpus size across those three sourced scale points and the orange marker climbs the payoff axis toward the peak held-out R-squared of 0.9891, which is the whole argument: when real labels are scarce, the leverage is in manufacturing and reusing data, not in chasing exotic few-shot tricks, and the numbers show synthetic scale doing the heavy lifting. The three scale points, the 80/20 split, the batch size of 1, the peak R-squared of 0.9891, and the lowest MAE of 0.0132 are sourced from the engagement archive; the marker is anchored at the two measured end-points and the curve drawn between them is illustrative interpolation, not a logged per-scale series.

The point is not that 0.9891 is a good number in the abstract; it is that the number is produced by the bottom of the reading list, the data and the discipline, not the top. The lever moves one thing, the count of manufactured instances, and the payoff tracks it.

Read four: the discipline that keeps the numbers honest

Underneath the three headline rungs is the reading nobody puts on a list because it is not a method: how you collect, split, and account for data. The survey view of where labelled data comes from, with augmentation and generation as first-class sources rather than afterthoughts, is the map that makes a data-centric reading list coherent instead of a pile of tricks [5], and it tells you that manufacturing data is a legitimate answer to the collection problem, not a shortcut around it.

The concrete discipline this rung buys is honest evaluation. We held out a straight 80/20 train-validation split and reported the held-out numbers, which is the only reason the 0.9891 means anything: the same figure measured on training data would be about memorisation, not about digitising a log the model has not seen. None of this is exotic, and that is the point of putting it on the list. The reading that keeps you honest is duller than the reading that makes you clever, and it matters more.

What the list leaves off, on purpose

There is a whole shelf this list does not recommend reading first, and naming it makes the omission a choice. The k-shot method families, matching networks and prototypical methods and meta-learners, are genuinely good work, and our companion survey credits them in full. They are not at the top here because they answer a question we did not have: how much you can extract from a small, real, borrowed support set. We had no borrowed support set to optimise; we had a renderer. If your labels are scarce but real and cannot be manufactured, invert this list, put the method families on top, and reach for the generation papers only if you find a way to render. The ordering is a function of your data, not a universal ranking.

Limitations

This is one pipeline's reading list, not a benchmark, and the ordering is contingent on a property most vision problems do not share: our images are deterministic renderings of source data, so synthetic generation is unusually cheap and faithful. On natural photographs the reality gap is wider, and the synthetic rung would likely sit lower relative to the few-shot and pretraining rungs. The sourced numbers are real archive figures, the 2,000, 15,000, and 20,000 instance scales, the 80/20 split, the batch size of 1, the peak R-squared of 0.9891, and the lowest MAE of 0.0132, but the smooth payoff curve the instrument draws between the two measured end-points is illustrative interpolation, not a logged score at every intermediate corpus size, and a real scale-up rarely climbs that cleanly. The cited works are the frame and the mechanism, not a claim that any studied well logs. And a high held-out R-squared on rendered curves is necessary, not sufficient, for a usable log: whether the synthetic corpus covered the field's real failure modes is the question this list does not answer, and the reason we treat manufacturing as leverage rather than a finish line.

References

[1] Sutton, R. The Bitter Lesson. Incomplete Ideas blog (2019). The long-run argument that general methods scaling with data and compute beat human-engineered cleverness. http://www.incompleteideas.net/IncIdeas/BitterLesson.html

[2] Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. Extracting and Composing Robust Features with Denoising Autoencoders. ICML 2008. Learning a transferable representation from unlabelled input before labels are spent. https://dl.acm.org/doi/10.1145/1390156.1390294

[3] Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., and Bengio, S. Why Does Unsupervised Pre-training Help Deep Learning? Journal of Machine Learning Research 11 (2010), pp. 625-660. Pretraining as regularizer and initialisation, with the largest benefit when labelled data is scarce. https://www.jmlr.org/papers/v11/erhan10a.html

[4] Tremblay, J., et al. Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization. CVPR Workshops 2018. Scaled, randomised synthetic data training real-world networks. https://arxiv.org/abs/1804.06516

[5] Roh, Y., Heo, G., and Whang, S. E. A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective. IEEE TKDE 33(4) (2021), pp. 1328-1347. The map of where labelled data comes from, with generation and augmentation as first-class sources. https://arxiv.org/abs/1811.03402

Go to Top

© 2026 Copyright. Earthscan