Few-Shot and Small-Data Vision: A Reading List for 2023

Most reading lists for the small-data problem are organised by novelty: the newest few-shot method at the top, the foundational papers underneath, and somewhere at the bottom the unglamorous work of building a dataset. We want to invert that order, because the order in which you read is a claim about where the leverage is, and on the one pipeline we can speak to with numbers the leverage was not where the citation counts said it would be. This is the list we would hand a team with a few hundred labels and a deadline, arranged so the reading that actually moved a metric sits on top and the exotic tricks sit last.

The setting is raster well-log digitisation: lifting analogue curves off scanned paper logs, where the label is a dense per-pixel mask a human has to trace by hand. That is the most expensive kind of scarce label, and it is exactly the kind our earlier survey of few-shot segmentation method families was about. This note is the companion to that one, not a reprise. The survey mapped the borrowed-support world, where you assume a handful of real labelled examples exist and compete to squeeze more generalisation out of them. This list argues for the world next to it, where the image is a rendering of source data you control, so the support set can be built rather than borrowed. Read whichever matches your case; ours were renderable, and that changed which reading paid.

Read one: the frame that says manufacture, do not out-clever

Start with the argument that reorders the whole list. The general lesson of the field, stated most bluntly by Sutton, is that methods which scale with data and computation win over the long run against methods that encode human cleverness [1]. It reads as a claim about foundation models, but it lands just as hard one rung down, on a single segmentation task with a few hundred real labels. The clever move is to reach for a low-data method that learns from k examples; the scaling move is to ask whether you can manufacture the examples instead, and then make a lot of them. On a rendered-source problem that manufacturing move is available in a way it is not for natural photographs, so Sutton's frame is the reason to prefer it before you have benchmarked anything. It tells you where to spend the first week: not on a method, on a data generator. Accept it and the few-shot papers drop to a contingency you reach for only if manufacturing fails.

Read two: reuse representation before you spend labels

The second rung is representation reuse, the oldest idea on the list. Before you spend a single hand-drawn mask, you can learn a representation from unlabelled input alone. The denoising-autoencoder line made the constructive version of this: corrupt an input, train a network to reconstruct the clean version, and the hidden layer is forced to learn structure that transfers [2]. The empirical follow-up asked why unsupervised pretraining helps, and found it acts as both a regularizer and a better initialisation, with the help largest exactly when labelled data is scarce [3].

In our pipeline this is an autoencoder warm-start: the encoder learns to represent log imagery before the segmentation head ever trains on a mask, so the labels it eventually sees are spent on the hard part, the decision boundary, not on relearning what a log looks like. Read two before you write a training loop, because cold-starting the whole network on the handful of labels you fought to collect throws away the free supervision in your unlabelled images, and the penalty grows as your labels shrink [3].

Read three: the generator is the model that matters

The third rung, and the one that carried the pipeline, is scaled synthetic data. The systematic case is domain randomization: render a lot of varied synthetic scenes, and a network trained on them transfers to the real distribution, with the transfer improving as you scale and randomise the generation [4]. For a printed log this is unusually clean, because the log is a deterministic rendering of known source curves, so a procedural renderer emits an image together with its pixel-perfect mask for free, at whatever volume you compute. There is no reality gap of the photographic kind to bridge, only a rendering to make faithful, which is why the synthetic route dominates our reading list specifically and why we would not promise it dominates yours if your images are photographs of the world.

The numbers are the argument. The corpora grew across the engagement, from an initial 2,000 binary-segmentation instances to a 15,000-instance multiclass set and a 20,000-log two-curve set, and the realised performance grew with them, to a peak held-out R-squared of 0.9891 and a lowest MAE of 0.0132, all on a batch-size-1 pipeline pinned there by the variable image widths. No few-shot trick is doing that work; manufacturing more, and more varied, labelled instances is. The exhibit below is that claim made draggable: the rungs of this list stacked in payoff order, with a lever that moves the synthetic corpus across its three sourced scale points and the payoff marker climbing toward 0.9891 as it goes.

A reading list for the label-scarce regime, ordered by realised payoff on one raster-log-digitisation pipeline rather than by novelty. The three rungs read bottom-up: a clean 80/20 train-validation split, an autoencoder warm-start that reuses a learned representation before any labelling, and procedural synthesis scaled from 2,000 to 15,000 to 20,000 instances. The synthetic-scale lever drags the corpus size across those three sourced scale points and the orange marker climbs the payoff axis toward the peak held-out R-squared of 0.9891, which is the whole argument: when real labels are scarce, the leverage is in manufacturing and reusing data, not in chasing exotic few-shot tricks, and the numbers show synthetic scale doing the heavy lifting. The three scale points, the 80/20 split, the batch size of 1, the peak R-squared of 0.9891, and the lowest MAE of 0.0132 are sourced from the engagement archive; the marker is anchored at the two measured end-points and the curve drawn between them is illustrative interpolation, not a logged per-scale series.

The point is not that 0.9891 is a good number in the abstract; it is that the number is produced by the bottom of the reading list, the data and the discipline, not the top. The lever moves one thing, the count of manufactured instances, and the payoff tracks it.

Read four: the discipline that keeps the numbers honest

Underneath the three headline rungs is the reading nobody puts on a list because it is not a method: how you collect, split, and account for data. The survey view of where labelled data comes from, with augmentation and generation as first-class sources rather than afterthoughts, is the map that makes a data-centric reading list coherent instead of a pile of tricks [5], and it tells you that manufacturing data is a legitimate answer to the collection problem, not a shortcut around it.

The concrete discipline this rung buys is honest evaluation. We held out a straight 80/20 train-validation split and reported the held-out numbers, which is the only reason the 0.9891 means anything: the same figure measured on training data would be about memorisation, not about digitising a log the model has not seen. None of this is exotic, and that is the point of putting it on the list. The reading that keeps you honest is duller than the reading that makes you clever, and it matters more.

What the list leaves off, on purpose

There is a whole shelf this list does not recommend reading first, and naming it makes the omission a choice. The k-shot method families, matching networks and prototypical methods and meta-learners, are genuinely good work, and our companion survey credits them in full. They are not at the top here because they answer a question we did not have: how much you can extract from a small, real, borrowed support set. We had no borrowed support set to optimise; we had a renderer. If your labels are scarce but real and cannot be manufactured, invert this list, put the method families on top, and reach for the generation papers only if you find a way to render. The ordering is a function of your data, not a universal ranking.

Limitations

This is one pipeline's reading list, not a benchmark, and the ordering is contingent on a property most vision problems do not share: our images are deterministic renderings of source data, so synthetic generation is unusually cheap and faithful. On natural photographs the reality gap is wider, and the synthetic rung would likely sit lower relative to the few-shot and pretraining rungs. The sourced numbers are real archive figures, the 2,000, 15,000, and 20,000 instance scales, the 80/20 split, the batch size of 1, the peak R-squared of 0.9891, and the lowest MAE of 0.0132, but the smooth payoff curve the instrument draws between the two measured end-points is illustrative interpolation, not a logged score at every intermediate corpus size, and a real scale-up rarely climbs that cleanly. The cited works are the frame and the mechanism, not a claim that any studied well logs. And a high held-out R-squared on rendered curves is necessary, not sufficient, for a usable log: whether the synthetic corpus covered the field's real failure modes is the question this list does not answer, and the reason we treat manufacturing as leverage rather than a finish line.

References

[1] Sutton, R. The Bitter Lesson. Incomplete Ideas blog (2019). The long-run argument that general methods scaling with data and compute beat human-engineered cleverness. http://www.incompleteideas.net/IncIdeas/BitterLesson.html

[2] Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. Extracting and Composing Robust Features with Denoising Autoencoders. ICML 2008. Learning a transferable representation from unlabelled input before labels are spent. https://dl.acm.org/doi/10.1145/1390156.1390294

[3] Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., and Bengio, S. Why Does Unsupervised Pre-training Help Deep Learning? Journal of Machine Learning Research 11 (2010), pp. 625-660. Pretraining as regularizer and initialisation, with the largest benefit when labelled data is scarce. https://www.jmlr.org/papers/v11/erhan10a.html

[4] Tremblay, J., et al. Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization. CVPR Workshops 2018. Scaled, randomised synthetic data training real-world networks. https://arxiv.org/abs/1804.06516

[5] Roh, Y., Heo, G., and Whang, S. E. A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective. IEEE TKDE 33(4) (2021), pp. 1328-1347. The map of where labelled data comes from, with generation and augmentation as first-class sources. https://arxiv.org/abs/1811.03402

Few-Shot and Small-Data Vision: A Reading List for 2023

Read one: the frame that says manufacture, do not out-clever

Read two: reuse representation before you spend labels

Read three: the generator is the model that matters

Read four: the discipline that keeps the numbers honest

What the list leaves off, on purpose

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on