Data Augmentation Policies Learned and Hand-Tuned: A Survey Across Vision Tasks

Abstract

Choosing an augmentation policy has split into two traditions that rarely cite each other. One learns the policy: it defines a space of transforms and magnitudes and runs an outer optimiser over it, keeping the policy that trains the best child model. The other hand-sets the policy: a practitioner picks a short list of transforms by inspection of what the deployment data varies in, fixes reasonable ranges, and trains once. This survey reads both, tracing the learned line from AutoAugment through its cheaper successors to the tuning-free methods that ended it, and sets that line against the hand-tuned tradition of plain geometric and photometric transforms plus random erasing. We ground the reading in one real small-data dense-prediction baseline from VeerNet, our raster well-log digitiser: a five-transform hand-tuned stack of sharpness, blur, jitter, perspective, and random erasing, trained on 15,000 synthetic multiclass instances at an 80/20 train-validation split, where perspective and erasing carry most of the bridge from clean synthetic pages to 136,771 real scans. The central finding, which the field's own later results back into, is that a short hand-tuned stack buys most of the sim-to-real robustness a learned search promises at a small fraction of the search cost. This is a survey of the public field; our numbers are a worked example, not a new benchmark.

The two traditions and why they diverged

The learned tradition began with a bet that augmentation is a search problem. Cubuk and colleagues framed it exactly that way in AutoAugment: define a discrete space of sub-policies, each a pair of image operations with a probability and a magnitude, and search that space with reinforcement learning, scoring each candidate by the validation accuracy of a child model trained under it [1]. The result was strong and the cost was severe, because every candidate policy required training a model to score it, and the search ran over thousands of candidates. The bet paid in accuracy and charged in compute.

The hand-tuned tradition never treated augmentation as a search at all. It treats augmentation as a statement about the deployment distribution: if real inputs will arrive rotated, blurred, unevenly lit, or partly occluded, then the training inputs should be perturbed in those same ways, at ranges a human can set from looking at a handful of real examples. Random erasing is the archetype of the tradition's recent additions, a single transform that masks a rectangular region of the input to force robustness to occlusion, with no outer search whatsoever [6]; cutout is its near-twin, masking a fixed square [7]. The domain-randomisation line makes the tradition's logic explicit for the synthetic-to-real case: randomise the rendering parameters widely enough and the real image looks to the network like just another sample of the training distribution [8].

The two lines diverged on a single question, which this survey keeps in front of everything else: is the extra accuracy a search recovers worth the search? For a lab optimising the last point of top-one accuracy on a public benchmark, the answer was often yes. For a small-data, single-GPU project shipping into a closed operator network, the arithmetic is different, and the field's own later results, as we read them below, moved toward the hand-tuned answer.

The learned line, from expensive to almost free

What is striking about the learned tradition is that its most important results are about making the search cheaper, and each step toward cheapness moved the method closer to the hand-tuned stack it was supposed to beat. Population Based Augmentation replaced the offline reinforcement-learning search with population based training, learning a schedule of augmentations during a single training run at a small fraction of AutoAugment's cost [2]. Fast AutoAugment removed the repeated child-model retraining entirely by matching the density of augmented and clean data, treating policy search as a density-matching problem rather than a train-and-score loop [3]. Both kept the premise that the policy is worth searching for; both spent most of their effort on paying less for the search.

RandAugment is the pivot. Cubuk and colleagues observed that the enormous AutoAugment search space could be collapsed to two integers, the number of transforms applied per image and a single shared magnitude, and that grid-searching those two numbers matched or beat the learned policy [4]. That is close to an admission that the expensive search was recovering something a two-parameter sweep already contained. TrivialAugment closed the argument: Muller and Hutter showed that applying one transform, chosen uniformly at random at a strength chosen uniformly at random, with no tuning of anything, matches the tuned policies on standard benchmarks [5]. The end state of the learned tradition is a method that does no search at all, which is to say it is a hand-set policy with the ranges left wide. The survey by Shorten and Khoshgoftaar, written mid-trajectory, already catalogues both families side by side and notes how much of augmentation's value comes from the plain geometric and photometric transforms rather than from the search machinery on top [9].

Method

This is a structured reading of the published augmentation-policy literature, not a new experiment, and the scope was kept narrow so the claims stay defensible. We organised the field into two families. The learned family was read in its cost-descent order: the reinforcement-learning search of AutoAugment [1], the schedule-learning of Population Based Augmentation [2], the density-matching of Fast AutoAugment [3], the two-parameter reduction of RandAugment [4], and the no-search end state of TrivialAugment [5]. The hand-tuned family was read through its recent, well-cited additions, random erasing [6] and cutout [7], set inside the domain-randomisation rationale that governs synthetic-to-real transfer [8], with the broad survey of Shorten and Khoshgoftaar as the cross-family map [9]. For each method we extracted three things: what is chosen, what does the choosing, and the search cost that choice incurs, measured in trial training runs.

To anchor the reading to a real task, we set it against one reference from the engagement archive: the augmentation stack of VeerNet, our raster well-log digitiser. The stack is five hand-tuned transforms, sharpness, blur, jitter, perspective, and random erasing, applied to 15,000 synthetic multiclass training instances split 80/20 into train and validation. The two that matter most for transfer are perspective, which mimics the keystoning of a photographed page, and erasing, which mimics the smudges, staples, and torn corners of a scanned one; together they carry the bulk of the bridge from clean synthetic renders to the 136,771 real scans the model must eventually read. These figures are real and used as a worked example, not a re-run of any learned method. The interactive exhibit sits on the same footing: the stack size, instance count, split, and real-scan count are sourced; the robustness fractions and the learned-search curve are flagged as illustrative policy-design geometry.

What the search is actually buying

The clean way to compare the two traditions is to plot robustness against search cost, and the shape of that plot is the whole argument. Search cost is the number of trial training runs a policy consumes before it is fixed. A hand-tuned stack is set by inspection, so it costs one run, the run you were going to train anyway. A learned search costs as many runs as it evaluates candidates, which for the original AutoAugment was very large [1] and for its successors fell steadily but never to one [2] [3] [4]. Robustness, read as the fraction of the sim-to-real gap a policy closes on held-out real inputs, rises with search but saturates, because there are only so many useful ways to perturb an image and the good ones are found early.

Robustness bought per unit of augmentation search. The x-axis is search cost, the number of trial training runs a policy spends before it is fixed; the y-axis is a robustness proxy, the fraction of the sim-to-real gap a policy closes on held-out real scans. The five-transform hand-tuned stack (sharpness, blur, jitter, perspective, erasing) is set by inspection, so it sits at a search cost of one run yet already reaches most of the way to the ceiling a learned search approaches. Drag the search-budget lever to grant a learned search more trial runs; its robustness climbs with diminishing returns while the hand-stack point never moves, because it never searched. The orange segment is the only element that argues: the robustness the hand stack concedes to a learned search of the chosen budget, annotated with the search-cost multiple that concession costs. The stack transforms, the 15,000 synthetic training instances, the 80/20 train-validation split, and the 136,771 real scans that perspective and erasing bridge to are sourced from the engagement archive; the robustness fractions and the learned-search curve shape are illustrative policy-design geometry, not a recorded sweep.

The exhibit above renders that comparison with the sourced stack as its anchor. The hand-tuned point sits at a search cost of one and lands most of the way to the ceiling a learned search approaches, because the five transforms it contains are exactly the ones a human reads off the real scans: sharpness and blur for scan quality, jitter for lighting, perspective for the tilt of a photographed page, erasing for occlusion. Dragging the search-budget lever grants a learned search more trial runs, and its robustness climbs with diminishing returns while the hand-tuned point never moves, because it never searched. The orange segment is the only element that argues: it measures the robustness the hand stack concedes to a learned search of the chosen budget, and annotates the search-cost multiple that concession costs. The concession stays small across a wide budget range, which is the plot's point and the field's finding: RandAugment reducing the search to two numbers [4] and TrivialAugment reducing it to none [5] are the published confirmations that the concession really is small. The robustness fractions and the curve shape are illustrative, flagged on the canvas; only the stack, the instance count, the split, and the real-scan count are sourced.

Why the hand stack wins on this particular task

The general finding sharpens on a small-data dense-prediction task like ours. A learned augmentation search needs a reliable validation signal to score candidate policies, and it needs to run that scoring many times; on 15,000 synthetic instances at an 80/20 split, the scoring loop would multiply the multiclass training budget by the candidate count, which on one memory-bound GPU is not a cost the project can absorb. More to the point, the augmentation gap on this task is not mysterious. We know how a real scan differs from a synthetic render, because we can look at both: the render is square, clean, and evenly lit, and the scan is tilted, smudged, and variably sharp. A search would spend its budget rediscovering perspective and erasing as the transforms that matter, which is exactly what we set by hand. The domain-randomisation logic says the same from the other direction: randomise the axes real inputs vary in most and the transfer follows [8].

Here hand-tuning is not a compromise forced by a small compute budget but the better epistemics. When the deployment distribution is legible, a human reading a dozen real scans encodes the right prior faster and more reliably than a search inferring that prior from validation accuracy alone. The learned tradition's own convergence toward tuning-free methods [5] is the field admitting, at scale, what the small-data practitioner knew from the task: most of the value is in choosing the right few transforms, not in searching the space of all of them.

Discussion

Read together, the two traditions are closer than their literatures suggest, and the gap shrank from the learned side. AutoAugment opened with an expensive search [1]; each successor bought the same accuracy for less search cost [2] [3] [4]; TrivialAugment ended with no search, which is a hand-set policy by another name [5]. The hand-tuned tradition never moved, because it was already at the destination the learned line was walking toward: a short list of transforms chosen for the deployment distribution, with random erasing [6] and cutout [7] as recent standard additions and domain randomisation [8] as the rationale for synthetic-to-real transfer. The practical rule follows directly. Reach for the hand-tuned stack first when the deployment distribution is legible and the compute budget is small, and reserve a learned search for when the gap is genuinely unknown and the compute to search it is genuinely available.

The VeerNet stack is a worked example downstream of that reading: a five-transform hand-tuned policy on small synthetic data, which the survey's logic predicts should recover most of the achievable robustness at one run of search, precisely because the sim-to-real gap on scanned well logs is legible and the perturbations that close it are the ones a human reads off the scans.

Limitations

This is a survey and inherits a survey's limits. It synthesises what the published augmentation-policy literature reports and does not re-implement or re-measure any of the methods it discusses; where it names numbers, those are the real facts of one VeerNet configuration, a five-transform hand-tuned stack over 15,000 synthetic multiclass instances at an 80/20 split with perspective and erasing bridging to 136,771 real scans, used as a worked illustration rather than a fresh head-to-head benchmark of learned against hand-tuned policies. We did not run a learned-policy search on the reference task, so the survey makes no measured claim about the exact robustness a search would recover on it; the assertion that the concession is small is read from the field's own cost-descent results, above all RandAugment's two-parameter reduction and TrivialAugment's no-search result, not from an ablation we recorded. The interactive exhibit's robustness fractions and its learned-search curve are an illustrative policy-design model and are flagged as such on the canvas; the true per-transform robustness contributions for this run were not measured in isolation. The survey also scopes itself to the best-known learned methods and the recent hand-set transforms the period treats as canonical, and it stops at the close of its own quarter, so later augmentation and mixing methods the field has since explored are out of frame. A reader should take this as a map of when hand-tuning an augmentation stack beats searching for one, not as a substitute for running the comparison on their own task and distribution.

References

[1] Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q. V. AutoAugment: Learning Augmentation Strategies from Data. CVPR (2019). Searches a discrete space of augmentation sub-policies with reinforcement learning. https://arxiv.org/abs/1805.09501

[2] Ho, D., Liang, E., Chen, X., Stoica, I., and Abbeel, P. Population Based Augmentation: Efficient Learning of Augmentation Policy Schedules. ICML (2019). Learns an augmentation schedule with population based training at a fraction of AutoAugment's cost. https://arxiv.org/abs/1905.05393

[3] Lim, S., Kim, I., Kim, T., Kim, C., and Kim, S. Fast AutoAugment. NeurIPS (2019). Uses density matching to find augmentation policies without repeated retraining of the child model. https://arxiv.org/abs/1905.00397

[4] Cubuk, E. D., Zoph, B., Shlens, J., and Le, Q. V. RandAugment: Practical Automated Data Augmentation with a Reduced Search Space. CVPR Workshops (2020). Collapses the policy search to two integer hyperparameters, the number of transforms and a shared magnitude. https://arxiv.org/abs/1909.13719

[5] Muller, S. G., and Hutter, F. TrivialAugment: Tuning-free Yet State-of-the-Art Data Augmentation. ICCV (2021). Applies a single random transform at a random strength with no search, matching tuned policies. https://arxiv.org/abs/2103.10158

[6] Zhong, Z., Zheng, L., Kang, G., Li, S., and Yang, Y. Random Erasing Data Augmentation. AAAI (2020). Randomly masks a rectangular region of the input, a hand-set transform that improves robustness to occlusion. https://arxiv.org/abs/1708.04896

[7] DeVries, T., and Taylor, G. W. Improved Regularization of Convolutional Neural Networks with Cutout. arXiv (2017). Masks a fixed-size square region during training, a single hand-set augmentation with no search. https://arxiv.org/abs/1708.04552

[8] Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., and Abbeel, P. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. IROS (2017). Randomises rendering parameters so a network trained on synthetic data transfers to real inputs. https://arxiv.org/abs/1703.06907

[9] Shorten, C., and Khoshgoftaar, T. M. A survey on Image Data Augmentation for Deep Learning. Journal of Big Data (2019). A broad survey of augmentation methods across geometric, photometric, and learned families. https://doi.org/10.1186/s40537-019-0197-0

Data Augmentation Policies Learned and Hand-Tuned: A Survey Across Vision Tasks

Abstract

The two traditions and why they diverged

The learned line, from expensive to almost free

Method

What the search is actually buying

Why the hand stack wins on this particular task

Discussion

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on