Weak Supervision at Scale: Programmatic Labeling Since Snorkel

Abstract

Weak supervision replaces the hand-drawn label with a program: instead of an annotator marking each example, an engineer writes labeling functions that vote noisily on many examples at once, and a label model reconciles those votes into probabilistic training targets. This survey reads the programmatic-weak-supervision literature as one arc rather than a list, from the data-programming formulation that started it, through the labeling-function abstraction that made it usable, the label-model theory that made the votes trustworthy, the synthesis methods that tried to write the functions automatically, and the benchmark that finally let the field measure itself. We credit that public lineage in period, then turn it on the question that matters for our own work: a labeling function needs a signal it can encode as a rule, and that presupposes a feature space where weak rules exist. Our raster well-log digitisation regime is the test case, where the hand-labelled floor is a 2,000-instance binary segmentation set and the route that actually produced usable training data was procedural synthesis, with 20,000 logs generated for the multiclass task and a 15,000-curve final set, none of it hand-masked. The finding is that weak supervision and synthesis answer two different shapes of label scarcity. Weak supervision is the right tool when ground truth is expensive but writable as rules over rich features; synthesis is the right tool when the data is renderable and the labels can be manufactured alongside the pixels. Our problem is the second shape, and the instrument below makes the divergence between the two routes visible by sweeping labeling-function coverage against the synthetic ceiling.

Background: from data programming to a usable abstraction

The premise of programmatic weak supervision is older than the name. Practitioners had always written heuristics to bootstrap labels; what was missing was a principled way to combine many noisy heuristics without trusting any one of them. Data programming supplied it by treating each heuristic as a labeling function with an unknown accuracy and modelling the labels as latent variables a generative model could recover, so that disagreement among functions became information rather than noise (Ratner et al., 2016). The contribution that turned the formulation into a tool was Snorkel, which gave the labeling function a concrete interface, ran the generative label model over the function outputs, and trained a downstream discriminative model on the resulting probabilistic labels, all in one loop a domain expert could iterate (Ratner et al., 2018).

Two design commitments in that work shaped everything after it. The first is that the unit of human effort moves from the example to the function: an expert who would have labelled a thousand rows instead writes a dozen functions, each covering a slice of the data. The second is that the functions are allowed to be wrong, correlated, and incomplete, because the label model exists precisely to estimate their accuracies and dependencies and weight them accordingly. Both commitments are what make the route scale, and both are what make it fragile when the data has no slices a rule can name.

How the label model learned to trust noisy votes

The early generative label model assumed the labeling functions were conditionally independent given the true label, which is convenient and usually false. Two lines of work removed the assumption. One learned the dependency structure among functions directly from their agreement patterns, so that two functions keying off the same underlying signal would not be double-counted (Bach et al., 2017), and a later refinement recovered those dependencies more robustly as the number of functions grew (Varma et al., 2019). The other line generalised the label model to the multi-task setting, where functions vote on related label schemas at once and the model exploits the structure between tasks, which is the version that scaled to organisational deployments (Ratner et al., 2019).

The most practically consequential change was about speed rather than expressiveness. Estimating the label model by iterative methods is slow when there are many functions and many examples, and a closed-form estimator based on triplet methods made the label-model fit close to instantaneous, which removed the inner-loop latency that had made rapid function iteration painful (Fu et al., 2020). Taken together these advances meant that by 2020 the label model was no longer the bottleneck or the source of doubt; the open question had moved upstream, to where the labeling functions themselves come from.

When the functions write themselves, and when they cannot

Writing labeling functions by hand still requires an expert who knows which weak rules carry signal. Two responses tried to reduce that burden. Snuba synthesised labeling functions automatically from a small labelled set, generating candidate heuristics over the feature representation and selecting those that improved coverage and accuracy, which turns a handful of labelled examples into a much larger weakly labelled set without a human writing each rule (Varma and Re, 2018). A complementary direction folded the rules into the loss rather than the label, learning jointly from a few labelled exemplars and the rules that generalise them so that a rule and the examples it implies reinforce each other during training (Awasthi et al., 2020). A further step combined deep representations with probabilistic logic so the system could propose its own weak supervision and refine it, narrowing the human role to seeding the process (Lang and Poon, 2021).

Every one of these automatic methods shares a precondition that is easy to miss and decisive for us. They synthesise functions over a feature representation in which weak rules exist and can be discovered, whether keyword patterns in text, attribute thresholds in tabular records, or distances in an embedding. The method finds rules; it does not invent the feature space in which rules are findable. For a dense per-pixel segmentation task over a printed image, there is no obvious feature on which a labeling function votes for the right curve, because the answer is the spatial mask itself rather than a class derivable from named attributes. This is the seam where the survey stops being a catalogue and becomes a decision about our own data.

Method: locating our regime on the weak-supervision map

We did not deploy a Snorkel pipeline on the well-log task and measure it against synthesis; that would be a controlled study, and this is a survey with a positional claim. What we did is place our true label economics on the same axis the weak-supervision literature implicitly optimises, which is the multiplier between human effort and labelled volume, and reason about which routes have support there. The relevant figures are the engagement's own. The hand-labelled floor is a 2,000-instance binary segmentation set, each instance a mask a person drew. The route that actually produced training data at scale was procedural synthesis: a renderer generated 20,000 logs for the multiclass setting and the final training set settled at 15,000 synthetic curves, every one of them carrying a pixel-exact mask for free because the generator emitted the image and the label together.

The weak-supervision route, had we taken it, would multiply the human seed by labeling-function coverage rather than draw masks one by one. That is the mechanic the instrument models: a fixed human seed of the same order as our 2,000-instance floor, scaled by the average number of unlabelled instances each function reaches. Holding the synthetic figures fixed as sourced anchors, we sweep that coverage to see where weak supervision would land relative to both the hand floor and the 15,000-curve synthetic ceiling. The point is not to award a winner on a number we did not measure, but to show the shape of the two routes side by side, so the reader can see that they diverge because they solve different problems and not because one is uniformly better.

Results: where labeling functions land against synthesis

A label-yield stack contrasting how three sourcing routes convert human effort into usable training labels. Hand annotation is a flat floor: the binary segmentation set is 2,000 instances drawn one mask at a time, and no lever moves it. Weak supervision in the labeling-function lineage multiplies a small human seed by the average coverage of those functions, so the labelled volume rises as the analyst broadens functions rather than drawing masks. Procedural synthesis is the route this engagement actually used, a renderer that manufactures images and their masks together, which is why the multiclass route reached 20,000 generated logs and a 15,000-curve final training set with zero hand masks. Drag the coverage lever to sweep how many unlabelled instances each labeling function reaches and watch where weak supervision lands between the hand floor and the synthetic ceiling. The 2,000-instance hand floor, the 20,000 synthetic logs, and the 15,000 final curves are sourced figures; the weak-supervision yield is an illustrative seed-times-coverage model and is flagged as such.

The stack carries one argument, and the lever lets the reader pressure it. Hand annotation is the flat bar that no control moves: the binary set is 2,000 instances and stays 2,000, because the only way to grow it is to draw more masks. Procedural synthesis is the tall bar fixed at the sourced figures, 20,000 generated logs with the 15,000-curve final set marked as a reference line, and it is tall precisely because the renderer pays no per-label cost. Weak supervision is the bar the coverage lever drives. At low coverage, where each labeling function reaches only a few instances, the route barely clears the hand floor and the human is better off drawing masks. As coverage rises, the bar climbs quickly, which is the genuine appeal of the method: a single good function can label thousands of instances, and that is how Snorkel-style pipelines turn a dozen functions into a large training set in domains where such functions exist.

Read against our two anchors, the divergence is the result. Weak supervision can in principle reach the synthetic ceiling, but only by assuming a coverage that presupposes the very thing our data lacks, namely functions that vote correctly on raw log pixels. The lever makes the assumption visible: pushing coverage high enough to match the 15,000-curve set is arithmetically easy and semantically unjustified for a dense segmentation target, whereas synthesis reaches the same volume by construction and with exact masks. The honest reading of the chart is not that weak supervision is weak; it is that the multiplier it depends on is large in text and tabular settings and undefined in ours.

Discussion: two answers to two different scarcities

The cleanest way to summarise a decade of programmatic weak supervision is that it industrialised a specific trade: spend expert time writing noisy rules instead of drawing exact labels, and let a label model convert the noise into usable targets. The field's progress was real and is well credited above, and the WRENCH benchmark deserves particular mention for finally giving the area a common yardstick across datasets and label models, which is what let the survey literature speak about the field as a whole rather than method by method (Zhang et al., 2021). The synthesising survey that consolidated the subfield is the right entry point for anyone choosing among these methods, and it frames the same families this piece walks through (Zhang et al., 2022).

What that body of work also clarifies, by where its benchmarks concentrate, is the boundary of its applicability. Weak supervision shines where labels are a classification over features a human can describe in code: relation extraction, document classification, entity tagging, the industrial text and log settings the deployment papers report (Bach et al., 2019). Our task is not that shape. The label is a dense mask, the signal lives in pixel geometry rather than in named attributes, and the data is renderable, which means the cheapest way to a correct label is to generate the label with the image. Where our own work sits in this map is therefore at a corner the weak-supervision literature does not target: a problem with no ground truth, no feature space for weak rules, and a generator that makes labels free. The two routes are complementary rather than competing, and choosing between them is a question about the shape of your scarcity, not about which method is more advanced.

Limitations

This is a survey with a positional argument, and three limits bound it. First, we did not run a labeling-function pipeline on the well-log task, so the weak-supervision bar in the instrument is an illustrative seed-times-coverage model rather than a measured yield; the only measured figures are our own, the 2,000-instance hand floor and the 20,000 and 15,000 synthetic counts. Second, the single axis the instrument uses, the multiplier between human effort and labelled volume, deliberately compresses a richer comparison: label quality, the downstream model's tolerance for noisy targets, and the cost of writing good functions all matter and are not on the chart. Third, the survey is period-bounded to mid-2023 and to the programmatic-weak-supervision lineage that had stabilised by then; it does not cover the prompted and model-generated labeling that began to blur the line between a labeling function and a model around this time, which a later reading would have to fold in.

References

[1] A. Ratner, C. De Sa, S. Wu, D. Selsam, C. Re. Data Programming: Creating Large Training Sets, Quickly. NeurIPS 2016. https://arxiv.org/abs/1605.07723

[2] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, C. Re. Snorkel: Rapid Training Data Creation with Weak Supervision. VLDB 2018. https://arxiv.org/abs/1711.10160

[3] P. Varma, C. Re. Snuba: Automating Weak Supervision to Label Training Data. VLDB 2018. https://www.vldb.org/pvldb/vol12/p223-varma.pdf

[4] A. Ratner, B. Hancock, J. Dunnmon, F. Sala, S. Pandey, C. Re. Training Complex Models with Multi-Task Weak Supervision (Snorkel MeTaL). AAAI 2019. https://arxiv.org/abs/1810.02840

[5] S. H. Bach, D. Rodriguez, Y. Liu, et al. Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale. SIGMOD 2019. https://arxiv.org/abs/1812.00417

[6] D. Fu, M. Chen, F. Sala, S. Hooper, K. Fatahalian, C. Re. Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods (FlyingSquid). ICML 2020. https://arxiv.org/abs/2002.11955

[7] A. Awasthi, S. Ghosh, R. Goyal, S. Sarawagi. Learning from Rules Generalizing Labeled Exemplars (ImplyLoss). ICLR 2020. https://arxiv.org/abs/2004.06025

[8] P. Varma, F. Sala, A. He, A. Ratner, C. Re. Learning Dependency Structures for Weak Supervision Models. ICML 2019. https://arxiv.org/abs/1903.05844

[9] J. Zhang, Y. Yu, Y. Li, Y. Wang, Y. Yang, A. Ratner. WRENCH: A Comprehensive Benchmark for Weak Supervision. NeurIPS 2021 Datasets and Benchmarks. https://arxiv.org/abs/2109.11377

[10] J. Zhang, C.-Y. Hsieh, Y. Yu, C. Zhang, A. Ratner. A Survey on Programmatic Weak Supervision. arXiv preprint 2022. https://arxiv.org/abs/2202.05433

[11] C. Lang, H. Poon. Self-Supervised Self-Supervision by Combining Deep Learning and Probabilistic Logic. AAAI 2021. https://arxiv.org/abs/2103.12930

[12] S. H. Bach, B. He, A. Ratner, C. Re. Learning the Structure of Generative Models without Labeled Data. ICML 2017. https://arxiv.org/abs/1703.00854