Transformer Architectures for Dense Prediction on Document Images

Abstract

This survey asks how transformer and attention-based backbones for dense prediction came to differ from the convolutional U-Net lineage, and which of those differences actually matter when the image is a document rather than a photograph. We organise the public literature available through the third quarter of 2023 into four moves: the import of the self-attention mechanism and the plain Vision Transformer into vision, the first segmentation transformers that treated a mask as a sequence or a set of queries, the pyramid and window variants that put back the multi-scale structure dense prediction depends on, and the dense-prediction and mask-classification heads that turned a sequence of tokens into a per-pixel output. We credit each contribution to its authors and characterise what it changed about the receptive field, the inductive bias, and the data appetite of the backbone. We then place our own configuration on that map. It is not a transformer: it is a residual convolutional encoder and decoder of five stages each, with exactly two self-attention refinement layers sitting on a 128-dimensional bottleneck, taking a single grayscale channel and emitting a three-class mask. We argue that this is a deliberate, defensible position rather than a timid one, because the property that makes a full transformer win on natural images, an appetite for scale that converts a weak inductive bias into a learned one, is the property a single-channel scarce-label document raster cannot pay for. Our raster well-log digitisation task is the test case, with synthetic segmentation sets of 2,000 binary and 15,000 multiclass instances on an 80/20 train and validation split, and a 0.51 peak multiclass IoU that we read as a label-budget ceiling rather than a missing-attention one. The short finding is that the lineage's most useful gift to a document-image problem is not the transformer backbone but the attention operator, used sparingly and placed at the bottleneck where a global receptive field is cheapest to buy.

Background and the two demands a document image makes

The convolutional encoder-decoder is the backbone every transformer alternative is measured against, so it is worth being precise about what it gives and what it withholds. The U-Net is a symmetric stack: a contracting path that strides down through the image building semantic depth, an expanding path that strides back up recovering resolution, and copy-and-concatenate skip connections that route high-resolution features around the bottleneck so the decoder can place a label exactly (Ronneberger et al., 2015). Two inductive biases are baked into that design. Locality says that a pixel's label depends mostly on its neighbourhood, which is why a small convolution kernel is enough. Hierarchy says that meaning is built by composing local features over many strides, which is why the receptive field grows only gradually, stage by stage. Both biases are usually correct and they are why a U-Net learns from few labels: the architecture already knows a great deal about images before it sees a single one.

The catch is the receptive field. A pure-convolution path widens its view one stride at a time, so a feature in the upper part of a tall input cannot influence a feature far below it until enough downsampling stages have stacked their fields together. On a document image that catch becomes acute, because document images press two demands harder than natural photographs do. The first is long thin structure: the information lives in features one or two pixels wide that run the length of the image, and relating a fragment at the top to its continuation at the bottom is exactly the long-range link a local kernel resolves last. The second is a scarce label budget: annotations on scientific documents are expensive, so the architecture cannot afford the data appetite that depth alone would demand. The attention operator speaks directly to the first demand, because self-attention relates every position to every other in one layer regardless of distance (Vaswani et al., 2017). The open question this survey turns on is whether speaking to the first demand is worth what it costs against the second.

Method: how we read the lineage onto our regime

This is a literature survey with a positional argument, not a controlled benchmark, and we are explicit about that boundary. We selected backbone families public on or before the survey quarter that recur in dense-prediction work, characterised each by what it changed about receptive field, inductive bias, and data appetite, and placed our own configuration on the same axes. The only measured quantities here are our own: a residual encoder of five striding stages, a decoder of five upsampling stages, two self-attention refinement layers on a 128-dimensional bottleneck, a single grayscale input channel, and a three-class output, trained on synthetic sets of 2,000 binary and 15,000 multiclass instances at an 80/20 split, reaching a 0.51 peak multiclass IoU. Everything we say about the comparative behaviour of the surveyed families is illustrative of how the literature characterises them, not a re-run on a shared dataset.

We read the lineage as four moves, and credit each in turn. The first move imported the machinery. Self-attention arrived from sequence modelling as a mechanism that computes, for every token, a weighted sum over all other tokens, giving a one-layer global receptive field with no notion of locality at all (Vaswani et al., 2017). The Vision Transformer then showed that cutting an image into a grid of patches, embedding each as a token, and feeding the sequence to a plain transformer encoder could match convolutional classifiers, provided the training set was large enough to teach the network the spatial priors that a convolution gets for free (Dosovitskiy et al., 2021). That proviso is the whole tension of this survey stated in one clause: the transformer trades a built-in inductive bias for an appetite for data.

The second move made the transformer predict masks rather than labels. The detection transformer reframed a dense task as set prediction, attaching a transformer decoder of learned object queries to a convolutional backbone and matching predictions to ground truth with a bipartite loss, which removed hand-designed anchors and showed a decoder of queries could carry spatial structure (Carion et al., 2020). SETR then took the most literal route to segmentation, treating it as sequence-to-sequence over a pure transformer encoder with a simple upsampling head, demonstrating that the convolutional encoder could be removed entirely (Zheng et al., 2021). Segmenter folded the query idea back into segmentation, decoding class masks with a small mask transformer over a ViT encoder (Strudel et al., 2021).

The third move noticed what the plain transformer had thrown away and put it back. A flat sequence of same-size patches has no pyramid, and dense prediction lives on a pyramid, so the next variants rebuilt multi-scale structure inside the transformer. Swin computed attention inside local windows that shift between layers, which restores locality, keeps cost linear in image area, and produces a hierarchy of resolutions like a convolutional backbone does (Liu et al., 2021). The pyramid vision transformer reached the same end by progressively shrinking the token grid so a transformer could serve as a drop-in dense-prediction backbone (Wang et al., 2021). SegFormer paired a hierarchical transformer encoder with a deliberately lightweight all-MLP decoder, landing at the efficient corner of the lineage and making clear that much of the gain was in the hierarchical encoder, not an elaborate head (Xie et al., 2021).

The fourth move settled how the tokens become pixels. The dense-prediction transformer reassembled tokens from several transformer stages into image-like feature maps at multiple resolutions and fused them with a convolutional head, which is the cleanest statement of why a transformer can be a dense predictor at all and why a convolutional decoder is still useful at the end of one (Ranftl et al., 2021). In parallel, mask classification argued that per-pixel labelling is the wrong output abstraction and that predicting a set of binary masks each with a class is better, unifying semantic and instance segmentation under one query decoder (Cheng et al., 2021). Its successor sharpened the attention so each query attends only within its predicted region, which became the strong general segmentation transformer of the period (Cheng et al., 2022).

Results: where two attention layers land on the map

The lineage above separates cleanly along one axis, the receptive field, and that separation is the result a document-image reader should carry away. A pure-convolution path grows its field locally and gradually; the attention operator collapses distance to a single layer. The instrument below makes that contrast walkable rather than asserted: a scanline sweeps across a five-stage encoder, a bottleneck holding two attention layers, and a five-stage decoder, reading out the effective receptive field for a convolution-only path against the same path once the bottleneck attention is allowed to act.

A scanline explorer over an encoder-decoder dense-prediction backbone. Drag the vertical scanline left to right across 5 striding encoder stages, a bottleneck carrying the 2 transformer attention-refinement layers at 128-dim feature depth, and 5 upsampling decoder stages. Two effective-receptive-field read-outs move with it: teal is a pure-convolution path whose field grows only locally, stage by stage, and orange is the same path once the bottleneck attention is allowed to act, which jumps the field to global the moment the scanline enters the attention band. The gap between the two curves at and after that band is the whole argument: attention buys a global receptive field at the bottleneck that a convolution stack would need many more striding stages to reach. The sourced quantities are the 5 encoder and 5 decoder stages, the 2 transformer attention layers, the 128 feature depth, and the single grayscale input / three-class mask framing; the receptive-field figures are computed from a standard stride-2 doubling rule and are illustrative of the mechanism, not measured kernel coverage from a trained run.

Two things are visible the moment the scanline crosses the bottleneck. Up to it, the two read-outs agree: both paths have only the local field that the five striding stages have accumulated, and that field is a fraction of the image, large enough for short links but short of the full length of a thin trace. At the bottleneck the orange path jumps to global while the teal path does not, and it stays global through the decoder because the skip connections carry the globally informed bottleneck features back up to full resolution. That single jump is the entire reason to spend any attention at all on a convolutional backbone, and it is the reason we spend it exactly there. The bottleneck is where the feature map is smallest, so a global all-to-all operation is cheapest in tokens, and it is the one place a U-Net's own design leaves the receptive field shortest of global. Two layers are enough to perform the relation; more would add parameters that a 2,000-to-15,000-instance training budget cannot supervise.

Read against the 0.51 peak multiclass IoU, the per-class breakdown supports reading the ceiling as a data limit rather than an attention one. The network learns the background mask almost perfectly and struggles on the thin curve classes, which is the signature of a feature extractor that has enough global context to know where the curves roughly are but too few labelled examples to pin their exact one-pixel extent. A larger or fuller transformer would not change the size of the labelled set; it would only enlarge the appetite the set already fails to feed.

Discussion: the right amount of attention for a document raster

The honest synthesis is that the transformer lineage's progress on dense prediction has been, to a large degree, progress at converting scale into the spatial priors a convolution simply assumes. The plain Vision Transformer needs a great deal of data precisely because it discarded locality and hierarchy and must relearn them; the pyramid and window variants win efficiency and multi-scale structure back by smuggling those same convolutional priors into the attention, which is itself an admission of how useful the priors are when data is finite (Liu et al., 2021). For natural images at internet scale the trade pays. For a single-channel, scarce-label, document-style raster it runs the wrong way: there is no scale to convert, so a backbone that throws away the priors starts behind and stays there.

That is why our configuration is a convolutional encoder-decoder with attention placed at the bottleneck rather than a transformer with convolution bolted on. We keep the inductive biases that earn their keep on scarce data and borrow only the one thing the convolution withholds, the long-range link, and we borrow it at the cheapest point on the network. This is not a new architecture and we do not claim it as one. The self-attention operator is credited to its origin (Vaswani et al., 2017), the encoder-decoder body to the convolutional lineage (Ronneberger et al., 2015), and the idea that a convolutional head is still worth keeping at the end of an attention pipeline to the dense-prediction transformer (Ranftl et al., 2021). What the engagement contributes is the sizing decision, two attention layers at a 128-dimensional bottleneck, and the synthetic pipeline that makes it trainable, both of which are choices about how much of the lineage to import rather than additions to it. Knowing the map is what makes the choice defensible: it tells you that on this data the transformer's strength is a liability and its single transferable component is the operator, so you take the operator and leave the backbone.

Limitations

The survey carries the limitations of its form, and naming them is more honest than letting the instrument imply more than it can support. The receptive-field figures the scanline reads out are computed from a standard stride-2 doubling rule, not measured from a trained network's effective field; they illustrate the mechanism the literature describes rather than a coverage statistic from any of our runs, and the only measured quantities here are our stage counts, the two attention layers, the 128-dimensional depth, the dataset sizes, the split, and the 0.51 peak IoU. The contrast we draw is deliberately one-dimensional: receptive field is the axis on which attention and convolution differ most sharply, but it is not the only axis that decides a backbone, and parameter count, memory footprint, throughput on long inputs, and the availability of pretraining all move the choice independently and are not shown. The family characterisations are qualitative, drawn from how each paper reports its method, not from a re-run on a single shared dataset with a fixed split and metric, so a coordinate on our map should not be read as a benchmark number. Finally, the survey is period-bounded to the third quarter of 2023 and to the families that had stabilised by then; the dense-prediction transformer literature moved quickly, and a later reading would have to fold in methods that postdate this quarter.

What this survey maps

The convolutional U-Net wins on scarce-label document images because its locality and hierarchy priors mean it already knows a lot about images before training; a plain transformer discards those priors and must relearn them from scale it does not have here.
The transformer lineage's real differentiator is the receptive field: self-attention relates every position to every other in one layer, while a convolution path grows its field only locally, stage by stage, and resolves long-range links last.
The pyramid and window variants (Swin, PVT, SegFormer) win back efficiency and multi-scale structure by smuggling convolutional priors back into attention, which is itself evidence of how valuable those priors are when data is finite.
Our configuration is not a transformer: a residual encoder-decoder of five stages each with just two self-attention layers at a 128-dimensional bottleneck, placed where a global field is cheapest and where the U-Net's own field is shortest of global.
The 0.51 peak multiclass IoU reads as a label-budget ceiling, not a missing-attention one: the network locates the curves but lacks the labelled examples to pin their one-pixel extent, and a fuller transformer would enlarge the appetite, not the dataset.

References

[1] A. Vaswani, N. Shazeer, N. Parmar, et al. Attention Is All You Need. NeurIPS 2017. https://arxiv.org/abs/1706.03762

[2] O. Ronneberger, P. Fischer, T. Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2015. https://arxiv.org/abs/1505.04597

[3] A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT). ICLR 2021. https://arxiv.org/abs/2010.11929

[4] N. Carion, F. Massa, G. Synnaeve, et al. End-to-End Object Detection with Transformers (DETR). ECCV 2020. https://arxiv.org/abs/2005.12872

[5] S. Zheng, J. Lu, H. Zhao, et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers (SETR). CVPR 2021. https://arxiv.org/abs/2012.15840

[6] R. Strudel, R. Garcia, I. Laptev, C. Schmid. Segmenter: Transformer for Semantic Segmentation. ICCV 2021. https://arxiv.org/abs/2105.05633

[7] Z. Liu, Y. Lin, Y. Cao, et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ICCV 2021. https://arxiv.org/abs/2103.14030

[8] W. Wang, E. Xie, X. Li, et al. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. ICCV 2021. https://arxiv.org/abs/2102.12122

[9] E. Xie, W. Wang, Z. Yu, et al. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. NeurIPS 2021. https://arxiv.org/abs/2105.15203

[10] R. Ranftl, A. Bochkovskiy, V. Koltun. Vision Transformers for Dense Prediction (DPT). ICCV 2021. https://arxiv.org/abs/2103.13413

[11] B. Cheng, A. G. Schwing, A. Kirillov. Per-Pixel Classification Is Not All You Need for Semantic Segmentation (MaskFormer). NeurIPS 2021. https://arxiv.org/abs/2107.06278

[12] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, R. Girdhar. Masked-attention Mask Transformer for Universal Image Segmentation (Mask2Former). CVPR 2022. https://arxiv.org/abs/2112.01527

Transformer Architectures for Dense Prediction on Document Images

Abstract

Background and the two demands a document image makes

Method: how we read the lineage onto our regime

Results: where two attention layers land on the map

Discussion: the right amount of attention for a document raster

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on