Skip to main content

Blog

Lightweight Attention for Single-Curve Digitisation with CBAM

Attention has a reputation for being expensive, and on a small U-Net trained to lift a single curve off a scanned log that reputation is enough to scare people off it entirely. It should not. The Convolutional Block Attention Module of Woo et al. bolts channel and spatial attention onto any feature map for almost nothing: a two-layer MLP bottleneck on the channels and a single convolution on a pooled spatial map. At a 128-dim feature depth the whole block adds a few thousand parameters, a rounding error against the backbone, while giving the network a way to decide which channels and which pixels matter before it commits to a mask. This is a practitioner's primer on how the two knobs work, what they cost, and how to set them when your target is one thin curve.

Tarry SinghNarendra Patwardhanby Tarry Singh, Narendra Patwardhan14 min read
EarthScan insight

There is a quiet assumption baked into a lot of segmentation work, which is that attention is a luxury you buy once you have already paid for a large model. Transformers made attention synonymous with scale, and the mental accounting that came with them said that letting a network reweight its own features is something you do when you have compute to burn. On a small convolutional backbone trained to recover a single curve from a scanned well log, that assumption quietly steers people away from one of the cheapest, most useful tools available to them. The Convolutional Block Attention Module, introduced by Woo and colleagues in 2018, is attention that costs almost nothing, and it slots into a U-Net without rearchitecting anything [1]. This piece is a primer on how it works, what its two knobs actually buy, and how to reason about its cost when the thing you care about is one thin trace on a noisy raster.

The reweighting a plain convolution cannot do

Start with what a convolution does and does not do on its own. A convolution mixes a local neighbourhood of pixels across all the input channels and produces a stack of feature maps. Every channel is treated identically by the next layer in the sense that the network has not, at that point, expressed any opinion about which channels are informative for this particular image and which are noise. A feature map that lights up on grid lines is weighted the same as one that lights up on the curve, until later layers slowly learn to discount the first. Spatially, the same flatness holds: the convolution has no built-in way to say "the interesting content is in this band of the image and the rest is margin." It treats every location with the same kernel and lets the loss sort it out over many epochs.

On a clean photograph that is fine, because the signal is dense and roughly uniform. On a scanned log it is wasteful. The image is mostly background, the foreground is a single one-pixel-wide trace winding through a field of grid lines, ink bleed, and scanner noise, and the channels that matter for finding that trace are a small subset of the ones the backbone produces. What you want is a mechanism that lets the network, cheaply and early, say "these channels are the ones carrying the curve" and "this strip of the image is where the curve lives," so the later layers spend their capacity refining a mask rather than rediscovering where to look. That is precisely what attention provides, and CBAM provides it in the smallest form that is still useful.

Two modules, applied in sequence

CBAM is two small attention modules stacked one after the other, and the order matters less than the fact that each answers a different question [1]. The first is channel attention. It asks, for a given feature map of shape channels by height by width, which channels deserve to be amplified and which suppressed. It does this by pooling each channel down to a single number, once with global average pooling and once with global max pooling, so each channel is summarised by its average response and its strongest response. Those two summary vectors are passed through a tiny shared multilayer perceptron, added together, and squashed into a per-channel gate between zero and one. The original feature map is then multiplied channel-wise by that gate. Channels the network has learned are useful get scaled up; channels it has learned are noise get scaled down. The whole thing is a soft, learned, content-dependent feature selector.

The second module is spatial attention, and it asks the complementary question: within the channel-reweighted feature map, which locations matter. It collapses the feature map along the channel axis, again with both average and max pooling, to produce two single-channel maps that say "how much is going on here, on average" and "how much is going on here, at most." Those two maps are stacked into a two-channel image and passed through a single convolution with a square kernel, producing one spatial gate, again between zero and one, that is broadcast across all channels and multiplied in. Locations the network considers important get amplified; the margins get dimmed. Channel attention picks the what; spatial attention picks the where. Applied in sequence, they let a small network concentrate before it commits.

It is worth being precise that neither of these is the quadratic, all-pairs attention of a transformer [3]. There is no token-to-token interaction, no attention matrix that grows with the square of the sequence length, no positional encoding. CBAM is attention in the looser, older sense of a learned gating that reweights a representation. That is exactly why it is cheap, and the cheapness is the whole point of using it on a small model.

The first knob: the channel reduction ratio

The shared MLP inside channel attention is a bottleneck, and the width of that bottleneck is the first knob. If the feature depth is 128 channels, the MLP squeezes those 128 down to a hidden layer of 128 divided by the reduction ratio, applies a non-linearity, then expands back to 128. A reduction ratio of 8, the default in the original work, means the bottleneck is 16 units wide [1]. The reduction ratio is therefore a direct lever on how much capacity the channel attention has to model interactions between channels: a small ratio gives a wide, expressive bottleneck, a large ratio gives a narrow, lean one.

This is the same idea, and almost the same module, as the squeeze-and-excitation block that Hu and colleagues introduced a year earlier, which also recalibrates channels through a reduction-ratio bottleneck after global pooling [2]. CBAM's channel branch is essentially squeeze-and-excitation with max pooling added alongside average pooling, and then spatial attention bolted on after it. If you have used squeeze-and-excitation, you already understand the channel half of CBAM and the meaning of its reduction ratio.

The cost of this branch is easy to write down exactly. Two fully connected layers, one squeezing and one expanding, between 128 and 128 divided by the ratio, is two times 128 times the hidden width in weights. At the default ratio of 8 that is two times 128 times 16, which is 4096 parameters. Drop the ratio to 2 for a richer bottleneck and it becomes two times 128 times 64, which is 16384 parameters. Push it to 16 for a leaner one and it falls to two times 128 times 8, which is 2048. The entire dynamic range of this knob, from very lean to very rich, spans a few thousand parameters at a 128-dim depth. That is the headline a practitioner needs: the channel attention is essentially free, and the reduction ratio trades a tiny amount of parameter cost for a tiny amount of channel-modelling capacity.

CBAM ATTENTION ON A SMALL BACKBONE25.6%added vs one 1x1 conv at 128-dimDrag the two CBAM knobs: channel reduction ratio r and spatial kernel kSmaller r gives a richer channel bottleneck; a wider k pools more spatial context. Theattention band tightens on the trace, yet the parameter cost stays a rounding error.SINGLE SCANNED CURVE, ATTENTION BANDfocus 0.71band half-width 13.3 px (thinner = sharper attention)CHANNEL REDUCTION RATIO rr = 8r = 2 (rich)r = 16 (lean)SPATIAL ATTENTION KERNEL kk = 7 x 7357911ADDED PARAMETERS PER ATTENTION LAYERchannel MLP4,0962 x 128 x (128/8)spatial conv982 x 7 x 7CBAM total added4,194per attention layerone 1x1 conv @12816,384128-dimfeature depth2 layersattention1 channelgrayscale inSTILL CHEAP: at r=8, k=7 the whole CBAM block adds 4,194 params,25.6% of a single 1x1 conv at the 128-dim depth it refines. Attention sharpens; the budget barely moves.128-dim depth, 2 attention layers, r=8 / k=7 defaults are sourced · trace, band, and focus score are illustrative geometry
An interactive primer on the two CBAM knobs. Drag the channel reduction ratio r and the spatial attention kernel size k and watch the orange attention band tighten on a single scanned curve, while the ledger on the right tallies the parameters each module adds. Channel attention is a 128 to 128/r to 128 MLP bottleneck, so its cost is exactly 2 x 128 x (128/r); spatial attention is one k x k convolution over a 2-channel pooled map, costing 2 x k x k. Even at the richest settings the whole block adds a small fraction of a single 1x1 convolution at the 128-dim feature depth it refines. The sourced constants are the 128 embedding dimension, the 2 transformer attention layers, the single grayscale input channel, and the reduction-ratio 8 / kernel-7 defaults; the drawn trace, the attention halo, and the focus score are illustrative geometry built to explain the mechanism, not measured outputs.

The exhibit above is the cost intuition made draggable. The left panel shows a single scanned curve with the attention band drawn around it; the two knobs are the reduction ratio and the spatial kernel size, and the ledger on the right tallies the parameters each module adds as you move them. The arithmetic is exact, computed live from the module shapes rather than scripted. The thing to watch is the bottom number, which is the whole block's added parameters expressed as a percentage of a single one-by-one convolution at the 128-dim feature depth. Drag the knobs through their entire range and that percentage stays in the single digits to low teens. The attention band visibly tightens on the trace as you enrich the channel bottleneck and widen the spatial kernel, while the parameter cost barely registers against the depth it is refining. That gap, large effect on focus, negligible effect on size, is the case for using CBAM on a small model in one picture.

The second knob: the spatial kernel size

The spatial branch has exactly one hyperparameter worth tuning, which is the size of the convolution kernel that turns the two-channel pooled map into the spatial gate. The default is seven by seven [1]. A larger kernel means each location's importance is decided from a wider neighbourhood, which produces a smoother, more context-aware spatial gate; a smaller kernel makes the gate more local and more responsive to fine detail. On a thin curve there is a real tension here. Too small a kernel and the spatial attention can chase individual noisy pixels; too large and it blurs the band of interest until it stops being selective. Seven is a sensible middle for a structure that is locally thin but globally extended, which a log curve is.

The cost of this branch is even smaller than the channel branch, and it does not depend on the feature depth at all. The convolution runs over a two-channel input and produces a single-channel output with a kernel of size k by k, so its weight count is two times k times k. At the default seven that is two times forty-nine, which is 98 parameters. At a wide eleven-by-eleven kernel it is two times one hundred twenty-one, which is 242. At a tight three-by-three it is 18. The spatial branch is not even a rounding error; it is a rounding error on a rounding error. Whatever you do with the spatial kernel, you are not paying for it in model size. You are only trading off how local versus how smooth the spatial gate should be, which is a modelling decision, not a budget decision.

Where it goes in a small U-Net, and why the depth matters

The natural home for a CBAM block in a segmentation network is right after a convolution block, before the feature map is passed on or skipped across to the decoder [4]. In a U-Net the most valuable place to spend it is at the bottleneck and on the skip connections, because those are the points where the network has its richest, most semantically loaded features and the most to gain from deciding which of them to forward. Inserting the block is genuinely a two-line change in most frameworks: compute the channel gate, multiply, compute the spatial gate, multiply, return. Nothing about the surrounding architecture has to move.

The reason the cost stays negligible is the feature depth the attention sits on, and it is worth making this explicit because it is the crux of the whole argument. At a 128-dim depth a single ordinary one-by-one convolution that mixes channels carries 128 times 128, which is 16384 weights, and the network is full of convolutions far larger than that. The entire CBAM block, channel MLP plus spatial conv at the defaults, comes to roughly 4096 plus 98, a little over four thousand parameters. That is about a quarter of one one-by-one convolution at that depth, and a far smaller fraction of any real three-by-three block. You are adding a learned, content-dependent attention mechanism for less than the price of a single small layer. When people say attention is expensive they are describing transformers paying quadratic costs in sequence length; CBAM pays a fixed, depth-scaled, additive cost that is invisible next to the backbone it improves.

Setting the knobs when the target is one thin curve

The practical advice that falls out of all this is short, and it is shaped by the specific problem of lifting a single trace off a noisy scan. Start at the published defaults, a reduction ratio of 8 and a seven-by-seven spatial kernel, because they are well-chosen and the cost difference from any other setting is too small to be worth agonising over up front [1]. If the curve is getting lost among confusing channels, grid-line responses bleeding into curve responses, that is a channel problem, and the lever is the reduction ratio: lower it toward 4 or 2 to give the channel attention a wider bottleneck and more room to separate the useful channels from the distractors, at a cost of a few thousand parameters you will not feel. If instead the mask is spatially noisy, flickering on speckle away from the trace, that is a spatial problem, and the lever is the kernel: widen it toward nine or eleven so the spatial gate considers more context before deciding a pixel matters, at a cost of a hundred or two parameters that is not worth thinking about as a budget item.

The deeper point for anyone working on a small model with a hard, thin target is that attention does not have to be the heavy thing you reach for last. The channel-and-spatial form that CBAM packages is light enough to be the thing you reach for first, a default insertion at the bottleneck of any U-Net you build, tuned with two knobs whose entire cost range is a few thousand parameters. The credit for the design belongs to Woo and colleagues, building on the channel recalibration of squeeze-and-excitation and the broader idea that a network should be allowed to weigh its own features [1][2]. What our own work on raster-log digitisation took from it is simpler than a result: it is the habit of asking, before adding any module, what it costs in parameters and what it buys in focus, and CBAM is the rare case where the second number is large and the first is almost zero.

Key takeaways

  1. CBAM is two small attention modules in sequence: channel attention reweights which feature channels matter using a pooled, MLP-gated recalibration, then spatial attention reweights which locations matter using a single convolution over a pooled two-channel map. It lets a small network decide what and where to focus before it commits to a mask.
  2. This is not transformer attention. There is no token-to-token interaction and no quadratic cost in sequence length. CBAM is learned gating that reweights a representation, which is exactly why it is cheap enough to use on a small U-Net by default.
  3. The channel reduction ratio is the first knob. At a 128-dim depth its cost is two times 128 times the 128-over-ratio hidden width: about 4096 parameters at the default ratio of 8, ranging from roughly 2048 (lean) to 16384 (rich). A smaller ratio buys a wider channel bottleneck for a few thousand parameters.
  4. The spatial kernel size is the second knob, and its cost is two times k times k, independent of feature depth: 98 parameters at the default seven, 18 at three, 242 at eleven. It trades local responsiveness against smoother context, never model budget.
  5. The whole block adds on the order of a few thousand parameters, around a quarter of a single one-by-one convolution at the 128-dim depth it refines. Start at the published defaults of ratio 8 and kernel 7; lower the ratio for channel confusion, widen the kernel for spatial speckle, and stop worrying about the cost.

Stop pricing attention as a single expensive thing, and the question of what it can do comes apart cleanly from the question of what it costs. A transformer buys you long-range, all-pairs reasoning and charges accordingly. CBAM buys you a much narrower thing, a learned reweighting of channels and locations, and charges almost nothing for it. On a single thin curve against a noisy background, that narrow thing is often exactly what the network was missing, and the fact that it fits inside a parameter budget you can barely measure means there is rarely a good reason not to try it.

References

[1] Woo, S., Park, J., Lee, J.-Y., and Kweon, I. S. CBAM: Convolutional Block Attention Module. ECCV (2018). Introduces the sequential channel-then-spatial attention module, its average-and-max pooling design, the channel reduction ratio, and the spatial kernel size, and shows consistent gains for negligible parameter cost. https://arxiv.org/abs/1807.06521

[2] Hu, J., Shen, L., and Sun, G. Squeeze-and-Excitation Networks. CVPR (2018). The channel-recalibration precursor: global pooling followed by a reduction-ratio MLP bottleneck that gates channels, which CBAM's channel branch extends with max pooling. https://arxiv.org/abs/1709.01507

[3] Vaswani, A., Shazeer, N., Parmar, N., et al. Attention Is All You Need. NeurIPS (2017). The transformer and its quadratic all-pairs self-attention, the expensive form of attention that CBAM is explicitly not. https://arxiv.org/abs/1706.03762

[4] Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI (2015). The encoder-decoder-with-skips backbone whose bottleneck and skip connections are the natural insertion points for a lightweight attention block. https://arxiv.org/abs/1505.04597

Go to Top

© 2026 Copyright. Earthscan