Edge and Quantized Inference for Field Operations: CPU Versus GPU Serving

Abstract

Heavy vision models are usually trained and benchmarked where a graphics accelerator is plentiful, and then deployed where it is not. In subsurface work that mismatch is not an inconvenience but a hard constraint: the imagery is the operator's proprietary record, it cannot leave the network it was created on, and the network it was created on frequently has no GPU in it at all. This study asks a narrow, cross-industry question with a specific worked example. For a heavy encoder decoder that must run inside that perimeter, when is quantized inference on a CPU host an acceptable substitute for an accelerated GPU host, and when is it not? We survey the published quantization literature that made low-bit CPU inference accurate, the runtimes that execute it, and the empirical studies that measured what it costs in latency and footprint, and we credit each line of work in its period-correct form. We then read the evidence against the one variable that dominates this particular task, the input width, because a raster well log is a very wide and variable image rather than a fixed tile. The finding is that the CPU-versus-GPU decision is not a verdict but a frontier: on narrow inputs a quantized CPU server stays interactive and far cheaper, while on the widest inputs the per-image latency on a CPU climbs into many seconds and the accelerator earns its hourly rate. The encoder decoder we use to make this concrete, a wide-input segmentation network for digitising scanned well logs, is our own VeerNet system, and it is the only component here we claim as ours.

Why the accelerator is often somewhere else

It is easy to forget, from inside a research environment, how unusual it is to have a GPU exactly where the data is. The default mental model of model serving assumes a fleet of accelerated hosts that requests are routed to. A large class of real deployments does not look like that. The host is a workstation or a rack server inside a controlled network, provisioned for ordinary enterprise compute, with no accelerator and no path to add one without a procurement cycle and a security review. The landscape survey of inference accelerators makes the structural point plainly: an accelerator is a specialised, separately provisioned device, and most general-purpose compute is not built around one ^[8]. The question of how to serve a model on a host that has none is therefore not an edge case; it is the common case at the edge.

In subsurface work the constraint is sharper still, and it is worth being precise about why. The training and inference corpus is a collection of high-resolution scanned logs that, by contract and by competitive instinct, cannot traverse a network the operator does not control. The Texas regulatory archive is a public illustration of the image-first source material that exists in this form ^[9]; the proprietary equivalents inside an operator's vault carry the same shape and a much stronger prohibition on movement. When the data cannot move to the accelerator, two options remain. Move an accelerator to the data, which means provisioning GPU hardware inside the perimeter at a fixed recurring cost, or serve the model without one, which means making the model cheap enough to run on the CPU that is already there. This study is about the evidence behind the second option and the precise conditions under which it holds.

The reason serving a heavy model on a CPU is even thinkable is a decade of quantization research, and the honest way to assess CPU serving is to credit that research first. The foundational move was to stop treating a network as a floating-point object at inference time. Deep Compression showed that pruning and trained quantization together could shrink a network's footprint by roughly an order of magnitude with little accuracy loss, which reframed the deployment problem from one of raw compute to one of representation ^[5]. The result that made this practical on commodity CPUs specifically was the integer-arithmetic-only scheme, which replaced floating-point operations with eight-bit integer arithmetic and paired it with quantization-aware training so the accuracy cost stayed small ^[1]. Integer arithmetic matters on a CPU for a concrete hardware reason the surveys spell out: low-bit integer operations are cheaper in both energy and cycles than floating-point, and they make far better use of the vector units a general-purpose CPU does have ^[4].

From there the field consolidated into a small set of well-credited references that a practitioner actually uses. The early practitioner whitepaper laid out the menu of post-training quantization and quantization-aware training and the accuracy cost of each at eight and four bits, including the per-channel granularity that lets convolutional networks survive low-bit weights ^[2]. A later unified white paper folded the accumulated experience into a single framework and, crucially, into guidance on which models tolerate plain eight-bit post-training quantization and which need quantization-aware training or finer granularity to hold their accuracy ^[3]. The empirical anchor for all of this is the large cross-network study of integer quantization, which established, on real hardware rather than in principle, the conditions under which eight-bit integer inference matches floating-point accuracy and the latency it returns ^[6]. These four references are the spine of any serious claim that a quantized model can be served well on a CPU.

The last piece of related work is the unglamorous but decisive matter of tooling. Quantizing a stock classifier is one thing; quantizing a custom encoder decoder with a transformer bottleneck and variable-size inputs is another, and it only became routine once the framework gained reliable graph capture. The program-capture machinery that underpins the practical quantization workflow in modern PyTorch is what makes it tractable to trace a bespoke network, insert observers, and lower it to an integer graph without rewriting the model by hand ^[7]. Without that, CPU serving of a non-standard architecture is a research project; with it, it is a configuration choice.

Method

This is an assessment rather than a fresh benchmark, so the method is one of structured comparison anchored on a single concrete architecture, with the sourced numbers held separate from the illustrative ones.

The fixed object across the comparison is the served network. It is a wide-input encoder decoder with five encoder stages, five decoder stages, two transformer attention layers that refine the bottleneck, and a 128-dimension embedding, operating on single-channel grayscale log images. We hold that architecture constant and vary only the input width, which for synthetic and real logs alike runs from 3200 pixels to 12800 pixels while the height stays in a narrow band. Width is the right axis to sweep because it, not depth or class count, dominates the compute of a fully convolutional dense-prediction pass: the work scales with the number of pixels the network must convolve, and on a log that number is set almost entirely by how wide the scan is.

The two serving postures we compare are the ones a field deployment actually chooses between. The first is the same network served unquantized on a host with a graphics accelerator, at the GPU host rates the engagement archive records, 750 EUR and 1800 EUR per month for the high-end and advanced tiers. The second is the network quantized to integer arithmetic and served on a host with no accelerator at all, the CPU that is already inside the perimeter. For the comparison we read two quantities at each width: the per-image latency of each server, and the hourly cost of the host it runs on. We then place both on one latency-versus-cost frontier so the tradeoff is visible rather than asserted.

We are deliberate about which numbers are sourced and which are illustrative, because the most common way a serving comparison misleads is by presenting a modelled latency curve as a measurement. The architecture, the 3200-to-12800-pixel width range, and the GPU host rates are sourced from the engagement archive. The per-width latency curves for each server, the CPU host rate, and the quantization speed penalty are an illustrative model, calibrated to the qualitative shape the literature reports, integer CPU inference being usable but markedly slower than accelerated floating-point, and they are flagged as such on the instrument itself.

Results

The frontier is below. Read it as a map of the tradeoff, not as a measured benchmark of two specific hosts.

The same encoder decoder served two ways: quantized on a CPU host with no accelerator, and unquantized on a GPU host. Each operating point plots a single served width on a latency versus host-cost frontier, and dragging the ruler sweeps the synthetic log width from 3200 px to 12800 px so both points slide along their own curve. The GPU point sits low on latency but far right on cost; the quantized CPU point hugs the cost axis but climbs steeply on latency as the logs widen. The served architecture (5 encoder stages, 5 decoder stages, 2 transformer attention layers, a 128-dimension embedding), the 3200 to 12800 px width range, and the GPU host rates are sourced from the engagement archive; the per-width latency curves, the CPU host rate, the quantization penalty, and the resulting frontier positions are illustrative.

Three features of the frontier survive the illustrative-model caveat and match what the cited evidence predicts. The first is that the two servers occupy opposite corners. The GPU host sits low on latency and far to the right on cost, because the accelerator buys throughput at a fixed recurring rate ^[8]. The quantized CPU host hugs the cost axis, because the host is already provisioned and integer arithmetic uses it efficiently ^[1]^[4], but it climbs steeply on latency. The second is that the gap is not constant. On the narrowest logs the per-image latency on a CPU is well within an interactive budget, and the cheaper host wins outright; on the widest logs the same host turns a sub-second call into many seconds, one image at a time, and the accelerator's hourly rate starts to look cheap relative to a geoscientist waiting. Drag the width ruler and the crossover reveals itself: there is a width below which CPU serving is simply the better deal and above which it is a liability.

The third feature is the one most relevant to the air-gapped reality. The CPU server's curve is the same regardless of whether a GPU exists elsewhere in the organisation, because the constraint is not the absence of accelerators in the company but the absence of one inside the perimeter where the data is allowed to live. The frontier therefore answers a more useful question than "CPU or GPU?". It answers "given that the inference must run on this host, how wide an input stays acceptable, and where must we either narrow the input, batch differently, or pay to put an accelerator inside the perimeter?".

Discussion

The cross-industry reading of this is that the quantization literature solved the accuracy problem of CPU serving and left the latency problem to architecture and input size. For a fixed-size classifier the cited evidence is unambiguous: eight-bit integer inference on a CPU recovers floating-point accuracy under well-understood conditions and runs at a latency that is perfectly acceptable for the input sizes those studies used ^[6]^[3]. The reason subsurface log digitisation cannot simply inherit that conclusion is the input. A classifier reads a small tile; our encoder decoder reads a scan that can be 12800 pixels wide, and a dense-prediction pass touches every one of those pixels. The latency that is invisible on a 224-pixel tile becomes the dominant cost on a wide log, and no amount of quantization changes the fact that there are far more pixels to process. Quantization makes the per-operation cost on a CPU tolerable; it does not make the number of operations small.

That is why the honest output of this assessment is a frontier and not a recommendation. Where our work sits relative to the field is specific. We did not invent quantization or integer CPU inference, and none of the methods on the spine of this survey are ours; we credit them as the prior art that makes CPU serving viable at all. What is ours is the wide-input encoder decoder used as the worked example, VeerNet, and the operational reading of the tradeoff for its particular input regime. The practical guidance that falls out is concrete. If the logs in a given deployment are narrow and the host is air-gapped, quantize and serve on the CPU; the cited evidence supports the accuracy, and the frontier shows the latency is fine. If the logs are wide and interactivity matters, the realistic choices are to provision an accelerator inside the perimeter despite its recurring cost, to tile the input and pay a stitching cost instead of a latency cost, or to accept batch rather than interactive turnaround. The decision is set by where on the width axis a deployment actually lives, which is exactly what the instrument is built to expose.

Limitations

This assessment carries the limitations of a survey crossed with an illustrative model, and they should bound how far the frontier is read. The latency curves on the instrument are a calibrated model, not a measurement of two named hosts; they reproduce the qualitative shape the cited literature reports, accelerated floating-point being fast and quantized CPU inference being usable but much slower, but the exact crossover width on real hardware depends on the specific CPU, its vector-instruction support, the runtime, the memory bandwidth, and the quantization scheme, none of which a single curve can capture. The CPU host rate and the quantization speed penalty are illustrative; only the architecture, the width range, and the GPU host rates are sourced. The accuracy side of the tradeoff is taken from the cited studies rather than re-measured here, and those studies largely evaluated fixed-size classifiers, so their accuracy guarantees transfer to a wide-input dense-prediction encoder decoder only by analogy and should be validated per-deployment before a quantized CPU server is trusted in production. Finally, the comparison treats the two serving postures as the realistic alternatives for an air-gapped deployment and does not survey the full space of accelerators between them, such as integrated graphics, low-power inference cards, or on-CPU matrix extensions, any of which can move the crossover and which a complete hardware study would include.

What the assessment establishes

The CPU-versus-GPU serving decision for a heavy segmentation model is not a verdict but a frontier. We survey the quantization literature that made low-bit CPU inference accurate and read it against the air-gapped reality of subsurface work, where the data and often the inference must stay inside a network with no accelerator.
A decade of credited prior art solved the accuracy problem of CPU serving: integer-arithmetic-only inference with quantization-aware training, per-channel post-training quantization, and the empirical studies that fixed the conditions under which eight-bit integer inference matches floating-point.
Quantization makes the per-operation cost on a CPU tolerable; it does not reduce the number of operations. For a wide-input encoder decoder a dense-prediction pass touches every pixel, so latency that is invisible on a small tile dominates on a 12800-pixel log.
The frontier has a crossover that moves with input width. On narrow logs a quantized CPU host stays interactive and far cheaper; on the widest logs it turns a sub-second call into many seconds, and provisioning an accelerator inside the perimeter, tiling the input, or accepting batch turnaround become the real options.
The quantization and CPU-serving methods surveyed here are credited prior art, not ours. What is ours is the wide-input encoder decoder used as the worked example, VeerNet, and the operational reading of where on the width axis a given deployment can afford to serve without an accelerator.

References

[1] Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., and Kalenichenko, D. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. CVPR (2018). The integer-arithmetic-only scheme and quantization-aware training that made eight-bit CPU inference accurate enough to deploy, with measured latency-accuracy tradeoffs on commodity CPUs. https://arxiv.org/abs/1712.05877

[2] Krishnamoorthi, R. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv:1806.08342 (2018). A practitioner reference on per-channel and per-tensor post-training quantization and quantization-aware training, with the accuracy cost of each at four and eight bits. https://arxiv.org/abs/1806.08342

[3] Nagel, M., Fournarakis, M., Amjad, R. A., Bondarenko, Y., van Baalen, M., and Blankevoort, T. A White Paper on Neural Network Quantization. arXiv:2106.08295 (2021). A unified survey of post-training quantization and quantization-aware training that codifies which models tolerate eight-bit and which need finer-grained or learned schemes. https://arxiv.org/abs/2106.08295

[4] Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M. W., and Keutzer, K. A Survey of Quantization Methods for Efficient Neural Network Inference. arXiv:2103.13630 (2021). A taxonomy of quantization methods and the hardware-cost reasons low-bit integer arithmetic runs faster and cheaper on CPUs than floating-point. https://arxiv.org/abs/2103.13630

[5] Han, S., Mao, H., and Dally, W. J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. ICLR (2016). The foundational result that combined pruning and quantization to shrink model footprint by an order of magnitude with little accuracy loss, enabling models to fit constrained hosts. https://arxiv.org/abs/1510.00149

[6] Wu, H., Judd, P., Zhang, X., Isaev, M., and Micikevicius, P. Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation. arXiv:2004.09602 (2020). An empirical study across many networks giving the conditions under which eight-bit integer inference matches floating-point accuracy and the latency it returns on real hardware. https://arxiv.org/abs/2004.09602

[7] Reed, J., DeVito, Z., He, H., Ussery, A., and Ansel, J. Torch.fx: Practical Program Capture and Transformation for Deep Learning in Python. MLSys (2022). The graph-capture machinery underneath the practical PyTorch quantization workflow, which makes post-training and quantization-aware quantization of a custom encoder decoder tractable. https://arxiv.org/abs/2112.08429

[8] Reuther, A., Michaleas, P., Jones, M., Gadepally, V., Samsi, S., and Kepner, J. Survey of Machine Learning Accelerators. IEEE HPEC (2020). A survey of inference accelerators that frames why a GPU host buys throughput at a fixed hourly cost and why a host without one is the common case at the edge. https://arxiv.org/abs/2009.00993

[9] Railroad Commission of Texas. Well log and digital records, public well data. Texas RRC (accessed 2025). The state regulatory archive of scanned raster well logs, an example of the image-first, on-network source material whose digitisation must often run where no accelerator is provisioned. https://www.rrc.texas.gov/resource-center/research/data-sets-available-for-download/

Edge and Quantized Inference for Field Operations: CPU Versus GPU Serving

Abstract

Why the accelerator is often somewhere else

Method

Results

Discussion

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on