The DGX A100 Stack: Standing Up an On-Prem Research Bench for a Confidential Programme

Most infrastructure decisions in an ML programme are made by engineers reading a workload. Ours was made by a lawyer reading a clause. Before we trained a single model for a mid-sized Middle East carbonate operator, a mutual non-disclosure agreement fixed one term that quietly decided the entire hardware plan: confidential information was to be stored only on local machines at the registered office, LAN-accessible but not internet-facing, password-protected, hardware kept in locked cabinets when idle, and explicitly "not stored on a remote server of any kind." Read literally, and it was meant literally, that sentence deletes the public cloud from the option set. Not "prefer on-prem." Not "encrypt in transit." No remote server, of any kind. Everything that touched the operator's borehole image logs had to sit on iron we could point to in a room.

This is the account of the bench we built under that constraint: what we bought, in what order, sized to what peak, and what it cost to run a full research programme without renting a single cloud GPU.

The clause is the spec

It is tempting to treat data-residency language as a compliance footnote and design the compute the way you always would. On this engagement that instinct would have produced an architecture we could never have deployed. The NDA was signed in July 2020, roughly six weeks before the proposal that turned it into paid work, so the residency rule predated the workload it would govern. By the time we knew we needed a fleet of GPUs, where those GPUs could live was already law.

Three consequences fell straight out of the clause. Compute had to be on-prem, in a private datacenter, on a local network. The versioned store the training runs read from had to be on-prem too, because a remote object store is exactly the "remote server of any kind" the clause forbids. And even the throwaway experiment sandboxes, spun up to test a preprocessing idea and torn down an hour later, had to live inside a private, isolated network rather than a public region. The confidentiality line did not just choose the compute. It chose the storage and the scratch space as well.

The research bench a confidentiality clause compelled. One NDA line, that confidential data is not stored on a remote server of any kind, LAN-accessible only, hardware in locked cabinets when idle, is the orange constraint on the left, and it forces every teal box to on-prem iron: three compute tiers ordered by parallelism (a 1080Ti stack at 8 GB per machine for sequential per-well work, a DGX A100 with 4-8 A100 GPUs, up to 640 GB of GPU memory and 2.5-5 petaFLOPS of AI compute for the parallel multi-well sweep, and an optional SuperPod of 5-10 nodes at 25-50 PFLOPS, 3-6 TB of GPU memory and 200 Gb HDR InfiniBand under a negotiated one-week 24x7 window), plus a private DataOps server on a 1 TB network backbone with 4 TB of redundant SSD and a private-cloud VPC Build&Tear lab for isolated experiments. Toggle a tier, or drag the peak-concurrency lever across the sourced 60-to-90 band to see which tier absorbs the peak: the sequential stack tops out, the DGX A100 carries it, and the SuperPod is the negotiated escape hatch above. The tier specs, the DataOps and VPC servers, the 60-90 concurrent peak with full-model training in minutes to hours, and the NDA clause are sourced from the engagement archive; the exact tier a given peak lands on is an illustrative reading of the sequential-to-parallel tier design, not a measured scheduler trace, and none of it is a benchmark.

Three compute tiers, ordered by parallelism

The compute was not one machine but a graded set, and the ordering that mattered was not raw FLOPS. It was how many runs a tier could carry at once.

The entry tier was a stack of 1080Ti machines, 8 GB of GPU memory per machine, suited to sequential per-well pipelines: preprocessing, quick sanity trainings, the work where you run one well through and look at it before the next. It is cheap, it was already partly on hand, and it does exactly one job at a time well.

The workhorse was a DGX A100 node: 4 to 8 A100 GPUs, up to 640 GB of total GPU memory, delivering 2.5 to 5 petaFLOPS of AI compute. This is the tier built for the parallel multi-well sweep, where a research phase spawns dozens of variants of the same training job differing by a hyperparameter or a data split and every one of them wants to run now. The DGX is where a full-model training that used to be an overnight wait collapses to minutes-to-hours, and where the concurrent load the programme generated actually landed.

Above it sat a negotiated escape hatch rather than owned iron: an optional SuperPod of 5 to 10 nodes, 25 to 50 PFLOPS, 3 to 6 TB of GPU memory, lashed together over 200 Gb HDR InfiniBand, available under a one-week 24x7 access window arranged with the vendor. We did not stand this up permanently. We kept the option so that a burst larger than the DGX could absorb had a home that still sat inside the residency perimeter, not on a rented public endpoint.

Sized for the peak, not the average

The number that drove the sizing was concurrency, not throughput. A single training run on a raster well-log segmenter takes hours, and the parallel supervised-plus-unsupervised research design meant the bench had to hold 60 to 90 concurrent runs at the peak of a phase, with full-model training turning around in minutes to hours rather than days. That peak is what the DGX tier was chosen against. The 1080Ti stack could never carry it; the SuperPod was overkill for the steady state and reserved for the rare burst above it. How we packed that concurrency onto the cards themselves, by slicing one A100 into many hardware-isolated instances, is a separate story we have told in "MIG-Partitioning an A100 to Run 60 Experiments at Once"; here the point is only that the tier was procured to a peak, and the peak was set by how many experiments a research phase runs at the same time.

Stated as the sizing rule the procurement followed:

\text{tier}_{\text{workhorse}} \;=\; \min\{\text{tier} : \text{ceiling}(\text{tier}) \ge \text{peak}_{\text{concurrent}}\},\quad \text{peak}_{\text{concurrent}} \in [60,\,90]

You buy the smallest tier whose concurrent ceiling clears the peak the research design will actually generate, and you keep a negotiated window above it for the tail.

The private DataOps server and the tear-down labs

Compute was the visible half. The residency clause forced two less obvious pieces on-prem as well.

The first was a private DataOps server for the versioned data the training runs consumed: a 1 TB network backbone with 4 TB of redundant SSD, clusterable behind a load balancer as the corpus grew. A cloud object store would have been the natural home for large image-log datasets, and it was precisely the option the NDA removed. So the versioned store lived on the same private network as the compute, which is what let a training run pin an exact data revision without any of that data ever leaving the LAN. The reproducibility machinery on top of this server, the data versioning and experiment tracking and model registry, is its own milestone we have written up separately; the infrastructure point here is that the store underneath it had to be iron we owned.

The second was the experiment sandbox. Not every piece of scratch work wants a slot on the DGX, and engineers need somewhere to stand up an isolated environment, test an idea, and tear it down. We ran these as VPC-isolated "Build and Tear" labs on a private cloud, each a disposable, network-isolated environment for a single experiment. The framing matters: even the throwaway lab was inside a private, isolated perimeter, not a public region, because the clause draws no distinction between a production database and a two-hour test box. Both hold, or can hold, confidential pixels; both stay behind the perimeter.

What it cost to own instead of rent

An on-prem bench trades a monthly cloud bill for a capital position, and the ledger shows both the discipline and the strain of that trade.

The rate the programme worked to reflected owning the hardware rather than renting it: an offered blended rate near USD 17.22 per GPU-hour against an industry baseline of about USD 30, a discount that only makes sense when the iron is bought, not leased. The moat behind that number was a multi-year hardware partnership locked in 2020, before the research started, plus a local datacenter arrangement that gave full cost visibility. When the 2022 energy shock pushed the rental market for DGX-class hosting from roughly USD 12,000 to 15,000 a month toward USD 20,000 and beyond, owning the hardware is what kept the programme off that spike.

Owning is not the same as being under budget. The research phases ran hot: runtime came in near 11,300 GPU-hours against a budget of 8,200, and the supervised phase alone consumed about 2,600 hours against 1,200 budgeted, a 2.2x overrun from running two research tracks in parallel on this bench. The clause fixed where the compute lived. It did not fix how much of it a hard research problem would burn.

Read the contract before the datasheet

The transferable asset is not a parts list. It is the order of operations: read the data-residency clause first, and let it delete options before you draw an architecture. For an operator whose confidential asset is the subsurface data itself, a clause that forbids remote storage is not a preference to be satisfied with encryption; it is a hard boundary that removes the public cloud, the managed object store, and even the public development sandbox from the menu. What remains is an on-prem bench: compute tiers ordered by the concurrency a research phase generates, a private versioned store on the same LAN, and isolated tear-down labs inside the perimeter. Size to the peak number of simultaneous experiments, keep a negotiated burst window above it, and own the hardware if the economics and a pre-committed vendor partnership allow. The specifics of the cards will age. The posture, that the contract is the first line of the infrastructure spec, does not.

Limitations

These figures are from a single confidential engagement and are engineering and procurement numbers, not a benchmark. The compute-tier specifications, the 60-to-90 concurrent-run peak, the private DataOps server sizing, and the residency clause are drawn from the programme's own board decks, cost ledger, and signed NDA; the exact tier a given concurrency peak lands on in the instrument is an illustrative reading of the sourced sequential-to-parallel tier design, not a measured scheduler trace. The cost structure reflects a hardware partnership and academic-rate discount specific to this programme and the 2020-2022 market, and should not be read as a general price for on-prem GPU compute. The SuperPod tier was a negotiated option, not permanently provisioned. None of the numbers here should be quoted against a different vendor stack or a different confidentiality regime.

References

On-prem compute-tier architecture (1080Ti stack, DGX A100, optional SuperPod), the private DataOps server (1 TB network / 4 TB redundant SSD), and the VPC-isolated Build-and-Tear experiment labs, from an internal steering deck for the engagement; specifications withheld under operator confidentiality.
Mutual non-disclosure agreement establishing the data-residency terms (confidential information "not stored on a remote server of any kind," LAN-accessible only, hardware in locked cabinets when idle), dated July 2020; contents withheld under confidentiality.
Programme compute-economics ledger (budgeted vs actual GPU-hours per phase, blended per-GPU-hour rate) and the 60-to-90 concurrent-run peak sizing, from internal infrastructure cost and additional-budget records; figures withheld under confidentiality.

The DGX A100 Stack: Standing Up an On-Prem Research Bench for a Confidential Programme

The clause is the spec

Three compute tiers, ordered by parallelism

Sized for the peak, not the average

The private DataOps server and the tear-down labs

What it cost to own instead of rent

Read the contract before the datasheet

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on