BluTrain: A C++/CUDA Framework for AI Systems

Adhitya Charan; Adwaid Suresh; Anuj Kumar; Aparna A; Dhanakumar K; Dharun M S; Dinesh G; Goutham Kumar Reddy K; Harshini V M; Jenifa D

arxiv: 2606.24780 · v1 · pith:AOBUGPP5new · submitted 2026-06-23 · 💻 cs.AI · cs.LG

BluTrain: A C++/CUDA Framework for AI Systems

Adhitya Charan , Adwaid Suresh , Anuj Kumar , Aparna A , Dhanakumar K , Dharun M S , Dinesh G , Goutham Kumar Reddy K

show 14 more authors

Harshini V M Jenifa D Jona Delcy C A Kathirvel S Killi Uma Maheswara Rao Kiruthik Kanna M Kurra Vishnu Sai Madhumithaa G K Navin Kumar V Ram Charan Golla Revathi T Rishikkanth R Sanjay Krishna M V Surendra Vendra

This is my paper

Pith reviewed 2026-06-25 23:21 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords C++CUDAtraining frameworkGPT-2throughputmemory efficiencyautograddistributed execution

0 comments

The pith

A native C++ and CUDA training framework sustains higher throughput and lower memory use than PyTorch for a 124M GPT-2 model while preserving numerical fidelity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BluTrain as a framework built from first principles in standard C++ and the core CUDA model to give direct control over hardware execution of deep learning models. It implements every component natively, including tensors with reverse-mode autograd, linear algebra routines, a caching allocator, distributed execution, and an MLIR compiler, to remove repetitive orchestration while allowing tuning. Evaluations training a 124M-parameter GPT-2 model in FP32 on an 8-GPU system show 407K tokens per second versus PyTorch's 395K, up to 22% smaller memory footprint, identical numerical results, and a slightly lower final validation loss. A sympathetic reader would care because the work claims that systems engineering at the framework level can improve training efficiency at scale without changes to model architecture.

Core claim

BluTrain is a lightweight, architecture-general training framework in C++ and CUDA with a typed tensor module that includes reverse-mode autograd, a linear-algebra library, a caching allocator, a multi-mode distributed-execution module, and an MLIR-based deep-learning compiler. Every layer is implemented natively to achieve absolute control over hardware expression while abstracting systems complexity. In formal evaluations on a 124M-parameter GPT-2 baseline in FP32 on an 8-GPU 6000 Ada system, it sustains an average of 407K tokens/s versus PyTorch's 395K tokens/s, achieves up to a 22% footprint reduction, preserves numerical fidelity, and converges to a marginally lower final validation los

What carries the argument

The native C++/CUDA implementation of every layer, including the typed tensor module with reverse-mode autograd and the multi-mode distributed-execution module, which carries the argument by enabling direct hardware control and tuning.

If this is right

Training throughput for models like GPT-2 can reach 407K tokens per second on 8-GPU systems through native framework design.
Memory footprint during training can be reduced by up to 22% compared with industry-standard frameworks.
Numerical fidelity remains strictly preserved and validation loss can reach a marginally lower value.
The performance ceiling for training becomes the framework's own to raise through explicit native tuning of every layer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams training large models on specific hardware clusters could adopt native frameworks to lower per-epoch wall-clock time and hardware requirements.
Open native code bases might allow targeted optimizations for new accelerator generations that closed frameworks cannot expose.
The same approach could be tested on other model families such as vision transformers or diffusion models to check for similar gains.
Production pipelines might integrate such frameworks to reduce overall training energy use when the reported memory savings scale.

Load-bearing premise

The reported throughput and memory numbers were obtained under identical experimental conditions, model configurations, and optimization settings as the PyTorch baseline.

What would settle it

Re-running the 124M-parameter GPT-2 training experiment on the same 8-GPU system with all data loading, precision handling, and kernel parameters fully disclosed and identical to the baseline, then checking whether the 407K versus 395K tokens/s difference and 22% memory reduction still appear.

Figures

Figures reproduced from arXiv: 2606.24780 by Adhitya Charan, Adwaid Suresh, Anuj Kumar, Aparna A, Dhanakumar K, Dharun M S, Dinesh G, Goutham Kumar Reddy K, Harshini V M, Jenifa D, Jona Delcy C A, Kathirvel S, Killi Uma Maheswara Rao, Kiruthik Kanna M, Kurra Vishnu Sai, Madhumithaa G K, Navin Kumar V, Ram Charan Golla, Revathi T, Rishikkanth R, Sanjay Krishna M V, Surendra Vendra.

**Figure 1.** Figure 1: The modular architecture of the BluTrain framework. 2.1 Tensor & Ops Module The foundation of the framework is the Tensor & Ops module, which serves as the primary execution environment. A Tensor encapsulates its shape, stride, and view offset configurations alongside rich runtime metadata (including datatype, device placement, version counters for in-place mutation tracking, and lazy autograd states), all… view at source ↗

**Figure 2.** Figure 2: The GPT-2 decoder-only Transformer architecture featuring 12 pre-normalized decoder blocks, each combining masked multi-head self-attention with a GELU feed-forward sublayer through residual connections, and weight-tied embeddings. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Training and validation loss over the full 19,073-step 124M GPT-2 run. Panels (a)–(c) show the raw per-run train/val curves. All configurations converge identically to a final validation loss of ≈ 3.07, confirming that BluTrain preserves numerical fidelity against both PyTorch eager and compile. 6.2 Throughput Across the identical 124M, 8-GPU, 19,073-step workload, BluTrain achieves an aggregate throughpu… view at source ↗

**Figure 4.** Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Forward Attention median latency (Causal=0). Lower is better. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Forward Attention median latency (Causal=1). Lower is better. 1x16x64 1x32x128 4x32x128 0 2 4 6 8 10 12 14 0.61 2.4 9.37 0.71 2.3 9.48 Config (B × NH × HD) Median Latency (ms) T=2048, Causal=0 1x16x64 1x32x128 4x32x128 0 2 4 6 8 0.37 1.3 4.96 0.5 1.29 5.36 Config (B × NH × HD) Median Latency (ms) T=2048, Causal=1 [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Forward Attention median latency (T=2048 configs). Lower is better. 1x12x64 0 5 10 15 20 25 17.6 13 Config (B × NH × HD) TFLOPS T=512, Causal=0 1x12x64 4x12x64 8x16x64 16x12x64 0 10 20 30 40 21.8 30 30.9 31.2 18.7 22.3 18.4 18 Config (B × NH × HD) TFLOPS T=1024, Causal=0 [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Forward Attention TFLOPS (Causal=0). Higher is better. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Forward Attention TFLOPS (Causal=1). Higher is better. 1x16x64 1x32x128 4x32x128 0 10 20 30 40 50 28.1 28.5 29.1 23.3 31.1 28.3 Config (B × NH × HD) TFLOPS T=2048, Causal=0 1x16x64 1x32x128 4x32x128 0 5 10 15 20 25 30 23.5 26.6 27.7 17.3 26.1 25.6 Config (B × NH × HD) TFLOPS T=2048, Causal=1 [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Forward Attention TFLOPS (T=2048 configs). Higher is better. B.2 Attention Backward ( BluBridge, PyTorch) 1x12x64 0 0.05 0.1 0.15 0.2 0.25 0.151 0.174 Config (B × NH × HD) Median Latency (ms) T=512, Causal=0 1x12x64 4x12x64 8x16x64 16x12x64 0 2 4 6 8 0.41 1.43 3.8 5.68 0.51 1.66 4.38 6.53 Config (B × NH × HD) Median Latency (ms) T=1024, Causal=0 [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Backward Attention median latency (Causal=0). Lower is better. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Backward Attention median latency (Causal=1). Lower is better. 1x16x64 1x32x128 4x32x128 0 20 40 1.7 8.4 32.8 2.1 7.8 32.8 Config (B × NH × HD) Median Latency (ms) T=2048, Causal=0 1x16x64 1x32x128 4x32x128 0 5 10 15 20 25 30 0.8 4.1 16.3 1.3 4.5 18.8 Config (B × NH × HD) Median Latency (ms) T=2048, Causal=1 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 13.** Figure 13: Backward Attention median latency (T=2048 configs). Lower is better. 1x12x64 0 5 10 15 13.4 11.6 Config (B × NH × HD) TFLOPS T=512, Causal=0 1x12x64 4x12x64 8x16x64 16x12x64 0 5 10 15 20 25 30 19.7 22.5 22.6 22.7 16 19.7 19.6 19.7 Config (B × NH × HD) TFLOPS T=1024, Causal=0 [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

**Figure 14.** Figure 14: Backward Attention TFLOPS (Causal=0). Higher is better. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: Backward Attention TFLOPS (Causal=1). Higher is better. 1x16x64 1x32x128 4x32x128 0 5 10 15 20 25 30 24.4 20.4 20 21 .5 21.9 21 Config (B × NH × HD) TFLOPS T=2048, Causal=0 1x16x64 1x32x128 4x32x128 0 5 10 15 20 25 30 26.5 20.8 21 17.1 19.1 18.2 Config (B × NH × HD) TFLOPS T=2048, Causal=1 [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

**Figure 16.** Figure 16: Backward Attention TFLOPS (T=2048 configs). Higher is better. B.3 GELU Forward ( BluBridge, PyTorch) 1x1024x3072 4x1024x3072 16x1024x3072 16x1024x4096 32x1024x3072 16x1024x16384 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0.03 0.12 0.5 0.67 1 2.71 0.03 0.13 0.51 0.68 1 2.68 Config (B × T × HD) Median Latency (ms) T=1024 [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗

**Figure 17.** Figure 17: Forward GELU median latency (T=1024). Lower is better. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗

**Figure 18.** Figure 18: Forward GELU memory bandwidth (T=1024). Higher is better. 1x2048x3072 16x2048x3072 0 0.5 1 1.5 0.06 1 0.07 1.01 Config (B × T × HD) Median Latency (ms) T=2048 1x2048x3072 16x2048x3072 0 200 400 600 800 1,000 787.5 803.2 764.7 801.6 Config (B × T × HD) Bandwidth (GB/s) T=2048 [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗

**Figure 19.** Figure 19: Forward GELU performance (T=2048). Left: Median Latency (Lower is better). Right: Memory Bandwidth (Higher is better). B.4 GELU Backward ( BluBridge, PyTorch) 1x1024x3072 4x1024x3072 16x1024x3072 16x1024x4096 32x1024x3072 16x1024x16384 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0.05 0.18 0.75 1 1.5 3.91 0.05 0.19 0.74 0.99 1.48 3.93 Config Median Latency (ms) [PITH_FULL_IMAGE:figures/full_fig_p027_19.png] view at source ↗

**Figure 20.** Figure 20: Backward GELU median latency (T=1024). Lower is better. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_20.png] view at source ↗

**Figure 21.** Figure 21: Backward GELU memory bandwidth (T=1024). Higher is better. 1x2048x3072 16x2048x3072 0 0.5 1 1.5 2 2.5 0.09 1.5 0.1 1.48 Config Median Latency (ms) 1x2048x3072 16x2048x3072 0 200 400 600 800 1,000 1,200 798 769 804 817 Config Bandwidth (GB/s) [PITH_FULL_IMAGE:figures/full_fig_p028_21.png] view at source ↗

**Figure 22.** Figure 22: Backward GELU performance (T=2048). Left: Median Latency. Right: Bandwidth. B.5 LayerNorm Forward ( BluBridge, PyTorch) 1024x768 4096x768 16384x768 32768x768 0 0.05 0.1 0.15 0.2 0.25 0.3 0.01 0.04 0.12 0.25 0.01 0.04 0.13 0.25 Config (rows × cols) Median Latency (ms) Varying Rows (Cols=768) 16384x768 16384x1024 16384x3072 16384x4096 0 0.2 0.4 0.6 0.8 1 0.12 0.17 0.5 0.66 0.13 0.17 0.51 0.7 Config (rows × … view at source ↗

**Figure 23.** Figure 23: Forward LayerNorm median latency. Lower is better. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_23.png] view at source ↗

**Figure 24.** Figure 24: Forward LayerNorm memory bandwidth. Higher is better. B.6 LayerNorm Backward ( BluBridge, PyTorch) 1024x768 4096x768 16384x768 32768x768 0 0.2 0.4 0.6 0.8 1 0.03 0.07 0.34 0.61 0.02 0.07 0.36 0.68 Config (rows × cols) Median Latency (ms) Varying Rows (Cols=768) 16384x768 16384x1024 16384x3072 16384x4096 0 0.5 1 1.5 2 2.5 0.34 0.43 1.22 1.61 0.36 0.48 1.31 1.79 Config (rows × cols) Median Latency (ms) Vary… view at source ↗

**Figure 25.** Figure 25: Backward LayerNorm median latency. Lower is better. 1024x768 4096x768 16384x768 32768x768 0 200 400 600 800 363 559 452 494 408 544 422 444 Config (rows × cols) Bandwidth (GB/s) Varying Rows (Cols=768) 16384x768 16384x1024 16384x3072 16384x4096 0 200 400 600 800 452 474 497 501 422 423 460 451 Config (rows × cols) Bandwidth (GB/s) Varying Cols (Rows=16384) [PITH_FULL_IMAGE:figures/full_fig_p029_25.png] view at source ↗

**Figure 26.** Figure 26: Backward LayerNorm memory bandwidth. Higher is better. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_26.png] view at source ↗

**Figure 27.** Figure 27: Forward Loss median latency. Lower is better. 1024x50304 4096x50304 8192x50304 16384x50304 32768x50304 0 200 400 600 800 1,000 1,200 641 813 850 871 882 382 398 403 404 405 Config (batch × vocab) Bandwidth (GB/s) Varying Batch (Vocab=50304) 16384x32000 16384x50304 16384x128256 0 200 400 600 800 1,000 1,200 859 871 884 408 404 262 Config (batch × vocab) Bandwidth (GB/s) Varying Vocab (Batch=16384) [PITH_F… view at source ↗

**Figure 28.** Figure 28: Forward Loss memory bandwidth. Higher is better. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_28.png] view at source ↗

**Figure 29.** Figure 29: Backward Loss median latency. Lower is better. 1024x50304 4096x50304 8192x50304 16384x50304 32768x50304 0 200 400 600 800 1,000 1,200 744 782 794 801 804 446 467 472 474 476 Config (batch × vocab) Bandwidth (GB/s) Varying Batch (Vocab=50304) 16384x32000 16384x50304 16384x128256 0 200 400 600 800 1,000 1,200 796 801 793 515 474 435 Config (batch × vocab) Bandwidth (GB/s) Varying Vocab (Batch=16384) [PITH_… view at source ↗

**Figure 30.** Figure 30: Backward Loss memory bandwidth. Higher is better. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_30.png] view at source ↗

**Figure 31.** Figure 31: Reduce Sum median latency. Lower is better. 1024x768 4096x768 16384x768 32768x768 0 200 400 600 800 1,000 280 495 611 620 270 486 607 614 Config (N × D) Bandwidth (GB/s) Varying N (D=768) 16384x768 16384x3072 16384x4096 16384x50304 0 200 400 600 800 1,000 1,200 611 656 694 872 607 647 694 871 Config (N × D) Bandwidth (GB/s) Varying D (N=16384) [PITH_FULL_IMAGE:figures/full_fig_p032_31.png] view at source ↗

**Figure 32.** Figure 32: Reduce Sum memory bandwidth. Higher is better. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_32.png] view at source ↗

**Figure 33.** Figure 33: AdamW Optimizer median latency. Lower is better. single_ln_weight single_ln_bias single_cattn_bias single_cfc_bias single_cproj_weight single_cattn_weight single_cfc_weight single_wpe single_wte gpt2_one_block gpt2_full_124M 100 101 102 103 1.44 1.54 2.43 2.75 6.14 6.52 6.62 6.36 6.65 6.61 6.66 1.09 1.14 2.04 2.34 5.67 6.41 6.49 5.93 6.64 6.59 6.65 Config Bandwidth (GB/s) [Log Scale] [PITH_FULL_IMAGE:fig… view at source ↗

**Figure 34.** Figure 34: AdamW Optimizer memory bandwidth. Higher is better. B.11 Distributed Data Parallel Execution The following distributed data-parallel benchmarks were evaluated on an 8x NVIDIA RTX 6000 Ada Lovelace (48 GB) multi-GPU hardware cluster, training a 124M-parameter GPT-2 configuration. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_34.png] view at source ↗

**Figure 35.** Figure 35: Distributed Data Parallel (DDP) AllReduce latency and throughput scaling against varying bucket sizes (5MB to 100MB) during the training of a 124M-parameter GPT-2 configuration. B.12 Tensor Parallelism Benchmarks The following benchmarks evaluate the exact execution characteristics of the tensor-parallel runtime. All empirical measurements were conducted during the training of a 124M-parameter GPT-2 confi… view at source ↗

**Figure 38.** Figure 38: Per-Stage Latency Breakdown: This visualizes exactly where the AsyncTP protocol recovers idle time. By hiding the synchronous AllReduce network transfers strictly behind the matmul compute kernels, it selectively collapses the backward pass latency without affecting the forward pass. B.13 Context Parallelism Benchmarks The following benchmarks evaluate the context-parallel runtime across its three ring-ro… view at source ↗

**Figure 39.** Figure 39: Context Parallelism on 44M GPT-2, dual RTX 5070: the CP backends preserve a throughput lead over PyTorch alongside an equal-or-better convergence. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_39.png] view at source ↗

**Figure 40.** Figure 40: Context Parallelism on 163M GPT-2, dual RTX 6000 Ada: higher end-to-end throughput than PyTorch at an equal-or-better validation loss. Across both testbeds the context-parallel runtime delivers higher end-to-end throughput than the PyTorch baseline while matching or improving the final validation loss, with the largest margin on the RTX 6000 Ada testbed [PITH_FULL_IMAGE:figures/full_fig_p037_40.png] view at source ↗

**Figure 41.** Figure 41: Single-file checkpoint latency, BluTrain vs PyTorch (median, FP32, integrity-verified). Synchronous save and load are directly comparable across systems; the asynchronous bars report only the staging stall the training loop actually waits on, for which PyTorch has no equivalent. The full asynchronous save completes in 177 ms for 44M (20 ms GPU stall + 157 ms background disk write, hidden) and 503 ms for 1… view at source ↗

read the original abstract

Progress in deep learning is, at scale, more a matter of systems engineering than of modelling: the behaviour of a model in training (its throughput, its memory footprint, and the numerical fidelity of the result) is determined less by the architecture itself than by how that architecture is expressed on the hardware. To achieve absolute control over this hardware expression while abstracting away systems complexity to make modelling seamless and eliminating the need for repetitive orchestration logic, BluTrain was architected from first principles as a robust, lightweight, and architecture-general training framework in standard C++ and the core CUDA programming model. Every layer is implemented natively: a typed tensor module with reverse-mode autograd, a linear-algebra library, a caching allocator, a multi-mode distributed-execution module, and an MLIR-based deep-learning compiler. In formal evaluations training a 124M-parameter GPT-2 baseline in FP32 on an 8-GPU 6000 Ada system, BluTrain outperforms industry-standard baselines in both throughput (sustaining an average of 407K tokens/s versus PyTorch's 395K tokens/s) and memory efficiency (achieving up to a 22% footprint reduction), while strictly preserving numerical fidelity and converging to a marginally lower final validation loss. With every layer explicitly open to native tuning, the performance ceiling is the framework's own to raise.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BluTrain describes a full native C++/CUDA training stack but its performance claims over PyTorch rest on missing experimental details that make the 3% throughput edge impossible to assess.

read the letter

The one thing to take away is that this is an engineering write-up of a from-scratch C++/CUDA framework with its own tensor module, autograd, allocator, and MLIR compiler, and the only quantitative support is a 124M GPT-2 run showing 407K vs 395K tokens/s and up to 22% less memory. Those numbers are the entire case for superiority.

The work does ship a complete native implementation rather than wrappers, which is real effort. Every component is written in standard CUDA, and they report that numerical results match while the final validation loss is marginally better. That level of self-contained systems work is uncommon in short papers.

The soft spot is the complete absence of any experimental protocol. The abstract gives no batch size, sequence length, data loader details, optimizer settings, whether the PyTorch baseline used stock modules or any custom kernels, or how memory was measured. A 3% throughput difference sits well inside the variation that can come from prefetching, allocator behavior, or launch configuration alone. Without those controls the claim cannot be checked.

This paper is for the small group of systems researchers who want to inspect or extend a full native stack themselves. Most readers will find little actionable content because the central result is not reproducible from the text.

I would not bring it to reading group and would not cite it. It does not deserve peer review in its current form; the evidence for the main claim is simply not present.

Referee Report

1 major / 0 minor

Summary. The paper presents BluTrain, a C++/CUDA training framework built from first principles with native typed tensors and reverse-mode autograd, a linear-algebra library, caching allocator, multi-mode distributed execution, and an MLIR-based compiler. It claims that training a 124M-parameter GPT-2 model in FP32 on an 8-GPU Ada 6000 system yields 407K tokens/s throughput (vs. PyTorch 395K) and up to 22% lower memory footprint while preserving numerical fidelity and achieving a marginally lower validation loss.

Significance. If the performance claims can be substantiated under controlled conditions, the work would offer a fully open, natively tunable alternative to dominant frameworks, with the potential to raise performance ceilings through direct kernel access; however, the absence of verifiable experimental controls currently prevents this assessment.

major comments (1)

[Abstract] Abstract: the headline claims of 407K vs 395K tokens/s throughput and 22% memory reduction are presented with no description of batch size, sequence length, data-loader implementation, gradient-accumulation steps, optimizer-state placement, kernel-launch configuration, or whether the PyTorch baseline used stock nn.Linear/F.scaled_dot_product_attention or any custom extensions; without these controls the 3% throughput delta cannot be attributed to the framework.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The concern regarding insufficient experimental controls in the abstract is valid, and we will revise the manuscript to address it directly.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claims of 407K vs 395K tokens/s throughput and 22% memory reduction are presented with no description of batch size, sequence length, data-loader implementation, gradient-accumulation steps, optimizer-state placement, kernel-launch configuration, or whether the PyTorch baseline used stock nn.Linear/F.scaled_dot_product_attention or any custom extensions; without these controls the 3% throughput delta cannot be attributed to the framework.

Authors: We agree that the abstract, in its current form, omits key experimental parameters required to interpret the reported deltas. In the revised manuscript we will expand the abstract to specify the batch size, sequence length, data-loader implementation, number of gradient-accumulation steps, optimizer-state placement (CPU vs. GPU), kernel-launch configuration, and explicit confirmation that the PyTorch baseline used only stock nn.Linear and F.scaled_dot_product_attention with no custom extensions. These additions will make the 3 % throughput and 22 % memory claims attributable to BluTrain under controlled conditions. revision: yes

Circularity Check

0 steps flagged

No circularity; paper is a systems description with no derivations or fitted predictions

full rationale

The manuscript describes the architecture and implementation of a C++/CUDA training framework, including native layers, autograd, allocator, and compiler components. Performance numbers (407K vs 395K tokens/s, 22% memory reduction) are presented as direct empirical measurements on a GPT-2 baseline. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claims rest on benchmark results rather than any mathematical reduction to inputs. This matches the default expectation of no significant circularity for non-derivational systems papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the framework is presented as an engineering artifact built on the standard CUDA programming model and MLIR infrastructure.

axioms (1)

domain assumption The CUDA programming model and MLIR infrastructure behave as documented by NVIDIA and the MLIR project.
All native layer implementations rest on the correctness and performance characteristics of these external systems.

pith-pipeline@v0.9.1-grok · 5881 in / 1334 out tokens · 20628 ms · 2026-06-25T23:21:32.809191+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 1 canonical work pages

[1]

Lattner et al

C. Lattner et al. MLIR: A Compiler Infrastructure for the End of Moore’s Law. arXiv:2002.11654, 2020

arXiv 2002
[2]

Li et al

S. Li et al. PyTorch Distributed: Experiences on Accelerating Data Parallel Training.VLDB, 2020

2020
[3]

Paszke et al

A. Paszke et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library.NeurIPS, 2019

2019
[4]

Shoeybi et al

M. Shoeybi et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053, 2019

Pith/arXiv arXiv 1909
[5]

He et al

H. He et al. Introducing Async Tensor Parallelism in PyTorch (TorchTitan).PyTorch Dev Discuss, 2024

2024
[6]

Wu et al

X. Wu et al. Breaking Barriers: Training Long Context LLMs with 1M Sequence Length in PyTorch Using Context Parallel.PyTorch Dev Discuss, 2025

2025
[7]

Dao et al

T. Dao et al. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS, 2022

2022
[8]

Liu et al

H. Liu et al. Ring Attention with Blockwise Transformers for Near-Infinite Context.ICLR, 2024

2024
[9]

Introducing Context Parallelism

Insujang. Introducing Context Parallelism. https://insujang.github.io/2024-09-20/ introducing-context-parallelism/, 2024

2024
[10]

Narayanan et al

D. Narayanan et al. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron- LM.SC, 2021. 19

2021
[11]

Eisenman et al

A. Eisenman et al. Check-N-Run: A Checkpointing System for Training Deep Learning Recommen- dation Models.NSDI, 2022

2022
[12]

Nie et al

B. Nie et al. Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center GPUs: A Field Study. 2018

2018
[13]

Zhao et al

Y. Zhao et al. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.VLDB, 2023

2023
[14]

BluBLAS: Hand-Tuned GEMM Kernels for Ada Lovelace Tensor Cores

BluBridge Team. BluBLAS: Hand-Tuned GEMM Kernels for Ada Lovelace Tensor Cores. BluBridge Technologies, technical report, 2026

2026
[15]

Radford et al

A. Radford et al. Language Models are Unsupervised Multitask Learners. OpenAI, 2019

2019
[16]

Penedo et al

G. Penedo et al. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. arXiv:2406.17557, 2024

Pith/arXiv arXiv 2024
[17]

Loshchilov, F

I. Loshchilov, F. Hutter. Decoupled Weight Decay Regularization.ICLR, 2019

2019
[18]

Ansel et al

A. Ansel et al. PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transfor- mation and Graph Compilation.ASPLOS, 2024

2024
[19]

Abadi et al

M. Abadi et al. TensorFlow: A System for Large-Scale Machine Learning.OSDI, 2016

2016
[20]

TorchDynamo

PyTorch Team. TorchDynamo. https://docs.pytorch.org/docs/2.12/user_guide/torch_ compiler/torch.compiler_dynamo_overview.html, 2022

2022
[21]

Sergeev, M

A. Sergeev, M. Del Balso. Horovod: Fast and Easy Distributed Deep Learning in TensorFlow. arXiv:1802.05799, 2018

Pith/arXiv arXiv 2018
[22]

Rajbhandari et al

S. Rajbhandari et al. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.SC, 2020

2020
[23]

Ren et al

J. Ren et al. ZeRO-Offload: Democratizing Billion-Scale Model Training.USENIX ATC, 2021

2021
[24]

Rajbhandari et al

S. Rajbhandari et al. ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning.SC, 2021

2021
[25]

Rasley et al

J. Rasley et al. DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters.KDD, 2020

2020
[26]

Shazeer et al

N. Shazeer et al. Mesh-TensorFlow: Deep Learning for Supercomputers.NeurIPS, 2018

2018
[27]

Jacobs et al

S. Jacobs et al. DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models. arXiv:2309.14509, 2023

Pith/arXiv arXiv 2023
[28]

Tillet et al

P. Tillet et al. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computa- tions.MAPL, 2019

2019
[29]

P. Tillet. Introducing Triton: Open-Source GPU Programming for Neural Networks.OpenAI Blog, 2021

2021
[30]

Chen et al

T. Chen et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning.OSDI, 2018

2018
[31]

Leary et al

C. Leary et al. XLA: TensorFlow, Compiled.TensorFlow Dev Summit, 2017

2017
[32]

Chen et al

T. Chen et al. Training Deep Nets with Sublinear Memory Cost. arXiv:1604.06174, 2016. 20

Pith/arXiv arXiv 2016
[33]

Peng et al

X. Peng et al. Capuchin: Tensor-based GPU Memory Management for Deep Learning.ASPLOS, 2020

2020
[34]

CUTLASS: Fast Linear Algebra in CUDA C++.https://github.com/NVIDIA/cutlass

NVIDIA. CUTLASS: Fast Linear Algebra in CUDA C++.https://github.com/NVIDIA/cutlass
[35]

Milakov, N

M. Milakov, N. Gimelshein. Online Normalizer Calculation for Softmax. arXiv:1805.02867, 2018

Pith/arXiv arXiv 2018
[36]

B. P. Welford. Note on a Method for Calculating Corrected Sums of Squares and Products.Techno- metrics, 4(3):419–420, 1962. doi:10.1080/00401706.1962.10490022. 21 Appendix A Hardware and Benchmarking Methodology All empirical kernel benchmarks and performance metrics presented in Appendix B.1 through B.10 wereexecutedonasingleNVIDIARTX6000AdaLovelace(48GB...

work page doi:10.1080/00401706.1962.10490022 1962

[1] [1]

Lattner et al

C. Lattner et al. MLIR: A Compiler Infrastructure for the End of Moore’s Law. arXiv:2002.11654, 2020

arXiv 2002

[2] [2]

Li et al

S. Li et al. PyTorch Distributed: Experiences on Accelerating Data Parallel Training.VLDB, 2020

2020

[3] [3]

Paszke et al

A. Paszke et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library.NeurIPS, 2019

2019

[4] [4]

Shoeybi et al

M. Shoeybi et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053, 2019

Pith/arXiv arXiv 1909

[5] [5]

He et al

H. He et al. Introducing Async Tensor Parallelism in PyTorch (TorchTitan).PyTorch Dev Discuss, 2024

2024

[6] [6]

Wu et al

X. Wu et al. Breaking Barriers: Training Long Context LLMs with 1M Sequence Length in PyTorch Using Context Parallel.PyTorch Dev Discuss, 2025

2025

[7] [7]

Dao et al

T. Dao et al. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS, 2022

2022

[8] [8]

Liu et al

H. Liu et al. Ring Attention with Blockwise Transformers for Near-Infinite Context.ICLR, 2024

2024

[9] [9]

Introducing Context Parallelism

Insujang. Introducing Context Parallelism. https://insujang.github.io/2024-09-20/ introducing-context-parallelism/, 2024

2024

[10] [10]

Narayanan et al

D. Narayanan et al. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron- LM.SC, 2021. 19

2021

[11] [11]

Eisenman et al

A. Eisenman et al. Check-N-Run: A Checkpointing System for Training Deep Learning Recommen- dation Models.NSDI, 2022

2022

[12] [12]

Nie et al

B. Nie et al. Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center GPUs: A Field Study. 2018

2018

[13] [13]

Zhao et al

Y. Zhao et al. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.VLDB, 2023

2023

[14] [14]

BluBLAS: Hand-Tuned GEMM Kernels for Ada Lovelace Tensor Cores

BluBridge Team. BluBLAS: Hand-Tuned GEMM Kernels for Ada Lovelace Tensor Cores. BluBridge Technologies, technical report, 2026

2026

[15] [15]

Radford et al

A. Radford et al. Language Models are Unsupervised Multitask Learners. OpenAI, 2019

2019

[16] [16]

Penedo et al

G. Penedo et al. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. arXiv:2406.17557, 2024

Pith/arXiv arXiv 2024

[17] [17]

Loshchilov, F

I. Loshchilov, F. Hutter. Decoupled Weight Decay Regularization.ICLR, 2019

2019

[18] [18]

Ansel et al

A. Ansel et al. PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transfor- mation and Graph Compilation.ASPLOS, 2024

2024

[19] [19]

Abadi et al

M. Abadi et al. TensorFlow: A System for Large-Scale Machine Learning.OSDI, 2016

2016

[20] [20]

TorchDynamo

PyTorch Team. TorchDynamo. https://docs.pytorch.org/docs/2.12/user_guide/torch_ compiler/torch.compiler_dynamo_overview.html, 2022

2022

[21] [21]

Sergeev, M

A. Sergeev, M. Del Balso. Horovod: Fast and Easy Distributed Deep Learning in TensorFlow. arXiv:1802.05799, 2018

Pith/arXiv arXiv 2018

[22] [22]

Rajbhandari et al

S. Rajbhandari et al. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.SC, 2020

2020

[23] [23]

Ren et al

J. Ren et al. ZeRO-Offload: Democratizing Billion-Scale Model Training.USENIX ATC, 2021

2021

[24] [24]

Rajbhandari et al

S. Rajbhandari et al. ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning.SC, 2021

2021

[25] [25]

Rasley et al

J. Rasley et al. DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters.KDD, 2020

2020

[26] [26]

Shazeer et al

N. Shazeer et al. Mesh-TensorFlow: Deep Learning for Supercomputers.NeurIPS, 2018

2018

[27] [27]

Jacobs et al

S. Jacobs et al. DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models. arXiv:2309.14509, 2023

Pith/arXiv arXiv 2023

[28] [28]

Tillet et al

P. Tillet et al. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computa- tions.MAPL, 2019

2019

[29] [29]

P. Tillet. Introducing Triton: Open-Source GPU Programming for Neural Networks.OpenAI Blog, 2021

2021

[30] [30]

Chen et al

T. Chen et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning.OSDI, 2018

2018

[31] [31]

Leary et al

C. Leary et al. XLA: TensorFlow, Compiled.TensorFlow Dev Summit, 2017

2017

[32] [32]

Chen et al

T. Chen et al. Training Deep Nets with Sublinear Memory Cost. arXiv:1604.06174, 2016. 20

Pith/arXiv arXiv 2016

[33] [33]

Peng et al

X. Peng et al. Capuchin: Tensor-based GPU Memory Management for Deep Learning.ASPLOS, 2020

2020

[34] [34]

CUTLASS: Fast Linear Algebra in CUDA C++.https://github.com/NVIDIA/cutlass

NVIDIA. CUTLASS: Fast Linear Algebra in CUDA C++.https://github.com/NVIDIA/cutlass

[35] [35]

Milakov, N

M. Milakov, N. Gimelshein. Online Normalizer Calculation for Softmax. arXiv:1805.02867, 2018

Pith/arXiv arXiv 2018

[36] [36]

B. P. Welford. Note on a Method for Calculating Corrected Sums of Squares and Products.Techno- metrics, 4(3):419–420, 1962. doi:10.1080/00401706.1962.10490022. 21 Appendix A Hardware and Benchmarking Methodology All empirical kernel benchmarks and performance metrics presented in Appendix B.1 through B.10 wereexecutedonasingleNVIDIARTX6000AdaLovelace(48GB...

work page doi:10.1080/00401706.1962.10490022 1962