Horizon-LM: A RAM-Centric Architecture for LLM Training

Lichao Sun; Yanfang Ye; Zhengqing Yuan

arxiv: 2602.04816 · v3 · submitted 2026-02-04 · 💻 cs.OS · cs.CL· cs.DC

Horizon-LM: A RAM-Centric Architecture for LLM Training

Zhengqing Yuan , Lichao Sun , Yanfang Ye This is my paper

Pith reviewed 2026-05-16 06:52 UTC · model grok-4.3

classification 💻 cs.OS cs.CLcs.DC

keywords LLM trainingmemory-centric architectureCPU offloadingsingle-GPU traininglarge language modelsRAM-centric designmodel scalingtraining throughput

0 comments

The pith

Horizon-LM stores LLM parameters in host RAM and uses GPUs only as short-lived compute engines to train up to 120B-parameter models on a single GPU.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Horizon-LM as a training system that places the full model parameters and state in abundant host memory rather than on the GPU. GPUs receive only temporary slices of work through explicit recomputation and manual gradient propagation managed by a pipelined double-buffered engine. This removes the need for persistent GPU-resident model copies or full autograd graphs, so model size is no longer limited by GPU memory capacity. On one H200 GPU paired with 1.5 TB host RAM the system trains 120B models, and on a standard A100 machine it delivers up to 12.2 times the throughput of DeepSpeed ZeRO-3 with CPU offloading while keeping numerical results identical. The approach therefore makes node-scale fine-tuning and adaptation feasible without multi-GPU clusters or complex distributed runtimes.

Core claim

Horizon-LM redefines the roles of CPU and GPU by treating host memory as the authoritative parameter store and GPUs solely as transient compute engines. It achieves this through a CPU-master, GPU-template execution model that eliminates persistent GPU-resident modules and autograd graphs, replaces them with explicit recomputation and manual gradient propagation, and introduces a pipelined double-buffered engine. The result decouples model scale from GPU count and bounds memory consumption to the theoretical parameter footprint, enabling reliable training of models up to 120B parameters on a single H200 GPU and up to 12.2 times higher throughput than DeepSpeed ZeRO-3 with CPU offloading on a

What carries the argument

The CPU-master, GPU-template execution model together with explicit recomputation, manual gradient propagation, and a pipelined double-buffered engine that keeps GPUs free of persistent model state and autograd graphs.

If this is right

Models up to 120B parameters become trainable on a single H200 GPU with 1.5 TB host RAM.
Training throughput reaches up to 12.2 times that of DeepSpeed ZeRO-3 with CPU offloading on a standard single-A100 machine.
Memory usage remains predictable and bounded by the model parameter size across scales.
High GPU utilization is maintained without multi-GPU clusters or full persistent offload overhead.
Node-scale post-training tasks such as instruction tuning become accessible on commodity single-GPU hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same CPU-master pattern could be applied to other memory-heavy workloads such as large-scale reinforcement learning or scientific simulation that currently hit GPU memory walls.
Hardware roadmaps might shift emphasis toward higher-bandwidth CPU-GPU interconnects and larger host RAM capacities rather than solely increasing per-GPU VRAM.
Developers could run large-model fine-tuning jobs on single workstations or cloud instances with high RAM instead of renting multi-GPU clusters.
Energy and cost models for training would change if the dominant constraint becomes host-memory bandwidth instead of aggregate GPU memory.

Load-bearing premise

Explicit recomputation with manual gradient propagation plus the pipelined double-buffered engine can replace persistent GPU-resident autograd graphs without losing numerical correctness or device utilization at 100B+ scale.

What would settle it

A side-by-side run of the same 70B training step on hardware where both Horizon-LM and a standard autograd system fit entirely in GPU memory, checking whether loss values, gradients, and final outputs match to machine precision.

read the original abstract

The rapid growth of large language models (LLMs) has outpaced the evolution of single-GPU hardware, making model scale increasingly constrained by memory capacity rather than computation. While modern training systems extend GPU memory through distributed parallelism and offloading across CPU and storage tiers, they fundamentally retain a GPU-centric execution paradigm in which GPUs host persistent model replicas and full autograd graphs. As a result, scaling large models remains tightly coupled to multi-GPU clusters, complex distributed runtimes, and unpredictable host memory consumption, creating substantial barriers for node-scale post-training workloads such as instruction tuning, alignment, and domain adaptation. We present Horizon-LM, a memory-centric training system that redefines the roles of CPU and GPU for large-model optimization. Horizon-LM treats host memory as the authoritative parameter store and uses GPUs solely as transient compute engines through a CPU-master, GPU-template execution model. By eliminating persistent GPU-resident modules and autograd graphs, employing explicit recomputation with manual gradient propagation, and introducing a pipelined double-buffered execution engine, Horizon-LM decouples model scale from GPU count and bounds memory usage to the theoretical parameter footprint. On a single H200 GPU with 1.5\,TB host RAM, Horizon-LM reliably trains models up to 120B parameters. On a standard single A100 machine, Horizon-LM achieves up to 12.2$\times$ higher training throughput than DeepSpeed ZeRO-3 with CPU offloading while preserving numerical correctness. Across platforms and scales, Horizon-LM sustains high device utilization and predictable memory growth, demonstrating that host memory, not GPU memory, defines the true feasibility boundary for node-scale large-model training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The CPU-master GPU-template model for 120B training on single machines is a clear direction but the abstract supplies no evidence that manual gradient propagation actually works at that scale.

read the letter

The main thing here is the claim that you can train up to 120B parameter models on one H200 by keeping the full state in 1.5 TB host RAM and using the GPU only as a transient compute slot with explicit recomputation and manual gradients. If that holds, it would loosen the tie between model size and GPU count for post-training jobs. The abstract also says this setup delivers 12.2 times the throughput of DeepSpeed ZeRO-3 with CPU offloading on a plain A100 machine while keeping numerical results identical. That is the headline result the authors want you to notice. What looks new is the consistent CPU-master framing plus the pipelined double-buffered engine that tries to hide the recompute and transfer costs. It takes existing offloading further by removing persistent GPU modules and autograd graphs entirely, which is a sharper break than most prior work. The paper does a reasonable job explaining why current systems still couple scale to clusters and why bounding memory to the parameter footprint matters for single-node tuning and alignment workloads. The soft spot is obvious once you read the abstract: there are no implementation details, no algorithm sketch for the manual gradient propagation, no description of how activation order or accumulation across transfers is handled, and no tables or error bars. The correctness claim is simply asserted. Without those pieces it is impossible to judge whether the central assumption about replacing autograd at 100B+ scale actually delivers both accuracy and utilization. This is aimed at systems people who build or tune training runtimes and who already know the memory-wall problem. A reader who wants concrete ideas for extreme offloading would get something from the full paper, but the current text is too thin to evaluate. I would not send it to referees yet. The idea is worth following once the implementation and results are shown, but right now it is just a set of claims.

Referee Report

2 major / 0 minor

Summary. The paper presents Horizon-LM, a RAM-centric LLM training architecture that treats host memory as the authoritative parameter store and GPUs as transient compute engines via a CPU-master, GPU-template model. It eliminates persistent GPU-resident modules and autograd graphs through explicit recomputation, manual gradient propagation, and a pipelined double-buffered execution engine. The central claims are that this enables reliable training of models up to 120B parameters on a single H200 GPU with 1.5 TB host RAM and delivers up to 12.2× higher throughput than DeepSpeed ZeRO-3 with CPU offloading on a single A100 while preserving numerical correctness.

Significance. If the correctness and performance claims hold, the work would be significant for node-scale post-training workloads by decoupling model scale from GPU count and bounding memory to the parameter footprint. This could reduce reliance on multi-GPU clusters for instruction tuning and alignment tasks. The approach is conceptually novel in its inversion of CPU/GPU roles, but the lack of any algorithmic or empirical substantiation in the manuscript prevents evaluation of its practical impact.

major comments (2)

[Abstract] Abstract: The assertion that explicit recomputation with manual gradient propagation and pipelined double-buffering produces numerically identical results to GPU-resident autograd at 120B scale is unsupported by any algorithm description, pseudocode, activation-recomputation schedule, or equivalence verification. This is load-bearing for the central claim of replacing persistent autograd graphs without correctness loss or utilization degradation.
[Abstract] Abstract: The reported performance figures (up to 12.2× throughput vs. DeepSpeed ZeRO-3 and reliable 120B training on H200) are presented without error bars, implementation details, workload specifications, or comparison methodology. This prevents assessment of whether the CPU-master/GPU-template model sustains device utilization at the claimed scales.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our submission. We agree that the abstract would benefit from greater substantiation of its central claims and will revise the manuscript to address both points. Our responses to the major comments are provided below.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that explicit recomputation with manual gradient propagation and pipelined double-buffering produces numerically identical results to GPU-resident autograd at 120B scale is unsupported by any algorithm description, pseudocode, activation-recomputation schedule, or equivalence verification. This is load-bearing for the central claim of replacing persistent autograd graphs without correctness loss or utilization degradation.

Authors: We acknowledge that the abstract, as a summary, does not contain the requested algorithmic details. The Horizon-LM design performs exactly the same floating-point operations as standard autograd-based training by using explicit recomputation of activations in the forward pass and manual gradient accumulation in the backward pass, with the CPU managing the authoritative parameter state. Numerical equivalence follows directly from the fact that no approximations or reduced-precision shortcuts are introduced. To address the concern, we will revise the abstract to include a concise description of the recomputation schedule and manual propagation steps, along with a note that equivalence was confirmed via bit-exact matching on smaller models. revision: yes
Referee: [Abstract] Abstract: The reported performance figures (up to 12.2× throughput vs. DeepSpeed ZeRO-3 and reliable 120B training on H200) are presented without error bars, implementation details, workload specifications, or comparison methodology. This prevents assessment of whether the CPU-master/GPU-template model sustains device utilization at the claimed scales.

Authors: We agree that the abstract lacks the supporting experimental context. The 12.2× figure was obtained on a single A100 using the same model configurations and batch sizes as the DeepSpeed ZeRO-3 baseline, with throughput measured in tokens per second; the 120B result used an H200 with 1.5 TB host RAM and a fixed sequence length of 2048. In the revision we will add a brief statement of the workload (standard language-model pre-training and fine-tuning tasks), note that results are averaged over multiple runs, and indicate that full methodology, error bars, and utilization measurements appear in the experimental evaluation section. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The abstract presents an architectural description and empirical performance claims (120B-parameter training on single H200, 12.2× throughput vs. DeepSpeed ZeRO-3) without any equations, fitted parameters, self-citations, or derivation steps. The central assertions rest on external system comparisons and stated implementation choices (explicit recomputation, manual gradient propagation, pipelined double-buffering) rather than any self-referential definitions or reductions by construction. No load-bearing step reduces to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond the high-level architecture description; ledger entries are therefore minimal and inferred.

axioms (1)

domain assumption CPU-GPU interconnect bandwidth and latency allow efficient pipelined transfers without becoming the dominant bottleneck
Implicit in the claim that the double-buffered engine sustains high device utilization

invented entities (1)

CPU-master, GPU-template execution model no independent evidence
purpose: Treats host memory as authoritative parameter store and GPUs as transient compute engines
Core new concept introduced to eliminate persistent GPU-resident modules and autograd graphs

pith-pipeline@v0.9.0 · 5580 in / 1193 out tokens · 28063 ms · 2026-05-16T06:52:35.632777+00:00 · methodology

Horizon-LM: A RAM-Centric Architecture for LLM Training

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)