BASIS: Balanced Activation Sketching with Invariant Scalars for "Ghost Backpropagation"

Vladimer Khasia

arxiv: 2604.16324 · v1 · submitted 2026-03-05 · 💻 cs.LG

BASIS: Balanced Activation Sketching with Invariant Scalars for "Ghost Backpropagation"

Vladimer Khasia This is my paper

Pith reviewed 2026-05-15 15:53 UTC · model grok-4.3

classification 💻 cs.LG

keywords activation sketchingmemory-efficient backpropagationghost backpropagationbalanced hashinginvariant scalarsgradient compressionGPT training

0 comments

The pith

BASIS sketches activations into rank-R tensors to decouple backpropagation memory from batch and sequence size while keeping exact error signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BASIS as a backpropagation algorithm that propagates the precise gradient signal dX but replaces the weight-update computation dW with heavily compressed rank-R sketches. It introduces balanced hashing to remove off-diagonal collision variance and invariant scalars to keep the continuous energy norm intact, thereby reducing activation storage from O(L * BN) to O(L * RN). If the method works as described, deep models such as GPTs can be trained with far less memory while still converging, and the same compressed path even acts as a mild regularizer that sometimes yields lower validation loss than full-precision backpropagation.

Core claim

BASIS propagates the exact error signal (dX) to preserve flawless gradient flow, but computes the weight updates (dW) using massively compressed rank-R tensors stabilized by balanced hashing that eliminates off-diagonal collision variance and invariant scalars that deterministically preserve the exact continuous energy norm, reducing activation memory to O(L * RN) and allowing GPT training to reach or exceed exact-backpropagation validation loss at R = 32 while still converging at R = 1.

What carries the argument

Balanced hashing together with invariant scalars applied to rank-R activation sketches that replace the weight-gradient computation while leaving the error-signal path exact.

If this is right

Activation memory scales linearly with rank R rather than with batch size B and sequence length.
The backward-pass matrix-multiplication cost drops proportionally to the compression ratio.
Training remains stable and converges even when the sketch rank is reduced to 1.
The magnitude-stabilized trajectory can produce lower validation loss than exact backpropagation, functioning as an implicit regularizer.
The same compressed backward path can be used across different network depths without reintroducing the original O(L * BN) memory wall.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on transformer variants with context lengths far beyond current hardware limits to check whether the memory decoupling enables new scale regimes.
If the invariant-scalar construction generalizes, similar norm-preserving sketches might be applied to other first-order optimizers or to forward-pass compression.
The observed regularization effect at low R suggests a controlled study of whether the method systematically improves generalization on downstream tasks.
Hardware implementations could measure actual wall-clock speedups once the reduced matrix-multiplication footprint is mapped onto sparse or low-precision kernels.

Load-bearing premise

Balanced hashing and invariant scalars together remove enough variance and preserve enough norm that the sketched weight updates remain stable enough for convergence even at rank 1.

What would settle it

Run the same 50,000-step GPT training loop with BASIS at R = 32 and measure whether the final validation loss stays within 0.05 of the exact-backpropagation baseline of 6.616.

Figures

Figures reproduced from arXiv: 2604.16324 by Vladimer Khasia.

read the original abstract

The activation memory required for exact backpropagation scales linearly with network depth, context length, and feature dimensionality, forming an O(L * BN ) spatial bottleneck (where B is the sequence-batch cardinality and N is the feature dimension). This constraint historically throttles the scaling of deep neural networks. While randomized automatic differentiation attempts to mitigate this, it historically suffers from catastrophic variance. In this paper, we introduce BASIS (Balanced Activation Sketching with Invariant Scalars), an efficient backpropagation algorithm that fully decouples activation memory from the batch and sequence dimensions. BASIS propagates the exact error signal (dX) to preserve flawless gradient flow, but computes the weight updates (dW) using massively compressed rank-R tensors. To solve the foundational instability of sketched gradients, we propose two novel mechanisms: Balanced Hashing, which strictly eliminates off-diagonal collision variance, and Invariant Scalars, a principled bias-variance tradeoff that deterministically preserves the exact continuous energy norm of the spatial geometry. Theoretically, BASIS reduces activation memory to O(L * RN ) and heavily decreases the backward pass matrix-multiplication footprint. Empirically, training a GPT architecture for 50,000 steps validates our theoretical guarantees: at R = 32, BASIS achieves parity with (and marginally outperforms) exact backpropagation validation loss (6.575 vs. 6.616), acting as an implicit regularizer. Remarkably, the stabilized magnitude trajectory allows the model to converge smoothly even under extreme spatial compression (R = 1), proving the extreme robustness of the estimator. The code is available at https://github.com/VladimerKhasia/basis

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BASIS sketches activations for lower memory in backprop but its core claims on variance elimination and norm preservation lack visible derivations.

read the letter

The main point is a sketching method for backprop that keeps exact error signals for the input side while compressing the weight updates to rank R. This is meant to drop activation memory from linear in batch and sequence length down to linear in R times feature dim, with two new pieces: balanced hashing to kill collision variance and invariant scalars to hold the energy norm steady. Empirically it shows parity or slight improvement over exact backprop at R=32 on a GPT run, and even converges at R=1, which is the strongest result here. The public code is a plus for anyone wanting to test it. The soft spots sit in the theory. The abstract asserts that balanced hashing strictly removes off-diagonal variance and that the scalars preserve the exact continuous norm, but no derivation, bound, or proof sketch appears to support why these hold deterministically instead of approximately. If either property leaks under finite precision or real data, the low-R stability would not follow as claimed. Experiments are narrow—one architecture, no visible ablations on the new components—so it is hard to tell how general the robustness is. This work is aimed at people building or scaling large models who care about activation memory. A reader already working on randomized autodiff or memory-efficient training would find the specific mechanisms worth looking at. It deserves a serious referee because the problem is real, the empirical signal is positive, and the code lets others check the claims directly.

Referee Report

3 major / 2 minor

Summary. The paper introduces BASIS, a backpropagation algorithm that decouples activation memory from batch/sequence dimensions by propagating exact dX error signals while computing sketched rank-R dW updates. It claims two new mechanisms—balanced hashing that strictly eliminates off-diagonal collision variance and invariant scalars that deterministically preserve the exact continuous energy norm—solve the instability of sketched gradients. Theoretically this yields O(L RN) activation memory; empirically, on GPT training for 50k steps, R=32 matches or slightly beats exact backprop validation loss (6.575 vs 6.616) and the method converges smoothly even at R=1.

Significance. If the deterministic variance-elimination and norm-preservation properties can be rigorously established, BASIS would offer a practical route to training deeper or longer-context models under severe memory constraints, with the reported R=1 convergence suggesting an unusually stable sketched estimator that may also regularize. The combination of exact dX flow with compressed dW is a clean architectural split that, if proven, would be a notable advance over prior randomized autodiff approaches.

major comments (3)

[Abstract] Abstract: the central claim that balanced hashing 'strictly eliminates off-diagonal collision variance' (so that sketched dW has zero extra variance) is load-bearing for both the O(L RN) memory bound and the R=1 convergence result, yet no derivation, collision-probability analysis, or proof sketch is supplied; without this the stability guarantee does not follow from the stated construction.
[Abstract] Abstract: invariant scalars are asserted to 'deterministically preserve the exact continuous energy norm of the spatial geometry' under arbitrary compression, but no bias-variance bound, finite-precision analysis, or continuous-to-discrete equivalence is given; this assumption is required for the reported robustness at R=1 and must be substantiated before the empirical parity can be interpreted as theoretical validation.
[Empirical results] Empirical validation paragraph: the single-run validation-loss comparison (6.575 vs 6.616) is presented without reported standard deviation across seeds, number of independent trials, or ablation of the two new mechanisms, so it is impossible to determine whether the marginal improvement is statistically reliable or merely an artifact of the particular training trajectory.

minor comments (2)

[Abstract] Abstract: notation 'O(L * BN )' contains an extraneous space and inconsistent spacing around the multiplication symbol; standardize to O(L BN).
[Abstract] Abstract: the parenthetical phrase 'Ghost Backpropagation' is introduced in quotes without definition or reference to prior usage; either remove or supply a one-sentence clarification of its intended meaning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the insightful and constructive feedback. We address each major comment below and commit to revisions that strengthen the theoretical derivations and empirical reporting while preserving the core contributions of the work.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that balanced hashing 'strictly eliminates off-diagonal collision variance' (so that sketched dW has zero extra variance) is load-bearing for both the O(L RN) memory bound and the R=1 convergence result, yet no derivation, collision-probability analysis, or proof sketch is supplied; without this the stability guarantee does not follow from the stated construction.

Authors: We agree that the variance-elimination property requires explicit derivation. Balanced hashing is constructed so that each off-diagonal pair receives equal positive and negative collision contributions that cancel in expectation. In the revised manuscript we will insert a dedicated theoretical subsection containing a collision-probability analysis and a short proof that the expected off-diagonal variance term is identically zero, thereby rigorously justifying both the O(L RN) memory claim and the observed stability at low rank. revision: yes
Referee: [Abstract] Abstract: invariant scalars are asserted to 'deterministically preserve the exact continuous energy norm of the spatial geometry' under arbitrary compression, but no bias-variance bound, finite-precision analysis, or continuous-to-discrete equivalence is given; this assumption is required for the reported robustness at R=1 and must be substantiated before the empirical parity can be interpreted as theoretical validation.

Authors: We acknowledge that a formal bias-variance analysis and finite-precision treatment are missing. The invariant scalars are defined to enforce exact L2-norm preservation in the continuous limit; we will add a bias-variance decomposition proving zero bias together with an explicit variance bound scaling with 1/R, plus a short finite-precision error analysis and a statement of the continuous-to-discrete equivalence. These additions will appear in the revised theoretical section. revision: yes
Referee: [Empirical results] Empirical validation paragraph: the single-run validation-loss comparison (6.575 vs 6.616) is presented without reported standard deviation across seeds, number of independent trials, or ablation of the two new mechanisms, so it is impossible to determine whether the marginal improvement is statistically reliable or merely an artifact of the particular training trajectory.

Authors: We agree that the current empirical presentation lacks statistical robustness. In the revision we will report validation loss averaged over at least three independent random seeds, including standard deviations, and will add ablation experiments that isolate the individual contributions of balanced hashing and invariant scalars. The public code repository will be updated with the exact seeds and scripts used for these additional runs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper introduces two new algorithmic mechanisms—Balanced Hashing and Invariant Scalars—as solutions to sketched-gradient instability. These are asserted to deterministically eliminate off-diagonal variance and preserve exact energy norms, enabling the claimed O(L·RN) memory reduction via exact dX / sketched dW split. No equations, derivations, or self-citations appear in the provided text that reduce any central claim to a fitted parameter, renamed input, or prior self-result by construction. The theoretical guarantees and empirical GPT results (R=32 parity, R=1 convergence) are presented as consequences of the new components rather than tautological restatements of inputs. This is the normal case of a self-contained proposal whose load-bearing assumptions remain externally verifiable.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on the unverified properties of the newly introduced balanced hashing and invariant scalars; R is a user-chosen compression parameter.

free parameters (1)

R (sketch rank)
Compression rank chosen by the user and tested at discrete values such as 32 and 1

axioms (2)

domain assumption Balanced Hashing strictly eliminates off-diagonal collision variance
Invoked to solve the instability of sketched gradients
domain assumption Invariant Scalars deterministically preserve the exact continuous energy norm
Used to achieve a principled bias-variance tradeoff

invented entities (1)

Invariant Scalars no independent evidence
purpose: To preserve the spatial geometry energy norm under sketching
New mechanism introduced to stabilize the estimator

pith-pipeline@v0.9.0 · 5588 in / 1322 out tokens · 71465 ms · 2026-05-15T15:53:56.490620+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Balanced Hashing, which strictly eliminates off-diagonal collision variance, and Invariant Scalars, a principled bias-variance tradeoff that deterministically preserves the exact continuous energy norm of the spatial geometry
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BASIS reduces activation memory to O(L·RN) ... at R=32 achieves parity ... even at R=1

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

Training deep nets with sublinear memory cost, 2016

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost, 2016

work page 2016
[2]

Rios and Nikolaos V

Luis M. Rios and Nikolaos V . Sahinidis. Derivative-free optimization: a review of algorithms and comparison of software implementations.Journal of Global Optimization, 56(3):1247–1293, 2013

work page 2013
[3]

Evolution strategies as a scalable alternative to reinforcement learning, 2017

Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning, 2017

work page 2017
[4]

Evolution strategies at the hyperscale, 2026

Bidipta Sarkar, Mattie Fellows, Juan Agustin Duque, Alistair Letcher, Antonio León Villares, Anya Sims, Clarisse Wibault, Dmitry Samsonov, Dylan Cope, Jarek Liesen, Kang Li, Lukas Seier, Theo Wolf, Uljad Berdica, Valentin Mohl, Alexander David Goldie, Aaron Courville, Karin Sevegnani, Shimon Whiteson, and Jakob Nicolaus Foerster. Evolution strategies at t...

work page 2026
[5]

Spokoiny

Yurii Nesterov and Vladimir G. Spokoiny. Random gradient-free minimization of convex functions.F oundations of Computational Mathematics, 17:527 – 566, 2015

work page 2015
[6]

Randomized automatic differentiation

Deniz Oktay, Nick McGreivy, Joshua Aduol, Alex Beatson, and Ryan P Adams. Randomized automatic differentiation. InInternational Conference on Learning Representations, 2021

work page 2021
[7]

Finding frequent items in data streams

Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. In Peter Widmayer, Stephan Eidenbenz, Francisco Triguero, Rafael Morales, Ricardo Conejo, and Matthew Hennessy, editors,Automata, Languages and Programming, pages 693–703, Berlin, Heidelberg, 2002. Springer Berlin Heidelberg

work page 2002
[8]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017

work page 2017
[9]

Language models are unsupervised multitask learners, 2019

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners, 2019

work page 2019
[10]

Andrej Karpathy. nanogpt. https://github.com/karpathy/nanoGPT, 2022. GitHub repository, accessed 2026-03-05

work page 2022
[11]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

work page 2017
[12]

The fineweb datasets: Decanting the web for the finest text data at scale

Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Lean- dro V on Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. 9

work page 2024

[1] [1]

Training deep nets with sublinear memory cost, 2016

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost, 2016

work page 2016

[2] [2]

Rios and Nikolaos V

Luis M. Rios and Nikolaos V . Sahinidis. Derivative-free optimization: a review of algorithms and comparison of software implementations.Journal of Global Optimization, 56(3):1247–1293, 2013

work page 2013

[3] [3]

Evolution strategies as a scalable alternative to reinforcement learning, 2017

Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning, 2017

work page 2017

[4] [4]

Evolution strategies at the hyperscale, 2026

Bidipta Sarkar, Mattie Fellows, Juan Agustin Duque, Alistair Letcher, Antonio León Villares, Anya Sims, Clarisse Wibault, Dmitry Samsonov, Dylan Cope, Jarek Liesen, Kang Li, Lukas Seier, Theo Wolf, Uljad Berdica, Valentin Mohl, Alexander David Goldie, Aaron Courville, Karin Sevegnani, Shimon Whiteson, and Jakob Nicolaus Foerster. Evolution strategies at t...

work page 2026

[5] [5]

Spokoiny

Yurii Nesterov and Vladimir G. Spokoiny. Random gradient-free minimization of convex functions.F oundations of Computational Mathematics, 17:527 – 566, 2015

work page 2015

[6] [6]

Randomized automatic differentiation

Deniz Oktay, Nick McGreivy, Joshua Aduol, Alex Beatson, and Ryan P Adams. Randomized automatic differentiation. InInternational Conference on Learning Representations, 2021

work page 2021

[7] [7]

Finding frequent items in data streams

Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. In Peter Widmayer, Stephan Eidenbenz, Francisco Triguero, Rafael Morales, Ricardo Conejo, and Matthew Hennessy, editors,Automata, Languages and Programming, pages 693–703, Berlin, Heidelberg, 2002. Springer Berlin Heidelberg

work page 2002

[8] [8]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017

work page 2017

[9] [9]

Language models are unsupervised multitask learners, 2019

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners, 2019

work page 2019

[10] [10]

Andrej Karpathy. nanogpt. https://github.com/karpathy/nanoGPT, 2022. GitHub repository, accessed 2026-03-05

work page 2022

[11] [11]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

work page 2017

[12] [12]

The fineweb datasets: Decanting the web for the finest text data at scale

Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Lean- dro V on Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. 9

work page 2024