BASIS: Balanced Activation Sketching with Invariant Scalars for "Ghost Backpropagation"
Pith reviewed 2026-05-15 15:53 UTC · model grok-4.3
The pith
BASIS sketches activations into rank-R tensors to decouple backpropagation memory from batch and sequence size while keeping exact error signals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BASIS propagates the exact error signal (dX) to preserve flawless gradient flow, but computes the weight updates (dW) using massively compressed rank-R tensors stabilized by balanced hashing that eliminates off-diagonal collision variance and invariant scalars that deterministically preserve the exact continuous energy norm, reducing activation memory to O(L * RN) and allowing GPT training to reach or exceed exact-backpropagation validation loss at R = 32 while still converging at R = 1.
What carries the argument
Balanced hashing together with invariant scalars applied to rank-R activation sketches that replace the weight-gradient computation while leaving the error-signal path exact.
If this is right
- Activation memory scales linearly with rank R rather than with batch size B and sequence length.
- The backward-pass matrix-multiplication cost drops proportionally to the compression ratio.
- Training remains stable and converges even when the sketch rank is reduced to 1.
- The magnitude-stabilized trajectory can produce lower validation loss than exact backpropagation, functioning as an implicit regularizer.
- The same compressed backward path can be used across different network depths without reintroducing the original O(L * BN) memory wall.
Where Pith is reading between the lines
- The approach could be tested on transformer variants with context lengths far beyond current hardware limits to check whether the memory decoupling enables new scale regimes.
- If the invariant-scalar construction generalizes, similar norm-preserving sketches might be applied to other first-order optimizers or to forward-pass compression.
- The observed regularization effect at low R suggests a controlled study of whether the method systematically improves generalization on downstream tasks.
- Hardware implementations could measure actual wall-clock speedups once the reduced matrix-multiplication footprint is mapped onto sparse or low-precision kernels.
Load-bearing premise
Balanced hashing and invariant scalars together remove enough variance and preserve enough norm that the sketched weight updates remain stable enough for convergence even at rank 1.
What would settle it
Run the same 50,000-step GPT training loop with BASIS at R = 32 and measure whether the final validation loss stays within 0.05 of the exact-backpropagation baseline of 6.616.
Figures
read the original abstract
The activation memory required for exact backpropagation scales linearly with network depth, context length, and feature dimensionality, forming an O(L * BN ) spatial bottleneck (where B is the sequence-batch cardinality and N is the feature dimension). This constraint historically throttles the scaling of deep neural networks. While randomized automatic differentiation attempts to mitigate this, it historically suffers from catastrophic variance. In this paper, we introduce BASIS (Balanced Activation Sketching with Invariant Scalars), an efficient backpropagation algorithm that fully decouples activation memory from the batch and sequence dimensions. BASIS propagates the exact error signal (dX) to preserve flawless gradient flow, but computes the weight updates (dW) using massively compressed rank-R tensors. To solve the foundational instability of sketched gradients, we propose two novel mechanisms: Balanced Hashing, which strictly eliminates off-diagonal collision variance, and Invariant Scalars, a principled bias-variance tradeoff that deterministically preserves the exact continuous energy norm of the spatial geometry. Theoretically, BASIS reduces activation memory to O(L * RN ) and heavily decreases the backward pass matrix-multiplication footprint. Empirically, training a GPT architecture for 50,000 steps validates our theoretical guarantees: at R = 32, BASIS achieves parity with (and marginally outperforms) exact backpropagation validation loss (6.575 vs. 6.616), acting as an implicit regularizer. Remarkably, the stabilized magnitude trajectory allows the model to converge smoothly even under extreme spatial compression (R = 1), proving the extreme robustness of the estimator. The code is available at https://github.com/VladimerKhasia/basis
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BASIS, a backpropagation algorithm that decouples activation memory from batch/sequence dimensions by propagating exact dX error signals while computing sketched rank-R dW updates. It claims two new mechanisms—balanced hashing that strictly eliminates off-diagonal collision variance and invariant scalars that deterministically preserve the exact continuous energy norm—solve the instability of sketched gradients. Theoretically this yields O(L RN) activation memory; empirically, on GPT training for 50k steps, R=32 matches or slightly beats exact backprop validation loss (6.575 vs 6.616) and the method converges smoothly even at R=1.
Significance. If the deterministic variance-elimination and norm-preservation properties can be rigorously established, BASIS would offer a practical route to training deeper or longer-context models under severe memory constraints, with the reported R=1 convergence suggesting an unusually stable sketched estimator that may also regularize. The combination of exact dX flow with compressed dW is a clean architectural split that, if proven, would be a notable advance over prior randomized autodiff approaches.
major comments (3)
- [Abstract] Abstract: the central claim that balanced hashing 'strictly eliminates off-diagonal collision variance' (so that sketched dW has zero extra variance) is load-bearing for both the O(L RN) memory bound and the R=1 convergence result, yet no derivation, collision-probability analysis, or proof sketch is supplied; without this the stability guarantee does not follow from the stated construction.
- [Abstract] Abstract: invariant scalars are asserted to 'deterministically preserve the exact continuous energy norm of the spatial geometry' under arbitrary compression, but no bias-variance bound, finite-precision analysis, or continuous-to-discrete equivalence is given; this assumption is required for the reported robustness at R=1 and must be substantiated before the empirical parity can be interpreted as theoretical validation.
- [Empirical results] Empirical validation paragraph: the single-run validation-loss comparison (6.575 vs 6.616) is presented without reported standard deviation across seeds, number of independent trials, or ablation of the two new mechanisms, so it is impossible to determine whether the marginal improvement is statistically reliable or merely an artifact of the particular training trajectory.
minor comments (2)
- [Abstract] Abstract: notation 'O(L * BN )' contains an extraneous space and inconsistent spacing around the multiplication symbol; standardize to O(L BN).
- [Abstract] Abstract: the parenthetical phrase 'Ghost Backpropagation' is introduced in quotes without definition or reference to prior usage; either remove or supply a one-sentence clarification of its intended meaning.
Simulated Author's Rebuttal
We thank the referee for the insightful and constructive feedback. We address each major comment below and commit to revisions that strengthen the theoretical derivations and empirical reporting while preserving the core contributions of the work.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that balanced hashing 'strictly eliminates off-diagonal collision variance' (so that sketched dW has zero extra variance) is load-bearing for both the O(L RN) memory bound and the R=1 convergence result, yet no derivation, collision-probability analysis, or proof sketch is supplied; without this the stability guarantee does not follow from the stated construction.
Authors: We agree that the variance-elimination property requires explicit derivation. Balanced hashing is constructed so that each off-diagonal pair receives equal positive and negative collision contributions that cancel in expectation. In the revised manuscript we will insert a dedicated theoretical subsection containing a collision-probability analysis and a short proof that the expected off-diagonal variance term is identically zero, thereby rigorously justifying both the O(L RN) memory claim and the observed stability at low rank. revision: yes
-
Referee: [Abstract] Abstract: invariant scalars are asserted to 'deterministically preserve the exact continuous energy norm of the spatial geometry' under arbitrary compression, but no bias-variance bound, finite-precision analysis, or continuous-to-discrete equivalence is given; this assumption is required for the reported robustness at R=1 and must be substantiated before the empirical parity can be interpreted as theoretical validation.
Authors: We acknowledge that a formal bias-variance analysis and finite-precision treatment are missing. The invariant scalars are defined to enforce exact L2-norm preservation in the continuous limit; we will add a bias-variance decomposition proving zero bias together with an explicit variance bound scaling with 1/R, plus a short finite-precision error analysis and a statement of the continuous-to-discrete equivalence. These additions will appear in the revised theoretical section. revision: yes
-
Referee: [Empirical results] Empirical validation paragraph: the single-run validation-loss comparison (6.575 vs 6.616) is presented without reported standard deviation across seeds, number of independent trials, or ablation of the two new mechanisms, so it is impossible to determine whether the marginal improvement is statistically reliable or merely an artifact of the particular training trajectory.
Authors: We agree that the current empirical presentation lacks statistical robustness. In the revision we will report validation loss averaged over at least three independent random seeds, including standard deviations, and will add ablation experiments that isolate the individual contributions of balanced hashing and invariant scalars. The public code repository will be updated with the exact seeds and scripts used for these additional runs. revision: yes
Circularity Check
No significant circularity; derivation self-contained
full rationale
The paper introduces two new algorithmic mechanisms—Balanced Hashing and Invariant Scalars—as solutions to sketched-gradient instability. These are asserted to deterministically eliminate off-diagonal variance and preserve exact energy norms, enabling the claimed O(L·RN) memory reduction via exact dX / sketched dW split. No equations, derivations, or self-citations appear in the provided text that reduce any central claim to a fitted parameter, renamed input, or prior self-result by construction. The theoretical guarantees and empirical GPT results (R=32 parity, R=1 convergence) are presented as consequences of the new components rather than tautological restatements of inputs. This is the normal case of a self-contained proposal whose load-bearing assumptions remain externally verifiable.
Axiom & Free-Parameter Ledger
free parameters (1)
- R (sketch rank)
axioms (2)
- domain assumption Balanced Hashing strictly eliminates off-diagonal collision variance
- domain assumption Invariant Scalars deterministically preserve the exact continuous energy norm
invented entities (1)
-
Invariant Scalars
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Balanced Hashing, which strictly eliminates off-diagonal collision variance, and Invariant Scalars, a principled bias-variance tradeoff that deterministically preserves the exact continuous energy norm of the spatial geometry
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
BASIS reduces activation memory to O(L·RN) ... at R=32 achieves parity ... even at R=1
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Training deep nets with sublinear memory cost, 2016
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost, 2016
work page 2016
-
[2]
Luis M. Rios and Nikolaos V . Sahinidis. Derivative-free optimization: a review of algorithms and comparison of software implementations.Journal of Global Optimization, 56(3):1247–1293, 2013
work page 2013
-
[3]
Evolution strategies as a scalable alternative to reinforcement learning, 2017
Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning, 2017
work page 2017
-
[4]
Evolution strategies at the hyperscale, 2026
Bidipta Sarkar, Mattie Fellows, Juan Agustin Duque, Alistair Letcher, Antonio León Villares, Anya Sims, Clarisse Wibault, Dmitry Samsonov, Dylan Cope, Jarek Liesen, Kang Li, Lukas Seier, Theo Wolf, Uljad Berdica, Valentin Mohl, Alexander David Goldie, Aaron Courville, Karin Sevegnani, Shimon Whiteson, and Jakob Nicolaus Foerster. Evolution strategies at t...
work page 2026
- [5]
-
[6]
Randomized automatic differentiation
Deniz Oktay, Nick McGreivy, Joshua Aduol, Alex Beatson, and Ryan P Adams. Randomized automatic differentiation. InInternational Conference on Learning Representations, 2021
work page 2021
-
[7]
Finding frequent items in data streams
Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. In Peter Widmayer, Stephan Eidenbenz, Francisco Triguero, Rafael Morales, Ricardo Conejo, and Matthew Hennessy, editors,Automata, Languages and Programming, pages 693–703, Berlin, Heidelberg, 2002. Springer Berlin Heidelberg
work page 2002
-
[8]
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017
work page 2017
-
[9]
Language models are unsupervised multitask learners, 2019
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners, 2019
work page 2019
-
[10]
Andrej Karpathy. nanogpt. https://github.com/karpathy/nanoGPT, 2022. GitHub repository, accessed 2026-03-05
work page 2022
-
[11]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017
work page 2017
-
[12]
The fineweb datasets: Decanting the web for the finest text data at scale
Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Lean- dro V on Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. 9
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.