A Held-Out Transition-Pair Falsifier for Long-Horizon Non-Abelian State Tracking

Jeonghoon Lee

arxiv: 2606.07254 · v1 · pith:64MXXD53new · submitted 2026-06-05 · 💻 cs.LG · cs.FL

A Held-Out Transition-Pair Falsifier for Long-Horizon Non-Abelian State Tracking

Jeonghoon Lee This is my paper

Pith reviewed 2026-06-27 22:46 UTC · model grok-4.3

classification 💻 cs.LG cs.FL

keywords non-Abelian state trackingheld-out transition pairssequence modelsfinite groupsinductive biasrecurrent modelslong-horizon predictionS3 group

0 comments

The pith

A projected recurrent state model tracks non-Abelian group states perfectly over million-token horizons when trained only on length-8 sequences under a held-out transition-pair protocol.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a held-out transition-pair falsifier that withholds specific ordered generator pairs from training data while requiring those same pairs during evaluation to test long-horizon state tracking in non-Abelian groups. In the controlled S3 × S3 benchmark, a projected recurrent state model achieves perfect final-state predictions across all tested horizons up to 1,048,576 tokens, while GRUs, structured SSMs, and bag baselines remain near chance even when equipped with similar readouts. The protocol includes clean-split audits confirming zero overlap in reduced words or structural templates between partitions. Results show that hard projection onto group elements correlates with low homomorphism error and commutator separation, whereas softened projection leads to collapsed accuracy. The evidence applies specifically to this finite-group setup rather than general architecture comparisons.

Core claim

The held-out transition-pair falsifier blocks selected ordered generator pairs during training and requires the same local patterns during evaluation. In an S3 × S3 benchmark, a projected recurrent state model trained only on length-8 sequences produces error-free final-state predictions through evaluation horizons up to 1,048,576 tokens across five seeds, while matched native-readout baselines remain near floor and projection-matched baselines also fail.

What carries the argument

The held-out transition-pair falsifier, which forbids selected ordered generator pairs from training while mandating them in evaluation to isolate non-local state composition.

If this is right

Hard projection onto finite-group elements is necessary, since softening the projection causes final-state accuracy to collapse.
The successful model exhibits low homomorphism error, low state-consistency drift, and non-trivial commutator separation.
Explicit projected non-commutative state composition supplies an inductive bias that supports long-horizon hidden-state tracking in this regime.
Clean audits confirm zero verbatim reduced-word overlap and zero structural-template overlap between the training and evaluation partitions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The falsifier protocol could be extended to larger finite groups or other algebraic structures to test whether the projection bias scales.
Success on this controlled benchmark raises the question of whether similar explicit composition mechanisms would improve performance on permutation-based planning tasks.
The gap between projected and native-readout models suggests that architectural incorporation of group structure may matter more than raw capacity in non-commutative tracking.

Load-bearing premise

The held-out transition pairs during training combined with their requirement in evaluation fully block direct local-transition memorization pathways without leaving other memorization routes open.

What would settle it

Observing even one error in the 250/250 final-state predictions on a held-out transition pair at a long horizon, or observing a baseline model succeeding under the identical split, would falsify the reported advantage.

Figures

Figures reproduced from arXiv: 2606.07254 by Jeonghoon Lee.

**Figure 1.** Figure 1: Gate A baseline competence under matched protocol. The dotted line marks the 1/6 chance level for the separate 6-class diagnostic control. Error bars indicate 95% bootstrap intervals over five seeds. 6.2 Gate B: long-horizon held-out transition-pair performance Main result (expanded 𝑛test = 50). Model Eval length Seeds n per seed Exact / total Mean final acc 95% lower bound Hard-projected (ours) 524288 5 5… view at source ↗

**Figure 2.** Figure 2: Gate B held-out-pair falsifier. Panel (a) shows the short-horizon supplement. Panel (b) shows the million-token expanded test with projection-matched baselines (GRU, bag, structured SSM with prototype-projection readout) and the retained native-readout pilot reference. The dotted line marks chance accuracy for 𝑆3 × 𝑆3 , 1/36 ≈ 0.0278. Curves are shown only for model/horizon combinations implemented in the … view at source ↗

**Figure 3.** Figure 3: Projection-matched baselines under the held-out-pair protocol at 𝐿eval ∈ {524,288, 1,048,576}. All baseline cells are far below the hard-projected 250/250 result (all ≤ 15/250); the largest, bag at 524,288 (15/250), is modestly above the 1/36 chance line (dotted), while the rest sit at or near it. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Same-factor held-out robustness on 𝑆3 × 𝑆3 . Across both the first-factor and secondfactor in-factor held-out splits, the hard-projected model is error-free at every evaluation horizon while the projection-matched GRU, structured SSM, and bag baselines remain near the 1/36 chance line. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Preliminary 𝑆5 non-solvable stress test. The hard-projected model remains exact across the executed horizons, while native-readout GRU remains near chance. Chance line at 1/120 ≈ 0.0083. F.4 Honest caveat Perfect S_5 accuracy under the present projected-readout interface raises a natural architectural question: does the non-released carrier instantiate or approximate a representation of S_5 in a way that m… view at source ↗

**Figure 7.** Figure 7: Projection-matched baselines on 𝑆5 under the held-out-pair protocol. Prototypeprojection GRU, structured SSM, and bag baselines all remain near the 1/120 ≈ 0.0083 chance line, mirroring the native-readout GRU and supporting that the hard-projected 𝑆5 result is not explained by the tested projection-readout artifact hypothesis. carrier, not a general non-solvable-tracking result. F.5 Wall-clock summary Per… view at source ↗

read the original abstract

State tracking exposes a sharp limitation of sequence models: the relevant signal is often not a summary of observed tokens, but an ordered latent state that evolves through non-commutative transformations. We introduce a held-out transition-pair falsifier for finite non-Abelian group tracking. The protocol forbids selected ordered generator pairs during training and requires the same local patterns during evaluation, blocking one direct local-transition memorization pathway. In a controlled $S_3 \times S_3$ benchmark, a projected recurrent state model trained only on length-8 sequences produces error-free final-state predictions (perfect 250/250 per horizon) through evaluation horizons up to 1,048,576 tokens across five seeds. Matched native-readout baselines, including bag, GRU, and a single-configuration structured state-space model, remain near floor under the same protocol. Projection-matched GRU, structured SSM, and bag baselines equipped with analogous finite-group prototype readouts also remain near chance under the same split. Mechanism diagnostics show that hard projection coincides with low homomorphism error, low state-consistency drift, and non-trivial commutator separation, while softened projection collapses final-state accuracy. Clean-split audits verify zero verbatim reduced-word overlap and zero structural-template overlap between training and evaluation partitions. The evidence is scoped to this controlled finite-group falsifier rather than to a general architecture ranking. Within that regime, explicit projected non-commutative state composition acts as a useful inductive bias for long-horizon hidden-state tracking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The held-out pair protocol is a concrete new falsifier for non-Abelian tracking with striking S3xS3 results, but finite-group closure leaves open whether models truly avoid indirect inference of missing transitions.

read the letter

The main takeaway is a held-out transition-pair protocol that withholds specific ordered generator pairs from short training sequences while requiring them at long evaluation horizons, plus clean results on S3 x S3 where a projected recurrent model hits perfect accuracy out to over a million tokens. Baselines with matched readouts stay near chance. The clean-split audits and mechanism diagnostics tying hard projection to low homomorphism error and commutator separation are useful additions.

What the paper does well is scope the claim narrowly to this controlled finite-group case and supply explicit checks for verbatim and template overlap. The fact that softening the projection collapses performance gives some evidence the inductive bias matters. The long-horizon scale with short training is a clear strength of the benchmark design.

The soft spot is the one the stress-test note flags. S3 x S3 has only 36 elements, so a model that recovers the group operation from the observed pairs can use associativity, inverses, and closure to compute the withheld transitions without local exposure. The audits rule out direct copying but do not test whether the learned map is a full homomorphism on the missing generators. If that route remains open, the perfect scores show the architecture can represent the group once the table is known, not that the falsifier blocked every memorization pathway. The abstract does not report such a check.

This is for researchers building or using benchmarks for compositional state tracking in sequence models. A reader working on inductive biases for non-commutative composition would find the protocol worth examining. The work shows clear thinking on its own terms and the central empirical claim is falsifiable, so it deserves a serious referee even if the closure issue needs direct testing in revision. I would send it out.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces a held-out transition-pair falsifier protocol for finite non-Abelian group state tracking. In an S3×S3 benchmark, a projected recurrent state model trained exclusively on length-8 sequences achieves perfect final-state prediction accuracy (250/250 per horizon) on held-out transitions for evaluation horizons up to 1,048,576 tokens across five seeds. Matched baselines (bag, GRU, structured SSM) remain near chance under the same protocol. The protocol is supported by mechanism diagnostics (homomorphism error, state-consistency drift, commutator separation) and clean-split audits showing zero verbatim or structural-template overlap between partitions. The scope is limited to this controlled falsifier rather than general architecture ranking.

Significance. If the falsifier protocol demonstrably prevents all memorization routes (direct and indirect), the result supplies a concrete empirical demonstration that explicit projection onto finite-group prototypes supplies a useful inductive bias for accurate long-horizon non-commutative state composition. The extreme horizon lengths and perfect per-seed accuracy constitute a strong positive signal for the architecture class under the stated conditions; the clean-split design and mechanism diagnostics are positive features of the evaluation.

major comments (2)

[Abstract / protocol section] Abstract and protocol description: the claim that withholding selected ordered generator pairs 'blocks one direct local-transition memorization pathway' is load-bearing for the interpretation of the perfect 250/250 scores. In the finite S3×S3 group (36 elements), a model that recovers a faithful homomorphism from the observed pairs can algebraically deduce the withheld multiplications via closure, associativity, and inverses. The clean-split audits verify zero verbatim and template overlap but do not test whether the learned map is a full homomorphism on the withheld generators; therefore the results do not yet rule out that the model simply reconstructs the complete multiplication table rather than performing genuine held-out state tracking.
[Mechanism diagnostics] Mechanism diagnostics paragraph: the reported 'low homomorphism error' under hard projection is not stated to have been evaluated on the held-out pairs themselves. If the metric is computed only on observed transitions, it does not address whether the model has inferred the withheld transitions via group structure, weakening the link between low homomorphism error and the claimed blocking of memorization routes.

minor comments (1)

[Abstract] The abstract states 'perfect 250/250 per horizon' across five seeds; the main text should report per-seed variance or confirm that every seed achieved exactly 250/250 rather than an aggregate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive comments on the held-out transition-pair falsifier. We respond point-by-point to the major comments below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract / protocol section] Abstract and protocol description: the claim that withholding selected ordered generator pairs 'blocks one direct local-transition memorization pathway' is load-bearing for the interpretation of the perfect 250/250 scores. In the finite S3×S3 group (36 elements), a model that recovers a faithful homomorphism from the observed pairs can algebraically deduce the withheld multiplications via closure, associativity, and inverses. The clean-split audits verify zero verbatim and template overlap but do not test whether the learned map is a full homomorphism on the withheld generators; therefore the results do not yet rule out that the model simply reconstructs the complete multiplication table rather than performing genuine held-out state tracking.

Authors: We agree that the clean-split audits confirm absence of verbatim and template overlap but do not test whether the learned map constitutes a full homomorphism on the withheld generators. The protocol is explicitly scoped to blocking one direct local-transition memorization pathway (exposure to the specific ordered pairs), not to excluding all algebraic reconstruction routes via group closure. The perfect accuracy of the projected model versus matched baselines (including projection-equipped variants) under this split provides evidence that the projection supplies a useful inductive bias within the protocol's constraints. To address the concern, we will add an explicit evaluation of homomorphism error on the held-out pairs and report the results in the revised mechanism diagnostics section. revision: yes
Referee: [Mechanism diagnostics] Mechanism diagnostics paragraph: the reported 'low homomorphism error' under hard projection is not stated to have been evaluated on the held-out pairs themselves. If the metric is computed only on observed transitions, it does not address whether the model has inferred the withheld transitions via group structure, weakening the link between low homomorphism error and the claimed blocking of memorization routes.

Authors: The referee correctly notes that the manuscript does not state the homomorphism error was evaluated on held-out pairs; the reported values were computed on observed transitions. We will revise the mechanism diagnostics paragraph to separately report homomorphism error on both observed and held-out pairs, clarifying the scope of the diagnostic and its relation to the falsifier protocol. revision: yes

Circularity Check

0 steps flagged

Empirical held-out falsifier benchmark; no derivation reduces accuracy to fitted input by construction

full rationale

The paper reports an empirical result: a projected recurrent model trained on length-8 sequences with selected transition pairs withheld achieves perfect final-state accuracy on long-horizon evaluation sequences that require those pairs. The abstract and protocol description contain no equations, no fitted parameters renamed as predictions, and no self-citations that bear the central claim. The held-out split and clean-split audits are presented as direct measurements on held-out data, not tautological re-expressions of training statistics. The result is therefore scoped as an empirical benchmark rather than a first-principles derivation that collapses to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the protocol relies on standard group theory assumptions not detailed here.

pith-pipeline@v0.9.1-grok · 5792 in / 1021 out tokens · 26548 ms · 2026-06-27T22:46:09.437249+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 8 canonical work pages · 2 internal anchors

[1]

Barrington

David A. Barrington. Bounded-Width Polynomial-Size Branching Programs Recognize Ex- actly Those Languages in NC^1 . Journal of Computer and System Sciences, 38(1):150–164,
[2]

doi:10.1016/0022-0000(89)90037-8

work page doi:10.1016/0022-0000(89)90037-8
[3]

Algebraic Theory of Machines

Kenneth Krohn and John Rhodes. Algebraic Theory of Machines. I. Prime Decomposition Theorem for Finite Semigroups and Machines . Transactions of the American Mathematical Society, 116:450–464, 1965. doi:10.2307/1994127

work page doi:10.2307/1994127 1965
[4]

arXiv preprint arXiv:2404.08819 , year=

William Merrill, Jackson Petty, and Ashish Sabharwal. The Illusion of State in State-Space Models. ICML 2024; arXiv:2404.08819, 2024

work page arXiv 2024
[5]

The Expressive Limits of Diagonal SSMs for State-Tracking

Mehran Shakerinava, Behnoush Khavari, Siamak Ravanbakhsh, and Sarath Chandar. The Expressive Limits of Diagonal SSMs for State-Tracking . arXiv:2603.01959, 2026

work page arXiv 2026
[6]

On the "Induction Bias" in Sequence Models

M. Reza Ebrahimi, Michaël Defferrard, Sunny Panchal, and Roland Memisevic. On the “Induction Bias” in Sequence Models . arXiv:2602.18333, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models

Aleksandar Terzić, Nicolas Menet, Michael Hersche, Thomas Hofmann, and Abbas Rahimi. Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models . NeurIPS 2025 Spotlight; arXiv:2509.22284, 2025

work page arXiv 2025
[8]

M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling

Mayank Mishra, Shawn Tan, Ion Stoica, Joseph Gonzalez, and Tri Dao. 𝑀 2RNN: Non- Linear RNNs with Matrix-Valued States for Scalable Language Modeling . arXiv:2603.14360, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Robust Reasoning as a Symmetry-Protected Topological Phase

Ilmo Sung. Robust Reasoning as a Symmetry-Protected Topological Phase. arXiv:2601.05240, 2026. Code and Data A vailability The public release package includes benchmark generation code, held-out split construction, overlap-audit scripts, Gate E specificity checks, result CSVs, figure scripts, evaluation-set hashes, and projection-matched baseline configur...

work page doi:10.5281/ze 2026

[1] [1]

Barrington

David A. Barrington. Bounded-Width Polynomial-Size Branching Programs Recognize Ex- actly Those Languages in NC^1 . Journal of Computer and System Sciences, 38(1):150–164,

[2] [2]

doi:10.1016/0022-0000(89)90037-8

work page doi:10.1016/0022-0000(89)90037-8

[3] [3]

Algebraic Theory of Machines

Kenneth Krohn and John Rhodes. Algebraic Theory of Machines. I. Prime Decomposition Theorem for Finite Semigroups and Machines . Transactions of the American Mathematical Society, 116:450–464, 1965. doi:10.2307/1994127

work page doi:10.2307/1994127 1965

[4] [4]

arXiv preprint arXiv:2404.08819 , year=

William Merrill, Jackson Petty, and Ashish Sabharwal. The Illusion of State in State-Space Models. ICML 2024; arXiv:2404.08819, 2024

work page arXiv 2024

[5] [5]

The Expressive Limits of Diagonal SSMs for State-Tracking

Mehran Shakerinava, Behnoush Khavari, Siamak Ravanbakhsh, and Sarath Chandar. The Expressive Limits of Diagonal SSMs for State-Tracking . arXiv:2603.01959, 2026

work page arXiv 2026

[6] [6]

On the "Induction Bias" in Sequence Models

M. Reza Ebrahimi, Michaël Defferrard, Sunny Panchal, and Roland Memisevic. On the “Induction Bias” in Sequence Models . arXiv:2602.18333, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models

Aleksandar Terzić, Nicolas Menet, Michael Hersche, Thomas Hofmann, and Abbas Rahimi. Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models . NeurIPS 2025 Spotlight; arXiv:2509.22284, 2025

work page arXiv 2025

[8] [8]

M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling

Mayank Mishra, Shawn Tan, Ion Stoica, Joseph Gonzalez, and Tri Dao. 𝑀 2RNN: Non- Linear RNNs with Matrix-Valued States for Scalable Language Modeling . arXiv:2603.14360, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

Robust Reasoning as a Symmetry-Protected Topological Phase

Ilmo Sung. Robust Reasoning as a Symmetry-Protected Topological Phase. arXiv:2601.05240, 2026. Code and Data A vailability The public release package includes benchmark generation code, held-out split construction, overlap-audit scripts, Gate E specificity checks, result CSVs, figure scripts, evaluation-set hashes, and projection-matched baseline configur...

work page doi:10.5281/ze 2026