arxiv: 2604.09870 · v1 · submitted 2026-04-10 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Relational Preference Encoding in Looped Transformer Internal States

Jan Kirin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords relational encodinglooped transformerspreference alignmentinternal statespairwise differencesevaluator headsconsistency probe

0 comments

The pith

Looped transformers encode human preferences relationally through comparisons in their internal states.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how a looped transformer organizes human preference information inside the hidden states produced by each iteration of refinement. It establishes that these states hold preference data mainly in relational form, so that differences between paired response representations allow accurate prediction while single-response states yield weak or inverted results. A sympathetic reader would care because this structure suggests the model maintains consistent orderings between options as a core part of its learned value system rather than assigning fixed scores to each option alone. The work also shows how training choices like data swapping and learning-rate scheduling shape the apparent performance of such probes.

Core claim

The loop states encode preference predominantly relationally. A linear probe on pairwise differences achieves 84.5% test accuracy. The best nonlinear independent evaluator reaches only 65% test accuracy, and linear independent classification scores 21.75%, below chance and with inverted polarity. Interpreted precisely, the evaluator functions as a model-internal consistency probe measuring how stably the model's own learned value system organizes its representations rather than how well it predicts noisy human annotations.

What carries the argument

Pairwise difference probes applied to the hidden states of each loop iteration, which extract relational comparisons between response representations.

If this is right

The 50% argument-swap protocol is required to prevent degenerate solutions in pairwise training.
Independent scoring faces a genuine performance ceiling near 70% even after systematic architecture search.
The cosine learning-rate dead zone at epoch 2 functions as early stopping that preserves the generalization peak.
Flip-test analysis on sign consistency serves as a diagnostic for bias in pairwise preference evaluators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If relational encoding is primary, then alignment methods that explicitly reward consistent pairwise orderings across iterations could strengthen the model's internal value system.
The pattern might be tested in non-looped models trained with comparative losses to determine whether iteration itself is necessary for the relational structure.
The below-chance linear independent results suggest the model may lack a stable absolute preference scale that generalizes across examples.

Load-bearing premise

The assumption that the large gap between pairwise and independent probe accuracies reflects genuine relational encoding in the model's value system rather than an artifact of the 50% argument-swap protocol or the accidental early stopping.

What would settle it

Training independent evaluator heads on the same states but without the 50% argument swap and without the cosine learning-rate dead zone, then measuring whether test accuracy exceeds the documented 70% ceiling for independent scoring.

Figures

Figures reproduced from arXiv: 2604.09870 by Jan Kirin.

**Figure 2.** Figure 2: Cross-epoch flip test analysis. Left: Antisymmetry correlation (ρ) is stable at −0.92 to −0.97 across all five epochs. Centre: Strict sign flip rate ranges from 25% (epoch 2, peak accuracy) to 96% (epochs 4–5, overfit) — inversely tracking accuracy. Right: Scorer bias (mean sum of normal and flipped scores) peaks at +2.51 at epoch 2 and dissipates to −0.37 by epoch 5, confirming that sign flip rate measure… view at source ↗

**Figure 3.** Figure 3: Swap-protocol metric inversion across five epochs. The deflated training metric (blue) climbs [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

read the original abstract

We investigate how looped transformers encode human preference in their internal iteration states. Using Ouro-2.6B-Thinking, a 2.6B-parameter looped transformer with iterative refinement, we extract hidden states from each loop iteration and train lightweight evaluator heads (~5M parameters) to predict human preference on the Anthropic HH-RLHF dataset. Our pairwise evaluator achieves 95.2% test accuracy on 8,552 unseen examples, surpassing a full-batch L-BFGS probe (84.5%) while the base model remains completely frozen. Our central finding is that loop states encode preference predominantly relationally: a linear probe on pairwise differences achieves 84.5%, the best nonlinear independent evaluator reaches only 65% test accuracy, and linear independent classification scores 21.75%, below chance and with inverted polarity. Interpreted precisely, the evaluator functions as a model-internal consistency probe, measuring how stably Ouro's own learned value system organizes its representations rather than how well it predicts noisy human annotations. We also document a systematic architecture search that established a genuine 70% ceiling for independent scoring, and show how the 50% argument-swap protocol required to prevent degenerate pairwise solutions deflated pairwise training metrics by about 31 points at peak, creating the false appearance that pairwise and pointwise evaluators shared the same ceiling. Finally, we show that a cosine learning-rate dead zone at epoch 2 accidentally acted as early stopping, preserving the generalization peak before overfitting degraded test accuracy from 95.2% to 62.4% by epoch 5. Cross-epoch flip-test analysis shows that antisymmetry correlation remains stable while strict sign-flip rate mainly tracks scorer bias. We propose the flip test as a mandatory diagnostic for pairwise preference evaluators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows looped transformer states encode preferences mostly relationally, with concrete numbers on pairwise vs independent probes and controls for their own protocol choices.

read the letter

The main point is that in Ouro-2.6B-Thinking the internal loop states hold preference information primarily in the differences between pairs rather than in standalone features. A linear probe on pairwise differences hits 84.5 percent, while the best nonlinear independent probe reaches only 65 percent and a linear independent one drops to 21.75 percent with reversed polarity. They back this with an architecture search that caps independent scoring at 70 percent and with measurements showing the 50 percent argument-swap protocol cuts pairwise performance by roughly 31 points at peak.

Referee Report

0 major / 3 minor

Summary. The paper claims that looped transformer internal states (from Ouro-2.6B-Thinking on Anthropic HH-RLHF) encode human preferences predominantly in relational form. This is shown by a linear probe on pairwise state differences reaching 84.5% test accuracy on 8,552 examples, while independent linear classification reaches only 21.75% (below chance, inverted polarity) and the best nonlinear independent evaluator reaches 65%. Lightweight (~5M-param) heads are trained on frozen loop states; the work includes a systematic architecture search establishing a 70% independent ceiling, quantifies a ~31-point deflation from the 50% argument-swap protocol, and identifies an accidental cosine LR dead-zone as early stopping that preserved the 95.2% peak. The evaluators are interpreted as model-internal consistency probes rather than direct predictors of human labels, and a flip-test diagnostic is proposed.

Significance. If the results hold, the work advances understanding of how iterative refinement organizes preference representations relationally inside looped transformers, providing a concrete internal-consistency probe for model value systems. Credit is due for the systematic architecture search that bounds the independent ceiling, the explicit quantification of protocol artifacts (argument swap and early stopping), and the introduction of the flip test as a diagnostic. These elements make the relational-encoding claim more falsifiable and reproducible than typical post-hoc probing studies.

minor comments (3)

[Methods] The methods section should explicitly state the train/validation/test splits for both the base model fine-tuning and the subsequent evaluator-head training, including whether the 8,552 test examples were strictly held out from any hyperparameter tuning.
[Results] The flip-test analysis (antisymmetry correlation and sign-flip rate) is introduced as a mandatory diagnostic; adding pseudocode or a short algorithm box would improve reproducibility without lengthening the main text.
Table or figure captions for the architecture-search results should list the exact hyperparameter ranges explored and the validation metric used to select the 70% independent ceiling, to allow direct replication.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the detailed summary of our work, the positive assessment of its significance, and the recommendation for minor revision. No specific major comments or criticisms were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's claims rest on empirical measurements: lightweight probes trained on frozen Ouro-2.6B-Thinking hidden states to predict held-out human preference labels from the Anthropic HH-RLHF dataset. Pairwise difference probes, independent classifiers, and architecture-search ceilings are reported as direct test accuracies (e.g., 84.5% pairwise linear, 65% best nonlinear independent). The paper explicitly measures and reports the deflationary effect of the 50% argument-swap protocol, the early-stopping effect of the cosine LR dead-zone, and the 70% independent ceiling from systematic search. No equation or result is shown to reduce by construction to its own inputs, no self-citation supplies a load-bearing uniqueness theorem, and the interpretation as 'model-internal consistency probe' is presented as a reframing of the observed accuracy gap rather than a definitional tautology. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the lightweight heads are faithful probes of the frozen model's representations and that the HH-RLHF labels provide a stable external signal against which to measure internal consistency. No new physical entities are postulated.

free parameters (1)

evaluator head architecture and learning rate schedule
The ~5M-parameter heads and the cosine schedule with dead zone at epoch 2 are chosen and tuned; the dead zone is later interpreted as beneficial early stopping.

axioms (2)

domain assumption The looped transformer's hidden states after each iteration contain stable, extractable preference information.
Invoked when the authors treat the extracted states as the substrate for preference encoding.
domain assumption Pairwise accuracy above independent accuracy indicates relational rather than absolute encoding.
The mapping from the observed accuracy gap to the 'relational' interpretation is taken as given.

pith-pipeline@v0.9.0 · 5614 in / 1662 out tokens · 41051 ms · 2026-05-10T16:53:36.083854+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

loop states encode preference predominantly relationally: a linear probe on pairwise differences achieves 84.5%, ... antisymmetry correlation (ρ=−0.92 to −0.97)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the 50% argument-swap protocol ... flip test as a mandatory diagnostic for pairwise preference evaluators

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

10 extracted references · 3 canonical work pages · 3 internal anchors

[1]

Emotion concepts and their function in a large language model

Anthropic. Emotion concepts and their function in a large language model. Transformer Circuits Thread, 2026. https://transformer-circuits.pub/2025/emotion-features/index.html

2026
[2]

Z., and Koltun, V

Bai, S., Kolter, J. Z., and Koltun, V. Deep Equilibrium Models. In NeurIPS, 2019

2019
[3]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Y., et al. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Deep Reinforcement Learning from Human Preferences

Christiano, P., et al. Deep Reinforcement Learning from Human Preferences. In NeurIPS, 2017

2017
[5]

Universal Transformers

Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., and Kaiser, . Universal Transformers. In ICLR, 2019

2019
[6]

Adaptive Computation Time for Recurrent Neural Networks

Graves, A. Adaptive Computation Time for Recurrent Neural Networks. arXiv:1603.08983, 2016

work page internal anchor Pith review arXiv 2016
[7]

and Desarkar, M

Maheswaran, A. and Desarkar, M. S. A Unified View on Emotion Representation in Large Language Models. In EACL, 2026

2026
[8]

Training language models to follow instructions with human feedback

Ouyang, L., et al. Training language models to follow instructions with human feedback. In NeurIPS, 2022

2022
[9]

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Templeton, A., et al. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread, 2024

2024
[10]

Scaling Latent Reasoning via Looped Language Models

Zhu, R., Wang, Z., Hua, K., Zhang, T., et al. Scaling Latent Reasoning via Looped Language Models. arXiv:2510.25741, 2025

work page internal anchor Pith review arXiv 2025