pith. sign in

arxiv: 2605.18838 · v3 · pith:GJWMKKG2new · submitted 2026-05-13 · 💻 cs.LG · cs.AI· cs.CL

Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

Pith reviewed 2026-05-20 22:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords language model scalingcapability couplingphase transitionreasoningtruthfulnesscritical scalealignmentbenchmark correlation
0
0 comments X

The pith

Language models switch from anticorrelated to cooperative reasoning and truthfulness above a critical scale of about 3.5 billion parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures how reasoning and truthfulness interact across dozens of language models of different sizes and families. It finds that below a certain parameter threshold specific to each model family, gains in one capability tend to reduce performance in the other. Once models surpass this critical scale, the two capabilities start to improve together. This regime change is not visible in standard loss scaling curves but appears clearly through correlations in public benchmark scores. The result indicates that apparent alignment trade-offs can resolve naturally with scale, though architecture, data curation, and training choices can shift the transition point independently.

Core claim

We measure the coupling between reasoning and truthfulness across 63 base models from 16 families and find a regime change invisible to loss curves: below a family-dependent critical scale Nc, capabilities anticorrelate; above it, they cooperate. Nc ≈ 3.5B parameters [2.9B, 13.4B] (bootstrap 95% CI). Architecture, data curation, and training recipe each shift Nc independently. Width normalization eliminates the anticorrelation across all tested families, supporting an output-projection bottleneck. Internally, 38 of 40 models show zero competing attention heads. A sparse-regression ODE cross-predicts held-out models at 5.6% error. The cooperative regime extends to the frontier with r = +0.72.

What carries the argument

The coupling between reasoning and truthfulness as measured by the sign and strength of correlation in public benchmark scores across a model family, which detects the phase transition at the critical scale Nc and indicates whether capabilities compete or reinforce.

If this is right

  • Curated training data can eliminate the coupling dip and raise correlation at smaller scales, as observed between Qwen generations.
  • Architectural innovations and distillation allow models like Gemma-4 at 4B to reach coupling levels typical of 13B+ standard models.
  • Data curation alone enables small models like Phi at 1B to match coupling of web-trained models at 10B.
  • Width normalization removes anticorrelation for all tested families, consistent with an output-projection bottleneck.
  • The phase diagnostic and ODE predictions require only public benchmark scores and work without access to model internals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the phase transition applies to other capability pairs, multiple hidden transitions may exist during scaling that affect alignment strategies differently at each stage.
  • Benchmark design could incorporate scale-aware rotation to avoid conflating phase-dependent effects with true capability gaps.
  • Focusing interventions on width or curation may accelerate entry into the cooperative regime more efficiently than uniform scaling.
  • The transition suggests that safety and alignment interventions for sub-critical models may need to address competition between capabilities directly.

Load-bearing premise

The measured correlation between public benchmark scores for reasoning and truthfulness reflects a genuine internal coupling mechanism rather than artifacts of benchmark construction, data overlap, or independent scaling trends.

What would settle it

Recomputing the correlations on a fresh set of benchmarks with no training data overlap and finding that the sign change at Nc disappears or that anticorrelation persists uniformly across all scales and families.

Figures

Figures reproduced from arXiv: 2605.18838 by Adil Amin.

Figure 1
Figure 1. Figure 1: Capability coupling phase transition across 63 models and 16 families. (a) Phase diagram: HellaSwag vs. TruthfulQA across families, showing the U-shaped trajectory. (b) Running coupling γ12(N) for six families, with architecture-specific Nc marked. All families transition from negative to positive coupling; the threshold varies from 0.12B (OPT) to 7B (Falcon). (c) OLMo confirmation: γ12 = 0.000 at 1B param… view at source ↗
Figure 2
Figure 2. Figure 2: Loss is exact—the transition lives in the coupling. (a) Nα(L − E) = 154 ± 2 (CV= 0.8%) across all 8 Pythia models: loss follows a single power law with no visible transi￾tion. (b) Boosting chain: the independent-parameter gradient prediction (L1) makes the error 142× worse—the strongest single diagnostic that parameters are collectively coupled. The collective cor￾rection (L2) restores agreement. (c) Holdo… view at source ↗
Figure 3
Figure 3. Figure 3: ODE reproduces benchmark trajectories and cross-predicts held-out family. Sparse regression discovers a dynamical system that simultaneously fits five Pythia benchmarks (Hel￾laSwag, TruthfulQA, ARC, WinoGrande, MMLU) at 2.6% mean error. Cross-prediction on held-out Llama-2 achieves 5.6% MAE—approximately twice the accuracy of polynomial baselines. before the Nc2 crash—the ODE is predictive within a phase b… view at source ↗
Figure 4
Figure 4. Figure 4: The alignment tax is a design choice. (a) Qwen2.5 at 1.5B shows a coupling dip (3% cooperative, net = 0.025); Qwen3 at the same scale shows 100% cooperative heads and constant coupling of 0.830. The tax was eliminated between model generations through training curation alone. (b) Width normalization: dividing benchmark scores by model width (dmodel) flips the corre￾lation from negative to positive for all … view at source ↗
Figure 5
Figure 5. Figure 5: Internal coupling: zero competing heads across 40 models. Bars show the percentage of cooperative attention heads per family (averaged across sizes). 38 of 40 individual models show 100% cooperative heads. The two exceptions are both Qwen2.5: at 1.5B, only 3% of heads are cooperative (the remaining 97% compete—the known dip), and at 7B, 99.7% cooperative (mild last-layer dip). These pull the Qwen2.5 family… view at source ↗
Figure 6
Figure 6. Figure 6: Output projection bottleneck is scale-specific. At Pythia-410M (tax) and Pythia-2.8B (bonus), the projection increases coupling. At Pythia-1B (Nc), coupling drops from 0.725 (hidden) to 0.639 (output)—a 12% compression loss. A wider projection recovers coupling to 0.805. The bottleneck is dimensional: it appears only at the transition scale. sion loss. A wider projection recovers coupling to 0.805. This co… view at source ↗
Figure 7
Figure 7. Figure 7: The critical scale is a training parameter, not a size barrier. (a) In coupling– dimensionality space, PLE architecture trades per-dimension coupling for representational axes (Gemma-3→Gemma-4, dashed arrow), and RLHF restores coupling while preserving the extra di￾mensions (solid red arrow). All three models are 4B parameters. (b) Data curation eliminates the alignment tax: Qwen2.5 at 1.5B has coupling 0.… view at source ↗
read the original abstract

Scaling laws predict loss from compute but not how capabilities interact. We measure the coupling between reasoning and truthfulness across 63 base models from 16 families and find a regime change invisible to loss curves: below a family-dependent critical scale N_c, capabilities anticorrelate (r = -0.989, p = 4 x 10^{-5} nonparametric permutation test); above it, they cooperate. N_c ~ 3.5B parameters [2.9B, 13.4B] (bootstrap 95% CI), but model size is not the only variable that determines phase. Architecture, data curation, and training recipe each shift N_c independently: curated training eliminated the coupling dip between Qwen generations (0.025 to 0.830 at matched scale), Gemma-4 at 4B achieves coupling 0.871, characteristic of 13B+ standard-trained models, through distillation and architectural innovation, and Phi at 1B matches web-trained coupling at 10B through data curation alone. Width normalization eliminates the anticorrelation across all tested families, supporting an output-projection bottleneck. Internally, 38 of 40 models show zero competing attention heads. A sparse-regression ODE cross-predicts held-out Llama-2 at 5.6% error. The diagnostic requires no model internals -- only public benchmark scores across a model family. The cooperative regime extends to the frontier (r = +0.72, 34 models, 10 labs). A proof-of-concept intervention confirms the bottleneck is exploitable: adding a single truth-direction vector at the identified layer corrects 60% of misaligned outputs in the tax phase with zero retraining -- a surgical, per-inference correction that requires no weight modification. Code, data, an open-source steering CLI for any open-weight model, and an interactive dashboard for phase diagnosis are released: https://zehenlabs.com/cape/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that scaling reveals a hidden phase transition in capability coupling: across 63 base models from 16 families, reasoning and truthfulness benchmarks anticorrelate below a family-dependent critical scale Nc ≈ 3.5B parameters (bootstrap 95% CI [2.9B, 13.4B]) and positively correlate above it. This transition is invisible to loss curves, depends on architecture/data/training factors (e.g., curation shifts coupling from 0.025 to 0.830), is supported by width normalization eliminating anticorrelation and zero competing heads in 38/40 models, and is modeled by a sparse-regression ODE that cross-predicts held-out Llama-2 at 5.6% error. The diagnostic uses only public benchmark scores; code, data, and a dashboard are released.

Significance. If the result holds, the work would be significant for extending scaling laws beyond loss to capability interactions, with implications for alignment and training interventions. Credit is due for the scale of empirical measurements (63 models, 16 families), bootstrap intervals, ODE cross-prediction on held-out models, reported interventions that shift Nc, and open release of code/data/dashboard for reproducibility and further testing.

major comments (3)
  1. [Abstract (coupling measurement)] Abstract, paragraph describing coupling measurement across 63 models: Nc is estimated per family from the same benchmark data used to define the coupling metric, and the ODE is a sparse regression fit; while cross-prediction on held-out models adds some independence, the core quantities remain derived from fitted parameters on the observed correlations, risking circularity in establishing the phase transition.
  2. [Abstract (interventions and internal mechanisms)] Abstract, paragraph on interventions (curated training, Gemma-4, Phi, width normalization): interventions are reported to shift coupling (e.g., 0.025 → 0.830 at matched scale) and support an output-projection bottleneck, but without explicit controls for benchmark overlap, data contamination between reasoning/truthfulness suites, or matched training data ablations, these could alter surface performance rather than reveal an internal mechanism.
  3. [Abstract (ODE and cross-prediction)] Abstract, ODE cross-prediction description: the sparse-regression ODE achieves 5.6% error on Llama-2, but this does not directly test whether the anticorrelation-to-cooperation switch reflects a genuine internal coupling (e.g., zero competing heads) or independent power-law scaling trends that cross at ~3.5B.
minor comments (2)
  1. [Abstract] Abstract: the wide bootstrap CI [2.9B, 13.4B] for Nc should be discussed with respect to the sharpness of the claimed transition and sensitivity to benchmark selection.
  2. [Abstract] Abstract: the frontier cooperative regime (r = +0.72, 34 models, 10 labs) would benefit from explicit listing of the exact models or families included to allow replication.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the scale of our empirical analysis and the open release of code and data. We address each of the major comments in turn, providing clarifications and noting revisions where appropriate to improve the manuscript.

read point-by-point responses
  1. Referee: Abstract, paragraph describing coupling measurement across 63 models: Nc is estimated per family from the same benchmark data used to define the coupling metric, and the ODE is a sparse regression fit; while cross-prediction on held-out models adds some independence, the core quantities remain derived from fitted parameters on the observed correlations, risking circularity in establishing the phase transition.

    Authors: We agree that both the coupling metric and the estimation of Nc are derived from the same set of public benchmark scores, which could raise questions about circularity. However, the coupling is computed as the correlation coefficient across models of varying sizes within a family, and Nc is the scale at which this correlation changes sign, identified through bootstrap resampling for uncertainty quantification. This is an observational finding from the data rather than a circular definition. The ODE provides a dynamical model that is validated through cross-prediction on held-out models like Llama-2. To mitigate concerns, we will revise the abstract and add a dedicated subsection in the methods clarifying the separation between metric definition and transition point estimation, along with additional robustness checks. revision: partial

  2. Referee: Abstract, paragraph on interventions (curated training, Gemma-4, Phi, width normalization): interventions are reported to shift coupling (e.g., 0.025 → 0.830 at matched scale) and support an output-projection bottleneck, but without explicit controls for benchmark overlap, data contamination between reasoning/truthfulness suites, or matched training data ablations, these could alter surface performance rather than reveal an internal mechanism.

    Authors: This is a valid point regarding potential confounds. Our interventions demonstrate consistent shifts in the observed coupling at matched model scales across different families and training approaches. For instance, data curation in Qwen shifts the coupling significantly, and width normalization removes the anticorrelation entirely. While we do not have full access to proprietary training datasets for exhaustive contamination checks, the internal analysis of attention heads (zero competing heads in most models) provides supporting evidence for a mechanistic basis. We will add a new limitations paragraph discussing benchmark overlap and contamination risks, and emphasize that the results are based on public benchmarks. revision: yes

  3. Referee: Abstract, ODE cross-prediction description: the sparse-regression ODE achieves 5.6% error on Llama-2, but this does not directly test whether the anticorrelation-to-cooperation switch reflects a genuine internal coupling (e.g., zero competing heads) or independent power-law scaling trends that cross at ~3.5B.

    Authors: We clarify that the ODE is intended as a phenomenological model of the capability scaling trajectories, not as direct proof of internal mechanisms. The 5.6% cross-prediction error validates its ability to capture the transition dynamics on unseen models. The evidence for internal coupling comes from complementary analyses: the absence of competing attention heads in 38/40 models and the effect of width normalization on eliminating anticorrelation. We will update the abstract to better distinguish the ODE's role in modeling scaling from the internal diagnostics. revision: yes

Circularity Check

0 steps flagged

Empirical correlation analysis across model families identifies phase transition without self-referential derivation

full rationale

The paper computes coupling directly as the correlation between reasoning and truthfulness benchmark scores on 63 models from 16 families, locates the sign-change point Nc per family via bootstrap on those same observed correlations, and fits a sparse-regression ODE whose cross-prediction error is reported on held-out models (Llama-2 at 5.6%). These steps are standard data-driven estimation and out-of-sample validation rather than any reduction of a claimed result to its own fitted inputs by construction. No equations are shown to equate a prediction to a prior fit, no uniqueness theorem is imported from self-citation, and no ansatz is smuggled; the derivation remains self-contained against the external benchmark data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on treating benchmark scores as faithful proxies for internal capabilities and on estimating a critical scale from observed correlations; no new physical entities are introduced.

free parameters (1)
  • critical scale Nc = 3.5B
    Family-dependent tipping point estimated from benchmark correlations with bootstrap confidence interval.
axioms (1)
  • domain assumption Public benchmark scores for reasoning and truthfulness accurately reflect the underlying capabilities whose coupling is being measured.
    All coupling statistics and phase identification are computed directly from these scores.

pith-pipeline@v0.9.0 · 5852 in / 1396 out tokens · 49815 ms · 2026-05-20T22:03:43.996957+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.