Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling
Pith reviewed 2026-05-20 22:03 UTC · model grok-4.3
The pith
Language models switch from anticorrelated to cooperative reasoning and truthfulness above a critical scale of about 3.5 billion parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We measure the coupling between reasoning and truthfulness across 63 base models from 16 families and find a regime change invisible to loss curves: below a family-dependent critical scale Nc, capabilities anticorrelate; above it, they cooperate. Nc ≈ 3.5B parameters [2.9B, 13.4B] (bootstrap 95% CI). Architecture, data curation, and training recipe each shift Nc independently. Width normalization eliminates the anticorrelation across all tested families, supporting an output-projection bottleneck. Internally, 38 of 40 models show zero competing attention heads. A sparse-regression ODE cross-predicts held-out models at 5.6% error. The cooperative regime extends to the frontier with r = +0.72.
What carries the argument
The coupling between reasoning and truthfulness as measured by the sign and strength of correlation in public benchmark scores across a model family, which detects the phase transition at the critical scale Nc and indicates whether capabilities compete or reinforce.
If this is right
- Curated training data can eliminate the coupling dip and raise correlation at smaller scales, as observed between Qwen generations.
- Architectural innovations and distillation allow models like Gemma-4 at 4B to reach coupling levels typical of 13B+ standard models.
- Data curation alone enables small models like Phi at 1B to match coupling of web-trained models at 10B.
- Width normalization removes anticorrelation for all tested families, consistent with an output-projection bottleneck.
- The phase diagnostic and ODE predictions require only public benchmark scores and work without access to model internals.
Where Pith is reading between the lines
- If the phase transition applies to other capability pairs, multiple hidden transitions may exist during scaling that affect alignment strategies differently at each stage.
- Benchmark design could incorporate scale-aware rotation to avoid conflating phase-dependent effects with true capability gaps.
- Focusing interventions on width or curation may accelerate entry into the cooperative regime more efficiently than uniform scaling.
- The transition suggests that safety and alignment interventions for sub-critical models may need to address competition between capabilities directly.
Load-bearing premise
The measured correlation between public benchmark scores for reasoning and truthfulness reflects a genuine internal coupling mechanism rather than artifacts of benchmark construction, data overlap, or independent scaling trends.
What would settle it
Recomputing the correlations on a fresh set of benchmarks with no training data overlap and finding that the sign change at Nc disappears or that anticorrelation persists uniformly across all scales and families.
Figures
read the original abstract
Scaling laws predict loss from compute but not how capabilities interact. We measure the coupling between reasoning and truthfulness across 63 base models from 16 families and find a regime change invisible to loss curves: below a family-dependent critical scale N_c, capabilities anticorrelate (r = -0.989, p = 4 x 10^{-5} nonparametric permutation test); above it, they cooperate. N_c ~ 3.5B parameters [2.9B, 13.4B] (bootstrap 95% CI), but model size is not the only variable that determines phase. Architecture, data curation, and training recipe each shift N_c independently: curated training eliminated the coupling dip between Qwen generations (0.025 to 0.830 at matched scale), Gemma-4 at 4B achieves coupling 0.871, characteristic of 13B+ standard-trained models, through distillation and architectural innovation, and Phi at 1B matches web-trained coupling at 10B through data curation alone. Width normalization eliminates the anticorrelation across all tested families, supporting an output-projection bottleneck. Internally, 38 of 40 models show zero competing attention heads. A sparse-regression ODE cross-predicts held-out Llama-2 at 5.6% error. The diagnostic requires no model internals -- only public benchmark scores across a model family. The cooperative regime extends to the frontier (r = +0.72, 34 models, 10 labs). A proof-of-concept intervention confirms the bottleneck is exploitable: adding a single truth-direction vector at the identified layer corrects 60% of misaligned outputs in the tax phase with zero retraining -- a surgical, per-inference correction that requires no weight modification. Code, data, an open-source steering CLI for any open-weight model, and an interactive dashboard for phase diagnosis are released: https://zehenlabs.com/cape/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that scaling reveals a hidden phase transition in capability coupling: across 63 base models from 16 families, reasoning and truthfulness benchmarks anticorrelate below a family-dependent critical scale Nc ≈ 3.5B parameters (bootstrap 95% CI [2.9B, 13.4B]) and positively correlate above it. This transition is invisible to loss curves, depends on architecture/data/training factors (e.g., curation shifts coupling from 0.025 to 0.830), is supported by width normalization eliminating anticorrelation and zero competing heads in 38/40 models, and is modeled by a sparse-regression ODE that cross-predicts held-out Llama-2 at 5.6% error. The diagnostic uses only public benchmark scores; code, data, and a dashboard are released.
Significance. If the result holds, the work would be significant for extending scaling laws beyond loss to capability interactions, with implications for alignment and training interventions. Credit is due for the scale of empirical measurements (63 models, 16 families), bootstrap intervals, ODE cross-prediction on held-out models, reported interventions that shift Nc, and open release of code/data/dashboard for reproducibility and further testing.
major comments (3)
- [Abstract (coupling measurement)] Abstract, paragraph describing coupling measurement across 63 models: Nc is estimated per family from the same benchmark data used to define the coupling metric, and the ODE is a sparse regression fit; while cross-prediction on held-out models adds some independence, the core quantities remain derived from fitted parameters on the observed correlations, risking circularity in establishing the phase transition.
- [Abstract (interventions and internal mechanisms)] Abstract, paragraph on interventions (curated training, Gemma-4, Phi, width normalization): interventions are reported to shift coupling (e.g., 0.025 → 0.830 at matched scale) and support an output-projection bottleneck, but without explicit controls for benchmark overlap, data contamination between reasoning/truthfulness suites, or matched training data ablations, these could alter surface performance rather than reveal an internal mechanism.
- [Abstract (ODE and cross-prediction)] Abstract, ODE cross-prediction description: the sparse-regression ODE achieves 5.6% error on Llama-2, but this does not directly test whether the anticorrelation-to-cooperation switch reflects a genuine internal coupling (e.g., zero competing heads) or independent power-law scaling trends that cross at ~3.5B.
minor comments (2)
- [Abstract] Abstract: the wide bootstrap CI [2.9B, 13.4B] for Nc should be discussed with respect to the sharpness of the claimed transition and sensitivity to benchmark selection.
- [Abstract] Abstract: the frontier cooperative regime (r = +0.72, 34 models, 10 labs) would benefit from explicit listing of the exact models or families included to allow replication.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the scale of our empirical analysis and the open release of code and data. We address each of the major comments in turn, providing clarifications and noting revisions where appropriate to improve the manuscript.
read point-by-point responses
-
Referee: Abstract, paragraph describing coupling measurement across 63 models: Nc is estimated per family from the same benchmark data used to define the coupling metric, and the ODE is a sparse regression fit; while cross-prediction on held-out models adds some independence, the core quantities remain derived from fitted parameters on the observed correlations, risking circularity in establishing the phase transition.
Authors: We agree that both the coupling metric and the estimation of Nc are derived from the same set of public benchmark scores, which could raise questions about circularity. However, the coupling is computed as the correlation coefficient across models of varying sizes within a family, and Nc is the scale at which this correlation changes sign, identified through bootstrap resampling for uncertainty quantification. This is an observational finding from the data rather than a circular definition. The ODE provides a dynamical model that is validated through cross-prediction on held-out models like Llama-2. To mitigate concerns, we will revise the abstract and add a dedicated subsection in the methods clarifying the separation between metric definition and transition point estimation, along with additional robustness checks. revision: partial
-
Referee: Abstract, paragraph on interventions (curated training, Gemma-4, Phi, width normalization): interventions are reported to shift coupling (e.g., 0.025 → 0.830 at matched scale) and support an output-projection bottleneck, but without explicit controls for benchmark overlap, data contamination between reasoning/truthfulness suites, or matched training data ablations, these could alter surface performance rather than reveal an internal mechanism.
Authors: This is a valid point regarding potential confounds. Our interventions demonstrate consistent shifts in the observed coupling at matched model scales across different families and training approaches. For instance, data curation in Qwen shifts the coupling significantly, and width normalization removes the anticorrelation entirely. While we do not have full access to proprietary training datasets for exhaustive contamination checks, the internal analysis of attention heads (zero competing heads in most models) provides supporting evidence for a mechanistic basis. We will add a new limitations paragraph discussing benchmark overlap and contamination risks, and emphasize that the results are based on public benchmarks. revision: yes
-
Referee: Abstract, ODE cross-prediction description: the sparse-regression ODE achieves 5.6% error on Llama-2, but this does not directly test whether the anticorrelation-to-cooperation switch reflects a genuine internal coupling (e.g., zero competing heads) or independent power-law scaling trends that cross at ~3.5B.
Authors: We clarify that the ODE is intended as a phenomenological model of the capability scaling trajectories, not as direct proof of internal mechanisms. The 5.6% cross-prediction error validates its ability to capture the transition dynamics on unseen models. The evidence for internal coupling comes from complementary analyses: the absence of competing attention heads in 38/40 models and the effect of width normalization on eliminating anticorrelation. We will update the abstract to better distinguish the ODE's role in modeling scaling from the internal diagnostics. revision: yes
Circularity Check
Empirical correlation analysis across model families identifies phase transition without self-referential derivation
full rationale
The paper computes coupling directly as the correlation between reasoning and truthfulness benchmark scores on 63 models from 16 families, locates the sign-change point Nc per family via bootstrap on those same observed correlations, and fits a sparse-regression ODE whose cross-prediction error is reported on held-out models (Llama-2 at 5.6%). These steps are standard data-driven estimation and out-of-sample validation rather than any reduction of a claimed result to its own fitted inputs by construction. No equations are shown to equate a prediction to a prior fit, no uniqueness theorem is imported from self-citation, and no ansatz is smuggled; the derivation remains self-contained against the external benchmark data.
Axiom & Free-Parameter Ledger
free parameters (1)
- critical scale Nc =
3.5B
axioms (1)
- domain assumption Public benchmark scores for reasoning and truthfulness accurately reflect the underlying capabilities whose coupling is being measured.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We measure the coupling between reasoning and truthfulness across 63 base models... γ12(N) ≡ ∆TQA/∆HS... ODE dBi/dlog10 N = ∑ cij Bj + ∑ dijk Bj Bk
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
38 of 40 models show zero competing attention heads... output-projection bottleneck
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.