Causal Tongue-Tie: LLMs Can Encode Causal Direction, But Their Yes/No Outputs Fail to Express

Xiao-Ping Zhang; Ziyi Ding

arxiv: 2605.25891 · v1 · pith:7FVRAIFFnew · submitted 2026-05-25 · 💻 cs.CL · cs.AI

Causal Tongue-Tie: LLMs Can Encode Causal Direction, But Their Yes/No Outputs Fail to Express

Ziyi Ding , Xiao-Ping Zhang This is my paper

Pith reviewed 2026-06-29 22:11 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords causal reasoninglarge language modelslinear probeshidden statescommonsense biasCLadder datasetyes/no outputs

0 comments

The pith

LLMs internally encode the evidence-based causal answer but output the commonsense yes/no instead on conflicting questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates a mismatch on anti-commonsense causal questions from the CLadder dataset: a fixed linear probe on hidden states recovers the answer supported by the given evidence at roughly 97 percent accuracy, yet the model's spoken yes/no answer follows the conflicting commonsense at roughly 50 percent accuracy. This gap is labeled Causal Tongue-Tie and is decomposed into two separable issues—one where no internal signal exists and one where a signal exists but cannot be expressed verbally. The authors conclude that single accuracy numbers from output-only benchmarks are insufficient to determine whether models have grasped causal direction, because a correct output need not indicate understanding and an incorrect output need not indicate inability.

Core claim

On anti-commonsense CLadder items, hidden-state representations contain the evidence-supported causal answer even when the model's verbal yes/no output reverts to the commonsense answer, producing an accuracy gap of approximately 0.5 between probe recovery and spoken response.

What carries the argument

A fixed linear probe applied to the model's hidden states that extracts the evidence-supported causal direction, set against the verbal yes/no generation interface that fails to express it.

If this is right

A benchmark answer labeled correct does not establish that the model has internally represented the causal relation.
A benchmark answer labeled incorrect does not establish that the model lacks the relevant causal representation.
Causal reasoning claims drawn from yes/no accuracy alone require separate checks on internal representations.
The verbal output channel can mask encoded causal knowledge that remains accessible via probing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar internal-versus-output gaps may exist for other structured reasoning tasks that pit evidence against prior patterns.
Evaluation protocols could combine output accuracy with lightweight probes on the same items to separate encoding failures from expression failures.
Interventions that alter hidden-state representations might be tested to see whether they shift verbal outputs toward the probed answer.

Load-bearing premise

The linear probe is recovering genuine causal knowledge encoded in the hidden states rather than some other correlated but non-causal pattern, and the independently identified evidence-supported answer is the correct target.

What would settle it

A new collection of anti-commonsense causal questions on which the same linear probe no longer achieves high accuracy at recovering the evidence answer, or on which probe accuracy falls to match the verbal output accuracy.

Figures

Figures reproduced from arXiv: 2605.25891 by Xiao-Ping Zhang, Ziyi Ding.

**Figure 1.** Figure 1: A 2×2 framework for a wrong Yes/No on a causal benchmark. The two off-diagonal cells—CAUSAL TONGUE-TIE (b, this paper) and the “Causal Parrot” (c, Zecevi ˇ c et al. ´ 2023)—are what output-only accuracy hides when it collapses (a)+(c) into “correct” and (b)+(d) into “failed” [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Cell (b) on one anti-CS item. From the same hidden state, the probe recovers NO (Accprobe≈0.97) while lm_head produces YES (Accout≈0.50); the ∆≈+0.5 gap is the empirical handle on Causal TongueTie. is an ℓ2-regularised logistic regression gw(h) = σ(w⊤h + b), fit once on hidden states from cs items only and then frozen for every reported number; the language model itself is never updated. Second, the same … view at source ↗

**Figure 3.** Figure 3: Anti-commonsense CLADDER accuracy across Qwen2.5 instruction-tuned model sizes (log-scale x-axis, 0.5B–72B). The hidden-state readout (Probe accuracy, best layer, deep teal) is already 0.969 on Qwen2.5-0.5B and reaches 1.000 on Qwen2.5-72B-NF4, while ordinary Yes/No accuracy (Output accuracy (lm_head), warm coral) stays in [0.350, 0.525]; the gap ∆≈+0.5 does not close with model size. Shaded ribbons are 95… view at source ↗

**Figure 4.** Figure 4: Counterfactual-KL lesion on Qwen-7B layer [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Activation patching on the N=42 anti-CS items Qwen2.5-7B-Instruct gets wrong (layer-27 lasttoken state). Bars left-to-right: no patch (wrong-item baseline, 0 by definition); patch A (full state) (replace the state with a matched-CS donor; 0.571); patch B (mean dir.) (inject only the mean Vcs direction); ctrl-rand (randomhidden donor); ctrl-self (self-injection sanity). Dashed line: 0.5 binary chance. Wh… view at source ↗

**Figure 6.** Figure 6: Answer-interface ladder across Qwen2.5- Instruct sizes (7B / 14B / 32B / 72B). Four of the five answer interfaces are plotted (the fifth, direct effect, is free-form and is reported separately in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Token-level selected-token entropy (× ln 2) for Yes/No vs. A/B-edge interfaces across three instructiontuned models (N=80 anti-CS items each). Accuracy values are annotated above each bar. A/B-edge entropy is far below Yes/No entropy for every model, while A/B accuracy is near 1.0—a token-level signature of interface-side commitment failure: the model’s internal causal direction signal (probe Acc ≥ 0.9… view at source ↗

**Figure 8.** Figure 8: Surface-form robustness across four forced [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

read the original abstract

We find a mismatch between what large language models encode about a causal question and what they answer. On anti-commonsense CLadder items, a fixed linear probe recovers the evidence-supported answer from the model's hidden state (accuracy approximately 0.97), while the spoken Yes/No reverts to the commonsense one (accuracy approximately 0.5). We call this approximately +0.5 gap Causal Tongue-Tie: a wrong Yes/No decomposes into two separable failure modes: no internal signal versus a signal the verbal interface cannot say. The implication cuts both ways for output-only causal benchmarks: a benchmark "correct" need not mean the model has understood, and a benchmark "wrong" need not mean it cannot. Sweeping claims about whether LLMs can do causal reasoning, drawn from a single accuracy number, deserve a second look.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a probe-output gap on anti-commonsense causal items but the probe's claim to recover genuine causal direction rests on thin controls.

read the letter

The main point is that on anti-commonsense CLadder questions a linear probe hits roughly 0.97 accuracy on the evidence-supported answer from hidden states, while the model's spoken yes/no drops to about 0.5 and follows commonsense instead. They name the gap Causal Tongue-Tie and argue it splits into two distinct problems: no internal signal versus a signal the output layer cannot express.

What the work does is separate those two failure modes and show why output accuracy alone can mislead when judging causal reasoning. That distinction is worth keeping in mind for anyone running causal benchmarks.

The soft spot is the probe itself. Nothing in the abstract or stress-test note indicates label-shuffled controls, interventions, or matched non-causal probes that would rule out the probe latching onto token patterns or graph statistics rather than causal direction. Without those checks the 0.97 number does not yet establish that the model encodes the causal fact internally.

The paper is aimed at people who build or critique LLM evaluation suites. Anyone measuring causal ability from yes/no outputs would find the framing useful to think about.

It should go to peer review. The observation is straightforward and the implication for benchmarks is real, but the methods need the usual controls before the central claim can be taken as settled.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that LLMs internally encode the evidence-supported causal direction on anti-commonsense CLadder items (recoverable via a fixed linear probe at ~0.97 accuracy from hidden states) but their Yes/No outputs fail to express it, defaulting to the commonsense answer (~0.5 accuracy). This 'Causal Tongue-Tie' is presented as two separable failure modes, with implications that output-only causal benchmarks are insufficient to assess whether models have understood causal structure.

Significance. If the central empirical separation between probe and output holds after controls, the result would be moderately significant for causal reasoning evaluations in LLMs: it would demonstrate that output accuracy alone cannot distinguish internal encoding failures from verbal-interface failures. The approach of using probes on hidden states to surface evidence-supported answers offers a useful diagnostic lens beyond verbal responses, though its value depends on ruling out non-causal confounds.

major comments (2)

[Abstract] Abstract: the central claim rests on a linear probe achieving ~0.97 accuracy recovering the evidence-supported answer, yet the abstract (and by extension the reported method) provides no details on probe training procedure, feature selection, regularization, statistical significance tests, or cross-validation; without these, it is impossible to assess whether the number supports genuine causal encoding or reflects overfitting or post-hoc selection.
[Abstract] Abstract / implied methods: on anti-commonsense items the probe is said to recover the 'evidence-supported' answer while output reverts to commonsense, but no label-shuffled controls, probes trained on matched non-causal tasks, or causal interventions on the residual stream are described to rule out spurious correlations (e.g., token patterns or graph-description statistics that differ between subsets); this is load-bearing for the 'internal signal vs. verbal interface' decomposition.

minor comments (1)

[Abstract] The abstract uses approximate figures ('approximately 0.97', 'approximately 0.5') without reporting exact values, confidence intervals, or dataset sizes; adding these would improve precision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below, indicating where we will revise the manuscript to incorporate additional details and controls while defending the core empirical separation presented.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim rests on a linear probe achieving ~0.97 accuracy recovering the evidence-supported answer, yet the abstract (and by extension the reported method) provides no details on probe training procedure, feature selection, regularization, statistical significance tests, or cross-validation; without these, it is impossible to assess whether the number supports genuine causal encoding or reflects overfitting or post-hoc selection.

Authors: We agree that the abstract is highly condensed and omits key methodological parameters. The full manuscript describes the probe as a linear logistic regression classifier applied to the final-layer residual stream activations, trained with L2 regularization (strength selected via inner cross-validation) on an 80/20 train/test split per item set, with 5-fold cross-validation used to compute mean accuracy and standard deviation. Feature selection was not applied; the full hidden-state dimension was used. Statistical significance against chance was evaluated with permutation tests (p < 0.001). To address the concern directly, we will revise the abstract to include a concise clause on the probe type, cross-validation, and regularization, and we will add a short methods paragraph summarizing these choices with exact hyperparameter values and significance results. revision: yes
Referee: [Abstract] Abstract / implied methods: on anti-commonsense items the probe is said to recover the 'evidence-supported' answer while output reverts to commonsense, but no label-shuffled controls, probes trained on matched non-causal tasks, or causal interventions on the residual stream are described to rule out spurious correlations (e.g., token patterns or graph-description statistics that differ between subsets); this is load-bearing for the 'internal signal vs. verbal interface' decomposition.

Authors: The anti-commonsense construction itself provides partial protection against simple commonsense or token-frequency confounds, because the probe recovers the evidence-supported direction (opposite to commonsense) at high accuracy while the verbal output does not. Nevertheless, we acknowledge that explicit controls would make the separation more robust. In revision we will add (i) label-shuffled baselines showing probe accuracy collapsing to chance (~0.5) and (ii) a matched non-causal probe (e.g., on syntactic subject-verb agreement) to demonstrate that the high accuracy is not an artifact of any linear probe on these activations. Full causal interventions on the residual stream (e.g., activation patching) lie beyond the current experimental scope and would require substantial additional compute; we will therefore note this as a limitation rather than claim to have performed them. revision: partial

Circularity Check

0 steps flagged

No derivation chain present; empirical comparison is self-contained

full rationale

The paper reports an empirical mismatch between linear-probe accuracy on hidden states (~0.97) and verbal Yes/No accuracy (~0.5) on anti-commonsense CLadder items. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The central claim rests on direct measurement of two observables (probe output vs. generated token) against an externally supplied ground-truth label; nothing reduces to its own inputs by construction. The skeptic concern about spurious correlations in the probe is a question of experimental validity, not circularity of the reported numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; the central claim rests on an empirical observation whose supporting details are absent.

pith-pipeline@v0.9.1-grok · 5675 in / 1154 out tokens · 31343 ms · 2026-06-29T22:11:19.507473+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 5 canonical work pages · 3 internal anchors

[1]

InAdvances in Neural Information Processing Systems (NeurIPS), pages 66044–66063

LEACE: Perfect linear concept erasure in closed form. InAdvances in Neural Information Processing Systems (NeurIPS), pages 66044–66063. Emily M. Bender, Timnit Gebru, Angelina McMillan- Major, and Shmargaret Shmitchell. 2021. On the dan- gers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM Conference on Fairness, Accou...

2021
[2]

CounterBench: Evaluating and Improving Counterfactual Reasoning in Large Language Models

CounterBench: Evaluating and improving coun- terfactual reasoning in large language models.arXiv preprint arXiv:2502.11008. Haoang Chi, He Li, Wenjing Yang, Feng Liu, Long Lan, Xiaoguang Ren, Tongliang Liu, and Bo Han. 2024. Unveiling causal reasoning in large language models: Reality or mirage? InAdvances in Neural Information Processing Systems (NeurIPS...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Preprint arXiv:2005.13407 (2020); journal version 2021

CausaLM: Causal model explanation through counterfactual language models.Computational Lin- guistics, 47(2):333–386. Preprint arXiv:2005.13407 (2020); journal version 2021. Jörg Frohberg and Frank Binder. 2022. CRASS: A novel data set and benchmark to test counterfactual reasoning of large language models. InProceedings of the Thir- teenth Language Resour...

work page arXiv 2005
[4]

and Potts, Christopher and Icard, Thomas , title =

ArXiv:2301.04709, first posted 2023; final JMLR version 2025. John Hewitt and Percy Liang. 2019. Designing and inter- preting probes with control tasks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pages 2733–2743. Albert ...

work page arXiv 2023
[5]

InThe Thirteenth International Conference on Learning Representations (ICLR)

The geometry of categorical and hierarchical concepts in large language models. InThe Thirteenth International Conference on Learning Representations (ICLR). Kiho Park, Yo Joong Choe, and Victor Veitch. 2024. The linear representation hypothesis and the geometry of large language models. InProceedings of the 41st In- ternational Conference on Machine Lear...

2024
[6]

Qwen2.5 Technical Report

Qwen2.5 technical report. arXiv preprint arXiv:2412.15115.Preprint, arXiv:2412.15115. Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, and Yoav Goldberg. 2020. Null it out: Guard- ing protected attributes by iterative nullspace projection. InProceedings of the 58th Annual Meeting of the As- sociation for Computational Linguistics (ACL), pages 72...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[7]

Representation Engineering: A Top-Down Approach to AI Transparency

Investigating gender bias in language models using causal mediation analysis. InAdvances in Neural Information Processing Systems (NeurIPS). Matej Ze ˇcevi´c, Moritz Willig, Devendra Singh Dhami, and Kristian Kersting. 2023. Causal parrots: Large language models may talk causality but are not causal. Transactions on Machine Learning Research (TMLR), 2023....

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

InAdvances in Neural Information Processing Systems (NeurIPS), pages 66044–66063

LEACE: Perfect linear concept erasure in closed form. InAdvances in Neural Information Processing Systems (NeurIPS), pages 66044–66063. Emily M. Bender, Timnit Gebru, Angelina McMillan- Major, and Shmargaret Shmitchell. 2021. On the dan- gers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM Conference on Fairness, Accou...

2021

[2] [2]

CounterBench: Evaluating and Improving Counterfactual Reasoning in Large Language Models

CounterBench: Evaluating and improving coun- terfactual reasoning in large language models.arXiv preprint arXiv:2502.11008. Haoang Chi, He Li, Wenjing Yang, Feng Liu, Long Lan, Xiaoguang Ren, Tongliang Liu, and Bo Han. 2024. Unveiling causal reasoning in large language models: Reality or mirage? InAdvances in Neural Information Processing Systems (NeurIPS...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Preprint arXiv:2005.13407 (2020); journal version 2021

CausaLM: Causal model explanation through counterfactual language models.Computational Lin- guistics, 47(2):333–386. Preprint arXiv:2005.13407 (2020); journal version 2021. Jörg Frohberg and Frank Binder. 2022. CRASS: A novel data set and benchmark to test counterfactual reasoning of large language models. InProceedings of the Thir- teenth Language Resour...

work page arXiv 2005

[4] [4]

and Potts, Christopher and Icard, Thomas , title =

ArXiv:2301.04709, first posted 2023; final JMLR version 2025. John Hewitt and Percy Liang. 2019. Designing and inter- preting probes with control tasks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pages 2733–2743. Albert ...

work page arXiv 2023

[5] [5]

InThe Thirteenth International Conference on Learning Representations (ICLR)

The geometry of categorical and hierarchical concepts in large language models. InThe Thirteenth International Conference on Learning Representations (ICLR). Kiho Park, Yo Joong Choe, and Victor Veitch. 2024. The linear representation hypothesis and the geometry of large language models. InProceedings of the 41st In- ternational Conference on Machine Lear...

2024

[6] [6]

Qwen2.5 Technical Report

Qwen2.5 technical report. arXiv preprint arXiv:2412.15115.Preprint, arXiv:2412.15115. Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, and Yoav Goldberg. 2020. Null it out: Guard- ing protected attributes by iterative nullspace projection. InProceedings of the 58th Annual Meeting of the As- sociation for Computational Linguistics (ACL), pages 72...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[7] [7]

Representation Engineering: A Top-Down Approach to AI Transparency

Investigating gender bias in language models using causal mediation analysis. InAdvances in Neural Information Processing Systems (NeurIPS). Matej Ze ˇcevi´c, Moritz Willig, Devendra Singh Dhami, and Kristian Kersting. 2023. Causal parrots: Large language models may talk causality but are not causal. Transactions on Machine Learning Research (TMLR), 2023....

work page internal anchor Pith review Pith/arXiv arXiv 2023