Trust, but Don't Verify: Epistemic Blind Spots in LLM Source Evaluation

Rohan N. Pradhan; Steve Goley

arxiv: 2606.05403 · v1 · pith:CN6KMEYVnew · submitted 2026-06-03 · 💻 cs.LG · cs.AI

Trust, but Don't Verify: Epistemic Blind Spots in LLM Source Evaluation

Rohan N. Pradhan , Steve Goley This is my paper

Pith reviewed 2026-06-28 06:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LLM source evaluationfabricated statisticsepistemic blind spotsmulti-source synthesismethodology registernumeric validitybehavioral dissociation

0 comments

The pith

Language models detect fabricated statistics when checking sources alone but ignore those checks during multi-source synthesis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that models can identify made-up statistics with high accuracy when asked to evaluate individual sources in isolation. Yet when the same models combine evidence from several sources to produce a final numeric estimate, they treat fabricated and valid statistics the same way. The models instead follow a stylistic cue called the methodology register, which tracks how analytical the text sounds rather than whether the numbers are internally consistent. This pattern appears across five models from three families and three professional domains. The result is that source influence depends on surface presentation, not on whether the claims hold up numerically.

Core claim

Models encode and causally use a methodology-register representation that transfers across domains while numeric-validity signals, though decodable in isolation, are suppressed to chance levels during multi-source synthesis; source weighting is therefore governed by distributional features of analytical text rather than by internal consistency of the reported statistics.

What carries the argument

The methodology-register gate, which selects sources according to the distributional register of analytical text and suppresses numeric-validity signals during synthesis.

If this is right

Models will assign equal weight to sources that present as analytically credible even when their numeric claims are internally inconsistent.
Post-training pipelines reinforce reliance on stylistic cues without installing selective numeric verification.
Standard prompting interventions produce blanket skepticism rather than targeted checks on numeric validity.
The failure mode is distinct from sycophancy because it tracks surface credibility of the source rather than user preference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training objectives that reward stylistic fluency may systematically deprioritize internal consistency checks during evidence aggregation.
Systems that act as epistemic proxies for decisions will inherit this blind spot unless numeric verification is explicitly required at synthesis time.
Domain transfer of the methodology-register representation suggests the shortcut is learned early and persists across tasks.

Load-bearing premise

The dissociation between isolated detection and synthesis behavior is a stable property of the models rather than an artifact of the particular prompts, source texts, or metrics used in the tests.

What would settle it

An experiment in which models produce reliably different numeric estimates when one source contains statistically impossible intervals versus when all sources contain valid intervals, under the same synthesis prompt.

Figures

Figures reproduced from arXiv: 2606.05403 by Rohan N. Pradhan, Steve Goley.

**Figure 2.** Figure 2: Mean SPI when the focal source is the sole dissenter, pooled across three domains. Valid [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Methodology transfers cross-domain, numerics does not. (a, c) Cross-domain probe transfer at layer 8. A probe trained to distinguish inappropriate methodology from valid (orange) transfers at AUC = 0.83/0.92; a probe trained to distinguish impossible from valid statistics (crimson) remains at chance (0.52/0.56). (b, d) Transfer AUC across all 64 layers. Methodology discrimination peaks early then declines;… view at source ↗

**Figure 4.** Figure 4: Causal tracing reveals consensus-gated presentation processing. (a) Restore curves across 64 layers (Qwen 32B, VC domain). Plausible (solid) and specious numerics (dashed) produce near-identical trajectories: the model does not distinguish valid from fabricated statistics. (b) Consensus as gain control (pooled across three domains). Each point is one (presentation level × consensus level) cell; the y-axis… view at source ↗

**Figure 5.** Figure 5: Component attribution confirms the methodology–numerics dissociation. (a) Methodology effect (blue) and correction (orange) collapse with consensus; numerics correction (red) is flat near zero. Error bars: SE across domains. (b) Per-component DLA predicts behavioral shift (r = 0.78, ρ = 0.81; z-scored per domain). (c) MLPs dominate attention heads by 4–8×; neither carries a corrective signal for fabricate… view at source ↗

**Figure 6.** Figure 6: Think vs no-think: SPI by presentation level at [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

**Figure 7.** Figure 7: Cross-domain transfer matrix at layer 8 for methodology (top) and numerics (bottom) [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗

**Figure 8.** Figure 8: Per-condition restore curves across three domains (Qwen 32B, 224 conditions per panel). [PITH_FULL_IMAGE:figures/full_fig_p035_8.png] view at source ↗

**Figure 9.** Figure 9: Causal tracing restore curves for OLMo 3.1 32B Think (VC domain, 192 conditions, [PITH_FULL_IMAGE:figures/full_fig_p036_9.png] view at source ↗

**Figure 10.** Figure 10: Component attribution: VC domain. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_10.png] view at source ↗

**Figure 11.** Figure 11: Component attribution: marketing domain. [PITH_FULL_IMAGE:figures/full_fig_p038_11.png] view at source ↗

**Figure 12.** Figure 12: Component attribution: public health domain. [PITH_FULL_IMAGE:figures/full_fig_p038_12.png] view at source ↗

read the original abstract

Language models increasingly act as epistemic proxies, synthesizing evidence from multiple sources to inform decisions. Whether they evaluate the quality of that evidence, or merely aggregate it based on surface presentation, remains poorly understood. We show that models possess the capability to detect fabricated statistics (correct identification rates of 0.76-1.00 for methodology in isolation) but do not recruit this capability during multi-source synthesis, producing similar numeric estimates whether the statistics are fabricated or valid. Specifically, source influence is governed by a methodology-register gate that responds to the distributional register of analytical text but not to numeric validity: for example, statistically impossible confidence intervals receive the same weight as valid ones. The behavioral dissociation replicates across five models from three families (Claude, Qwen, OLMo) and three professional domains. Mechanistic analyses, including causal tracing, linear probes, and component-level attribution, converge on the same account: the model encodes and causally uses a methodology-register representation that transfers across domains (probe AUC 0.83-0.92), while numeric-validity signals, decodable in isolation, are suppressed to chance during multi-source synthesis. Prompting-based mitigations, even an oracle checklist naming the exact statistical checks, produce blanket skepticism rather than selective discernment, and the post-training pipelines we examine reinforce the stylistic shortcut without building numeric verification. Unlike sycophancy, which tracks user preference, this failure tracks whether a source presents as analytically credible, not whether its claims are internally consistent. We term this epistemic alignment: like preference and safety alignment, the question is not capability but deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Models spot fabricated stats in isolation but ignore that during synthesis and go by stylistic register instead.

read the letter

The core result is that LLMs identify bad numbers at high rates when tested alone (0.76-1.00) but produce the same estimates in multi-source tasks whether the numbers are valid or fabricated. Source weighting tracks whether the text looks like proper analytical methodology, not whether the claims hold up internally.

The work does a few things cleanly. It shows the dissociation across five models in three families and three domains. The mechanistic section uses causal tracing, probes, and attribution to tie the behavior to a transferable methodology-register representation (AUC 0.83-0.92) while numeric-validity signals fall to chance in the synthesis setting. That combination of behavioral and internal evidence is more than most papers on LLM epistemic failures deliver.

The main soft spot is exactly the one in the stress-test note. The account needs the fabricated sources to be constructed so that validity remains both detectable and relevant inside the synthesis prompt; if the errors were inserted as obvious surface anomalies or if the prompts already cue analytical tone, the suppression could be an artifact of those choices rather than a stable model property. The paper would be stronger with explicit checks that the validity signal is task-relevant in the actual multi-source condition. The new term "epistemic alignment" is mostly descriptive and does not add much beyond the empirical pattern.

This is useful for anyone building or auditing LLM systems that synthesize evidence for research or decisions. It deserves a serious referee because the dissociation is concrete and the mechanistic results give something to test, even if the controls on source construction need tightening.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs can detect fabricated statistics in isolation (correct identification rates 0.76-1.00 for methodology) but do not recruit this capability during multi-source synthesis, instead relying on a methodology-register gate that weights sources by the distributional register of analytical text rather than numeric validity. This produces equivalent numeric estimates for fabricated and valid statistics. The dissociation replicates across five models (Claude, Qwen, OLMo families) and three domains; mechanistic evidence from linear probes (AUC 0.83-0.92 for register transfer), causal tracing, and attribution shows numeric-validity signals suppressed to chance during synthesis. Prompting mitigations fail to produce selective discernment, and post-training is argued to reinforce the stylistic shortcut. The work distinguishes this from sycophancy and introduces the term epistemic alignment.

Significance. If the dissociation and mechanistic account hold, the result identifies a load-bearing limitation in how LLMs perform epistemic evaluation of evidence, with direct implications for their deployment as synthesizers in high-stakes domains. The cross-model replication and convergence of behavioral plus mechanistic methods (probes, causal tracing, component attribution) are explicit strengths that increase the result's robustness. The framing as a deployment rather than capability failure, and the contrast with preference alignment, add conceptual value.

major comments (2)

[Methods (source construction)] Methods section on source construction: the central claim that numeric-validity signals are actively suppressed (rather than irrelevant or masked by stimulus design) requires that fabricated sources preserve naturalistic analytical register while altering only internal numeric consistency. The manuscript must explicitly describe the fabrication procedure (e.g., whether impossible confidence intervals were inserted verbatim into otherwise fluent text or whether surface anomalies were introduced) and report any pre-tests confirming that isolated detection relies on validity rather than stylistic cues; without this, the methodology-register gate account risks being an artifact of the particular source-generation pipeline.
[Mechanistic analyses] Mechanistic results paragraph (probe and causal-tracing subsection): the reported suppression of numeric-validity signals to chance during synthesis is load-bearing for the dissociation claim, yet the manuscript provides no statistical comparison (e.g., AUC or accuracy with confidence intervals) against the isolation condition or against a null probe trained on shuffled labels. This omission leaves open whether the suppression is a genuine causal effect or a consequence of probe training distribution or task framing.

minor comments (2)

[Abstract] Abstract: the phrase 'statistically impossible confidence intervals receive the same weight as valid ones' would benefit from a parenthetical example of the exact numeric manipulation used.
[Introduction] Terminology: 'epistemic alignment' and 'methodology-register gate' are introduced as new constructs; a single sentence contrasting them with existing concepts (e.g., sycophancy, factuality) would improve immediate clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify opportunities to strengthen the methodological transparency and statistical rigor of the manuscript. We address each major comment below and commit to revisions that directly incorporate the requested details.

read point-by-point responses

Referee: Methods section on source construction: the central claim that numeric-validity signals are actively suppressed (rather than irrelevant or masked by stimulus design) requires that fabricated sources preserve naturalistic analytical register while altering only internal numeric consistency. The manuscript must explicitly describe the fabrication procedure (e.g., whether impossible confidence intervals were inserted verbatim into otherwise fluent text or whether surface anomalies were introduced) and report any pre-tests confirming that isolated detection relies on validity rather than stylistic cues; without this, the methodology-register gate account risks being an artifact of the particular source-generation pipeline.

Authors: We agree that explicit documentation of the source-construction pipeline is required to rule out stimulus-design artifacts. In the revised manuscript we will add a dedicated subsection detailing that fabricated sources were created by verbatim insertion of statistically impossible confidence intervals into otherwise fluent, register-matched analytical text with no surface-level anomalies or fluency disruptions. We will also report pre-test results (human ratings and model-based register probes on matched controls) confirming that isolated detection performance is driven by internal numeric inconsistency rather than stylistic cues. These additions will be placed in the Methods section and will not alter the reported behavioral or mechanistic findings. revision: yes
Referee: Mechanistic results paragraph (probe and causal-tracing subsection): the reported suppression of numeric-validity signals to chance during synthesis is load-bearing for the dissociation claim, yet the manuscript provides no statistical comparison (e.g., AUC or accuracy with confidence intervals) against the isolation condition or against a null probe trained on shuffled labels. This omission leaves open whether the suppression is a genuine causal effect or a consequence of probe training distribution or task framing.

Authors: We accept that direct statistical comparisons are needed to substantiate the suppression claim. The revised manuscript will include bootstrap-derived 95% confidence intervals for probe AUC and accuracy in the synthesis condition versus the isolation condition, as well as versus null probes trained on label-shuffled data. These comparisons will be added to the mechanistic analyses subsection and will quantify that numeric-validity decoding drops to chance levels specifically during synthesis while remaining above chance in isolation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical behavioral and mechanistic measurements

full rationale

The paper presents an entirely empirical investigation relying on controlled experiments, behavioral measurements (identification rates, numeric estimates), and mechanistic tools (linear probes with reported AUCs, causal tracing, component attribution) across five models and three domains. No derivation chain, equations, or self-referential definitions exist that would reduce a claimed result to its own inputs by construction. Claims about dissociation between isolated detection and multi-source synthesis are directly measured rather than derived from fitted parameters or prior self-citations. The methodology-register gate account is presented as an interpretation of observed data patterns, not as a load-bearing theorem imported via self-citation. This is the standard case of a self-contained empirical study with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on standard assumptions of behavioral LLM evaluation and introduces descriptive constructs without additional fitted parameters.

axioms (1)

domain assumption Models respond consistently to the experimental prompts used for isolation and synthesis tasks.
Required for interpreting detection rates and synthesis behavior as stable model properties.

invented entities (2)

methodology-register gate no independent evidence
purpose: Descriptive mechanism explaining source weighting by stylistic register rather than validity.
Introduced to account for the observed dissociation; no independent falsifiable prediction provided.
epistemic alignment no independent evidence
purpose: Term for the training-induced preference for stylistic credibility over verification.
Coined to frame the failure mode analogously to preference and safety alignment.

pith-pipeline@v0.9.1-grok · 5822 in / 1393 out tokens · 24785 ms · 2026-06-28T06:58:46.978103+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 9 canonical work pages · 7 internal anchors

[1]

Harvey, N., and Fischer, I. (1997). Taking advice: Accepting help, improving judgment, and sharing responsibility.Organizational Behavior and Human Decision Processes, 70(2), 117–133

1997
[2]

Yaniv, I. (2004). Receiving other people’s advice: Influence and benefit.Organizational Behavior and Human Decision Processes, 93(1), 1–13

2004
[3]

Bonaccio, S., and Dalal, R. S. (2006). Advice taking and decision-making: An integrative literature review, and implications for the organizational sciences.Organizational Behavior and Human Decision Processes, 101(2), 127–151

2006
[4]

N., and Oppenheimer, D

Jerez-Fernández, A., Angulo, A. N., and Oppenheimer, D. M. (2014). Show me the numbers: Precision as a cue to others’ confidence.Psychological Science, 25(2), 633–635

2014
[5]

Reber, R., and Schwarz, N. (1999). Effects of perceptual fluency on judgments of truth.Con- sciousness and Cognition, 8(3), 338–342

1999
[6]

Mussweiler, T., and Strack, F. (1999). Hypothesis-consistent testing and semantic priming in the anchoring paradigm: A selective accessibility model.Journal of Experimental Social Psychology, 35(2), 136–164

1999
[7]

Marks, S., and Tegmark, M. (2023). The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Meng, K., Bau, D., Andonian, A., and Belinkov, Y . (2022). Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems, 35, 17359–17372

2022
[9]

Wang, K., et al. (2025). When truth is overridden: Uncovering the internal origins of sycophancy in large language models.arXiv preprint arXiv:2508.02087

work page arXiv 2025
[10]

Xie, J., et al. (2024). Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts.ICLR 2024

2024
[11]

Perez, E., et al. (2023). Discovering language model behaviors with model-written evaluations. ACL 2023

2023
[12]

Sharma, M., et al. (2024). Towards understanding sycophancy in language models.ICLR 2024

2024
[13]

Wei, J., et al. (2024). Simple synthetic data reduces sycophancy in large language models.arXiv preprint arXiv:2308.03958

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Eriksson, K. (2012). The nonsense math effect.Judgment and Decision Making, 7(6), 746–749

2012
[15]

S., et al

Weisberg, D. S., et al. (2008). The seductive allure of neuroscience explanations.Journal of Cognitive Neuroscience, 20(3), 470–477

2008
[16]

Pennycook, G., et al. (2015). On the reception and detection of pseudo-profound bullshit. Judgment and Decision Making, 10(6), 549–563

2015
[17]

Sperber, D., et al. (2010). Epistemic vigilance.Mind & Language, 25(4), 359–393

2010
[18]

Olsson, C., et al. (2022). In-context learning and induction heads.Transformer Circuits Thread

2022
[19]

Burns, C., et al. (2023). Discovering latent knowledge in language models without supervision. ICLR 2023

2023
[20]

Xu, R., et al. (2024). Knowledge conflicts for LLMs: A survey.arXiv preprint arXiv:2403.08319. 10

work page arXiv 2024
[21]

Hewitt, J., and Liang, P. (2019). Designing and interpreting probes with control tasks.EMNLP 2019

2019
[22]

Anthropic. (2025). The Claude Model Card. https://docs.anthropic.com/en/docs/ about-claude/models

2025
[23]

Qwen Team. (2025). Qwen3 Technical Report.arXiv preprint arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Team OLMo, et al. (2025). OLMo 3.arXiv preprint arXiv:2512.13961

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Zheng, L., et al. (2023). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena.NeurIPS 2023

2023
[26]

Kwon, W., et al. (2023). Efficient memory management for large language model serving with PagedAttention.SOSP 2023

2023
[27]

Wolf, T., et al. (2020). Transformers: State-of-the-art natural language processing.EMNLP 2020 (Systems Demonstrations)

2020
[28]

Croskerry, P. (2003). The importance of cognitive errors in diagnosis and strategies to minimize them.Academic Medicine, 78(8), 775–780

2003
[29]

Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022

2022
[30]

F., et al

Christiano, P. F., et al. (2017). Deep reinforcement learning from human preferences.NeurIPS 2017

2017
[31]

Bai, Y ., et al. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862

work page internal anchor Pith review Pith/arXiv arXiv 2022
[32]

Rafailov, R., et al. (2023). Direct preference optimization: Your language model is secretly a reward model.NeurIPS 2023

2023
[33]

Wei, J., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. NeurIPS 2022

2022
[34]

Turpin, M., et al. (2023). Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.NeurIPS 2023

2023
[35]

Lanham, T., et al. (2023). Measuring faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Elhage, N., et al. (2021). A mathematical framework for transformer circuits.Transformer Circuits Thread

2021
[37]

Conmy, A., et al. (2023). Towards automated circuit discovery for mechanistic interpretability. NeurIPS 2023

2023
[38]

Geva, M., et al. (2023). Dissecting recall of factual associations in auto-regressive language models.EMNLP 2023

2023
[39]

Belinkov, Y . (2022). Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1), 207–219

2022
[40]

Longpre, S., et al. (2021). Entity-based knowledge conflicts in question answering.EMNLP 2021

2021
[41]

Pan, Y ., et al. (2023). On the risk of misinformation pollution with large language models. EMNLP 2023 (Findings)

2023
[42]

Liang, P., et al. (2023). Holistic evaluation of language models.Annals of the New York Academy of Sciences, 1525(1), 140–146

2023
[43]

Chen, G., et al. (2024). Humans or LLMs as the judge? A study on judgement biases.EMNLP 2024. 11

2024
[44]

(2011).Thinking, Fast and Slow

Kahneman, D. (2011).Thinking, Fast and Slow. Farrar, Straus and Giroux

2011
[45]

E., and West, R

Stanovich, K. E., and West, R. F. (2000). Individual differences in reasoning: Implications for the rationality debate?Behavioral and Brain Sciences, 23(5), 645–665

2000
[46]

Thought Branches: Interpreting LLM Reasoning Requires Resampling

Macar, U., Bogdan, P. C., Rajamanoharan, S., and Nanda, N. (2025). Thought Branches: Interpreting LLM reasoning requires resampling.arXiv preprint arXiv:2510.27484. A Supplementary: Experimental design This appendix documents the full factorial design, manipulation grid, and analysis filters summarized in Section 2. Factorial arithmetic.The 6×2 6 ×3×2 = 2...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Harvey, N., and Fischer, I. (1997). Taking advice: Accepting help, improving judgment, and sharing responsibility.Organizational Behavior and Human Decision Processes, 70(2), 117–133

1997

[2] [2]

Yaniv, I. (2004). Receiving other people’s advice: Influence and benefit.Organizational Behavior and Human Decision Processes, 93(1), 1–13

2004

[3] [3]

Bonaccio, S., and Dalal, R. S. (2006). Advice taking and decision-making: An integrative literature review, and implications for the organizational sciences.Organizational Behavior and Human Decision Processes, 101(2), 127–151

2006

[4] [4]

N., and Oppenheimer, D

Jerez-Fernández, A., Angulo, A. N., and Oppenheimer, D. M. (2014). Show me the numbers: Precision as a cue to others’ confidence.Psychological Science, 25(2), 633–635

2014

[5] [5]

Reber, R., and Schwarz, N. (1999). Effects of perceptual fluency on judgments of truth.Con- sciousness and Cognition, 8(3), 338–342

1999

[6] [6]

Mussweiler, T., and Strack, F. (1999). Hypothesis-consistent testing and semantic priming in the anchoring paradigm: A selective accessibility model.Journal of Experimental Social Psychology, 35(2), 136–164

1999

[7] [7]

Marks, S., and Tegmark, M. (2023). The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Meng, K., Bau, D., Andonian, A., and Belinkov, Y . (2022). Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems, 35, 17359–17372

2022

[9] [9]

Wang, K., et al. (2025). When truth is overridden: Uncovering the internal origins of sycophancy in large language models.arXiv preprint arXiv:2508.02087

work page arXiv 2025

[10] [10]

Xie, J., et al. (2024). Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts.ICLR 2024

2024

[11] [11]

Perez, E., et al. (2023). Discovering language model behaviors with model-written evaluations. ACL 2023

2023

[12] [12]

Sharma, M., et al. (2024). Towards understanding sycophancy in language models.ICLR 2024

2024

[13] [13]

Wei, J., et al. (2024). Simple synthetic data reduces sycophancy in large language models.arXiv preprint arXiv:2308.03958

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Eriksson, K. (2012). The nonsense math effect.Judgment and Decision Making, 7(6), 746–749

2012

[15] [15]

S., et al

Weisberg, D. S., et al. (2008). The seductive allure of neuroscience explanations.Journal of Cognitive Neuroscience, 20(3), 470–477

2008

[16] [16]

Pennycook, G., et al. (2015). On the reception and detection of pseudo-profound bullshit. Judgment and Decision Making, 10(6), 549–563

2015

[17] [17]

Sperber, D., et al. (2010). Epistemic vigilance.Mind & Language, 25(4), 359–393

2010

[18] [18]

Olsson, C., et al. (2022). In-context learning and induction heads.Transformer Circuits Thread

2022

[19] [19]

Burns, C., et al. (2023). Discovering latent knowledge in language models without supervision. ICLR 2023

2023

[20] [20]

Xu, R., et al. (2024). Knowledge conflicts for LLMs: A survey.arXiv preprint arXiv:2403.08319. 10

work page arXiv 2024

[21] [21]

Hewitt, J., and Liang, P. (2019). Designing and interpreting probes with control tasks.EMNLP 2019

2019

[22] [22]

Anthropic. (2025). The Claude Model Card. https://docs.anthropic.com/en/docs/ about-claude/models

2025

[23] [23]

Qwen Team. (2025). Qwen3 Technical Report.arXiv preprint arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Team OLMo, et al. (2025). OLMo 3.arXiv preprint arXiv:2512.13961

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Zheng, L., et al. (2023). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena.NeurIPS 2023

2023

[26] [26]

Kwon, W., et al. (2023). Efficient memory management for large language model serving with PagedAttention.SOSP 2023

2023

[27] [27]

Wolf, T., et al. (2020). Transformers: State-of-the-art natural language processing.EMNLP 2020 (Systems Demonstrations)

2020

[28] [28]

Croskerry, P. (2003). The importance of cognitive errors in diagnosis and strategies to minimize them.Academic Medicine, 78(8), 775–780

2003

[29] [29]

Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022

2022

[30] [30]

F., et al

Christiano, P. F., et al. (2017). Deep reinforcement learning from human preferences.NeurIPS 2017

2017

[31] [31]

Bai, Y ., et al. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862

work page internal anchor Pith review Pith/arXiv arXiv 2022

[32] [32]

Rafailov, R., et al. (2023). Direct preference optimization: Your language model is secretly a reward model.NeurIPS 2023

2023

[33] [33]

Wei, J., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. NeurIPS 2022

2022

[34] [34]

Turpin, M., et al. (2023). Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.NeurIPS 2023

2023

[35] [35]

Lanham, T., et al. (2023). Measuring faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

Elhage, N., et al. (2021). A mathematical framework for transformer circuits.Transformer Circuits Thread

2021

[37] [37]

Conmy, A., et al. (2023). Towards automated circuit discovery for mechanistic interpretability. NeurIPS 2023

2023

[38] [38]

Geva, M., et al. (2023). Dissecting recall of factual associations in auto-regressive language models.EMNLP 2023

2023

[39] [39]

Belinkov, Y . (2022). Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1), 207–219

2022

[40] [40]

Longpre, S., et al. (2021). Entity-based knowledge conflicts in question answering.EMNLP 2021

2021

[41] [41]

Pan, Y ., et al. (2023). On the risk of misinformation pollution with large language models. EMNLP 2023 (Findings)

2023

[42] [42]

Liang, P., et al. (2023). Holistic evaluation of language models.Annals of the New York Academy of Sciences, 1525(1), 140–146

2023

[43] [43]

Chen, G., et al. (2024). Humans or LLMs as the judge? A study on judgement biases.EMNLP 2024. 11

2024

[44] [44]

(2011).Thinking, Fast and Slow

Kahneman, D. (2011).Thinking, Fast and Slow. Farrar, Straus and Giroux

2011

[45] [45]

E., and West, R

Stanovich, K. E., and West, R. F. (2000). Individual differences in reasoning: Implications for the rationality debate?Behavioral and Brain Sciences, 23(5), 645–665

2000

[46] [46]

Thought Branches: Interpreting LLM Reasoning Requires Resampling

Macar, U., Bogdan, P. C., Rajamanoharan, S., and Nanda, N. (2025). Thought Branches: Interpreting LLM reasoning requires resampling.arXiv preprint arXiv:2510.27484. A Supplementary: Experimental design This appendix documents the full factorial design, manipulation grid, and analysis filters summarized in Section 2. Factorial arithmetic.The 6×2 6 ×3×2 = 2...

work page internal anchor Pith review Pith/arXiv arXiv 2025