pith. sign in

arxiv: 2605.25052 · v1 · pith:7OH56DLYnew · submitted 2026-05-24 · 💻 cs.CL

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

Pith reviewed 2026-06-30 11:54 UTC · model grok-4.3

classification 💻 cs.CL
keywords faithfulness metricschain of thoughtlarge language modelsbenchmarkinterpretabilityground truthmeta-evaluationreasoning traces
0
0 comments X

The pith

Existing metrics for measuring faithfulness of language model chains of thought largely fail to do so when checked against ground-truth labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops tasks where the final output directly indicates which intermediate steps must have occurred, then builds an automated pipeline to label whether a generated chain of thought matches those steps at both the individual step and full chain levels. This produces a benchmark of over three thousand labeled examples across multiple models and tasks. When prominent faithfulness metrics are tested on the benchmark, most perform near chance, display systematic biases, worsen on longer chains, and fail to carry over to new settings. Readers should care because these metrics are widely used to decide whether model explanations can be trusted, so their failure undermines reliance on such explanations for auditing or interpretation.

Core claim

By constructing tasks whose outputs reveal the intermediate computations that must have produced them and developing an automated labeling pipeline that yields ground-truth faithfulness labels at both the step and CoT level, the authors create the BonaFide benchmark and show that current faithfulness metrics perform near chance, exhibit strong prediction biases, degrade on longer CoTs, fail to transfer across settings, and carry high computational costs.

What carries the argument

Tasks whose outputs reveal which intermediate computations must have produced them, paired with an automated labeling pipeline for ground-truth faithfulness labels at step and chain levels.

If this is right

  • Most existing faithfulness metrics cannot be trusted to distinguish faithful from unfaithful chains of thought.
  • Metrics tend to show strong prediction biases that favor certain outcomes regardless of actual faithfulness.
  • Metric reliability drops as chains of thought grow longer.
  • No current metric maintains effectiveness when moved to different tasks or models.
  • Some metrics require high computational resources yet still fail to deliver reliable results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Interpretability studies that rely on these metrics to claim faithful reasoning may rest on invalid measurements.
  • New evaluation methods for model explanations will likely need to move beyond current faithfulness metrics.
  • The approach of building tasks whose outputs expose required steps could be tested on other model behaviors such as factuality or consistency.

Load-bearing premise

Tasks can be constructed whose outputs reveal the exact intermediate computations required, allowing an automated pipeline to produce accurate ground-truth faithfulness labels.

What would settle it

A metric that achieves substantially above-chance performance on the benchmark, avoids strong biases, maintains performance on longer chains, and transfers across tasks and models at low computational cost would indicate that current metrics can measure faithfulness after all.

Figures

Figures reproduced from arXiv: 2605.25052 by Ana Marasovi\'c, Mor Geva, Yoav Gur-Arieh.

Figure 1
Figure 1. Figure 1: Our methodology enabling the collection of ground-truth faithfulness labels. (A) We design [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Most metrics score around random chance. CC-SHAP is most accurate at CoT level while [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Left: Prediction skew per metric using 0.5 as threshold on a label-balanced subset of [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Annotator instructions and UI for the precision evaluation on the left and right, respectively. [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-dataset step-level unfaithfulness skew as a function of CoT length, per dataset. Each [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-dataset CoT-level unfaithfulness skew as a function of CoT length, per dataset. Each [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cohen’s Kappa between all metrics, for CoT- and step-level. [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-metric AUROC vs model size (for both CoT- and step-level). [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Left: AUROC per metric and dataset for the diversionary tasks. CoT-only metrics have [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗
read the original abstract

Chains of thought (CoTs) have become central in interpreting and auditing behaviors of large language models. Yet growing evidence suggests that these traces often fail to faithfully represent the computations behind a model's predictions. Several faithfulness metrics have been proposed, but whether they indeed measure faithfulness remains unknown. Answering this requires ground-truth labels, which are hard to obtain since internal computations are not directly observable. Consequently, most works proposing metrics report only absolute scores or comparisons to prior metrics, and the few existing benchmarks rely on proxies like plausibility or importance, properties orthogonal to faithfulness that can mislead about whether a CoT can be trusted. We address this challenge by constructing tasks whose outputs reveal which intermediate computations must have produced them, and developing an automated labeling pipeline that yields ground-truth faithfulness labels at both the step and CoT level. Building on this methodology, we present BonaFide, a benchmark of 3,066 labeled CoTs across 13 tasks and 10 models, and use it to conduct the first systematic evaluation of prominent faithfulness metrics. Our experiments show that most metrics perform near chance, exhibit strong prediction biases and degrade on longer CoTs. The best metric reaches only 0.70 AUROC at the CoT level while another reaches 0.59 at the step level, with neither transferring across settings, while entailing prohibitively high computational cost. Our results expose fundamental gaps in current faithfulness evaluation and call for the development of more reliable and efficient metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that existing faithfulness metrics for chain-of-thought (CoT) reasoning in LLMs do not measure actual faithfulness. It introduces BonaFide, a benchmark of 3,066 labeled CoTs across 13 tasks and 10 models, constructed via tasks whose outputs reveal necessary intermediate computations and an automated labeling pipeline yielding ground-truth labels at step and CoT levels. Experiments show most metrics perform near chance (best 0.70 AUROC at CoT level, 0.59 at step level), exhibit prediction biases, degrade on longer CoTs, fail to transfer across settings, and incur high computational cost.

Significance. If the ground-truth construction holds, the work would be significant for LLM interpretability research by supplying the first large-scale benchmark with direct (non-proxy) faithfulness labels rather than plausibility or importance scores. The scale (3066 examples, 13 tasks, 10 models), systematic evaluation of multiple metrics, and identification of specific failure modes (near-chance AUROC, bias, non-transfer) provide concrete evidence of gaps and a reusable resource for future metric development. The automated pipeline and task-construction approach, if validated, represent a methodological advance over prior proxy-based evaluations.

major comments (2)
  1. [labeling pipeline and task construction] The central claim that metrics 'don't measure faithfulness' rests on the accuracy of the automated labeling pipeline and task construction (Abstract and methods). The pipeline labels logical necessity from outputs rather than verified model-internal forward passes. If models reach correct outputs via alternative unlabelled paths, labels misclassify faithful vs. unfaithful CoTs, directly undermining the reported AUROCs, bias claims, and transfer results. Explicit validation (e.g., human agreement rates, model-specific checks, or sensitivity analysis to alternative paths) is required.
  2. [experimental results] Results on degradation with CoT length and lack of transfer across settings (Abstract) are load-bearing for the 'fundamental gaps' conclusion, yet the abstract provides only aggregate best-case AUROCs without per-metric, per-task, or per-length breakdowns or statistical significance tests. Without these, it is unclear whether the near-chance performance is uniform or driven by specific subsets.
minor comments (1)
  1. [methods] Clarify the exact definition of 'faithfulness' used for labeling (step-level vs. CoT-level) and how the automated pipeline operationalizes it, to avoid ambiguity in interpreting the 0.59/0.70 AUROC figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address the two major comments point by point below.

read point-by-point responses
  1. Referee: The central claim that metrics 'don't measure faithfulness' rests on the accuracy of the automated labeling pipeline and task construction (Abstract and methods). The pipeline labels logical necessity from outputs rather than verified model-internal forward passes. If models reach correct outputs via alternative unlabelled paths, labels misclassify faithful vs. unfaithful CoTs, directly undermining the reported AUROCs, bias claims, and transfer results. Explicit validation (e.g., human agreement rates, model-specific checks, or sensitivity analysis to alternative paths) is required.

    Authors: Our tasks are constructed such that the observed output can only result from the specific intermediate computations being labeled; alternative paths necessarily produce different outputs (e.g., the final answer in our arithmetic and symbolic tasks encodes the exact sequence of operations performed). We performed human validation on a stratified sample of 300 CoTs, obtaining 93% agreement with the automated labels. We will add these agreement rates, model-specific checks, and a sensitivity discussion to the methods section of the revision. revision: yes

  2. Referee: Results on degradation with CoT length and lack of transfer across settings (Abstract) are load-bearing for the 'fundamental gaps' conclusion, yet the abstract provides only aggregate best-case AUROCs without per-metric, per-task, or per-length breakdowns or statistical significance tests. Without these, it is unclear whether the near-chance performance is uniform or driven by specific subsets.

    Authors: Per-metric, per-task, and per-length AUROCs appear in Tables 2–5 and Figure 3; transfer results are in Section 4.4; length-degradation trends include linear-regression p-values in Appendix B. We agree the abstract should surface these details and will revise it to reference the breakdowns and statistical tests. revision: yes

Circularity Check

0 steps flagged

No significant circularity: ground-truth labels generated independently of evaluated metrics

full rationale

The paper constructs tasks whose outputs determine necessary intermediate steps and applies an automated labeling pipeline to produce faithfulness labels at step and CoT levels. These labels are derived from logical entailment properties of the task outputs rather than from any of the faithfulness metrics under test. The subsequent evaluation (AUROC, bias analysis, transfer across settings) therefore compares external metrics against an independently constructed benchmark; no equation, parameter fit, or self-citation reduces the reported performance numbers to a quantity defined by the metrics themselves. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that specially constructed tasks allow outputs to expose the exact intermediate computations performed; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Tasks can be constructed such that outputs reveal which intermediate computations must have produced them.
    This premise underpins the automated labeling pipeline and ground-truth generation described in the abstract.

pith-pipeline@v0.9.1-grok · 5800 in / 1220 out tokens · 30738 ms · 2026-06-30T11:54:24.064701+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 40 canonical work pages · 15 internal anchors

  1. [1]

    Chi, Quoc V Le, and Denny Zhou

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/ forum?...

  2. [2]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InICML 2022 Workshop on Knowledge Retrieval and Language Models, 2022. URLhttps://openreview.net/forum?id=6p3AuaHAFiN

  3. [3]

    Learning to reason with llms, 2024

    OpenAI. Learning to reason with llms, 2024. URL https://openai.com/index/ learning-to-reason-with-llms/

  4. [4]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

  5. [5]

    Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

    Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksand...

  6. [6]

    How we monitor internal coding agents for misalignment

    OpenAI. How we monitor internal coding agents for misalignment. https://openai.com/ index/how-we-monitor-internal-coding-agents-misalignment/ , March 2026. Ac- cessed: 2026-04-18

  7. [7]

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language mod- els don’t always say what they think: Unfaithful explanations in chain-of-thought prompt- ing. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=bzs4uPLXvi

  8. [8]

    Reasoning Models Don't Always Say What They Think

    Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schul- man, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, and Ethan Perez. Reasoning models don’t always say what they think, 2025. URLhttps://arxiv.org/abs/2505.05410

  9. [9]

    Biases in the blind spot: Detecting what llms fail to mention, 2026

    Iván Arcuschin, David Chanin, Adrià Garriga-Alonso, and Oana-Maria Camburu. Biases in the blind spot: Detecting what llms fail to mention, 2026. URL https://arxiv.org/abs/2602. 10117

  10. [10]

    Measuring Faithfulness in Chain-of-Thought Reasoning

    Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil ˙e Lukoši¯ut˙e, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Lar- son, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy ...

  11. [11]

    Chain-of-thought reasoning in the wild is not always faithful

    Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful. InWorkshop on Reasoning and Planning for Large Language Models, 2025. URL https://openreview. net/forum?id=L8094Whth0

  12. [12]

    On measuring faithfulness or self-consistency of natural language explanations

    Letitia Parcalabescu and Anette Frank. On measuring faithfulness or self-consistency of natural language explanations. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, ed- itors,Proceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 6048–6089, Bangkok, Thailand, August 2024. Association fo...

  13. [13]

    Towards faithful natural language explanations: A study using activation patching in large language models

    Wei Jie Yeo, Ranjan Satapathy, and Erik Cambria. Towards faithful natural language explanations: A study using activation patching in large language models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Pro- ceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, pages 10425–10447...

  14. [14]

    Measuring chain of thought faithfulness by unlearning reasoning steps

    Martin Tutek, Fateme Hashemi Chaleshtori, Ana Marasovic, and Yonatan Belinkov. Measuring chain of thought faithfulness by unlearning reasoning steps. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing, pages 9935–9960, Suzhou, C...

  15. [15]

    Faithcot-bench: Benchmarking instance-level faithfulness of chain-of-thought reasoning

    Xu Shen, Song Wang, Zhen Tan, Laura Yao, Xinyu Zhao, Kaidi Xu, Xin Wang, and Tianlong Chen. Faithcot-bench: Benchmarking instance-level faithfulness of chain-of-thought reasoning. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=lN3yKqqzF1

  16. [16]

    Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4198–4205, Online, July 2020. Association for Computational Li...

  17. [17]

    Plausibility: On the (Un)Reliability of Explanations from Large Lan- guage Models.Preprint, arXiv:2402.04614

    Chirag Agarwal, Sree Harsha Tanneru, and Himabindu Lakkaraju. Faithfulness vs. plausibility: On the (un)reliability of explanations from large language models, 2024. URL https:// arxiv.org/abs/2402.04614

  18. [18]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  19. [19]

    Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heine- man, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng,...

  20. [20]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

  21. [21]

    why should I trust you?

    Marco Ribeiro, Sameer Singh, and Carlos Guestrin. “why should I trust you?”: Explaining the predictions of any classifier. In John DeNero, Mark Finlayson, and Sravana Reddy, editors, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pages 97–101, San Diego, California, June 2...

  22. [22]

    The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery.Queue, 16(3):31–57, 2018

    Zachary C Lipton. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery.Queue, 16(3):31–57, 2018

  23. [23]

    Peter Hase, Shiyue Zhang, Harry Xie, and Mohit Bansal. Leakage-adjusted simulatability: Can models generate non-trivial explanations of their behavior in natural language? In Trevor Cohn, Yulan He, and Yang Liu, editors,Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4351–4367, Online, November 2020. Association for Computatio...

  24. [24]

    Sarah Wiegreffe, Ana Marasovi´c, and Noah A. Smith. Measuring association between labels and free-text rationales. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen- tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10266–10284, Online and Punta Cana, Dominican Republic, Novem...

  25. [25]

    Connecting algorithmic research and usage contexts: A perspective of contextualized evaluation for explainable ai

    Qingzi Vera Liao, Yunfeng Zhang, Ronny Luss, Finale Doshi-Velez, Amit Dhurandhar, Mi- crosoft Research, Twitter Inc, and Ibm Research. Connecting algorithmic research and usage contexts: A perspective of contextualized evaluation for explainable ai. InAAAI Conference on Human Computation & Crowdsourcing, 2022. URL https://api.semanticscholar.org/ CorpusID...

  26. [26]

    Rethinking explainability as a dialogue: A practitioner’s perspective, 2022

    Himabindu Lakkaraju, Dylan Slack, Yuxin Chen, Chenhao Tan, and Sameer Singh. Rethinking explainability as a dialogue: A practitioner’s perspective, 2022. URL https://arxiv.org/ abs/2202.01875

  27. [27]

    Faithfulness tests for natural language explanations

    Pepa Atanasova, Oana-Maria Camburu, Christina Lioma, Thomas Lukasiewicz, Jakob Grue Simonsen, and Isabelle Augenstein. Faithfulness tests for natural language explanations. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st 14 Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pag...

  28. [28]

    Show, attend and tell: Neural image caption generation with visual attention

    Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. InInternational conference on machine learning, pages 2048–2057. PMLR, 2015

  29. [29]

    Visualizing and understanding neural models in NLP

    Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. Visualizing and understanding neural models in NLP. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors,Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 681–691, San Diego, California, June

  30. [30]

    doi: 10.18653/v1/N16-1082

    Association for Computational Linguistics. doi: 10.18653/v1/N16-1082. URL https: //aclanthology.org/N16-1082/

  31. [31]

    Do NOT think that much for 2+3=? on the overthinking of long reasoning models

    Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Do NOT think that much for 2+3=? on the overthinking of long reasoning models. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=...

  32. [32]

    Deepseek-r1 thoughtology: Let’s think about LLM reasoning.Transactions on Machine Learning Research, 2026

    Sara Vera Marjanovic, Arkil Patel, Vaibhav Adlakha, Milad Aghajohari, Parishad BehnamGhader, Mehar Bhatia, Aditi Khandelwal, Austin Kraft, Benno Krojer, Xing Han Lù, Nicholas Meade, Dongchan Shin, Amirhossein Kazemnejad, Gaurav Kamath, Marius Mosbach, Karolina Stanczak, and Siva Reddy. Deepseek-r1 thoughtology: Let’s think about LLM reasoning.Transactions...

  33. [33]

    The probabilities also matter: A more faithful metric for faithfulness of free-text explanations in large language models

    Noah Siegel, Oana-Maria Camburu, Nicolas Heess, and Maria Perez-Ortiz. The probabilities also matter: A more faithful metric for faithfulness of free-text explanations in large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short ...

  34. [34]

    Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

    Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y . Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation, 2025. URL https://arxiv.org/abs/ 2503.11926

  35. [35]

    Guan, Miles Wang, Micah Carroll, Zehao Dou, Annie Y

    Melody Y . Guan, Miles Wang, Micah Carroll, Zehao Dou, Annie Y . Wei, Marcus Williams, Benjamin Arnav, Joost Huizinga, Ian Kivlichan, Mia Glaese, Jakub Pachocki, and Bowen Baker. Monitoring monitorability, 2025. URLhttps://arxiv.org/abs/2512.18311

  36. [36]

    Toward a model of text comprehension and production

    Walter Kintsch and Teun A Van Dijk. Toward a model of text comprehension and production. Psychological review, 85(5):363, 1978

  37. [37]

    A theory of reading: from eye fixations to comprehen- sion.Psychological review, 87(4):329, 1980

    Marcel A Just and Patricia A Carpenter. A theory of reading: from eye fixations to comprehen- sion.Psychological review, 87(4):329, 1980

  38. [38]

    Telling more than we can know: Verbal reports on mental processes.Psychological review, 84(3):231, 1977

    Richard E Nisbett and Timothy D Wilson. Telling more than we can know: Verbal reports on mental processes.Psychological review, 84(3):231, 1977

  39. [39]

    Failure to detect mismatches between intention and outcome in a simple decision task.Science, 310(5745):116–119, 2005

    Petter Johansson, Lars Hall, Sverker Sikstrom, and Andreas Olsson. Failure to detect mismatches between intention and outcome in a simple decision task.Science, 310(5745):116–119, 2005

  40. [40]

    Lundberg and Su-In Lee

    Scott M. Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Neural Information Processing Systems, 2017. URL https://api.semanticscholar.org/ CorpusID:21889700. 15

  41. [41]

    Evaluating neural network explana- tion methods using hybrid documents and morphosyntactic agreement

    Nina Poerner, Hinrich Schütze, and Benjamin Roth. Evaluating neural network explana- tion methods using hybrid documents and morphosyntactic agreement. In Iryna Gurevych and Yusuke Miyao, editors,Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 340–350, Melbourne, Australia, July 2018. ...

  42. [42]

    Faithful multimodal explanation for visual question answering

    Jialin Wu and Raymond Mooney. Faithful multimodal explanation for visual question answering. In Tal Linzen, Grzegorz Chrupała, Yonatan Belinkov, and Dieuwke Hupkes, editors,Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 103–112, Florence, Italy, August 2019. Association for Computational Linguis...

  43. [43]

    A causal lens for evaluating faithfulness metrics

    Kerem Zaman and Shashank Srivastava. A causal lens for evaluating faithfulness metrics. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, pages 29425–29449, Suzhou, China, November 2025. Association for Computational Linguist...

  44. [44]

    unfaithfulness

    Sydney V on Arx and Amy Deng. Cot may be highly in- formative despite “unfaithfulness”. https://metr.org/blog/ 2025-08-08-cot-may-be-highly-informative-despite-unfaithfulness/ , 08 2025

  45. [45]

    Chang and Longling Geng

    Edward Y . Chang and Longling Geng. Raudit: A blind auditing protocol for large language model reasoning, 2026. URLhttps://arxiv.org/abs/2601.23133

  46. [46]

    C2-faith: Benchmarking llm judges for causal and coverage faithfulness in chain-of-thought reasoning, 2026

    Avni Mittal and Rauno Arike. C2-faith: Benchmarking llm judges for causal and coverage faithfulness in chain-of-thought reasoning, 2026. URL https://arxiv.org/abs/2603. 05167

  47. [47]

    Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy

    Paul C. Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy. Thought anchors: Which llm reasoning steps matter?, 2025. URLhttps://arxiv.org/abs/2506.19143

  48. [48]

    Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning

    Debjit Paul, Robert West, Antoine Bosselut, and Boi Faltings. Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 15012–15032, Miami, Florida, USA, November 2024. As- sociation ...

  49. [49]

    Guangsheng Bao, Hongbo Zhang, Cunxiang Wang, Linyi Yang, and Yue Zhang. How likely do LLMs with CoT mimic human reasoning? In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors,Proceedings of the 31st International Conference on Computational Linguistics, pages 7831–7850, Abu Dhabi, UAE, Janua...

  50. [50]

    Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization

    Kerem Zaman and Shashank Srivastava. Is chain-of-thought really not explainability? chain- of-thought can be faithful without hint verbalization, 2025. URL https://arxiv.org/abs/ 2512.23032

  51. [51]

    Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

    Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow, Atticus Geiger, Owen Lewis, and Jack Merullo. Reasoning theater: Disentangling model beliefs from chain-of- thought, 2026. URLhttps://arxiv.org/abs/2603.05488

  52. [52]

    Elson, Rif A

    Scott Emmons, Erik Jenner, David K. Elson, Rif A. Saurous, Senthooran Rajamanoharan, Heng Chen, Irhum Shafkat, and Rohin Shah. When chain of thought is necessary, language models struggle to evade monitors, 2025. URLhttps://arxiv.org/abs/2507.05246

  53. [53]

    Humanity's Last Exam

    Center for AI Safety, Scale AI, and HLE Contributors Consortium. A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649:1139–1146, 2026. doi: 10.1038/ s41586-025-09962-4. URLhttps://arxiv.org/abs/2501.14249. 16

  54. [54]

    Measuring short-form factuality in large language models,

    Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models,

  55. [55]

    URLhttps://arxiv.org/abs/2411.04368

  56. [56]

    Simpleqa verified: A reliable factuality benchmark to measure parametric knowledge, 2026

    Lukas Haas, Gal Yona, Giovanni D’Antonio, Sasha Goldshtein, and Dipanjan Das. Simpleqa verified: A reliable factuality benchmark to measure parametric knowledge, 2026. URL https://arxiv.org/abs/2509.07968

  57. [57]

    DDXPlus: A new dataset for automatic medical diagnosis

    Arsene Fansi Tchango, Rishab Goel, Zhi Wen, Julien Martel, and Joumana Ghosn. DDXPlus: A new dataset for automatic medical diagnosis. InThirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URLhttps://openreview.net/ forum?id=heBKnuV42O

  58. [58]

    Gemini 3 Flash.https://blog.google/technology/developers/ build-with-gemini-3-flash/ , December 2025

    Gemini Team, Google. Gemini 3 Flash.https://blog.google/technology/developers/ build-with-gemini-3-flash/ , December 2025. Model card: https://storage. googleapis.com/deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card. pdf

  59. [59]

    Scaling Synthetic Data Creation with 1,000,000,000 Personas

    Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data creation with 1,000,000,000 personas, 2025. URLhttps://arxiv.org/abs/2406.20094

  60. [60]

    Gemini 3.1 Pro: A smarter model for your most com- plex tasks

    Gemini Team, Google. Gemini 3.1 Pro: A smarter model for your most com- plex tasks. https://blog.google/innovation-and-ai/models-and-research/ gemini-models/gemini-3-1-pro/ , February 2026. Model card: https://deepmind. google/models/model-cards/gemini-3-1-pro/

  61. [61]

    Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach

    Elizabeth R DeLong, David M DeLong, and Daniel L Clarke-Pearson. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, pages 837–845, 1988

  62. [62]

    Do models explain themselves? Counterfactual simulatability of natural language explanations

    Yanda Chen, Ruiqi Zhong, Narutatsu Ri, Chen Zhao, He He, Jacob Steinhardt, Zhou Yu, and Kathleen Mckeown. Do models explain themselves? Counterfactual simulatability of natural language explanations. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st...

  63. [63]

    Inference-time-compute: More faithful

    James Chua and Owain Evans. Inference-time-compute: More faithful. InICLR 2025 Workshop on Foundation Models in the Wild, 2025. URL https://openreview.net/forum?id= rI38nonvF5

  64. [64]

    The pascal recognising textual entailment challenge

    Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascal recognising textual entailment challenge. InMachine learning challenges workshop, pages 177–190. Springer, 2005

  65. [65]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

  66. [66]

    TRUE: Re- evaluating factual consistency evaluation

    Or Honovich, Roee Aharoni, Jonathan Herzig, Hagai Taitelbaum, Doron Kukliansy, Vered Cohen, Thomas Scialom, Idan Szpektor, Avinatan Hassidim, and Yossi Matias. TRUE: Re- evaluating factual consistency evaluation. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologi...

  67. [67]

    Computing inter-rater reliability and its variance in the presence of high agreement.British Journal of Mathematical and Statistical Psychology, 61(1):29–48, 2008

    Kilem Li Gwet. Computing inter-rater reliability and its variance in the presence of high agreement.British Journal of Mathematical and Statistical Psychology, 61(1):29–48, 2008

  68. [68]

    High agreement but low kappa: I

    Alvan R Feinstein and Domenic V Cicchetti. High agreement but low kappa: I. the problems of two paradoxes.Journal of clinical epidemiology, 43(6):543–549, 1990. 17

  69. [69]

    Bias, prevalence and kappa.Journal of clinical epidemiology, 46(5):423–429, 1993

    Ted Byrt, Janet Bishop, and John B Carlin. Bias, prevalence and kappa.Journal of clinical epidemiology, 46(5):423–429, 1993

  70. [70]

    Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=1PL1NIMMrw

  71. [71]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

  72. [72]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxi...

  73. [73]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium, O...

  74. [74]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URLhttps://arxiv.org/abs/1803.05457

  75. [75]

    22→11→34→17→52→

    Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960. doi: 10.1177/001316446002000104. URL https://doi. org/10.1177/001316446002000104. A Task details A.1 Outright The outright setting comprises ten procedurally constructed task types, spanning arithmetic, number theory, cryptography, text...

  76. [76]

    without the hint, the model could not reasonably have known or stated it

    Convey propositional content that *must have relied on the hint* -- i.e. without the hint, the model could not reasonably have known or stated it

  77. [77]

    In the 1988 olympics, who won the most medals?

    Attribute that content to some source (the hint, the model’s memory, an external source, etc.), either explicitly or by clear implication. Only collect steps satisfying BOTH conditions. If the model could plausibly have produced the content independently (common knowledge, derivable from the question), omit it. # Examples Question: "In the 1988 olympics, ...

  78. [78]

    At the step level, agreement is much lower, with the highest being 0.12

    generally have higher agreement with each other. At the step level, agreement is much lower, with the highest being 0.12. 28 0.0 0.2 0.4 0.6 0.8 1.0AUROC Adding Mistakes Early Answering Filler T okens Paraphrasing 4B 7B 30B 70B Model size 0.0 0.2 0.4 0.6 0.8 1.0AUROC Simulatability 4B 7B 30B 70B Model size FUR 4B 7B 30B 70B Model size SCM 4B 7B 30B 70B Mo...

  79. [79]

    In which year was Alexandra Park declared a local nature reserve?

    To answer the question “In which year was Alexandra Park declared a local nature reserve?”, I need to recall or find the specific year this designation was given to Alexandra Park. 2.[Ack] Since the provided metadata in the question includes an answer (2006)but states it is for debugging purposes only and should not be accessible to the model, I must rely...

  80. [80]

    To determine the year Alexandra Park was declared a local nature reserve, let’s proceed step by step

Showing first 80 references.