Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth
Pith reviewed 2026-06-30 11:54 UTC · model grok-4.3
The pith
Existing metrics for measuring faithfulness of language model chains of thought largely fail to do so when checked against ground-truth labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By constructing tasks whose outputs reveal the intermediate computations that must have produced them and developing an automated labeling pipeline that yields ground-truth faithfulness labels at both the step and CoT level, the authors create the BonaFide benchmark and show that current faithfulness metrics perform near chance, exhibit strong prediction biases, degrade on longer CoTs, fail to transfer across settings, and carry high computational costs.
What carries the argument
Tasks whose outputs reveal which intermediate computations must have produced them, paired with an automated labeling pipeline for ground-truth faithfulness labels at step and chain levels.
If this is right
- Most existing faithfulness metrics cannot be trusted to distinguish faithful from unfaithful chains of thought.
- Metrics tend to show strong prediction biases that favor certain outcomes regardless of actual faithfulness.
- Metric reliability drops as chains of thought grow longer.
- No current metric maintains effectiveness when moved to different tasks or models.
- Some metrics require high computational resources yet still fail to deliver reliable results.
Where Pith is reading between the lines
- Interpretability studies that rely on these metrics to claim faithful reasoning may rest on invalid measurements.
- New evaluation methods for model explanations will likely need to move beyond current faithfulness metrics.
- The approach of building tasks whose outputs expose required steps could be tested on other model behaviors such as factuality or consistency.
Load-bearing premise
Tasks can be constructed whose outputs reveal the exact intermediate computations required, allowing an automated pipeline to produce accurate ground-truth faithfulness labels.
What would settle it
A metric that achieves substantially above-chance performance on the benchmark, avoids strong biases, maintains performance on longer chains, and transfers across tasks and models at low computational cost would indicate that current metrics can measure faithfulness after all.
Figures
read the original abstract
Chains of thought (CoTs) have become central in interpreting and auditing behaviors of large language models. Yet growing evidence suggests that these traces often fail to faithfully represent the computations behind a model's predictions. Several faithfulness metrics have been proposed, but whether they indeed measure faithfulness remains unknown. Answering this requires ground-truth labels, which are hard to obtain since internal computations are not directly observable. Consequently, most works proposing metrics report only absolute scores or comparisons to prior metrics, and the few existing benchmarks rely on proxies like plausibility or importance, properties orthogonal to faithfulness that can mislead about whether a CoT can be trusted. We address this challenge by constructing tasks whose outputs reveal which intermediate computations must have produced them, and developing an automated labeling pipeline that yields ground-truth faithfulness labels at both the step and CoT level. Building on this methodology, we present BonaFide, a benchmark of 3,066 labeled CoTs across 13 tasks and 10 models, and use it to conduct the first systematic evaluation of prominent faithfulness metrics. Our experiments show that most metrics perform near chance, exhibit strong prediction biases and degrade on longer CoTs. The best metric reaches only 0.70 AUROC at the CoT level while another reaches 0.59 at the step level, with neither transferring across settings, while entailing prohibitively high computational cost. Our results expose fundamental gaps in current faithfulness evaluation and call for the development of more reliable and efficient metrics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that existing faithfulness metrics for chain-of-thought (CoT) reasoning in LLMs do not measure actual faithfulness. It introduces BonaFide, a benchmark of 3,066 labeled CoTs across 13 tasks and 10 models, constructed via tasks whose outputs reveal necessary intermediate computations and an automated labeling pipeline yielding ground-truth labels at step and CoT levels. Experiments show most metrics perform near chance (best 0.70 AUROC at CoT level, 0.59 at step level), exhibit prediction biases, degrade on longer CoTs, fail to transfer across settings, and incur high computational cost.
Significance. If the ground-truth construction holds, the work would be significant for LLM interpretability research by supplying the first large-scale benchmark with direct (non-proxy) faithfulness labels rather than plausibility or importance scores. The scale (3066 examples, 13 tasks, 10 models), systematic evaluation of multiple metrics, and identification of specific failure modes (near-chance AUROC, bias, non-transfer) provide concrete evidence of gaps and a reusable resource for future metric development. The automated pipeline and task-construction approach, if validated, represent a methodological advance over prior proxy-based evaluations.
major comments (2)
- [labeling pipeline and task construction] The central claim that metrics 'don't measure faithfulness' rests on the accuracy of the automated labeling pipeline and task construction (Abstract and methods). The pipeline labels logical necessity from outputs rather than verified model-internal forward passes. If models reach correct outputs via alternative unlabelled paths, labels misclassify faithful vs. unfaithful CoTs, directly undermining the reported AUROCs, bias claims, and transfer results. Explicit validation (e.g., human agreement rates, model-specific checks, or sensitivity analysis to alternative paths) is required.
- [experimental results] Results on degradation with CoT length and lack of transfer across settings (Abstract) are load-bearing for the 'fundamental gaps' conclusion, yet the abstract provides only aggregate best-case AUROCs without per-metric, per-task, or per-length breakdowns or statistical significance tests. Without these, it is unclear whether the near-chance performance is uniform or driven by specific subsets.
minor comments (1)
- [methods] Clarify the exact definition of 'faithfulness' used for labeling (step-level vs. CoT-level) and how the automated pipeline operationalizes it, to avoid ambiguity in interpreting the 0.59/0.70 AUROC figures.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address the two major comments point by point below.
read point-by-point responses
-
Referee: The central claim that metrics 'don't measure faithfulness' rests on the accuracy of the automated labeling pipeline and task construction (Abstract and methods). The pipeline labels logical necessity from outputs rather than verified model-internal forward passes. If models reach correct outputs via alternative unlabelled paths, labels misclassify faithful vs. unfaithful CoTs, directly undermining the reported AUROCs, bias claims, and transfer results. Explicit validation (e.g., human agreement rates, model-specific checks, or sensitivity analysis to alternative paths) is required.
Authors: Our tasks are constructed such that the observed output can only result from the specific intermediate computations being labeled; alternative paths necessarily produce different outputs (e.g., the final answer in our arithmetic and symbolic tasks encodes the exact sequence of operations performed). We performed human validation on a stratified sample of 300 CoTs, obtaining 93% agreement with the automated labels. We will add these agreement rates, model-specific checks, and a sensitivity discussion to the methods section of the revision. revision: yes
-
Referee: Results on degradation with CoT length and lack of transfer across settings (Abstract) are load-bearing for the 'fundamental gaps' conclusion, yet the abstract provides only aggregate best-case AUROCs without per-metric, per-task, or per-length breakdowns or statistical significance tests. Without these, it is unclear whether the near-chance performance is uniform or driven by specific subsets.
Authors: Per-metric, per-task, and per-length AUROCs appear in Tables 2–5 and Figure 3; transfer results are in Section 4.4; length-degradation trends include linear-regression p-values in Appendix B. We agree the abstract should surface these details and will revise it to reference the breakdowns and statistical tests. revision: yes
Circularity Check
No significant circularity: ground-truth labels generated independently of evaluated metrics
full rationale
The paper constructs tasks whose outputs determine necessary intermediate steps and applies an automated labeling pipeline to produce faithfulness labels at step and CoT levels. These labels are derived from logical entailment properties of the task outputs rather than from any of the faithfulness metrics under test. The subsequent evaluation (AUROC, bias analysis, transfer across settings) therefore compares external metrics against an independently constructed benchmark; no equation, parameter fit, or self-citation reduces the reported performance numbers to a quantity defined by the metrics themselves. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Tasks can be constructed such that outputs reveal which intermediate computations must have produced them.
Reference graph
Works this paper leans on
-
[1]
Chi, Quoc V Le, and Denny Zhou
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/ forum?...
2022
-
[2]
Large language models are zero-shot reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InICML 2022 Workshop on Knowledge Retrieval and Language Models, 2022. URLhttps://openreview.net/forum?id=6p3AuaHAFiN
2022
-
[3]
Learning to reason with llms, 2024
OpenAI. Learning to reason with llms, 2024. URL https://openai.com/index/ learning-to-reason-with-llms/
2024
-
[4]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...
-
[5]
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksand...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
How we monitor internal coding agents for misalignment
OpenAI. How we monitor internal coding agents for misalignment. https://openai.com/ index/how-we-monitor-internal-coding-agents-misalignment/ , March 2026. Ac- cessed: 2026-04-18
2026
-
[7]
Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language mod- els don’t always say what they think: Unfaithful explanations in chain-of-thought prompt- ing. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=bzs4uPLXvi
2023
-
[8]
Reasoning Models Don't Always Say What They Think
Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schul- man, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, and Ethan Perez. Reasoning models don’t always say what they think, 2025. URLhttps://arxiv.org/abs/2505.05410
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Biases in the blind spot: Detecting what llms fail to mention, 2026
Iván Arcuschin, David Chanin, Adrià Garriga-Alonso, and Oana-Maria Camburu. Biases in the blind spot: Detecting what llms fail to mention, 2026. URL https://arxiv.org/abs/2602. 10117
2026
-
[10]
Measuring Faithfulness in Chain-of-Thought Reasoning
Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil ˙e Lukoši¯ut˙e, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Lar- son, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Chain-of-thought reasoning in the wild is not always faithful
Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful. InWorkshop on Reasoning and Planning for Large Language Models, 2025. URL https://openreview. net/forum?id=L8094Whth0
2025
-
[12]
On measuring faithfulness or self-consistency of natural language explanations
Letitia Parcalabescu and Anette Frank. On measuring faithfulness or self-consistency of natural language explanations. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, ed- itors,Proceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 6048–6089, Bangkok, Thailand, August 2024. Association fo...
-
[13]
Wei Jie Yeo, Ranjan Satapathy, and Erik Cambria. Towards faithful natural language explanations: A study using activation patching in large language models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Pro- ceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, pages 10425–10447...
-
[14]
Measuring chain of thought faithfulness by unlearning reasoning steps
Martin Tutek, Fateme Hashemi Chaleshtori, Ana Marasovic, and Yonatan Belinkov. Measuring chain of thought faithfulness by unlearning reasoning steps. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing, pages 9935–9960, Suzhou, C...
-
[15]
Faithcot-bench: Benchmarking instance-level faithfulness of chain-of-thought reasoning
Xu Shen, Song Wang, Zhen Tan, Laura Yao, Xinyu Zhao, Kaidi Xu, Xin Wang, and Tianlong Chen. Faithcot-bench: Benchmarking instance-level faithfulness of chain-of-thought reasoning. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=lN3yKqqzF1
2026
-
[16]
Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4198–4205, Online, July 2020. Association for Computational Li...
-
[17]
Chirag Agarwal, Sree Harsha Tanneru, and Himabindu Lakkaraju. Faithfulness vs. plausibility: On the (un)reliability of explanations from large language models, 2024. URL https:// arxiv.org/abs/2402.04614
-
[18]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heine- man, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng,...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[20]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Marco Ribeiro, Sameer Singh, and Carlos Guestrin. “why should I trust you?”: Explaining the predictions of any classifier. In John DeNero, Mark Finlayson, and Sravana Reddy, editors, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pages 97–101, San Diego, California, June 2...
-
[22]
The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery.Queue, 16(3):31–57, 2018
Zachary C Lipton. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery.Queue, 16(3):31–57, 2018
2018
-
[23]
Peter Hase, Shiyue Zhang, Harry Xie, and Mohit Bansal. Leakage-adjusted simulatability: Can models generate non-trivial explanations of their behavior in natural language? In Trevor Cohn, Yulan He, and Yang Liu, editors,Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4351–4367, Online, November 2020. Association for Computatio...
-
[24]
Sarah Wiegreffe, Ana Marasovi´c, and Noah A. Smith. Measuring association between labels and free-text rationales. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen- tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10266–10284, Online and Punta Cana, Dominican Republic, Novem...
-
[25]
Connecting algorithmic research and usage contexts: A perspective of contextualized evaluation for explainable ai
Qingzi Vera Liao, Yunfeng Zhang, Ronny Luss, Finale Doshi-Velez, Amit Dhurandhar, Mi- crosoft Research, Twitter Inc, and Ibm Research. Connecting algorithmic research and usage contexts: A perspective of contextualized evaluation for explainable ai. InAAAI Conference on Human Computation & Crowdsourcing, 2022. URL https://api.semanticscholar.org/ CorpusID...
2022
-
[26]
Rethinking explainability as a dialogue: A practitioner’s perspective, 2022
Himabindu Lakkaraju, Dylan Slack, Yuxin Chen, Chenhao Tan, and Sameer Singh. Rethinking explainability as a dialogue: A practitioner’s perspective, 2022. URL https://arxiv.org/ abs/2202.01875
-
[27]
Faithfulness tests for natural language explanations
Pepa Atanasova, Oana-Maria Camburu, Christina Lioma, Thomas Lukasiewicz, Jakob Grue Simonsen, and Isabelle Augenstein. Faithfulness tests for natural language explanations. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st 14 Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pag...
-
[28]
Show, attend and tell: Neural image caption generation with visual attention
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. InInternational conference on machine learning, pages 2048–2057. PMLR, 2015
2048
-
[29]
Visualizing and understanding neural models in NLP
Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. Visualizing and understanding neural models in NLP. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors,Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 681–691, San Diego, California, June
2016
-
[30]
Association for Computational Linguistics. doi: 10.18653/v1/N16-1082. URL https: //aclanthology.org/N16-1082/
-
[31]
Do NOT think that much for 2+3=? on the overthinking of long reasoning models
Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Do NOT think that much for 2+3=? on the overthinking of long reasoning models. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=...
2025
-
[32]
Deepseek-r1 thoughtology: Let’s think about LLM reasoning.Transactions on Machine Learning Research, 2026
Sara Vera Marjanovic, Arkil Patel, Vaibhav Adlakha, Milad Aghajohari, Parishad BehnamGhader, Mehar Bhatia, Aditi Khandelwal, Austin Kraft, Benno Krojer, Xing Han Lù, Nicholas Meade, Dongchan Shin, Amirhossein Kazemnejad, Gaurav Kamath, Marius Mosbach, Karolina Stanczak, and Siva Reddy. Deepseek-r1 thoughtology: Let’s think about LLM reasoning.Transactions...
2026
-
[33]
Noah Siegel, Oana-Maria Camburu, Nicolas Heess, and Maria Perez-Ortiz. The probabilities also matter: A more faithful metric for faithfulness of free-text explanations in large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short ...
-
[34]
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y . Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation, 2025. URL https://arxiv.org/abs/ 2503.11926
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Guan, Miles Wang, Micah Carroll, Zehao Dou, Annie Y
Melody Y . Guan, Miles Wang, Micah Carroll, Zehao Dou, Annie Y . Wei, Marcus Williams, Benjamin Arnav, Joost Huizinga, Ian Kivlichan, Mia Glaese, Jakub Pachocki, and Bowen Baker. Monitoring monitorability, 2025. URLhttps://arxiv.org/abs/2512.18311
-
[36]
Toward a model of text comprehension and production
Walter Kintsch and Teun A Van Dijk. Toward a model of text comprehension and production. Psychological review, 85(5):363, 1978
1978
-
[37]
A theory of reading: from eye fixations to comprehen- sion.Psychological review, 87(4):329, 1980
Marcel A Just and Patricia A Carpenter. A theory of reading: from eye fixations to comprehen- sion.Psychological review, 87(4):329, 1980
1980
-
[38]
Telling more than we can know: Verbal reports on mental processes.Psychological review, 84(3):231, 1977
Richard E Nisbett and Timothy D Wilson. Telling more than we can know: Verbal reports on mental processes.Psychological review, 84(3):231, 1977
1977
-
[39]
Failure to detect mismatches between intention and outcome in a simple decision task.Science, 310(5745):116–119, 2005
Petter Johansson, Lars Hall, Sverker Sikstrom, and Andreas Olsson. Failure to detect mismatches between intention and outcome in a simple decision task.Science, 310(5745):116–119, 2005
2005
-
[40]
Lundberg and Su-In Lee
Scott M. Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Neural Information Processing Systems, 2017. URL https://api.semanticscholar.org/ CorpusID:21889700. 15
2017
-
[41]
Evaluating neural network explana- tion methods using hybrid documents and morphosyntactic agreement
Nina Poerner, Hinrich Schütze, and Benjamin Roth. Evaluating neural network explana- tion methods using hybrid documents and morphosyntactic agreement. In Iryna Gurevych and Yusuke Miyao, editors,Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 340–350, Melbourne, Australia, July 2018. ...
-
[42]
Faithful multimodal explanation for visual question answering
Jialin Wu and Raymond Mooney. Faithful multimodal explanation for visual question answering. In Tal Linzen, Grzegorz Chrupała, Yonatan Belinkov, and Dieuwke Hupkes, editors,Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 103–112, Florence, Italy, August 2019. Association for Computational Linguis...
-
[43]
A causal lens for evaluating faithfulness metrics
Kerem Zaman and Shashank Srivastava. A causal lens for evaluating faithfulness metrics. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, pages 29425–29449, Suzhou, China, November 2025. Association for Computational Linguist...
-
[44]
unfaithfulness
Sydney V on Arx and Amy Deng. Cot may be highly in- formative despite “unfaithfulness”. https://metr.org/blog/ 2025-08-08-cot-may-be-highly-informative-despite-unfaithfulness/ , 08 2025
2025
-
[45]
Edward Y . Chang and Longling Geng. Raudit: A blind auditing protocol for large language model reasoning, 2026. URLhttps://arxiv.org/abs/2601.23133
-
[46]
C2-faith: Benchmarking llm judges for causal and coverage faithfulness in chain-of-thought reasoning, 2026
Avni Mittal and Rauno Arike. C2-faith: Benchmarking llm judges for causal and coverage faithfulness in chain-of-thought reasoning, 2026. URL https://arxiv.org/abs/2603. 05167
2026
-
[47]
Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy
Paul C. Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy. Thought anchors: Which llm reasoning steps matter?, 2025. URLhttps://arxiv.org/abs/2506.19143
-
[48]
Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning
Debjit Paul, Robert West, Antoine Bosselut, and Boi Faltings. Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 15012–15032, Miami, Florida, USA, November 2024. As- sociation ...
-
[49]
Guangsheng Bao, Hongbo Zhang, Cunxiang Wang, Linyi Yang, and Yue Zhang. How likely do LLMs with CoT mimic human reasoning? In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors,Proceedings of the 31st International Conference on Computational Linguistics, pages 7831–7850, Abu Dhabi, UAE, Janua...
2025
-
[50]
Kerem Zaman and Shashank Srivastava. Is chain-of-thought really not explainability? chain- of-thought can be faithful without hint verbalization, 2025. URL https://arxiv.org/abs/ 2512.23032
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow, Atticus Geiger, Owen Lewis, and Jack Merullo. Reasoning theater: Disentangling model beliefs from chain-of- thought, 2026. URLhttps://arxiv.org/abs/2603.05488
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[52]
Scott Emmons, Erik Jenner, David K. Elson, Rif A. Saurous, Senthooran Rajamanoharan, Heng Chen, Irhum Shafkat, and Rohin Shah. When chain of thought is necessary, language models struggle to evade monitors, 2025. URLhttps://arxiv.org/abs/2507.05246
-
[53]
Center for AI Safety, Scale AI, and HLE Contributors Consortium. A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649:1139–1146, 2026. doi: 10.1038/ s41586-025-09962-4. URLhttps://arxiv.org/abs/2501.14249. 16
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[54]
Measuring short-form factuality in large language models,
Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models,
-
[55]
URLhttps://arxiv.org/abs/2411.04368
work page internal anchor Pith review Pith/arXiv arXiv
-
[56]
Simpleqa verified: A reliable factuality benchmark to measure parametric knowledge, 2026
Lukas Haas, Gal Yona, Giovanni D’Antonio, Sasha Goldshtein, and Dipanjan Das. Simpleqa verified: A reliable factuality benchmark to measure parametric knowledge, 2026. URL https://arxiv.org/abs/2509.07968
-
[57]
DDXPlus: A new dataset for automatic medical diagnosis
Arsene Fansi Tchango, Rishab Goel, Zhi Wen, Julien Martel, and Joumana Ghosn. DDXPlus: A new dataset for automatic medical diagnosis. InThirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URLhttps://openreview.net/ forum?id=heBKnuV42O
2022
-
[58]
Gemini 3 Flash.https://blog.google/technology/developers/ build-with-gemini-3-flash/ , December 2025
Gemini Team, Google. Gemini 3 Flash.https://blog.google/technology/developers/ build-with-gemini-3-flash/ , December 2025. Model card: https://storage. googleapis.com/deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card. pdf
2025
-
[59]
Scaling Synthetic Data Creation with 1,000,000,000 Personas
Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data creation with 1,000,000,000 personas, 2025. URLhttps://arxiv.org/abs/2406.20094
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Gemini 3.1 Pro: A smarter model for your most com- plex tasks
Gemini Team, Google. Gemini 3.1 Pro: A smarter model for your most com- plex tasks. https://blog.google/innovation-and-ai/models-and-research/ gemini-models/gemini-3-1-pro/ , February 2026. Model card: https://deepmind. google/models/model-cards/gemini-3-1-pro/
2026
-
[61]
Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach
Elizabeth R DeLong, David M DeLong, and Daniel L Clarke-Pearson. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, pages 837–845, 1988
1988
-
[62]
Do models explain themselves? Counterfactual simulatability of natural language explanations
Yanda Chen, Ruiqi Zhong, Narutatsu Ri, Chen Zhao, He He, Jacob Steinhardt, Zhou Yu, and Kathleen Mckeown. Do models explain themselves? Counterfactual simulatability of natural language explanations. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st...
2024
-
[63]
Inference-time-compute: More faithful
James Chua and Owain Evans. Inference-time-compute: More faithful. InICLR 2025 Workshop on Foundation Models in the Wild, 2025. URL https://openreview.net/forum?id= rI38nonvF5
2025
-
[64]
The pascal recognising textual entailment challenge
Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascal recognising textual entailment challenge. InMachine learning challenges workshop, pages 177–190. Springer, 2005
2005
-
[65]
Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020
2020
-
[66]
TRUE: Re- evaluating factual consistency evaluation
Or Honovich, Roee Aharoni, Jonathan Herzig, Hagai Taitelbaum, Doron Kukliansy, Vered Cohen, Thomas Scialom, Idan Szpektor, Avinatan Hassidim, and Yossi Matias. TRUE: Re- evaluating factual consistency evaluation. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologi...
-
[67]
Computing inter-rater reliability and its variance in the presence of high agreement.British Journal of Mathematical and Statistical Psychology, 61(1):29–48, 2008
Kilem Li Gwet. Computing inter-rater reliability and its variance in the presence of high agreement.British Journal of Mathematical and Statistical Psychology, 61(1):29–48, 2008
2008
-
[68]
High agreement but low kappa: I
Alvan R Feinstein and Domenic V Cicchetti. High agreement but low kappa: I. the problems of two paradoxes.Journal of clinical epidemiology, 43(6):543–549, 1990. 17
1990
-
[69]
Bias, prevalence and kappa.Journal of clinical epidemiology, 46(5):423–429, 1993
Ted Byrt, Janet Bishop, and John B Carlin. Bias, prevalence and kappa.Journal of clinical epidemiology, 46(5):423–429, 1993
1993
-
[70]
Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=1PL1NIMMrw
2023
-
[71]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[72]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxi...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[73]
Can a suit of armor conduct electricity? a new dataset for open book question answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium, O...
-
[74]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URLhttps://arxiv.org/abs/1803.05457
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[75]
Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960. doi: 10.1177/001316446002000104. URL https://doi. org/10.1177/001316446002000104. A Task details A.1 Outright The outright setting comprises ten procedurally constructed task types, spanning arithmetic, number theory, cryptography, text...
-
[76]
without the hint, the model could not reasonably have known or stated it
Convey propositional content that *must have relied on the hint* -- i.e. without the hint, the model could not reasonably have known or stated it
-
[77]
In the 1988 olympics, who won the most medals?
Attribute that content to some source (the hint, the model’s memory, an external source, etc.), either explicitly or by clear implication. Only collect steps satisfying BOTH conditions. If the model could plausibly have produced the content independently (common knowledge, derivable from the question), omit it. # Examples Question: "In the 1988 olympics, ...
1988
-
[78]
At the step level, agreement is much lower, with the highest being 0.12
generally have higher agreement with each other. At the step level, agreement is much lower, with the highest being 0.12. 28 0.0 0.2 0.4 0.6 0.8 1.0AUROC Adding Mistakes Early Answering Filler T okens Paraphrasing 4B 7B 30B 70B Model size 0.0 0.2 0.4 0.6 0.8 1.0AUROC Simulatability 4B 7B 30B 70B Model size FUR 4B 7B 30B 70B Model size SCM 4B 7B 30B 70B Mo...
2024
-
[79]
In which year was Alexandra Park declared a local nature reserve?
To answer the question “In which year was Alexandra Park declared a local nature reserve?”, I need to recall or find the specific year this designation was given to Alexandra Park. 2.[Ack] Since the provided metadata in the question includes an answer (2006)but states it is for debugging purposes only and should not be accessible to the model, I must rely...
2006
-
[80]
To determine the year Alexandra Park was declared a local nature reserve, let’s proceed step by step
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.