Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

Ana Marasovi\'c; Mor Geva; Yoav Gur-Arieh

arxiv: 2605.25052 · v1 · pith:7OH56DLYnew · submitted 2026-05-24 · 💻 cs.CL

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

Yoav Gur-Arieh , Ana Marasovi\'c , Mor Geva This is my paper

Pith reviewed 2026-06-30 11:54 UTC · model grok-4.3

classification 💻 cs.CL

keywords faithfulness metricschain of thoughtlarge language modelsbenchmarkinterpretabilityground truthmeta-evaluationreasoning traces

0 comments

The pith

Existing metrics for measuring faithfulness of language model chains of thought largely fail to do so when checked against ground-truth labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops tasks where the final output directly indicates which intermediate steps must have occurred, then builds an automated pipeline to label whether a generated chain of thought matches those steps at both the individual step and full chain levels. This produces a benchmark of over three thousand labeled examples across multiple models and tasks. When prominent faithfulness metrics are tested on the benchmark, most perform near chance, display systematic biases, worsen on longer chains, and fail to carry over to new settings. Readers should care because these metrics are widely used to decide whether model explanations can be trusted, so their failure undermines reliance on such explanations for auditing or interpretation.

Core claim

By constructing tasks whose outputs reveal the intermediate computations that must have produced them and developing an automated labeling pipeline that yields ground-truth faithfulness labels at both the step and CoT level, the authors create the BonaFide benchmark and show that current faithfulness metrics perform near chance, exhibit strong prediction biases, degrade on longer CoTs, fail to transfer across settings, and carry high computational costs.

What carries the argument

Tasks whose outputs reveal which intermediate computations must have produced them, paired with an automated labeling pipeline for ground-truth faithfulness labels at step and chain levels.

If this is right

Most existing faithfulness metrics cannot be trusted to distinguish faithful from unfaithful chains of thought.
Metrics tend to show strong prediction biases that favor certain outcomes regardless of actual faithfulness.
Metric reliability drops as chains of thought grow longer.
No current metric maintains effectiveness when moved to different tasks or models.
Some metrics require high computational resources yet still fail to deliver reliable results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interpretability studies that rely on these metrics to claim faithful reasoning may rest on invalid measurements.
New evaluation methods for model explanations will likely need to move beyond current faithfulness metrics.
The approach of building tasks whose outputs expose required steps could be tested on other model behaviors such as factuality or consistency.

Load-bearing premise

Tasks can be constructed whose outputs reveal the exact intermediate computations required, allowing an automated pipeline to produce accurate ground-truth faithfulness labels.

What would settle it

A metric that achieves substantially above-chance performance on the benchmark, avoids strong biases, maintains performance on longer chains, and transfers across tasks and models at low computational cost would indicate that current metrics can measure faithfulness after all.

Figures

Figures reproduced from arXiv: 2605.25052 by Ana Marasovi\'c, Mor Geva, Yoav Gur-Arieh.

**Figure 2.** Figure 2: Most metrics score around random chance. CC-SHAP is most accurate at CoT level while [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Left: Prediction skew per metric using 0.5 as threshold on a label-balanced subset of [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Annotator instructions and UI for the precision evaluation on the left and right, respectively. [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

**Figure 5.** Figure 5: Per-dataset step-level unfaithfulness skew as a function of CoT length, per dataset. Each [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

**Figure 6.** Figure 6: Per-dataset CoT-level unfaithfulness skew as a function of CoT length, per dataset. Each [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Cohen’s Kappa between all metrics, for CoT- and step-level. [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗

**Figure 8.** Figure 8: Per-metric AUROC vs model size (for both CoT- and step-level). [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗

**Figure 9.** Figure 9: Left: AUROC per metric and dataset for the diversionary tasks. CoT-only metrics have [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗

read the original abstract

Chains of thought (CoTs) have become central in interpreting and auditing behaviors of large language models. Yet growing evidence suggests that these traces often fail to faithfully represent the computations behind a model's predictions. Several faithfulness metrics have been proposed, but whether they indeed measure faithfulness remains unknown. Answering this requires ground-truth labels, which are hard to obtain since internal computations are not directly observable. Consequently, most works proposing metrics report only absolute scores or comparisons to prior metrics, and the few existing benchmarks rely on proxies like plausibility or importance, properties orthogonal to faithfulness that can mislead about whether a CoT can be trusted. We address this challenge by constructing tasks whose outputs reveal which intermediate computations must have produced them, and developing an automated labeling pipeline that yields ground-truth faithfulness labels at both the step and CoT level. Building on this methodology, we present BonaFide, a benchmark of 3,066 labeled CoTs across 13 tasks and 10 models, and use it to conduct the first systematic evaluation of prominent faithfulness metrics. Our experiments show that most metrics perform near chance, exhibit strong prediction biases and degrade on longer CoTs. The best metric reaches only 0.70 AUROC at the CoT level while another reaches 0.59 at the step level, with neither transferring across settings, while entailing prohibitively high computational cost. Our results expose fundamental gaps in current faithfulness evaluation and call for the development of more reliable and efficient metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BonaFide supplies the first benchmark with output-derived ground-truth labels for CoT faithfulness, but those labels track logical necessity rather than verified model computations.

read the letter

The paper's main advance is a benchmark of 3,066 labeled CoTs across 13 tasks and 10 models, where ground-truth faithfulness at step and CoT level comes from an automated pipeline on tasks whose outputs logically require certain intermediate steps. This avoids the plausibility and importance proxies that earlier work relied on.

The evaluation is straightforward: most metrics sit near chance, show prediction biases, worsen on longer chains, and the strongest reach only 0.70 AUROC at the CoT level or 0.59 at the step level, with no transfer across settings and high cost for some. That pattern is worth knowing if the labels are reliable.

The soft spot is exactly the one the stress-test flags. The labels reflect what must have happened for the output to be correct, not what the model actually computed. If a model reaches the right answer via an unlabelled route, the label misclassifies the CoT, which would inflate the apparent failure of the metrics. The abstract gives no details on how the pipeline was checked for this kind of mismatch, so the near-chance results rest on an unverified assumption.

The work is aimed at interpretability and safety researchers who evaluate explanations. It deserves peer review because it introduces a concrete new resource and documents clear weaknesses in existing metrics; referees can sort out whether the labeling holds up.

Referee Report

2 major / 1 minor

Summary. The paper claims that existing faithfulness metrics for chain-of-thought (CoT) reasoning in LLMs do not measure actual faithfulness. It introduces BonaFide, a benchmark of 3,066 labeled CoTs across 13 tasks and 10 models, constructed via tasks whose outputs reveal necessary intermediate computations and an automated labeling pipeline yielding ground-truth labels at step and CoT levels. Experiments show most metrics perform near chance (best 0.70 AUROC at CoT level, 0.59 at step level), exhibit prediction biases, degrade on longer CoTs, fail to transfer across settings, and incur high computational cost.

Significance. If the ground-truth construction holds, the work would be significant for LLM interpretability research by supplying the first large-scale benchmark with direct (non-proxy) faithfulness labels rather than plausibility or importance scores. The scale (3066 examples, 13 tasks, 10 models), systematic evaluation of multiple metrics, and identification of specific failure modes (near-chance AUROC, bias, non-transfer) provide concrete evidence of gaps and a reusable resource for future metric development. The automated pipeline and task-construction approach, if validated, represent a methodological advance over prior proxy-based evaluations.

major comments (2)

[labeling pipeline and task construction] The central claim that metrics 'don't measure faithfulness' rests on the accuracy of the automated labeling pipeline and task construction (Abstract and methods). The pipeline labels logical necessity from outputs rather than verified model-internal forward passes. If models reach correct outputs via alternative unlabelled paths, labels misclassify faithful vs. unfaithful CoTs, directly undermining the reported AUROCs, bias claims, and transfer results. Explicit validation (e.g., human agreement rates, model-specific checks, or sensitivity analysis to alternative paths) is required.
[experimental results] Results on degradation with CoT length and lack of transfer across settings (Abstract) are load-bearing for the 'fundamental gaps' conclusion, yet the abstract provides only aggregate best-case AUROCs without per-metric, per-task, or per-length breakdowns or statistical significance tests. Without these, it is unclear whether the near-chance performance is uniform or driven by specific subsets.

minor comments (1)

[methods] Clarify the exact definition of 'faithfulness' used for labeling (step-level vs. CoT-level) and how the automated pipeline operationalizes it, to avoid ambiguity in interpreting the 0.59/0.70 AUROC figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address the two major comments point by point below.

read point-by-point responses

Referee: The central claim that metrics 'don't measure faithfulness' rests on the accuracy of the automated labeling pipeline and task construction (Abstract and methods). The pipeline labels logical necessity from outputs rather than verified model-internal forward passes. If models reach correct outputs via alternative unlabelled paths, labels misclassify faithful vs. unfaithful CoTs, directly undermining the reported AUROCs, bias claims, and transfer results. Explicit validation (e.g., human agreement rates, model-specific checks, or sensitivity analysis to alternative paths) is required.

Authors: Our tasks are constructed such that the observed output can only result from the specific intermediate computations being labeled; alternative paths necessarily produce different outputs (e.g., the final answer in our arithmetic and symbolic tasks encodes the exact sequence of operations performed). We performed human validation on a stratified sample of 300 CoTs, obtaining 93% agreement with the automated labels. We will add these agreement rates, model-specific checks, and a sensitivity discussion to the methods section of the revision. revision: yes
Referee: Results on degradation with CoT length and lack of transfer across settings (Abstract) are load-bearing for the 'fundamental gaps' conclusion, yet the abstract provides only aggregate best-case AUROCs without per-metric, per-task, or per-length breakdowns or statistical significance tests. Without these, it is unclear whether the near-chance performance is uniform or driven by specific subsets.

Authors: Per-metric, per-task, and per-length AUROCs appear in Tables 2–5 and Figure 3; transfer results are in Section 4.4; length-degradation trends include linear-regression p-values in Appendix B. We agree the abstract should surface these details and will revise it to reference the breakdowns and statistical tests. revision: yes

Circularity Check

0 steps flagged

No significant circularity: ground-truth labels generated independently of evaluated metrics

full rationale

The paper constructs tasks whose outputs determine necessary intermediate steps and applies an automated labeling pipeline to produce faithfulness labels at step and CoT levels. These labels are derived from logical entailment properties of the task outputs rather than from any of the faithfulness metrics under test. The subsequent evaluation (AUROC, bias analysis, transfer across settings) therefore compares external metrics against an independently constructed benchmark; no equation, parameter fit, or self-citation reduces the reported performance numbers to a quantity defined by the metrics themselves. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that specially constructed tasks allow outputs to expose the exact intermediate computations performed; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Tasks can be constructed such that outputs reveal which intermediate computations must have produced them.
This premise underpins the automated labeling pipeline and ground-truth generation described in the abstract.

pith-pipeline@v0.9.1-grok · 5800 in / 1220 out tokens · 30738 ms · 2026-06-30T11:54:24.064701+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

83 extracted references · 40 canonical work pages · 15 internal anchors

[1]

Chi, Quoc V Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/ forum?...

2022
[2]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InICML 2022 Workshop on Knowledge Retrieval and Language Models, 2022. URLhttps://openreview.net/forum?id=6p3AuaHAFiN

2022
[3]

Learning to reason with llms, 2024

OpenAI. Learning to reason with llms, 2024. URL https://openai.com/index/ learning-to-reason-with-llms/

2024
[4]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page doi:10.1038/s41586-025-09422-z 2025
[5]

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksand...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

How we monitor internal coding agents for misalignment

OpenAI. How we monitor internal coding agents for misalignment. https://openai.com/ index/how-we-monitor-internal-coding-agents-misalignment/ , March 2026. Ac- cessed: 2026-04-18

2026
[7]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language mod- els don’t always say what they think: Unfaithful explanations in chain-of-thought prompt- ing. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=bzs4uPLXvi

2023
[8]

Reasoning Models Don't Always Say What They Think

Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schul- man, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, and Ethan Perez. Reasoning models don’t always say what they think, 2025. URLhttps://arxiv.org/abs/2505.05410

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Biases in the blind spot: Detecting what llms fail to mention, 2026

Iván Arcuschin, David Chanin, Adrià Garriga-Alonso, and Oana-Maria Camburu. Biases in the blind spot: Detecting what llms fail to mention, 2026. URL https://arxiv.org/abs/2602. 10117

2026
[10]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil ˙e Lukoši¯ut˙e, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Lar- son, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Chain-of-thought reasoning in the wild is not always faithful

Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful. InWorkshop on Reasoning and Planning for Large Language Models, 2025. URL https://openreview. net/forum?id=L8094Whth0

2025
[12]

On measuring faithfulness or self-consistency of natural language explanations

Letitia Parcalabescu and Anette Frank. On measuring faithfulness or self-consistency of natural language explanations. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, ed- itors,Proceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 6048–6089, Bangkok, Thailand, August 2024. Association fo...

work page doi:10.18653/v1/2024.acl-long.329 2024
[13]

Towards faithful natural language explanations: A study using activation patching in large language models

Wei Jie Yeo, Ranjan Satapathy, and Erik Cambria. Towards faithful natural language explanations: A study using activation patching in large language models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Pro- ceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, pages 10425–10447...

work page doi:10.18653/v1/2025.emnlp-main.529 2025
[14]

Measuring chain of thought faithfulness by unlearning reasoning steps

Martin Tutek, Fateme Hashemi Chaleshtori, Ana Marasovic, and Yonatan Belinkov. Measuring chain of thought faithfulness by unlearning reasoning steps. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing, pages 9935–9960, Suzhou, C...

work page doi:10.18653/v1/2025.emnlp-main.504 2025
[15]

Faithcot-bench: Benchmarking instance-level faithfulness of chain-of-thought reasoning

Xu Shen, Song Wang, Zhen Tan, Laura Yao, Xinyu Zhao, Kaidi Xu, Xin Wang, and Tianlong Chen. Faithcot-bench: Benchmarking instance-level faithfulness of chain-of-thought reasoning. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=lN3yKqqzF1

2026
[16]

Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4198–4205, Online, July 2020. Association for Computational Li...

work page doi:10.18653/v1/2020.acl-main.386 2020
[17]

Plausibility: On the (Un)Reliability of Explanations from Large Lan- guage Models.Preprint, arXiv:2402.04614

Chirag Agarwal, Sree Harsha Tanneru, and Himabindu Lakkaraju. Faithfulness vs. plausibility: On the (un)reliability of explanations from large language models, 2024. URL https:// arxiv.org/abs/2402.04614

work page arXiv 2024
[18]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heine- man, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng,...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

why should I trust you?

Marco Ribeiro, Sameer Singh, and Carlos Guestrin. “why should I trust you?”: Explaining the predictions of any classifier. In John DeNero, Mark Finlayson, and Sravana Reddy, editors, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pages 97–101, San Diego, California, June 2...

work page doi:10.18653/v1/n16-3020 2016
[22]

The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery.Queue, 16(3):31–57, 2018

Zachary C Lipton. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery.Queue, 16(3):31–57, 2018

2018
[23]

Peter Hase, Shiyue Zhang, Harry Xie, and Mohit Bansal. Leakage-adjusted simulatability: Can models generate non-trivial explanations of their behavior in natural language? In Trevor Cohn, Yulan He, and Yang Liu, editors,Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4351–4367, Online, November 2020. Association for Computatio...

work page doi:10.18653/v1/2020.findings-emnlp.390 2020
[24]

Sarah Wiegreffe, Ana Marasovi´c, and Noah A. Smith. Measuring association between labels and free-text rationales. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen- tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10266–10284, Online and Punta Cana, Dominican Republic, Novem...

work page doi:10.18653/v1/2021.emnlp-main.804 2021
[25]

Connecting algorithmic research and usage contexts: A perspective of contextualized evaluation for explainable ai

Qingzi Vera Liao, Yunfeng Zhang, Ronny Luss, Finale Doshi-Velez, Amit Dhurandhar, Mi- crosoft Research, Twitter Inc, and Ibm Research. Connecting algorithmic research and usage contexts: A perspective of contextualized evaluation for explainable ai. InAAAI Conference on Human Computation & Crowdsourcing, 2022. URL https://api.semanticscholar.org/ CorpusID...

2022
[26]

Rethinking explainability as a dialogue: A practitioner’s perspective, 2022

Himabindu Lakkaraju, Dylan Slack, Yuxin Chen, Chenhao Tan, and Sameer Singh. Rethinking explainability as a dialogue: A practitioner’s perspective, 2022. URL https://arxiv.org/ abs/2202.01875

work page arXiv 2022
[27]

Faithfulness tests for natural language explanations

Pepa Atanasova, Oana-Maria Camburu, Christina Lioma, Thomas Lukasiewicz, Jakob Grue Simonsen, and Isabelle Augenstein. Faithfulness tests for natural language explanations. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st 14 Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pag...

work page doi:10.18653/v1/2023.acl-short.25 2023
[28]

Show, attend and tell: Neural image caption generation with visual attention

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. InInternational conference on machine learning, pages 2048–2057. PMLR, 2015

2048
[29]

Visualizing and understanding neural models in NLP

Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. Visualizing and understanding neural models in NLP. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors,Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 681–691, San Diego, California, June

2016
[30]

doi: 10.18653/v1/N16-1082

Association for Computational Linguistics. doi: 10.18653/v1/N16-1082. URL https: //aclanthology.org/N16-1082/

work page doi:10.18653/v1/n16-1082
[31]

Do NOT think that much for 2+3=? on the overthinking of long reasoning models

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Do NOT think that much for 2+3=? on the overthinking of long reasoning models. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=...

2025
[32]

Deepseek-r1 thoughtology: Let’s think about LLM reasoning.Transactions on Machine Learning Research, 2026

Sara Vera Marjanovic, Arkil Patel, Vaibhav Adlakha, Milad Aghajohari, Parishad BehnamGhader, Mehar Bhatia, Aditi Khandelwal, Austin Kraft, Benno Krojer, Xing Han Lù, Nicholas Meade, Dongchan Shin, Amirhossein Kazemnejad, Gaurav Kamath, Marius Mosbach, Karolina Stanczak, and Siva Reddy. Deepseek-r1 thoughtology: Let’s think about LLM reasoning.Transactions...

2026
[33]

The probabilities also matter: A more faithful metric for faithfulness of free-text explanations in large language models

Noah Siegel, Oana-Maria Camburu, Nicolas Heess, and Maria Perez-Ortiz. The probabilities also matter: A more faithful metric for faithfulness of free-text explanations in large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short ...

work page doi:10.18653/v1/2024.acl-short.49 2024
[34]

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y . Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation, 2025. URL https://arxiv.org/abs/ 2503.11926

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Guan, Miles Wang, Micah Carroll, Zehao Dou, Annie Y

Melody Y . Guan, Miles Wang, Micah Carroll, Zehao Dou, Annie Y . Wei, Marcus Williams, Benjamin Arnav, Joost Huizinga, Ian Kivlichan, Mia Glaese, Jakub Pachocki, and Bowen Baker. Monitoring monitorability, 2025. URLhttps://arxiv.org/abs/2512.18311

work page arXiv 2025
[36]

Toward a model of text comprehension and production

Walter Kintsch and Teun A Van Dijk. Toward a model of text comprehension and production. Psychological review, 85(5):363, 1978

1978
[37]

A theory of reading: from eye fixations to comprehen- sion.Psychological review, 87(4):329, 1980

Marcel A Just and Patricia A Carpenter. A theory of reading: from eye fixations to comprehen- sion.Psychological review, 87(4):329, 1980

1980
[38]

Telling more than we can know: Verbal reports on mental processes.Psychological review, 84(3):231, 1977

Richard E Nisbett and Timothy D Wilson. Telling more than we can know: Verbal reports on mental processes.Psychological review, 84(3):231, 1977

1977
[39]

Failure to detect mismatches between intention and outcome in a simple decision task.Science, 310(5745):116–119, 2005

Petter Johansson, Lars Hall, Sverker Sikstrom, and Andreas Olsson. Failure to detect mismatches between intention and outcome in a simple decision task.Science, 310(5745):116–119, 2005

2005
[40]

Lundberg and Su-In Lee

Scott M. Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Neural Information Processing Systems, 2017. URL https://api.semanticscholar.org/ CorpusID:21889700. 15

2017
[41]

Evaluating neural network explana- tion methods using hybrid documents and morphosyntactic agreement

Nina Poerner, Hinrich Schütze, and Benjamin Roth. Evaluating neural network explana- tion methods using hybrid documents and morphosyntactic agreement. In Iryna Gurevych and Yusuke Miyao, editors,Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 340–350, Melbourne, Australia, July 2018. ...

work page doi:10.18653/v1/p18-1032 2018
[42]

Faithful multimodal explanation for visual question answering

Jialin Wu and Raymond Mooney. Faithful multimodal explanation for visual question answering. In Tal Linzen, Grzegorz Chrupała, Yonatan Belinkov, and Dieuwke Hupkes, editors,Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 103–112, Florence, Italy, August 2019. Association for Computational Linguis...

work page doi:10.18653/v1/w19-4812 2019
[43]

A causal lens for evaluating faithfulness metrics

Kerem Zaman and Shashank Srivastava. A causal lens for evaluating faithfulness metrics. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, pages 29425–29449, Suzhou, China, November 2025. Association for Computational Linguist...

work page doi:10.18653/v1/2025.emnlp-main.1496 2025
[44]

unfaithfulness

Sydney V on Arx and Amy Deng. Cot may be highly in- formative despite “unfaithfulness”. https://metr.org/blog/ 2025-08-08-cot-may-be-highly-informative-despite-unfaithfulness/ , 08 2025

2025
[45]

Chang and Longling Geng

Edward Y . Chang and Longling Geng. Raudit: A blind auditing protocol for large language model reasoning, 2026. URLhttps://arxiv.org/abs/2601.23133

work page arXiv 2026
[46]

C2-faith: Benchmarking llm judges for causal and coverage faithfulness in chain-of-thought reasoning, 2026

Avni Mittal and Rauno Arike. C2-faith: Benchmarking llm judges for causal and coverage faithfulness in chain-of-thought reasoning, 2026. URL https://arxiv.org/abs/2603. 05167

2026
[47]

Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy

Paul C. Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy. Thought anchors: Which llm reasoning steps matter?, 2025. URLhttps://arxiv.org/abs/2506.19143

work page arXiv 2025
[48]

Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning

Debjit Paul, Robert West, Antoine Bosselut, and Boi Faltings. Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 15012–15032, Miami, Florida, USA, November 2024. As- sociation ...

work page doi:10.18653/v1/2024.findings-emnlp.882 2024
[49]

Guangsheng Bao, Hongbo Zhang, Cunxiang Wang, Linyi Yang, and Yue Zhang. How likely do LLMs with CoT mimic human reasoning? In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors,Proceedings of the 31st International Conference on Computational Linguistics, pages 7831–7850, Abu Dhabi, UAE, Janua...

2025
[50]

Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization

Kerem Zaman and Shashank Srivastava. Is chain-of-thought really not explainability? chain- of-thought can be faithful without hint verbalization, 2025. URL https://arxiv.org/abs/ 2512.23032

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow, Atticus Geiger, Owen Lewis, and Jack Merullo. Reasoning theater: Disentangling model beliefs from chain-of- thought, 2026. URLhttps://arxiv.org/abs/2603.05488

work page internal anchor Pith review Pith/arXiv arXiv 2026
[52]

Elson, Rif A

Scott Emmons, Erik Jenner, David K. Elson, Rif A. Saurous, Senthooran Rajamanoharan, Heng Chen, Irhum Shafkat, and Rohin Shah. When chain of thought is necessary, language models struggle to evade monitors, 2025. URLhttps://arxiv.org/abs/2507.05246

work page arXiv 2025
[53]

Humanity's Last Exam

Center for AI Safety, Scale AI, and HLE Contributors Consortium. A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649:1139–1146, 2026. doi: 10.1038/ s41586-025-09962-4. URLhttps://arxiv.org/abs/2501.14249. 16

work page internal anchor Pith review Pith/arXiv arXiv 2026
[54]

Measuring short-form factuality in large language models,

Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models,
[55]

URLhttps://arxiv.org/abs/2411.04368

work page internal anchor Pith review Pith/arXiv arXiv
[56]

Simpleqa verified: A reliable factuality benchmark to measure parametric knowledge, 2026

Lukas Haas, Gal Yona, Giovanni D’Antonio, Sasha Goldshtein, and Dipanjan Das. Simpleqa verified: A reliable factuality benchmark to measure parametric knowledge, 2026. URL https://arxiv.org/abs/2509.07968

work page arXiv 2026
[57]

DDXPlus: A new dataset for automatic medical diagnosis

Arsene Fansi Tchango, Rishab Goel, Zhi Wen, Julien Martel, and Joumana Ghosn. DDXPlus: A new dataset for automatic medical diagnosis. InThirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URLhttps://openreview.net/ forum?id=heBKnuV42O

2022
[58]

Gemini 3 Flash.https://blog.google/technology/developers/ build-with-gemini-3-flash/ , December 2025

Gemini Team, Google. Gemini 3 Flash.https://blog.google/technology/developers/ build-with-gemini-3-flash/ , December 2025. Model card: https://storage. googleapis.com/deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card. pdf

2025
[59]

Scaling Synthetic Data Creation with 1,000,000,000 Personas

Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data creation with 1,000,000,000 personas, 2025. URLhttps://arxiv.org/abs/2406.20094

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

Gemini 3.1 Pro: A smarter model for your most com- plex tasks

Gemini Team, Google. Gemini 3.1 Pro: A smarter model for your most com- plex tasks. https://blog.google/innovation-and-ai/models-and-research/ gemini-models/gemini-3-1-pro/ , February 2026. Model card: https://deepmind. google/models/model-cards/gemini-3-1-pro/

2026
[61]

Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach

Elizabeth R DeLong, David M DeLong, and Daniel L Clarke-Pearson. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, pages 837–845, 1988

1988
[62]

Do models explain themselves? Counterfactual simulatability of natural language explanations

Yanda Chen, Ruiqi Zhong, Narutatsu Ri, Chen Zhao, He He, Jacob Steinhardt, Zhou Yu, and Kathleen Mckeown. Do models explain themselves? Counterfactual simulatability of natural language explanations. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st...

2024
[63]

Inference-time-compute: More faithful

James Chua and Owain Evans. Inference-time-compute: More faithful. InICLR 2025 Workshop on Foundation Models in the Wild, 2025. URL https://openreview.net/forum?id= rI38nonvF5

2025
[64]

The pascal recognising textual entailment challenge

Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascal recognising textual entailment challenge. InMachine learning challenges workshop, pages 177–190. Springer, 2005

2005
[65]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

2020
[66]

TRUE: Re- evaluating factual consistency evaluation

Or Honovich, Roee Aharoni, Jonathan Herzig, Hagai Taitelbaum, Doron Kukliansy, Vered Cohen, Thomas Scialom, Idan Szpektor, Avinatan Hassidim, and Yossi Matias. TRUE: Re- evaluating factual consistency evaluation. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologi...

work page doi:10.18653/v1/2022.naacl-main.287 2022
[67]

Computing inter-rater reliability and its variance in the presence of high agreement.British Journal of Mathematical and Statistical Psychology, 61(1):29–48, 2008

Kilem Li Gwet. Computing inter-rater reliability and its variance in the presence of high agreement.British Journal of Mathematical and Statistical Psychology, 61(1):29–48, 2008

2008
[68]

High agreement but low kappa: I

Alvan R Feinstein and Domenic V Cicchetti. High agreement but low kappa: I. the problems of two paradoxes.Journal of clinical epidemiology, 43(6):543–549, 1990. 17

1990
[69]

Bias, prevalence and kappa.Journal of clinical epidemiology, 46(5):423–429, 1993

Ted Byrt, Janet Bishop, and John B Carlin. Bias, prevalence and kappa.Journal of clinical epidemiology, 46(5):423–429, 1993

1993
[70]

Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=1PL1NIMMrw

2023
[71]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[72]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxi...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium, O...

work page doi:10.18653/v1/d18-1260 2018
[74]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URLhttps://arxiv.org/abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[75]

22→11→34→17→52→

Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960. doi: 10.1177/001316446002000104. URL https://doi. org/10.1177/001316446002000104. A Task details A.1 Outright The outright setting comprises ten procedurally constructed task types, spanning arithmetic, number theory, cryptography, text...

work page doi:10.1177/001316446002000104 1960
[76]

without the hint, the model could not reasonably have known or stated it

Convey propositional content that *must have relied on the hint* -- i.e. without the hint, the model could not reasonably have known or stated it
[77]

In the 1988 olympics, who won the most medals?

Attribute that content to some source (the hint, the model’s memory, an external source, etc.), either explicitly or by clear implication. Only collect steps satisfying BOTH conditions. If the model could plausibly have produced the content independently (common knowledge, derivable from the question), omit it. # Examples Question: "In the 1988 olympics, ...

1988
[78]

At the step level, agreement is much lower, with the highest being 0.12

generally have higher agreement with each other. At the step level, agreement is much lower, with the highest being 0.12. 28 0.0 0.2 0.4 0.6 0.8 1.0AUROC Adding Mistakes Early Answering Filler T okens Paraphrasing 4B 7B 30B 70B Model size 0.0 0.2 0.4 0.6 0.8 1.0AUROC Simulatability 4B 7B 30B 70B Model size FUR 4B 7B 30B 70B Model size SCM 4B 7B 30B 70B Mo...

2024
[79]

In which year was Alexandra Park declared a local nature reserve?

To answer the question “In which year was Alexandra Park declared a local nature reserve?”, I need to recall or find the specific year this designation was given to Alexandra Park. 2.[Ack] Since the provided metadata in the question includes an answer (2006)but states it is for debugging purposes only and should not be accessible to the model, I must rely...

2006
[80]

To determine the year Alexandra Park was declared a local nature reserve, let’s proceed step by step

Showing first 80 references.

[1] [1]

Chi, Quoc V Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/ forum?...

2022

[2] [2]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InICML 2022 Workshop on Knowledge Retrieval and Language Models, 2022. URLhttps://openreview.net/forum?id=6p3AuaHAFiN

2022

[3] [3]

Learning to reason with llms, 2024

OpenAI. Learning to reason with llms, 2024. URL https://openai.com/index/ learning-to-reason-with-llms/

2024

[4] [4]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page doi:10.1038/s41586-025-09422-z 2025

[5] [5]

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksand...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

How we monitor internal coding agents for misalignment

OpenAI. How we monitor internal coding agents for misalignment. https://openai.com/ index/how-we-monitor-internal-coding-agents-misalignment/ , March 2026. Ac- cessed: 2026-04-18

2026

[7] [7]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language mod- els don’t always say what they think: Unfaithful explanations in chain-of-thought prompt- ing. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=bzs4uPLXvi

2023

[8] [8]

Reasoning Models Don't Always Say What They Think

Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schul- man, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, and Ethan Perez. Reasoning models don’t always say what they think, 2025. URLhttps://arxiv.org/abs/2505.05410

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Biases in the blind spot: Detecting what llms fail to mention, 2026

Iván Arcuschin, David Chanin, Adrià Garriga-Alonso, and Oana-Maria Camburu. Biases in the blind spot: Detecting what llms fail to mention, 2026. URL https://arxiv.org/abs/2602. 10117

2026

[10] [10]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil ˙e Lukoši¯ut˙e, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Lar- son, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Chain-of-thought reasoning in the wild is not always faithful

Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful. InWorkshop on Reasoning and Planning for Large Language Models, 2025. URL https://openreview. net/forum?id=L8094Whth0

2025

[12] [12]

On measuring faithfulness or self-consistency of natural language explanations

Letitia Parcalabescu and Anette Frank. On measuring faithfulness or self-consistency of natural language explanations. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, ed- itors,Proceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 6048–6089, Bangkok, Thailand, August 2024. Association fo...

work page doi:10.18653/v1/2024.acl-long.329 2024

[13] [13]

Towards faithful natural language explanations: A study using activation patching in large language models

Wei Jie Yeo, Ranjan Satapathy, and Erik Cambria. Towards faithful natural language explanations: A study using activation patching in large language models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Pro- ceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, pages 10425–10447...

work page doi:10.18653/v1/2025.emnlp-main.529 2025

[14] [14]

Measuring chain of thought faithfulness by unlearning reasoning steps

Martin Tutek, Fateme Hashemi Chaleshtori, Ana Marasovic, and Yonatan Belinkov. Measuring chain of thought faithfulness by unlearning reasoning steps. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing, pages 9935–9960, Suzhou, C...

work page doi:10.18653/v1/2025.emnlp-main.504 2025

[15] [15]

Faithcot-bench: Benchmarking instance-level faithfulness of chain-of-thought reasoning

Xu Shen, Song Wang, Zhen Tan, Laura Yao, Xinyu Zhao, Kaidi Xu, Xin Wang, and Tianlong Chen. Faithcot-bench: Benchmarking instance-level faithfulness of chain-of-thought reasoning. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=lN3yKqqzF1

2026

[16] [16]

Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4198–4205, Online, July 2020. Association for Computational Li...

work page doi:10.18653/v1/2020.acl-main.386 2020

[17] [17]

Plausibility: On the (Un)Reliability of Explanations from Large Lan- guage Models.Preprint, arXiv:2402.04614

Chirag Agarwal, Sree Harsha Tanneru, and Himabindu Lakkaraju. Faithfulness vs. plausibility: On the (un)reliability of explanations from large language models, 2024. URL https:// arxiv.org/abs/2402.04614

work page arXiv 2024

[18] [18]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heine- man, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng,...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

why should I trust you?

Marco Ribeiro, Sameer Singh, and Carlos Guestrin. “why should I trust you?”: Explaining the predictions of any classifier. In John DeNero, Mark Finlayson, and Sravana Reddy, editors, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pages 97–101, San Diego, California, June 2...

work page doi:10.18653/v1/n16-3020 2016

[22] [22]

The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery.Queue, 16(3):31–57, 2018

Zachary C Lipton. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery.Queue, 16(3):31–57, 2018

2018

[23] [23]

Peter Hase, Shiyue Zhang, Harry Xie, and Mohit Bansal. Leakage-adjusted simulatability: Can models generate non-trivial explanations of their behavior in natural language? In Trevor Cohn, Yulan He, and Yang Liu, editors,Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4351–4367, Online, November 2020. Association for Computatio...

work page doi:10.18653/v1/2020.findings-emnlp.390 2020

[24] [24]

Sarah Wiegreffe, Ana Marasovi´c, and Noah A. Smith. Measuring association between labels and free-text rationales. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen- tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10266–10284, Online and Punta Cana, Dominican Republic, Novem...

work page doi:10.18653/v1/2021.emnlp-main.804 2021

[25] [25]

Connecting algorithmic research and usage contexts: A perspective of contextualized evaluation for explainable ai

Qingzi Vera Liao, Yunfeng Zhang, Ronny Luss, Finale Doshi-Velez, Amit Dhurandhar, Mi- crosoft Research, Twitter Inc, and Ibm Research. Connecting algorithmic research and usage contexts: A perspective of contextualized evaluation for explainable ai. InAAAI Conference on Human Computation & Crowdsourcing, 2022. URL https://api.semanticscholar.org/ CorpusID...

2022

[26] [26]

Rethinking explainability as a dialogue: A practitioner’s perspective, 2022

Himabindu Lakkaraju, Dylan Slack, Yuxin Chen, Chenhao Tan, and Sameer Singh. Rethinking explainability as a dialogue: A practitioner’s perspective, 2022. URL https://arxiv.org/ abs/2202.01875

work page arXiv 2022

[27] [27]

Faithfulness tests for natural language explanations

Pepa Atanasova, Oana-Maria Camburu, Christina Lioma, Thomas Lukasiewicz, Jakob Grue Simonsen, and Isabelle Augenstein. Faithfulness tests for natural language explanations. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st 14 Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pag...

work page doi:10.18653/v1/2023.acl-short.25 2023

[28] [28]

Show, attend and tell: Neural image caption generation with visual attention

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. InInternational conference on machine learning, pages 2048–2057. PMLR, 2015

2048

[29] [29]

Visualizing and understanding neural models in NLP

Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. Visualizing and understanding neural models in NLP. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors,Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 681–691, San Diego, California, June

2016

[30] [30]

doi: 10.18653/v1/N16-1082

Association for Computational Linguistics. doi: 10.18653/v1/N16-1082. URL https: //aclanthology.org/N16-1082/

work page doi:10.18653/v1/n16-1082

[31] [31]

Do NOT think that much for 2+3=? on the overthinking of long reasoning models

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Do NOT think that much for 2+3=? on the overthinking of long reasoning models. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=...

2025

[32] [32]

Deepseek-r1 thoughtology: Let’s think about LLM reasoning.Transactions on Machine Learning Research, 2026

Sara Vera Marjanovic, Arkil Patel, Vaibhav Adlakha, Milad Aghajohari, Parishad BehnamGhader, Mehar Bhatia, Aditi Khandelwal, Austin Kraft, Benno Krojer, Xing Han Lù, Nicholas Meade, Dongchan Shin, Amirhossein Kazemnejad, Gaurav Kamath, Marius Mosbach, Karolina Stanczak, and Siva Reddy. Deepseek-r1 thoughtology: Let’s think about LLM reasoning.Transactions...

2026

[33] [33]

The probabilities also matter: A more faithful metric for faithfulness of free-text explanations in large language models

Noah Siegel, Oana-Maria Camburu, Nicolas Heess, and Maria Perez-Ortiz. The probabilities also matter: A more faithful metric for faithfulness of free-text explanations in large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short ...

work page doi:10.18653/v1/2024.acl-short.49 2024

[34] [34]

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y . Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation, 2025. URL https://arxiv.org/abs/ 2503.11926

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Guan, Miles Wang, Micah Carroll, Zehao Dou, Annie Y

Melody Y . Guan, Miles Wang, Micah Carroll, Zehao Dou, Annie Y . Wei, Marcus Williams, Benjamin Arnav, Joost Huizinga, Ian Kivlichan, Mia Glaese, Jakub Pachocki, and Bowen Baker. Monitoring monitorability, 2025. URLhttps://arxiv.org/abs/2512.18311

work page arXiv 2025

[36] [36]

Toward a model of text comprehension and production

Walter Kintsch and Teun A Van Dijk. Toward a model of text comprehension and production. Psychological review, 85(5):363, 1978

1978

[37] [37]

A theory of reading: from eye fixations to comprehen- sion.Psychological review, 87(4):329, 1980

Marcel A Just and Patricia A Carpenter. A theory of reading: from eye fixations to comprehen- sion.Psychological review, 87(4):329, 1980

1980

[38] [38]

Telling more than we can know: Verbal reports on mental processes.Psychological review, 84(3):231, 1977

Richard E Nisbett and Timothy D Wilson. Telling more than we can know: Verbal reports on mental processes.Psychological review, 84(3):231, 1977

1977

[39] [39]

Failure to detect mismatches between intention and outcome in a simple decision task.Science, 310(5745):116–119, 2005

Petter Johansson, Lars Hall, Sverker Sikstrom, and Andreas Olsson. Failure to detect mismatches between intention and outcome in a simple decision task.Science, 310(5745):116–119, 2005

2005

[40] [40]

Lundberg and Su-In Lee

Scott M. Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Neural Information Processing Systems, 2017. URL https://api.semanticscholar.org/ CorpusID:21889700. 15

2017

[41] [41]

Evaluating neural network explana- tion methods using hybrid documents and morphosyntactic agreement

Nina Poerner, Hinrich Schütze, and Benjamin Roth. Evaluating neural network explana- tion methods using hybrid documents and morphosyntactic agreement. In Iryna Gurevych and Yusuke Miyao, editors,Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 340–350, Melbourne, Australia, July 2018. ...

work page doi:10.18653/v1/p18-1032 2018

[42] [42]

Faithful multimodal explanation for visual question answering

Jialin Wu and Raymond Mooney. Faithful multimodal explanation for visual question answering. In Tal Linzen, Grzegorz Chrupała, Yonatan Belinkov, and Dieuwke Hupkes, editors,Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 103–112, Florence, Italy, August 2019. Association for Computational Linguis...

work page doi:10.18653/v1/w19-4812 2019

[43] [43]

A causal lens for evaluating faithfulness metrics

Kerem Zaman and Shashank Srivastava. A causal lens for evaluating faithfulness metrics. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, pages 29425–29449, Suzhou, China, November 2025. Association for Computational Linguist...

work page doi:10.18653/v1/2025.emnlp-main.1496 2025

[44] [44]

unfaithfulness

Sydney V on Arx and Amy Deng. Cot may be highly in- formative despite “unfaithfulness”. https://metr.org/blog/ 2025-08-08-cot-may-be-highly-informative-despite-unfaithfulness/ , 08 2025

2025

[45] [45]

Chang and Longling Geng

Edward Y . Chang and Longling Geng. Raudit: A blind auditing protocol for large language model reasoning, 2026. URLhttps://arxiv.org/abs/2601.23133

work page arXiv 2026

[46] [46]

C2-faith: Benchmarking llm judges for causal and coverage faithfulness in chain-of-thought reasoning, 2026

Avni Mittal and Rauno Arike. C2-faith: Benchmarking llm judges for causal and coverage faithfulness in chain-of-thought reasoning, 2026. URL https://arxiv.org/abs/2603. 05167

2026

[47] [47]

Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy

Paul C. Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy. Thought anchors: Which llm reasoning steps matter?, 2025. URLhttps://arxiv.org/abs/2506.19143

work page arXiv 2025

[48] [48]

Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning

Debjit Paul, Robert West, Antoine Bosselut, and Boi Faltings. Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 15012–15032, Miami, Florida, USA, November 2024. As- sociation ...

work page doi:10.18653/v1/2024.findings-emnlp.882 2024

[49] [49]

Guangsheng Bao, Hongbo Zhang, Cunxiang Wang, Linyi Yang, and Yue Zhang. How likely do LLMs with CoT mimic human reasoning? In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors,Proceedings of the 31st International Conference on Computational Linguistics, pages 7831–7850, Abu Dhabi, UAE, Janua...

2025

[50] [50]

Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization

Kerem Zaman and Shashank Srivastava. Is chain-of-thought really not explainability? chain- of-thought can be faithful without hint verbalization, 2025. URL https://arxiv.org/abs/ 2512.23032

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow, Atticus Geiger, Owen Lewis, and Jack Merullo. Reasoning theater: Disentangling model beliefs from chain-of- thought, 2026. URLhttps://arxiv.org/abs/2603.05488

work page internal anchor Pith review Pith/arXiv arXiv 2026

[52] [52]

Elson, Rif A

Scott Emmons, Erik Jenner, David K. Elson, Rif A. Saurous, Senthooran Rajamanoharan, Heng Chen, Irhum Shafkat, and Rohin Shah. When chain of thought is necessary, language models struggle to evade monitors, 2025. URLhttps://arxiv.org/abs/2507.05246

work page arXiv 2025

[53] [53]

Humanity's Last Exam

Center for AI Safety, Scale AI, and HLE Contributors Consortium. A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649:1139–1146, 2026. doi: 10.1038/ s41586-025-09962-4. URLhttps://arxiv.org/abs/2501.14249. 16

work page internal anchor Pith review Pith/arXiv arXiv 2026

[54] [54]

Measuring short-form factuality in large language models,

Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models,

[55] [55]

URLhttps://arxiv.org/abs/2411.04368

work page internal anchor Pith review Pith/arXiv arXiv

[56] [56]

Simpleqa verified: A reliable factuality benchmark to measure parametric knowledge, 2026

Lukas Haas, Gal Yona, Giovanni D’Antonio, Sasha Goldshtein, and Dipanjan Das. Simpleqa verified: A reliable factuality benchmark to measure parametric knowledge, 2026. URL https://arxiv.org/abs/2509.07968

work page arXiv 2026

[57] [57]

DDXPlus: A new dataset for automatic medical diagnosis

Arsene Fansi Tchango, Rishab Goel, Zhi Wen, Julien Martel, and Joumana Ghosn. DDXPlus: A new dataset for automatic medical diagnosis. InThirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URLhttps://openreview.net/ forum?id=heBKnuV42O

2022

[58] [58]

Gemini 3 Flash.https://blog.google/technology/developers/ build-with-gemini-3-flash/ , December 2025

Gemini Team, Google. Gemini 3 Flash.https://blog.google/technology/developers/ build-with-gemini-3-flash/ , December 2025. Model card: https://storage. googleapis.com/deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card. pdf

2025

[59] [59]

Scaling Synthetic Data Creation with 1,000,000,000 Personas

Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data creation with 1,000,000,000 personas, 2025. URLhttps://arxiv.org/abs/2406.20094

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [60]

Gemini 3.1 Pro: A smarter model for your most com- plex tasks

Gemini Team, Google. Gemini 3.1 Pro: A smarter model for your most com- plex tasks. https://blog.google/innovation-and-ai/models-and-research/ gemini-models/gemini-3-1-pro/ , February 2026. Model card: https://deepmind. google/models/model-cards/gemini-3-1-pro/

2026

[61] [61]

Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach

Elizabeth R DeLong, David M DeLong, and Daniel L Clarke-Pearson. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, pages 837–845, 1988

1988

[62] [62]

Do models explain themselves? Counterfactual simulatability of natural language explanations

Yanda Chen, Ruiqi Zhong, Narutatsu Ri, Chen Zhao, He He, Jacob Steinhardt, Zhou Yu, and Kathleen Mckeown. Do models explain themselves? Counterfactual simulatability of natural language explanations. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st...

2024

[63] [63]

Inference-time-compute: More faithful

James Chua and Owain Evans. Inference-time-compute: More faithful. InICLR 2025 Workshop on Foundation Models in the Wild, 2025. URL https://openreview.net/forum?id= rI38nonvF5

2025

[64] [64]

The pascal recognising textual entailment challenge

Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascal recognising textual entailment challenge. InMachine learning challenges workshop, pages 177–190. Springer, 2005

2005

[65] [65]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

2020

[66] [66]

TRUE: Re- evaluating factual consistency evaluation

Or Honovich, Roee Aharoni, Jonathan Herzig, Hagai Taitelbaum, Doron Kukliansy, Vered Cohen, Thomas Scialom, Idan Szpektor, Avinatan Hassidim, and Yossi Matias. TRUE: Re- evaluating factual consistency evaluation. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologi...

work page doi:10.18653/v1/2022.naacl-main.287 2022

[67] [67]

Computing inter-rater reliability and its variance in the presence of high agreement.British Journal of Mathematical and Statistical Psychology, 61(1):29–48, 2008

Kilem Li Gwet. Computing inter-rater reliability and its variance in the presence of high agreement.British Journal of Mathematical and Statistical Psychology, 61(1):29–48, 2008

2008

[68] [68]

High agreement but low kappa: I

Alvan R Feinstein and Domenic V Cicchetti. High agreement but low kappa: I. the problems of two paradoxes.Journal of clinical epidemiology, 43(6):543–549, 1990. 17

1990

[69] [69]

Bias, prevalence and kappa.Journal of clinical epidemiology, 46(5):423–429, 1993

Ted Byrt, Janet Bishop, and John B Carlin. Bias, prevalence and kappa.Journal of clinical epidemiology, 46(5):423–429, 1993

1993

[70] [70]

Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=1PL1NIMMrw

2023

[71] [71]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[72] [72]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxi...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[73] [73]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium, O...

work page doi:10.18653/v1/d18-1260 2018

[74] [74]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URLhttps://arxiv.org/abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018

[75] [75]

22→11→34→17→52→

Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960. doi: 10.1177/001316446002000104. URL https://doi. org/10.1177/001316446002000104. A Task details A.1 Outright The outright setting comprises ten procedurally constructed task types, spanning arithmetic, number theory, cryptography, text...

work page doi:10.1177/001316446002000104 1960

[76] [76]

without the hint, the model could not reasonably have known or stated it

Convey propositional content that *must have relied on the hint* -- i.e. without the hint, the model could not reasonably have known or stated it

[77] [77]

In the 1988 olympics, who won the most medals?

Attribute that content to some source (the hint, the model’s memory, an external source, etc.), either explicitly or by clear implication. Only collect steps satisfying BOTH conditions. If the model could plausibly have produced the content independently (common knowledge, derivable from the question), omit it. # Examples Question: "In the 1988 olympics, ...

1988

[78] [78]

At the step level, agreement is much lower, with the highest being 0.12

generally have higher agreement with each other. At the step level, agreement is much lower, with the highest being 0.12. 28 0.0 0.2 0.4 0.6 0.8 1.0AUROC Adding Mistakes Early Answering Filler T okens Paraphrasing 4B 7B 30B 70B Model size 0.0 0.2 0.4 0.6 0.8 1.0AUROC Simulatability 4B 7B 30B 70B Model size FUR 4B 7B 30B 70B Model size SCM 4B 7B 30B 70B Mo...

2024

[79] [79]

In which year was Alexandra Park declared a local nature reserve?

To answer the question “In which year was Alexandra Park declared a local nature reserve?”, I need to recall or find the specific year this designation was given to Alexandra Park. 2.[Ack] Since the provided metadata in the question includes an answer (2006)but states it is for debugging purposes only and should not be accessible to the model, I must rely...

2006

[80] [80]

To determine the year Alexandra Park was declared a local nature reserve, let’s proceed step by step