Evaluating the False Trust Engendered by LLM Explanations

Subbarao Kambhampati; Upasana Biswas; Vardhan Palod

arxiv: 2605.10930 · v2 · pith:3IE37OUWnew · submitted 2026-05-11 · 💻 cs.HC

Evaluating the False Trust Engendered by LLM Explanations

Vardhan Palod , Upasana Biswas , Subbarao Kambhampati This is my paper

Pith reviewed 2026-05-20 21:47 UTC · model grok-4.3

classification 💻 cs.HC

keywords LLM explanationsfalse trustreasoning tracespost-hoc explanationsdual explanationsuser studyhuman-AI interactionAI reliability

0 comments

The pith

Reasoning traces and post-hoc explanations from LLMs increase user acceptance of predictions whether correct or incorrect, while dual explanations improve distinction between them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines whether common explanations from large language models help users decide when to trust an AI answer in situations where they cannot verify it. Researchers conducted a between-subjects user study testing four conditions: full reasoning traces, summaries of those traces, post-hoc explanations, and dual explanations that give arguments both for and against the AI answer. The results indicate that reasoning traces and post-hoc explanations raise the chance users will accept the AI output no matter if it is right or wrong. Only the dual-explanation condition helped participants better tell correct answers from incorrect ones. This distinction matters because people increasingly turn to LLMs for decisions without independent ways to check the results.

Core claim

In a between-subject user study simulating a setting where users do not have the means to verify the solution, reasoning traces and post-hoc explanations are persuasive but not informative: they increase user acceptance of LLM predictions regardless of their correctness. In contrast, dual explanation is the only condition that genuinely improves users' ability to distinguish correct from incorrect AI outputs.

What carries the argument

A between-subjects user study that compares four explanation conditions (reasoning traces, trace summaries, post-hoc explanations, and contrastive dual explanations) in a simulated non-verification scenario.

If this is right

Reasoning traces and post-hoc explanations boost acceptance of LLM answers irrespective of accuracy.
Dual explanations enhance users' ability to discriminate between correct and incorrect outputs.
Standard explanation methods do not aid in identifying errors in AI predictions.
The choice of explanation type directly influences the level of false trust users develop.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interfaces for LLMs in critical domains might benefit from defaulting to balanced pro-and-con arguments rather than single-sided traces.
The pattern may extend to other generative AI tools where users lack the expertise to check outputs directly.
Developers could test whether automatically generating counter-arguments improves judgment accuracy across different model types.

Load-bearing premise

The evaluation protocol assumes that the simulated setting where users do not have the means to verify the solution accurately captures real-world scenarios in which people rely on LLM explanations without external verification.

What would settle it

A replication study in which participants can independently verify the correctness of AI answers and then measuring whether acceptance rates and error-detection accuracy still follow the same pattern across explanation conditions.

Figures

Figures reproduced from arXiv: 2605.10930 by Subbarao Kambhampati, Upasana Biswas, Vardhan Palod.

**Figure 1.** Figure 1: Step 1 – Consent. IRB-approved description of the study, expected duration, compensation structure (base pay plus bonus), and Prolific completion procedure. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_1.png] view at source ↗

**Figure 2.** Figure 2: Step 2 – Background questions. Two single-choice items capture prior exposure to JEE-style competitive exams and frequency of AI-tool use. Used as covariates / for stratification [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗

**Figure 3.** Figure 3: Step 3 – Attention check 1. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗

**Figure 4.** Figure 4: Step 4 – Task instructions. Describes the per-item flow the participant is about to begin (8 problems. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

**Figure 5.** Figure 5: Step 5 – Item page 1 (pre-AI). The participant sees the problem statement, indicates whether they know how to solve it (know-level: “I know the answer immediately” / “given more time” / “I cannot solve this”), optionally provides their own answer, and rates their confidence on a 1–7 Likert scale. This page is identical across variants. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Step 6a – Item page 2, reasoning_trace variant. After Page 1, the participant sees the AI’s predicted answer alongside the full chain-of-thought trace. The page also collects the AI-correctness judgment (Yes/No), confidence in that judgment (1–7), per-item trust rating (1–7), and a helpfulness probe. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Step 6b – Item page 2, reasoning_summary variant. Same layout as [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Step 6c – Item page 2, E+ variant. Replaces the trace with a single-sided post-hoc explanation arguing why the predicted answer may be correct. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Step 6d – Item page 2, E+/− Replaces the trace with paired “why this answer may be correct” / “why this answer may be incorrect” arguments. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

**Figure 10.** Figure 10: Step 8 – End-of-task transition. Confirms the participant has completed all 8 reasoning items and previews the post-study questionnaire [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

**Figure 11.** Figure 11: Step 9 – Post-study trust questionnaire. Three items rated on a 1–7 Likert scale (trust in outputs, willingness to rely on recommendations, confidence in answers). 27 [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

read the original abstract

Large Language Models (LLMs) and Large Reasoning Models (LRMs) are increasingly used for critical tasks, yet they provide no guarantees about the correctness of their solutions. Users must decide whether to trust the model's answer, aided by reasoning traces, their summaries, or post-hoc generated explanations. These reasoning traces, despite evidence that they are neither faithful representations of the model's computations nor necessarily semantically meaningful, are often interpreted as provenance explanations. It is unclear whether explanations or reasoning traces help users identify when the AI is incorrect, or whether they simply persuade users to trust the AI regardless. In this paper, we take a user-centered approach and develop an evaluation protocol to study how different explanation types affect users' ability to judge the correctness of AI-generated answers and engender false trust in the users. We conduct a between-subject user study, simulating a setting where users do not have the means to verify the solution and analyze the false trust engendered by commonly used LLM explanations - reasoning traces, their summaries and post-hoc explanations. We also test a contrastive dual explanation setting where we present arguments for and against the AI's answer. We find that reasoning traces and post-hoc explanations are persuasive but not informative: they increase user acceptance of LLM predictions regardless of their correctness. In contrast, dual explanation is the only condition that genuinely improves users' ability to distinguish correct from incorrect AI outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Dual explanations help users discriminate correct from incorrect LLM answers while other types just raise acceptance rates across the board.

read the letter

Hey, the main thing here is that the study finds only the contrastive dual explanation improves users' ability to tell right from wrong LLM outputs, whereas reasoning traces, summaries, and post-hoc explanations simply increase how often people accept the answer regardless of correctness. They ran a between-subjects user study in a no-verification simulation to measure both acceptance and discrimination accuracy across those conditions. The protocol is straightforward and the dual result gives a concrete data point on what actually informs users versus what just persuades them. That separation of effects is useful for thinking about explanation design. The clearest soft spot is the simulated setting with no verification possible. It matches the scenario they target, but it leaves open whether the same patterns would appear when users have partial knowledge or external checks available. The abstract is also thin on participant numbers, exact statistical tests, and how they handled potential confounds, so the strength of the differences depends on those details holding up in the full methods. This is aimed at HCI researchers and people building LLM interfaces who need evidence on trust calibration. Anyone working on explanation features would pick up the practical angle from the dual condition. I would send it for peer review. The question is timely, the experiment is direct, and the findings are worth referee scrutiny even if generalizability and reporting need tightening in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript reports a between-subjects user study comparing four explanation conditions (reasoning traces, summaries of reasoning traces, post-hoc explanations, and dual/contrastive explanations) in a simulated no-verification setting. It claims that reasoning traces and post-hoc explanations increase user acceptance of LLM outputs irrespective of correctness (persuasive but not informative), while only the dual-explanation condition improves users' ability to distinguish correct from incorrect AI answers.

Significance. If the empirical results prove robust, the work provides concrete evidence that common LLM explanation formats can engender false trust, with direct implications for explanation design in high-stakes domains. The inclusion of a dual-explanation baseline that demonstrably improves discrimination is a constructive contribution to the HCI and XAI literature.

major comments (2)

[§4] §4 (User Study Protocol): The description of participant recruitment, sample size per condition, task selection, and statistical analysis (including any power analysis or correction for multiple comparisons) is absent or insufficiently detailed. These elements are load-bearing for the central claims about differential acceptance rates and distinction accuracy.
[Results] Results section (around Tables 1–2 or equivalent): The manuscript states that reasoning traces and post-hoc explanations increase acceptance “regardless of their correctness,” yet does not report the raw acceptance percentages or statistical contrasts for correct versus incorrect trials within each condition. Without these numbers, the magnitude and reliability of the “persuasive but not informative” effect cannot be evaluated.

minor comments (2)

[Abstract] Abstract: The phrase “dual explanation setting” should be briefly glossed so readers understand it presents arguments both for and against the AI answer.
[Figures] Figure captions: Ensure all example explanations are legible at print size and that condition labels match the text exactly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential significance of our findings on false trust in LLM explanations. We address each major comment below and will revise the manuscript accordingly to improve clarity and completeness.

read point-by-point responses

Referee: [§4] §4 (User Study Protocol): The description of participant recruitment, sample size per condition, task selection, and statistical analysis (including any power analysis or correction for multiple comparisons) is absent or insufficiently detailed. These elements are load-bearing for the central claims about differential acceptance rates and distinction accuracy.

Authors: We agree that §4 currently provides insufficient detail on these protocol elements, which limits the ability to fully evaluate and replicate the study. In the revised manuscript we will expand this section with explicit descriptions of participant recruitment (including platform, screening, and consent procedures), the precise sample size per condition, the rationale and criteria for task selection, and the full statistical analysis plan including any power analysis conducted and corrections applied for multiple comparisons. revision: yes
Referee: [Results] Results section (around Tables 1–2 or equivalent): The manuscript states that reasoning traces and post-hoc explanations increase acceptance “regardless of their correctness,” yet does not report the raw acceptance percentages or statistical contrasts for correct versus incorrect trials within each condition. Without these numbers, the magnitude and reliability of the “persuasive but not informative” effect cannot be evaluated.

Authors: We acknowledge that the current Results section reports aggregate acceptance rates and overall condition effects but omits the explicit per-condition breakdowns of acceptance rates for correct versus incorrect trials along with the corresponding within-condition statistical contrasts. In the revision we will add these raw percentages, effect sizes, and targeted statistical tests to the Results section (and associated tables) so that readers can directly assess the magnitude and reliability of the persuasive-but-not-informative pattern. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in empirical user study

full rationale

This paper reports a between-subjects user study comparing explanation conditions (reasoning traces, summaries, post-hoc, dual) on user acceptance and distinction accuracy in a simulated no-verification setting. The analysis relies on collected participant data and statistical comparisons rather than any derivations, equations, fitted parameters, or self-referential predictions. No load-bearing steps reduce to inputs by construction, and the central claims follow directly from the experimental protocol and results without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claim depends on the validity of the user study methodology and the generalizability of the simulated unverifiable setting to real use cases.

axioms (2)

domain assumption The chosen explanation types (reasoning traces, summaries, post-hoc, dual) are representative of commonly used LLM explanations.
The study tests these specific types to draw conclusions about LLM explanations in general.
domain assumption Acceptance rates in the study reflect users' trust and judgment of correctness.
The interpretation of results relies on this mapping from behavior to internal state.

pith-pipeline@v0.9.0 · 5783 in / 1400 out tokens · 119694 ms · 2026-05-20T21:47:20.285336+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We conduct a between-subject user study, simulating a setting where users do not have the means to verify the solution and analyze the false trust engendered by commonly used LLM explanations
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

False Trust = |{responses judged as correct ∩ actually incorrect}| / |{responses judged as correct}|

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 22 internal anchors

[1]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models, 2026. URL https://arxiv.org/abs/ 2303.18223

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Joseph Hoane, and Feng-hsiung Hsu

Murray Campbell, A. Joseph Hoane, and Feng-hsiung Hsu. Deep blue.Artif. Intell., 134 (1–2):57–83, January 2002. ISSN 0004-3702. doi: 10.1016/S0004-3702(01)00129-1. URL https://doi.org/10.1016/S0004-3702(01)00129-1

work page doi:10.1016/s0004-3702(01)00129-1 2002
[3]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm, 2017. URLhttps://arxiv.org/abs/1712.01815

work page internal anchor Pith review Pith/arXiv arXiv 2017
[4]

"Why Should I Trust You?": Explaining the Predictions of Any Classifier

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should i trust you?": Explaining the predictions of any classifier, 2016. URLhttps://arxiv.org/abs/1602.04938

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

Towards a rigorous science of interpretable machine learning,

Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning,

work page
[6]

URLhttps://arxiv.org/abs/1702.08608

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting, 2023. URL https://arxiv.org/abs/2305.04388

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Reasoning Models Don't Always Say What They Think

Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schul- man, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, and Ethan Perez. Reasoning models don’t always say what they think, 2025. URLhttps://arxiv.org/abs/2505.05410

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Are deepseek r1 and other reasoning models more faithful? In ICLR 2025 Workshop on Foundation Models in the Wild, 2025

James Chua and Owain Evans. Are deepseek r1 and other reasoning models more faithful? In ICLR 2025 Workshop on Foundation Models in the Wild, 2025

work page 2025
[10]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

doi: 10.1038/s41586-025-09422-z

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.Nature, 645(8081):633–638, 2025. doi: 10.1038/s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z 2025
[12]

Explainable human-ai interaction: A planning perspective, 2024

Sarath Sreedharan, Anagha Kulkarni, and Subbarao Kambhampati. Explainable human-ai interaction: A planning perspective, 2024. URLhttps://arxiv.org/abs/2405.15804

work page arXiv 2024
[13]

Beyond semantics: The unreasonable effectiveness of reasonless intermediate tokens,

Karthik Valmeekam, Kaya Stechly, Vardhan Palod, Atharva Gundawar, and Subbarao Kamb- hampati. Beyond semantics: The unreasonable effectiveness of reasonless intermediate tokens,

work page
[14]

URLhttps://arxiv.org/abs/2505.13775

work page internal anchor Pith review arXiv
[15]

Interpretable Traces, Unexpected Outcomes: Investigating the Disconnect in Trace-Based Knowledge Distillation

Siddhant Bhambri, Upasana Biswas, and Subbarao Kambhampati. Interpretable traces, unex- pected outcomes: Investigating the disconnect in trace-based knowledge distillation, 2026. URL https://arxiv.org/abs/2505.13792

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Position: Stop anthropomorphizing intermediate tokens as reasoning/thinking traces!, 2026

Subbarao Kambhampati, Karthik Valmeekam, Siddhant Bhambri, Vardhan Palod, Lucas Saldyt, Kaya Stechly, Soumya Rani Samineni, Durgesh Kalwar, and Upasana Biswas. Position: Stop anthropomorphizing intermediate tokens as reasoning/thinking traces!, 2026

work page 2026
[17]

Learning to reason with LLMs

OpenAI. Learning to reason with LLMs. https://openai.com/index/ learning-to-reason-with-llms/, 2024

work page 2024
[19]

Claude’s extended thinking

Anthropic. Claude’s extended thinking. https://www.anthropic.com/news/ visible-extended-thinking, 2025

work page 2025
[20]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025. URL https://arxiv.org/abs/ 2508.10925

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Raymond Fok and Daniel S. Weld. In search of verifiability: Explanations rarely enable complementary performance in ai-advised decision making, 2024. URL https://arxiv. org/abs/2305.07722

work page arXiv 2024
[22]

Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z. Gajos. To trust or to think: Cognitive forcing functions can reduce overreliance on ai in ai-assisted decision-making.Proceedings of the ACM on Human-Computer Interaction, 5(CSCW1):1–21, April 2021. ISSN 2573-0142. doi: 10.1145/3449287. URLhttp://dx.doi.org/10.1145/3449287

work page internal anchor Pith review doi:10.1145/3449287 2021
[23]

G., et al

Daman Arora, Himanshu Gaurav Singh, and Mausam. Have llms advanced enough? a challenging problem solving benchmark for large language models, 2023. URL https: //arxiv.org/abs/2305.15074

work page arXiv 2023
[24]

In: Zong, C., Xia, F., Li, W., Navigli, R

Chenglei Si, Navita Goyal, Tongshuang Wu, Chen Zhao, Shi Feng, Hal Daumé Iii, and Jordan Boyd-Graber. Large language models help humans verify truthfulness – except when they are convincingly wrong. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational ...

work page doi:10.18653/v1/ 2024
[25]

Mayer, and Padhraic Smyth

Mark Steyvers, Heliodoro Tejeda, Aakriti Kumar, Catarina Belem, Sheer Karny, Xinyue Hu, Lukas W. Mayer, and Padhraic Smyth. What large language models know and what people think they know.Nature Machine Intelligence, 7(2):221–231, January 2025. ISSN 2522-5839. doi: 10. 1038/s42256-024-00976-7. URLhttp://dx.doi.org/10.1038/s42256-024-00976-7

work page doi:10.1038/s42256-024-00976-7 2025
[26]

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms, 2024. URL https://arxiv.org/abs/2306.13063

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Survey of Hallucination in Natural Language Generation

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12), March 2023. ISSN 0360-0300. doi: 10.1145/3571730. URL https://doi.org/10.1145/3571730

work page doi:10.1145/3571730 2023
[28]

Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models, 2023. URL https://arxiv. org/abs/2303.08896

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis, 16(1):64–93, January

Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E Ho. Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis, 16(1):64–93, January

work page
[30]

doi: 10.1093/jla/laae003

ISSN 1946-5319. doi: 10.1093/jla/laae003. URL http://dx.doi.org/10.1093/jla/ laae003

work page doi:10.1093/jla/laae003 1946
[31]

Sunnie S. Y . Kim, Jennifer Wortman Vaughan, Q. Vera Liao, Tania Lombrozo, and Olga Russakovsky. Fostering appropriate reliance on large language models: The role of explanations, sources, and inconsistencies. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI ’25, page 1–19. ACM, April 2025. doi: 10.1145/3706598.3714020. ...

work page doi:10.1145/3706598.3714020 2025
[32]

Manasi Sharma, Ho Chit Siu, Rohan Paleja, and Jaime D. Peña. Why would you suggest that? human trust in language model responses, 2024. URL https://arxiv.org/abs/2406. 02018

work page 2024
[33]

To rely or not to rely? evaluating interven- tions for appropriate reliance on large language models.arXiv preprint arXiv:2412.15584, 2024

Jessica Y . Bo, Sophia Wan, and Ashton Anderson. To rely or not to rely? evaluating interventions for appropriate reliance on large language models, 2025. URL https://arxiv.org/abs/ 2412.15584. 11

work page arXiv 2025
[34]

Selecting methods for the analysis of reliance on automation

Lu Wang, Greg Jamieson, and Justin Hollands. Selecting methods for the analysis of reliance on automation. volume 52, pages 287–291, 09 2008. doi: 10.1177/154193120805200419

work page doi:10.1177/154193120805200419 2008
[35]

Should i follow ai-based advice? measuring appropriate reliance in human-ai decision-making, 2022

Max Schemmer, Patrick Hemmer, Niklas Kühl, Carina Benz, and Gerhard Satzger. Should i follow ai-based advice? measuring appropriate reliance in human-ai decision-making, 2022. URLhttps://arxiv.org/abs/2204.06916

work page arXiv 2022
[36]

Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , articleno =

Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel Weld. Does the whole exceed its parts? the effect of ai explanations on complementary team performance. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21, New York, NY , USA, 2021. Association for Computing ...

work page doi:10.1145/3411764.3445717 2021
[37]

Vera and Bellamy, Rachel K

Yunfeng Zhang, Q. Vera Liao, and Rachel K. E. Bellamy. Effect of confidence and explanation on accuracy and trust calibration in ai-assisted decision making. InProceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20, page 295–305, New York, NY , USA, 2020. Association for Computing Machinery. ISBN 9781450369367. doi: 10....

work page doi:10.1145/3351095.3372852 2020
[38]

On human predictions with explanations and predictions of machine learning models: A case study on deception detection

Vivian Lai and Chenhao Tan. On human predictions with explanations and predictions of machine learning models: A case study on deception detection. InProceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, page 29–38, New York, NY , USA,

work page
[39]

ISBN 9781450361255

Association for Computing Machinery. ISBN 9781450361255. doi: 10.1145/3287560. 3287590. URLhttps://doi.org/10.1145/3287560.3287590

work page doi:10.1145/3287560
[40]

Peeking inside the black-box: A survey on explainable artificial intelligence (xai).IEEE Access, PP:1–1, 09 2018

Amina Adadi and Mohammed Berrada. Peeking inside the black-box: A survey on explainable artificial intelligence (xai).IEEE Access, PP:1–1, 09 2018. doi: 10.1109/ACCESS.2018. 2870052

work page doi:10.1109/access.2018 2018
[41]

GPT-4 Technical Report

OpenAI. GPT-4 technical report, 2023. URLhttps://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

DeepSeek-V3 technical report, 2024

DeepSeek-AI. DeepSeek-V3 technical report, 2024. URL https://arxiv.org/abs/2412. 19437

work page 2024
[43]

Qwq-32b: Embracing the power of reinforcement learning, March 2025

Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025. URL https://qwenlm.github.io/blog/qwq-32b/

work page 2025
[44]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024. URL https: //arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Show your work: Scratchpads for intermediate computation with language models

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. 2021

work page 2021
[49]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. 12

work page 2022
[50]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint:2503.08679, 2025

Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint arXiv:2503.08679, 2025

work page internal anchor Pith review arXiv 2025
[53]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[54]

Training language models to be warm and empathetic makes them less reliable and more sycophantic.arXiv preprint arXiv:2507.21919, 2025

Lujain Ibrahim, Franziska Sofia Hafner, and Luc Rocher. Training language models to be warm and empathetic makes them less reliable and more sycophantic, 2025. URL https: //arxiv.org/abs/2507.21919

work page arXiv 2025
[55]

Do llms have core beliefs? 2026

Nitesh Chawla, Anna Sokol, and Marianna Ganapini. Do llms have core beliefs? 2026. 13 A Prompt Templates Summarization Template: Reasoning Trace Summarize the following reasoning trace in a very concise and clear manner, highlighting key events and outcomes in less than 250 words: ... {trace} ... Summary: Explanation Template: Correct Answer You are given...

work page 2026
[56]

Reasons why the answer might be correct

work page
[57]

answer":

Reasons why the answer might be incorrect Question: {question} AI Answer: {answer} Reasons why this answer might be correct (within 250 words): Reasons why this answer might be incorrect (within 250 words): Do not include any other text outside of the format above. Solver Template (unified across answer types) Solve the following {subject} problem and pro...

work page
[58]

The participant read the problem statement and answer choices

work page
[59]

Do you know how to solve this question?

The participant was asked:“Do you know how to solve this question?”with three options: “I know the answer immediately,” “I might be able to solve it given more time,” or “I cannot solve this.”

work page
[60]

I know the answer immediately,

If the participant selected “I know the answer immediately,”they were shown the problem again and asked to select their own answer and rate their confidence:“How confident are you in your own answer?”(7-point Likert scale). They were then shown the AI system’s predicted answer along with the explanation corresponding to their assigned condition. If the pa...

work page
[61]

Do you think the AI’s answer is correct?

The participant was asked:“Do you think the AI’s answer is correct?”(Yes/No). Participants were reminded that an incorrect judgment would cost $0.25 from their bonus

work page
[62]

How confident are you in your judgment about the AI’s answer?

The participant rated:“How confident are you in your judgment about the AI’s answer?” (7-point Likert scale)

work page
[63]

I trust the AI system’s outputs

The participant rated:“I trust the AI system’s outputs. ”(7-point Likert scale)

work page
[64]

Did the AI reasoning help you understand how the model reached its answer?

The participant was asked:“Did the AI reasoning help you understand how the model reached its answer?”(Yes/No). This question was omitted in the solution_only condi- tion. Third, participants completed a post-study questionnaire in which they rated the following statements on a 7-point Likert scale: •“I trust the AI system’s outputs. ” •“I would rely on t...

work page

[1] [1]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models, 2026. URL https://arxiv.org/abs/ 2303.18223

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Joseph Hoane, and Feng-hsiung Hsu

Murray Campbell, A. Joseph Hoane, and Feng-hsiung Hsu. Deep blue.Artif. Intell., 134 (1–2):57–83, January 2002. ISSN 0004-3702. doi: 10.1016/S0004-3702(01)00129-1. URL https://doi.org/10.1016/S0004-3702(01)00129-1

work page doi:10.1016/s0004-3702(01)00129-1 2002

[3] [3]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm, 2017. URLhttps://arxiv.org/abs/1712.01815

work page internal anchor Pith review Pith/arXiv arXiv 2017

[4] [4]

"Why Should I Trust You?": Explaining the Predictions of Any Classifier

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should i trust you?": Explaining the predictions of any classifier, 2016. URLhttps://arxiv.org/abs/1602.04938

work page internal anchor Pith review Pith/arXiv arXiv 2016

[5] [5]

Towards a rigorous science of interpretable machine learning,

Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning,

work page

[6] [6]

URLhttps://arxiv.org/abs/1702.08608

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting, 2023. URL https://arxiv.org/abs/2305.04388

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Reasoning Models Don't Always Say What They Think

Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schul- man, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, and Ethan Perez. Reasoning models don’t always say what they think, 2025. URLhttps://arxiv.org/abs/2505.05410

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Are deepseek r1 and other reasoning models more faithful? In ICLR 2025 Workshop on Foundation Models in the Wild, 2025

James Chua and Owain Evans. Are deepseek r1 and other reasoning models more faithful? In ICLR 2025 Workshop on Foundation Models in the Wild, 2025

work page 2025

[10] [10]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

doi: 10.1038/s41586-025-09422-z

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.Nature, 645(8081):633–638, 2025. doi: 10.1038/s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z 2025

[12] [12]

Explainable human-ai interaction: A planning perspective, 2024

Sarath Sreedharan, Anagha Kulkarni, and Subbarao Kambhampati. Explainable human-ai interaction: A planning perspective, 2024. URLhttps://arxiv.org/abs/2405.15804

work page arXiv 2024

[13] [13]

Beyond semantics: The unreasonable effectiveness of reasonless intermediate tokens,

Karthik Valmeekam, Kaya Stechly, Vardhan Palod, Atharva Gundawar, and Subbarao Kamb- hampati. Beyond semantics: The unreasonable effectiveness of reasonless intermediate tokens,

work page

[14] [14]

URLhttps://arxiv.org/abs/2505.13775

work page internal anchor Pith review arXiv

[15] [15]

Interpretable Traces, Unexpected Outcomes: Investigating the Disconnect in Trace-Based Knowledge Distillation

Siddhant Bhambri, Upasana Biswas, and Subbarao Kambhampati. Interpretable traces, unex- pected outcomes: Investigating the disconnect in trace-based knowledge distillation, 2026. URL https://arxiv.org/abs/2505.13792

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

Position: Stop anthropomorphizing intermediate tokens as reasoning/thinking traces!, 2026

Subbarao Kambhampati, Karthik Valmeekam, Siddhant Bhambri, Vardhan Palod, Lucas Saldyt, Kaya Stechly, Soumya Rani Samineni, Durgesh Kalwar, and Upasana Biswas. Position: Stop anthropomorphizing intermediate tokens as reasoning/thinking traces!, 2026

work page 2026

[17] [17]

Learning to reason with LLMs

OpenAI. Learning to reason with LLMs. https://openai.com/index/ learning-to-reason-with-llms/, 2024

work page 2024

[18] [19]

Claude’s extended thinking

Anthropic. Claude’s extended thinking. https://www.anthropic.com/news/ visible-extended-thinking, 2025

work page 2025

[19] [20]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025. URL https://arxiv.org/abs/ 2508.10925

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [21]

Raymond Fok and Daniel S. Weld. In search of verifiability: Explanations rarely enable complementary performance in ai-advised decision making, 2024. URL https://arxiv. org/abs/2305.07722

work page arXiv 2024

[21] [22]

Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z. Gajos. To trust or to think: Cognitive forcing functions can reduce overreliance on ai in ai-assisted decision-making.Proceedings of the ACM on Human-Computer Interaction, 5(CSCW1):1–21, April 2021. ISSN 2573-0142. doi: 10.1145/3449287. URLhttp://dx.doi.org/10.1145/3449287

work page internal anchor Pith review doi:10.1145/3449287 2021

[22] [23]

G., et al

Daman Arora, Himanshu Gaurav Singh, and Mausam. Have llms advanced enough? a challenging problem solving benchmark for large language models, 2023. URL https: //arxiv.org/abs/2305.15074

work page arXiv 2023

[23] [24]

In: Zong, C., Xia, F., Li, W., Navigli, R

Chenglei Si, Navita Goyal, Tongshuang Wu, Chen Zhao, Shi Feng, Hal Daumé Iii, and Jordan Boyd-Graber. Large language models help humans verify truthfulness – except when they are convincingly wrong. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational ...

work page doi:10.18653/v1/ 2024

[24] [25]

Mayer, and Padhraic Smyth

Mark Steyvers, Heliodoro Tejeda, Aakriti Kumar, Catarina Belem, Sheer Karny, Xinyue Hu, Lukas W. Mayer, and Padhraic Smyth. What large language models know and what people think they know.Nature Machine Intelligence, 7(2):221–231, January 2025. ISSN 2522-5839. doi: 10. 1038/s42256-024-00976-7. URLhttp://dx.doi.org/10.1038/s42256-024-00976-7

work page doi:10.1038/s42256-024-00976-7 2025

[25] [26]

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms, 2024. URL https://arxiv.org/abs/2306.13063

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [27]

Survey of Hallucination in Natural Language Generation

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12), March 2023. ISSN 0360-0300. doi: 10.1145/3571730. URL https://doi.org/10.1145/3571730

work page doi:10.1145/3571730 2023

[27] [28]

Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models, 2023. URL https://arxiv. org/abs/2303.08896

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [29]

Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis, 16(1):64–93, January

Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E Ho. Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis, 16(1):64–93, January

work page

[29] [30]

doi: 10.1093/jla/laae003

ISSN 1946-5319. doi: 10.1093/jla/laae003. URL http://dx.doi.org/10.1093/jla/ laae003

work page doi:10.1093/jla/laae003 1946

[30] [31]

Sunnie S. Y . Kim, Jennifer Wortman Vaughan, Q. Vera Liao, Tania Lombrozo, and Olga Russakovsky. Fostering appropriate reliance on large language models: The role of explanations, sources, and inconsistencies. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI ’25, page 1–19. ACM, April 2025. doi: 10.1145/3706598.3714020. ...

work page doi:10.1145/3706598.3714020 2025

[31] [32]

Manasi Sharma, Ho Chit Siu, Rohan Paleja, and Jaime D. Peña. Why would you suggest that? human trust in language model responses, 2024. URL https://arxiv.org/abs/2406. 02018

work page 2024

[32] [33]

To rely or not to rely? evaluating interven- tions for appropriate reliance on large language models.arXiv preprint arXiv:2412.15584, 2024

Jessica Y . Bo, Sophia Wan, and Ashton Anderson. To rely or not to rely? evaluating interventions for appropriate reliance on large language models, 2025. URL https://arxiv.org/abs/ 2412.15584. 11

work page arXiv 2025

[33] [34]

Selecting methods for the analysis of reliance on automation

Lu Wang, Greg Jamieson, and Justin Hollands. Selecting methods for the analysis of reliance on automation. volume 52, pages 287–291, 09 2008. doi: 10.1177/154193120805200419

work page doi:10.1177/154193120805200419 2008

[34] [35]

Should i follow ai-based advice? measuring appropriate reliance in human-ai decision-making, 2022

Max Schemmer, Patrick Hemmer, Niklas Kühl, Carina Benz, and Gerhard Satzger. Should i follow ai-based advice? measuring appropriate reliance in human-ai decision-making, 2022. URLhttps://arxiv.org/abs/2204.06916

work page arXiv 2022

[35] [36]

Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , articleno =

Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel Weld. Does the whole exceed its parts? the effect of ai explanations on complementary team performance. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21, New York, NY , USA, 2021. Association for Computing ...

work page doi:10.1145/3411764.3445717 2021

[36] [37]

Vera and Bellamy, Rachel K

Yunfeng Zhang, Q. Vera Liao, and Rachel K. E. Bellamy. Effect of confidence and explanation on accuracy and trust calibration in ai-assisted decision making. InProceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20, page 295–305, New York, NY , USA, 2020. Association for Computing Machinery. ISBN 9781450369367. doi: 10....

work page doi:10.1145/3351095.3372852 2020

[37] [38]

On human predictions with explanations and predictions of machine learning models: A case study on deception detection

Vivian Lai and Chenhao Tan. On human predictions with explanations and predictions of machine learning models: A case study on deception detection. InProceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, page 29–38, New York, NY , USA,

work page

[38] [39]

ISBN 9781450361255

Association for Computing Machinery. ISBN 9781450361255. doi: 10.1145/3287560. 3287590. URLhttps://doi.org/10.1145/3287560.3287590

work page doi:10.1145/3287560

[39] [40]

Peeking inside the black-box: A survey on explainable artificial intelligence (xai).IEEE Access, PP:1–1, 09 2018

Amina Adadi and Mohammed Berrada. Peeking inside the black-box: A survey on explainable artificial intelligence (xai).IEEE Access, PP:1–1, 09 2018. doi: 10.1109/ACCESS.2018. 2870052

work page doi:10.1109/access.2018 2018

[40] [41]

GPT-4 Technical Report

OpenAI. GPT-4 technical report, 2023. URLhttps://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [42]

DeepSeek-V3 technical report, 2024

DeepSeek-AI. DeepSeek-V3 technical report, 2024. URL https://arxiv.org/abs/2412. 19437

work page 2024

[42] [43]

Qwq-32b: Embracing the power of reinforcement learning, March 2025

Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025. URL https://qwenlm.github.io/blog/qwq-32b/

work page 2025

[43] [44]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024. URL https: //arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [45]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [46]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [47]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [48]

Show your work: Scratchpads for intermediate computation with language models

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. 2021

work page 2021

[48] [49]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. 12

work page 2022

[49] [50]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [51]

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [52]

Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint:2503.08679, 2025

Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint arXiv:2503.08679, 2025

work page internal anchor Pith review arXiv 2025

[52] [53]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[53] [54]

Training language models to be warm and empathetic makes them less reliable and more sycophantic.arXiv preprint arXiv:2507.21919, 2025

Lujain Ibrahim, Franziska Sofia Hafner, and Luc Rocher. Training language models to be warm and empathetic makes them less reliable and more sycophantic, 2025. URL https: //arxiv.org/abs/2507.21919

work page arXiv 2025

[54] [55]

Do llms have core beliefs? 2026

Nitesh Chawla, Anna Sokol, and Marianna Ganapini. Do llms have core beliefs? 2026. 13 A Prompt Templates Summarization Template: Reasoning Trace Summarize the following reasoning trace in a very concise and clear manner, highlighting key events and outcomes in less than 250 words: ... {trace} ... Summary: Explanation Template: Correct Answer You are given...

work page 2026

[55] [56]

Reasons why the answer might be correct

work page

[56] [57]

answer":

Reasons why the answer might be incorrect Question: {question} AI Answer: {answer} Reasons why this answer might be correct (within 250 words): Reasons why this answer might be incorrect (within 250 words): Do not include any other text outside of the format above. Solver Template (unified across answer types) Solve the following {subject} problem and pro...

work page

[57] [58]

The participant read the problem statement and answer choices

work page

[58] [59]

Do you know how to solve this question?

The participant was asked:“Do you know how to solve this question?”with three options: “I know the answer immediately,” “I might be able to solve it given more time,” or “I cannot solve this.”

work page

[59] [60]

I know the answer immediately,

If the participant selected “I know the answer immediately,”they were shown the problem again and asked to select their own answer and rate their confidence:“How confident are you in your own answer?”(7-point Likert scale). They were then shown the AI system’s predicted answer along with the explanation corresponding to their assigned condition. If the pa...

work page

[60] [61]

Do you think the AI’s answer is correct?

The participant was asked:“Do you think the AI’s answer is correct?”(Yes/No). Participants were reminded that an incorrect judgment would cost $0.25 from their bonus

work page

[61] [62]

How confident are you in your judgment about the AI’s answer?

The participant rated:“How confident are you in your judgment about the AI’s answer?” (7-point Likert scale)

work page

[62] [63]

I trust the AI system’s outputs

The participant rated:“I trust the AI system’s outputs. ”(7-point Likert scale)

work page

[63] [64]

Did the AI reasoning help you understand how the model reached its answer?

The participant was asked:“Did the AI reasoning help you understand how the model reached its answer?”(Yes/No). This question was omitted in the solution_only condi- tion. Third, participants completed a post-study questionnaire in which they rated the following statements on a 7-point Likert scale: •“I trust the AI system’s outputs. ” •“I would rely on t...

work page