pith. machine review for the scientific record. sign in

arxiv: 2605.10930 · v1 · submitted 2026-05-11 · 💻 cs.HC

Recognition: 2 theorem links

· Lean Theorem

Evaluating the False Trust engendered by LLM Explanations

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:18 UTC · model grok-4.3

classification 💻 cs.HC
keywords LLM explanationsfalse trustreasoning tracespost-hoc explanationsdual explanationshuman-AI interactionuser studyAI trust calibration
0
0 comments X

The pith

Reasoning traces and post-hoc explanations from LLMs increase user acceptance of answers whether correct or incorrect, while only dual explanations improve users' ability to tell the difference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether common LLM explanation formats actually help people judge when an answer is wrong or merely make them trust the model more. In a controlled user study with no way for participants to check the underlying facts, reasoning traces, summaries of those traces, and post-hoc explanations all raised acceptance rates for both right and wrong answers. Presenting arguments both supporting and opposing the model's output was the only format that raised users' accuracy at distinguishing correct from incorrect results. The finding matters because LLMs are already deployed on tasks where users lack independent verification, so persuasive but uninformative explanations can produce systematic over-trust.

Core claim

Reasoning traces and post-hoc explanations function as persuasive but not informative signals: they raise the rate at which users accept an LLM prediction irrespective of whether the prediction is correct. In contrast, a contrastive dual-explanation condition that supplies arguments both for and against the model's answer is the sole format shown to improve users' ability to discriminate correct from incorrect outputs.

What carries the argument

A between-subjects evaluation protocol that measures false trust by comparing acceptance rates across explanation conditions (reasoning trace, summary, post-hoc, and dual) in tasks where participants have no independent verification method.

If this is right

  • Standard reasoning traces and post-hoc explanations will continue to raise acceptance of incorrect outputs at the same rate as correct outputs.
  • Dual explanations are the only tested format that measurably improves users' discrimination between correct and incorrect LLM answers.
  • Designers relying on single-direction explanations should expect elevated rates of false trust when users lack verification tools.
  • Replacing or supplementing current explanation displays with contrastive arguments would be required to reduce false acceptance of wrong answers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Interface guidelines for high-stakes LLM use should default to contrastive rather than single-sided explanations.
  • Automated generation of balanced pro-and-con arguments may become a necessary component of trustworthy LLM deployments.
  • The same persuasion-without-information pattern could appear in other opaque decision systems such as recommendation engines or diagnostic tools.

Load-bearing premise

The assumption that a simulated setting in which users cannot verify solutions and a between-subjects design with selected tasks accurately captures real-world false trust and extends to critical decision scenarios.

What would settle it

A follow-up experiment in which participants can actually verify answer correctness in a real task and the acceptance rates for incorrect answers under reasoning-trace and post-hoc conditions remain as high as under no-explanation controls.

Figures

Figures reproduced from arXiv: 2605.10930 by Subbarao Kambhampati, Upasana Biswas, Vardhan Palod.

Figure 1
Figure 1. Figure 1: Step 1 – Consent. IRB-approved description of the study, expected duration, compensation structure (base pay plus bonus), and Prolific completion procedure. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Step 2 – Background questions. Two single-choice items capture prior exposure to JEE-style competitive exams and frequency of AI-tool use. Used as covariates / for stratification [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Step 3 – Attention check 1. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Step 4 – Task instructions. Describes the per-item flow the participant is about to begin (8 problems. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Step 5 – Item page 1 (pre-AI). The participant sees the problem statement, indicates whether they know how to solve it (know-level: “I know the answer immediately” / “given more time” / “I cannot solve this”), optionally provides their own answer, and rates their confidence on a 1–7 Likert scale. This page is identical across variants. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Step 6a – Item page 2, reasoning_trace variant. After Page 1, the participant sees the AI’s predicted answer alongside the full chain-of-thought trace. The page also collects the AI-correctness judgment (Yes/No), confidence in that judgment (1–7), per-item trust rating (1–7), and a helpfulness probe. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Step 6b – Item page 2, reasoning_summary variant. Same layout as [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Step 6c – Item page 2, E+ variant. Replaces the trace with a single-sided post-hoc explanation arguing why the predicted answer may be correct. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Step 6d – Item page 2, E+/− Replaces the trace with paired “why this answer may be correct” / “why this answer may be incorrect” arguments. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Step 8 – End-of-task transition. Confirms the participant has completed all 8 reasoning items and previews the post-study questionnaire [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Step 9 – Post-study trust questionnaire. Three items rated on a 1–7 Likert scale (trust in outputs, willingness to rely on recommendations, confidence in answers). 27 [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
read the original abstract

Large Language Models (LLMs) and Large Reasoning Models (LRMs) are increasingly used for critical tasks, yet they provide no guarantees about the correctness of their solutions. Users must decide whether to trust the model's answer, aided by reasoning traces, their summaries, or post-hoc generated explanations. These reasoning traces, despite evidence that they are neither faithful representations of the model's computations nor necessarily semantically meaningful, are often interpreted as provenance explanations. It is unclear whether explanations or reasoning traces help users identify when the AI is incorrect, or whether they simply persuade users to trust the AI regardless. In this paper, we take a user-centered approach and develop an evaluation protocol to study how different explanation types affect users' ability to judge the correctness of AI-generated answers and engender false trust in the users. We conduct a between-subject user study, simulating a setting where users do not have the means to verify the solution and analyze the false trust engendered by commonly used LLM explanations - reasoning traces, their summaries and post-hoc explanations. We also test a contrastive dual explanation setting where we present arguments for and against the AI's answer. We find that reasoning traces and post-hoc explanations are persuasive but not informative: they increase user acceptance of LLM predictions regardless of their correctness. In contrast, dual explanation is the only condition that genuinely improves users' ability to distinguish correct from incorrect AI outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript reports a between-subjects user study that evaluates how reasoning traces, their summaries, post-hoc explanations, and contrastive dual explanations affect users' acceptance of LLM predictions and their ability to judge correctness. In a simulated setting without verification options, the study concludes that reasoning traces and post-hoc explanations increase acceptance regardless of whether the prediction is correct, while only the dual-explanation condition improves users' discrimination between correct and incorrect outputs.

Significance. If the empirical results prove robust, the work would make a useful contribution to HCI and AI-trust research by distinguishing persuasive from informative explanations and by identifying dual explanations as a promising mitigation for false trust. The between-subjects design and focus on a no-verification scenario are reasonable choices for isolating the claimed effects. The idea of testing contrastive explanations is a clear strength. However, the absence of any methodological or statistical detail in the provided abstract prevents confirmation that the data actually support the central distinction.

major comments (1)
  1. [Abstract] The abstract presents the key directional findings of the between-subjects study but supplies none of the information required to evaluate them: participant count, task descriptions, construction and presentation of correct versus incorrect answers, exact dependent measures, statistical tests, effect sizes, or power analysis. These omissions are load-bearing for the central claim that reasoning traces and post-hoc explanations are merely persuasive while dual explanations are informative.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review, positive assessment of the work's potential contribution to HCI and AI-trust research, and for identifying the need for greater transparency in the abstract. We address the major comment below and commit to revisions that strengthen the manuscript without altering its core claims or design.

read point-by-point responses
  1. Referee: [Abstract] The abstract presents the key directional findings of the between-subjects study but supplies none of the information required to evaluate them: participant count, task descriptions, construction and presentation of correct versus incorrect answers, exact dependent measures, statistical tests, effect sizes, or power analysis. These omissions are load-bearing for the central claim that reasoning traces and post-hoc explanations are merely persuasive while dual explanations are informative.

    Authors: We agree that the abstract, constrained by length, currently omits key methodological and statistical details that would allow readers to more readily evaluate the central distinction between persuasive and informative explanations. The full manuscript (Methods, Results, and supplementary materials) contains these elements, including the between-subjects design, participant recruitment, task construction (simulated no-verification setting with correct/incorrect LLM outputs), dependent measures (acceptance rates and discrimination accuracy), statistical analyses, effect sizes, and power considerations. To directly address the concern, we will revise the abstract to concisely incorporate the participant count, a brief description of the tasks and conditions, the primary dependent measures, main statistical tests and results (including effect sizes), and a note on the study power. This revision will preserve the abstract's focus on findings while providing the load-bearing information the referee correctly identifies as necessary. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical user study with no derivations or self-referential definitions

full rationale

The paper reports results from a between-subjects user study on explanation effects in a simulated no-verification setting. The abstract contains no equations, fitted parameters, predictions derived from inputs, self-citations used as load-bearing premises, or any claimed derivation chain. Findings are presented as direct experimental observations (e.g., reasoning traces increase acceptance regardless of correctness) rather than reductions to prior definitions or self-referential logic. This matches the default case of a self-contained empirical protocol with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the user study methodology and the interpretation of behavioral acceptance rates as indicators of false trust engendered by explanations.

axioms (1)
  • domain assumption The simulated no-verification setting and between-subject design validly capture how explanations affect user trust judgments in real critical tasks.
    This assumption is required to interpret increased acceptance of incorrect answers as false trust and to generalize the findings.

pith-pipeline@v0.9.0 · 5523 in / 1466 out tokens · 66502 ms · 2026-05-12T03:18:28.568972+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 14 internal anchors

  1. [1]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models, 2026. URL https://arxiv.org/abs/ 2303.18223

  2. [2]

    Deep blue.Artificial Intelligence, 134(1):57–83, 2002

    Murray Campbell, A. Joseph Hoane, and Feng-hsiung Hsu. Deep blue.Artif. Intell., 134 (1–2):57–83, January 2002. ISSN 0004-3702. doi: 10.1016/S0004-3702(01)00129-1. URL https://doi.org/10.1016/S0004-3702(01)00129-1

  3. [3]

    Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm, 2017. URLhttps://arxiv.org/abs/1712.01815

  4. [4]

    ``why should i trust you?": Explaining the predictions of any classifier

    Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should i trust you?": Explaining the predictions of any classifier, 2016. URLhttps://arxiv.org/abs/1602.04938

  5. [5]

    Towards a rigorous science of interpretable machine learning,

    Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning,

  6. [6]

    URLhttps://arxiv.org/abs/1702.08608

  7. [7]

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting, 2023. URL https://arxiv.org/abs/2305.04388

  8. [8]

    Reasoning Models Don't Always Say What They Think

    Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schul- man, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, and Ethan Perez. Reasoning models don’t always say what they think, 2025. URLhttps://arxiv.org/abs/2505.05410

  9. [9]

    Are deepseek r1 and other reasoning models more faithful? In ICLR 2025 Workshop on Foundation Models in the Wild, 2025

    James Chua and Owain Evans. Are deepseek r1 and other reasoning models more faithful? In ICLR 2025 Workshop on Foundation Models in the Wild, 2025

  10. [10]

    Measuring Faithfulness in Chain-of-Thought Reasoning

    Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702, 2023

  11. [11]

    Available: http://dx.doi.org/10.1038/s41586-025-09422-z

    DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.Nature, 645(8081):633–638, 2025. doi: 10.1038/s41586-025-09422-z

  12. [12]

    Explainable human-ai interaction: A planning perspective, 2024

    Sarath Sreedharan, Anagha Kulkarni, and Subbarao Kambhampati. Explainable human-ai interaction: A planning perspective, 2024. URLhttps://arxiv.org/abs/2405.15804

  13. [13]

    Beyond semantics: The unreasonable effectiveness of reasonless intermediate tokens,

    Karthik Valmeekam, Kaya Stechly, Vardhan Palod, Atharva Gundawar, and Subbarao Kamb- hampati. Beyond semantics: The unreasonable effectiveness of reasonless intermediate tokens,

  14. [14]

    URLhttps://arxiv.org/abs/2505.13775

  15. [15]

    Interpretable Traces, Unexpected Outcomes: Investigating the Disconnect in Trace-Based Knowledge Distillation

    Siddhant Bhambri, Upasana Biswas, and Subbarao Kambhampati. Interpretable traces, unex- pected outcomes: Investigating the disconnect in trace-based knowledge distillation, 2026. URL https://arxiv.org/abs/2505.13792

  16. [16]

    Position: Stop anthropomorphizing intermediate tokens as reasoning/thinking traces!, 2026

    Subbarao Kambhampati, Karthik Valmeekam, Siddhant Bhambri, Vardhan Palod, Lucas Saldyt, Kaya Stechly, Soumya Rani Samineni, Durgesh Kalwar, and Upasana Biswas. Position: Stop anthropomorphizing intermediate tokens as reasoning/thinking traces!, 2026. 10

  17. [17]

    Learning to reason with LLMs

    OpenAI. Learning to reason with LLMs. https://openai.com/index/ learning-to-reason-with-llms/, 2024

  18. [19]

    Claude’s extended thinking

    Anthropic. Claude’s extended thinking. https://www.anthropic.com/news/ visible-extended-thinking, 2025

  19. [20]

    gpt-oss-120b & gpt-oss-20b Model Card

    OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025. URL https://arxiv.org/abs/ 2508.10925

  20. [21]

    Raymond Fok and Daniel S. Weld. In search of verifiability: Explanations rarely enable complementary performance in ai-advised decision making, 2024. URL https://arxiv. org/abs/2305.07722

  21. [22]

    Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z. Gajos. To trust or to think: Cognitive forcing functions can reduce overreliance on ai in ai-assisted decision-making.Proceedings of the ACM on Human-Computer Interaction, 5(CSCW1):1–21, April 2021. ISSN 2573-0142. doi: 10.1145/3449287. URLhttp://dx.doi.org/10.1145/3449287

  22. [23]

    Have llms advanced enough? a challenging problem solving benchmark for large language models, 2023

    Daman Arora, Himanshu Gaurav Singh, and Mausam. Have llms advanced enough? a challenging problem solving benchmark for large language models, 2023. URL https: //arxiv.org/abs/2305.15074

  23. [24]

    Appworld: A controllable world of apps and people for benchmarking interactive coding agents

    Chenglei Si, Navita Goyal, Tongshuang Wu, Chen Zhao, Shi Feng, Hal Daumé Iii, and Jordan Boyd-Graber. Large language models help humans verify truthfulness – except when they are convincingly wrong. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational ...

  24. [25]

    Mayer, and Padhraic Smyth

    Mark Steyvers, Heliodoro Tejeda, Aakriti Kumar, Catarina Belem, Sheer Karny, Xinyue Hu, Lukas W. Mayer, and Padhraic Smyth. What large language models know and what people think they know.Nature Machine Intelligence, 7(2):221–231, January 2025. ISSN 2522-5839. doi: 10. 1038/s42256-024-00976-7. URLhttp://dx.doi.org/10.1038/s42256-024-00976-7

  25. [26]

    Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

    Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms, 2024. URL https://arxiv.org/abs/2306.13063

  26. [27]

    ACM Comput

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12), March 2023. ISSN 0360-0300. doi: 10.1145/3571730. URL https://doi.org/10.1145/3571730

  27. [28]

    Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models, 2023. URL https://arxiv. org/abs/2303.08896

  28. [29]

    Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis, 16(1):64–93, January

    Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E Ho. Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis, 16(1):64–93, January

  29. [30]

    Large legal fictions: Profiling legal hallucinations in large language models,

    ISSN 1946-5319. doi: 10.1093/jla/laae003. URL http://dx.doi.org/10.1093/jla/ laae003

  30. [31]

    Sunnie S. Y . Kim, Jennifer Wortman Vaughan, Q. Vera Liao, Tania Lombrozo, and Olga Russakovsky. Fostering appropriate reliance on large language models: The role of explanations, sources, and inconsistencies. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI ’25, page 1–19. ACM, April 2025. doi: 10.1145/3706598.3714020. ...

  31. [32]

    Manasi Sharma, Ho Chit Siu, Rohan Paleja, and Jaime D. Peña. Why would you suggest that? human trust in language model responses, 2024. URL https://arxiv.org/abs/2406. 02018

  32. [33]

    Bo, Sophia Wan, and Ashton Anderson

    Jessica Y . Bo, Sophia Wan, and Ashton Anderson. To rely or not to rely? evaluating interventions for appropriate reliance on large language models, 2025. URL https://arxiv.org/abs/ 2412.15584

  33. [34]

    Selecting methods for the analysis of reliance on automation

    Lu Wang, Greg Jamieson, and Justin Hollands. Selecting methods for the analysis of reliance on automation. volume 52, pages 287–291, 09 2008. doi: 10.1177/154193120805200419

  34. [35]

    Should i follow ai-based advice? measuring appropriate reliance in human-ai decision-making, 2022

    Max Schemmer, Patrick Hemmer, Niklas Kühl, Carina Benz, and Gerhard Satzger. Should i follow ai-based advice? measuring appropriate reliance in human-ai decision-making, 2022. URLhttps://arxiv.org/abs/2204.06916

  35. [36]

    Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , articleno =

    Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel Weld. Does the whole exceed its parts? the effect of ai explanations on complementary team performance. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21, New York, NY , USA, 2021. Association for Computing ...

  36. [37]

    and Sharma, Amit and Tan, Chenhao , title =

    Yunfeng Zhang, Q. Vera Liao, and Rachel K. E. Bellamy. Effect of confidence and explanation on accuracy and trust calibration in ai-assisted decision making. InProceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20, page 295–305, New York, NY , USA, 2020. Association for Computing Machinery. ISBN 9781450369367. doi: 10....

  37. [38]

    On human predictions with explanations and predictions of machine learning models: A case study on deception detection

    Vivian Lai and Chenhao Tan. On human predictions with explanations and predictions of machine learning models: A case study on deception detection. InProceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, page 29–38, New York, NY , USA,

  38. [39]

    ISBN 9781450361255

    Association for Computing Machinery. ISBN 9781450361255. doi: 10.1145/3287560. 3287590. URLhttps://doi.org/10.1145/3287560.3287590

  39. [40]

    Peeking inside the black-box: A survey on explainable artificial intelligence (xai).IEEE Access, PP:1–1, 09 2018

    Amina Adadi and Mohammed Berrada. Peeking inside the black-box: A survey on explainable artificial intelligence (xai).IEEE Access, PP:1–1, 09 2018. doi: 10.1109/ACCESS.2018. 2870052

  40. [41]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report, 2023. URLhttps://arxiv.org/abs/2303.08774

  41. [42]

    DeepSeek-V3 technical report, 2024

    DeepSeek-AI. DeepSeek-V3 technical report, 2024. URL https://arxiv.org/abs/2412. 19437

  42. [43]

    Qwq-32b: Embracing the power of reinforcement learning, March 2025

    Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025. URL https://qwenlm.github.io/blog/qwq-32b/

  43. [44]

    Qwen2.5 Technical Report

    Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024. URL https: //arxiv.org/abs/2412.15115

  44. [45]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  45. [46]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  46. [47]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 12

  47. [48]

    Show your work: Scratchpads for intermediate computation with language models

    Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. 2021

  48. [49]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  49. [50]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393, 2025

  50. [51]

    Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars

    Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307, 2025

  51. [52]

    Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint arXiv: 2503.08679, 2025

    Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint arXiv:2503.08679, 2025

  52. [53]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

  53. [54]

    Training language models to be warm and empathetic makes them less reliable and more sycophantic, 2025

    Lujain Ibrahim, Franziska Sofia Hafner, and Luc Rocher. Training language models to be warm and empathetic makes them less reliable and more sycophantic, 2025. URL https: //arxiv.org/abs/2507.21919

  54. [55]

    Do llms have core beliefs? 2026

    Nitesh Chawla, Anna Sokol, and Marianna Ganapini. Do llms have core beliefs? 2026. 13 A Prompt Templates Summarization Template: Reasoning Trace Summarize the following reasoning trace in a very concise and clear manner, highlighting key events and outcomes in less than 250 words: ... {trace} ... Summary: Explanation Template: Correct Answer You are given...

  55. [56]

    Reasons why the answer might be correct

  56. [57]

    answer":

    Reasons why the answer might be incorrect Question: {question} AI Answer: {answer} Reasons why this answer might be correct (within 250 words): Reasons why this answer might be incorrect (within 250 words): Do not include any other text outside of the format above. Solver Template (unified across answer types) Solve the following {subject} problem and pro...

  57. [58]

    The participant read the problem statement and answer choices

  58. [59]

    Do you know how to solve this question?

    The participant was asked:“Do you know how to solve this question?”with three options: “I know the answer immediately,” “I might be able to solve it given more time,” or “I cannot solve this.”

  59. [60]

    I know the answer immediately,

    If the participant selected “I know the answer immediately,”they were shown the problem again and asked to select their own answer and rate their confidence:“How confident are you in your own answer?”(7-point Likert scale). They were then shown the AI system’s predicted answer along with the explanation corresponding to their assigned condition. If the pa...

  60. [61]

    Do you think the AI’s answer is correct?

    The participant was asked:“Do you think the AI’s answer is correct?”(Yes/No). Participants were reminded that an incorrect judgment would cost $0.25 from their bonus

  61. [62]

    How confident are you in your judgment about the AI’s answer?

    The participant rated:“How confident are you in your judgment about the AI’s answer?” (7-point Likert scale)

  62. [63]

    I trust the AI system’s outputs

    The participant rated:“I trust the AI system’s outputs. ”(7-point Likert scale)

  63. [64]

    Did the AI reasoning help you understand how the model reached its answer?

    The participant was asked:“Did the AI reasoning help you understand how the model reached its answer?”(Yes/No). This question was omitted in the solution_only condi- tion. Third, participants completed a post-study questionnaire in which they rated the following statements on a 7-point Likert scale: •“I trust the AI system’s outputs. ” •“I would rely on t...