pith. sign in

arxiv: 2603.21286 · v2 · submitted 2026-03-22 · 💻 cs.HC

When the Chain Breaks: Interactive Diagnosis of LLM Chain-of-Thought Reasoning Errors

Pith reviewed 2026-05-15 06:55 UTC · model grok-4.3

classification 💻 cs.HC
keywords chain-of-thoughtLLM reasoningerror detectioninteractive visualizationreasoning diagnosishuman-AI interactionvisual analyticstrust calibration
0
0 comments X

The pith

ReasonDiag uses error detection and interactive diagrams to help users identify and trace logical and factual mistakes in LLM chain-of-thought outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models generate long chain-of-thought traces that frequently contain logical and factual errors, which users struggle to spot and understand. The paper builds an error detection pipeline that applies external fact-checking plus symbolic formal validation to flag problems at the individual step level. ReasonDiag then presents these results through an arc diagram that maps step distributions and error propagation alongside a hierarchical node-link diagram that shows reasoning flows and premise dependencies. Technical evaluation, case studies, and interviews with 16 participants indicate the system lets users understand traces, locate erroneous steps, and determine root causes. This matters because clearer visibility into flawed reasoning supports better calibration of trust in model outputs.

Core claim

ReasonDiag provides an integrated arc diagram to show reasoning-step distributions and error-propagation patterns, and a hierarchical node-link diagram to visualize high-level reasoning flows and premise dependencies, built on an error detection pipeline that combines external fact-checking with symbolic formal logical validation to identify errors at the step level.

What carries the argument

The error detection pipeline combining external fact-checking with symbolic formal logical validation to flag errors at each reasoning step, which then drives the arc diagram for propagation patterns and the node-link diagram for premise dependencies.

If this is right

  • Users can interpret lengthy CoT traces more efficiently by seeing error locations and spread.
  • Erroneous steps and their root causes become identifiable through combined visual and detection support.
  • Trust calibration improves when users can trace how specific flaws affect final outputs.
  • Both logical inconsistencies and factual inaccuracies receive unified diagnosis in one interface.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Embedding ReasonDiag-style diagnostics into everyday LLM chat interfaces could reduce reliance on undetected flawed reasoning.
  • The pipeline approach might scale to additional error types such as arithmetic or causal mistakes not covered in the current validation.
  • Deployment in high-stakes domains like medical or financial advice would test whether visible root-cause tracing changes user decisions.

Load-bearing premise

The error detection pipeline accurately identifies logical and factual errors at the step level without substantial false positives or missed issues that would mislead the visualizations and user diagnosis.

What would settle it

A controlled study in which participants diagnose the same set of CoT traces with and without ReasonDiag and their accuracy at locating injected errors is measured.

Figures

Figures reproduced from arXiv: 2603.21286 by Chenxi Zhang, Niruthikka Sritharan, Shiwei Chen, Xiaolin Wen, Xingbo Wang, Yong Wang.

Figure 1
Figure 1. Figure 1: The Error Detection Pipeline comprises three stages: (1) Premise Tree Generation structures the raw CoT by classifying step roles and mapping dependencies; (2) Factual Error Detection verifies checkable claims using retrieval-augmented external evidence; and (3) Logical Error Detection translates natural language steps into symbolic constraints for formal consistency checks via the Z3 solver. The CoT reaso… view at source ↗
Figure 2
Figure 2. Figure 2: ReasonDiag interface: (A) The Overview displays ordinal reasoning steps (A1) along a horizontal axis (A4) and highlights uncer￾tain regions (A7) and error propagation (A6). Users can adjust the shown steps (A5) using two filters (A2, A3). (B) The Section View presents a hierarchical summary through textual section labels (B1) and colored step markers (B2), allowing users to click on erroneous steps to reve… view at source ↗
Figure 3
Figure 3. Figure 3: With ReasonDiag, a user diagnoses errors and reasoning patterns in a mathematical CoT. (A) Problem statement. (B) Overview of step types and error propagation, with (B1) highlighting the “polluted” final answer and (B2) a retroactive reasoning pattern. (C) Structured reasoning trace, where (C1) shows the premise–conclusion chain to the answer and the red error path, and (C2) describes the self-correction p… view at source ↗
Figure 4
Figure 4. Figure 4: With ReasonDiag, a user diagnoses illusory truth and logical gaps. (A) Structured reasoning trace where only a sparse subset of steps contributes to the final answer; (A1) marks the key factual error and (A2) a disconnected high-importance section. (B) Shows the model’s shift from tentative to definitive statements, illus￾trating the illusory truth effect. (C) Highlights unused player statis￾tics. (D) Show… view at source ↗
Figure 5
Figure 5. Figure 5: The user interview questionnaire results. Q1-Q13 are closed-ended and rated on a 7-point Likert scale. Q14 and Q15 are open￾ended questions to collect participants’ feedback. The detailed scores of Q1-Q10 are shown in a stacked bar chart. trace, like correcting erroneous CoT reasoning steps or simplifying repetitive reasoning. 7. Discussion and Future Work Despite the effectiveness of ReasonDiag demonstrat… view at source ↗
Figure 1
Figure 1. Figure 1: Example of Claude presenting a summarized reasoning trace with multiple steps [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example of ChatGPT presenting a summarized reason￾ing trace with brief explanations [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
read the original abstract

Current Large Language Models (LLMs), especially Large Reasoning Models, can generate Chain-of-Thought (CoT) reasoning traces to illustrate how they produce final outputs, thereby facilitating trust calibration for users. However, these CoT reasoning traces are usually lengthy and tedious, and can contain various issues, such as logical and factual errors, which make it difficult for users to interpret the reasoning traces efficiently and accurately. To address these challenges, we develop an error detection pipeline that combines external fact-checking with symbolic formal logical validation to identify errors at the step level. Building on this pipeline, we propose ReasonDiag, an interactive visualization system for diagnosing CoT reasoning traces. ReasonDiag provides 1) an integrated arc diagram to show reasoning-step distributions and error-propagation patterns, and 2) a hierarchical node-link diagram to visualize high-level reasoning flows and premise dependencies. We evaluate ReasonDiag through a technical evaluation for the error detection pipeline, two case studies, and user interviews with 16 participants. The results indicate that ReasonDiag helps users effectively understand CoT reasoning traces, identify erroneous steps, and determine their root causes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces ReasonDiag, an interactive visualization system for diagnosing errors in LLM Chain-of-Thought (CoT) reasoning traces. It features an error detection pipeline combining external fact-checking with symbolic formal logical validation to identify step-level errors. The visualizations include an integrated arc diagram showing reasoning-step distributions and error-propagation patterns, plus a hierarchical node-link diagram for high-level reasoning flows and premise dependencies. The system is evaluated via a technical evaluation of the pipeline, two case studies, and interviews with 16 participants; results indicate that ReasonDiag helps users understand CoT traces, identify erroneous steps, and determine root causes.

Significance. If the error detection pipeline is shown to be reliable, ReasonDiag would offer a meaningful advance in HCI tools for LLM interpretability by supporting interactive diagnosis of lengthy and flawed reasoning traces, potentially improving user trust calibration in applications relying on CoT outputs.

major comments (1)
  1. [Abstract / Technical Evaluation] Abstract and technical evaluation section: The manuscript states that the technical evaluation reports positive outcomes for the error detection pipeline, yet provides no quantitative metrics such as precision, recall, false-positive rate, or validation against ground-truth error labels on realistic CoT traces. This is load-bearing for the central claim, as the arc diagram, node-link views, case studies, and user interviews all depend on the pipeline producing accurate step-level error labels; unquantified accuracy leaves the reported user benefits uninterpretable.
minor comments (1)
  1. [Abstract] The abstract would benefit from briefly noting the scale of the technical evaluation (e.g., number of traces tested) to better contextualize the positive outcomes claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We agree that the absence of quantitative metrics for the error detection pipeline weakens the interpretability of the user-facing results and will revise the manuscript to address this directly.

read point-by-point responses
  1. Referee: [Abstract / Technical Evaluation] Abstract and technical evaluation section: The manuscript states that the technical evaluation reports positive outcomes for the error detection pipeline, yet provides no quantitative metrics such as precision, recall, false-positive rate, or validation against ground-truth error labels on realistic CoT traces. This is load-bearing for the central claim, as the arc diagram, node-link views, case studies, and user interviews all depend on the pipeline producing accurate step-level error labels; unquantified accuracy leaves the reported user benefits uninterpretable.

    Authors: We acknowledge that the current manuscript reports only that the technical evaluation yielded 'positive outcomes' without supplying precision, recall, F1, or false-positive rates, nor does it describe validation against ground-truth labels on realistic CoT traces. This omission does limit the strength of claims about downstream user benefits. In the revised version we will add a dedicated quantitative evaluation subsection that reports precision, recall, and F1 scores computed on a held-out set of 200 CoT traces manually annotated for step-level errors by two independent raters (with inter-rater agreement). We will also report false-positive rates and include a brief error analysis of the remaining failure cases. These additions will be referenced from the abstract and will allow readers to assess the pipeline's reliability before interpreting the visualizations and interview results. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces an error detection pipeline (external fact-checking plus symbolic validation) and the ReasonDiag visualization system, then evaluates them via a technical evaluation, two case studies, and user interviews with 16 participants. The central claims about helping users understand CoT traces and identify root causes rest on these independent external assessments rather than any self-referential equations, fitted parameters renamed as predictions, or self-citation chains that reduce the result to its own inputs by construction. No mathematical derivations or load-bearing self-citations appear in the provided text that would trigger the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied HCI system paper with no mathematical derivations, fitted parameters, or new postulated entities; the central claim rests on the design of the error detection pipeline and visualizations plus user feedback rather than axioms or free parameters.

pith-pipeline@v0.9.0 · 5511 in / 1198 out tokens · 68697 ms · 2026-05-15T06:55:56.604989+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

  1. [1]

    In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

    URL:https://arxiv.org/abs/2503.11926,arXiv: 2503.11926. 1 [BMNC25] BOGDANP. C., MACARU., NANDAN., CONMYA.: Thought anchors: Which llm reasoning steps matter?arXiv preprint arXiv:2506.19143(2025). 3, 4, 14 [BW A∗25] BAREZF., WUT.-Y., ARCUSCHINI., LANM., WANGV., SIEGELN., COLLIGNONN., NEOC., LEEI., PARENA.,ET AL.: Chain-of-thought is not explainability.Prep...

  2. [2]

    1, 3 [Ope26] OPENAI: Reasoning models | openai api, 2026

    URL:https://chatgpt.com/. 1, 3 [Ope26] OPENAI: Reasoning models | openai api, 2026. URL: https://developers.openai.com/api/docs/guides/ reasoning. 1, 13 [PAWW23] PANL., ALBALAKA., WANGX., WANGW.: Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. InFindings of the Association for Computational Lin- guistics: E...

  3. [3]

    any sum of consec- utive integers can be represented as a difference of two trian- gular numbers

    URL:https://www.sciencedirect.com/science/ article/pii/B9781558609150500469,doi:https: //doi.org/10.1016/B978-155860915-0/50046-9. 3 [SSS25] SRINIVASANA., SETLURV., SATYANARAYANA.: Pluto: Au- thoring semantically aligned text and charts for data-driven communi- cation. InProceedings of the 30th International Conference on Intel- ligent User Interfaces(New...

  4. [4]

    ‘problem_setup‘: Parsing or rephrasing the problem (initial reading or comprehen- sion)

  5. [5]

    ‘plan_generation‘: Stating or deciding on a plan of action (often meta-reasoning)

  6. [6]

    ‘fact_retrieval‘: Recalling facts, for- mulas, problem details (without immediate computation)

  7. [7]

    ‘active_computation‘: Performing alge- bra, calculations, manipulations toward the answer

  8. [8]

    ‘result_consolidation‘: Aggregating in- termediate results, summarizing, or prepar- ing the final answer

  9. [9]

    ‘uncertainty_management‘: Expressing confusion, re-evaluating, proposing alter- native plans (includes backtracking)

  10. [10]

    ‘final_answer_emission‘: Explicit state- ment of the final boxed answer or earlier steps that contain the final answer

  11. [11]

    ‘self_checking‘: Verifying previ- ous steps, Pythagorean checking, re- confirmations

  12. [12]

    function_tag

    ‘unknown‘: Use only if the step does not fit any of the above tags or is purely stylistic or semantic. -- ### Output Format: Return a single dictionary with one entry per step, where each entry has: - the step index (as the key, converted to a string), - a dictionary with: ‘"function_tag"‘: list of tag strings Here’s the expected format: “‘language=json {...

  13. [13]

    Checkability test: Does the statement assert a claim that could be objectively verified as true/false using external sources or formal/logical validation?

  14. [14]

    Evidence availability: In principle, could a verifier consult public knowledge, data, or compute a proof/check (without re- lying on extra unstated assumptions)?

  15. [15]

    id": "<step_index>

    Non-claim filter: If the statement is purely procedural, subjective, or meta/or- ganizational (no factual or logically testable claim), label it Non_verifiable. Output Schema (JSON) [ { "id": "<step_index>", "category": "Verifiable|Non_verifiable", "explanation": "<string>", "confidence": "<number 0..1>" } ] Example (ILLUSTRATIVE) [AN EXAMPLE HERE, Omitte...

  16. [16]

    A step cannot be a premise to itself

  17. [17]

    The question (Step 0) can be a premise if used directly Generate **ONLY** the premises and nothing else. Format your response with one premise per line as: Step X: [explanation of why this step is necessary for the current step] {fewshot_template} NL to Symbolic Purpose:Convert premises and conclusion from natural language into symbolic formulas. Inputs:{...

  18. [18]

    target_statement

    "target_statement": The target statement to be transformed to formal logic (in natu- ral language)

  19. [19]

    related_

    "related_": A list of supporting state- ments (in natural language) relevant to the main statement that need to be transformed to formal logic as constraints

  20. [20]

    full_reasoning

    "full_reasoning": The complete reasoning chain in natural language for refining the related statements

  21. [21]

    question_context

    "question_context": The original ques- tion text for background

  22. [22]

    declarations & constraints

    "declarations & constraints": Logic declarations and variable domains in the question, as would be used in formal logic (e.g., function, variable, and domain def- initions). Any extra constraints or axioms given or derived from the question. Your task: - Convert the "target_statement" (in natu- ral language) into formal logic. - Convert every "related_sta...

  23. [23]

    declarations

    "declarations": An array of the formal logic declarations from related_statements and the given declarations(as code or for- mal expressions)

  24. [24]

    constraints

    "constraints": An array of the formal logic constraints for related_statements (as code or formal expressions)

  25. [25]

    target_statement

    "target_statement": An object with: - sentence: The target statement (in natu- ral language). - FL: The formal logic representation. Symbolic to Z3 Code Purpose:Convert symbolic formulas into executable Z3 (Python) code for verification. Inputs:{declarations}, {constraints}, {target_statement} Model/Params:gpt-5, max_tokens=10000 Output Contract:Return on...

  26. [26]

    Declarations (using EnumSort, Function, Const, etc.)

  27. [27]

    A list named ‘constraints‘ containing the constraints

  28. [28]

    sentence

    A list named ‘target_statement‘ con- taining the target_statement logical state- ments. - Ensure all variable names and function names exactly match the ones in the decla- rations and constraints. - Use ‘Const‘ for quantified variables. - Do not include solver code, explanations, or any extra strings. Only return the pure Python code such that it can be e...

  29. [29]

    ‘problem_setup‘: Parsing or rephrasing the problem (ini- tial reading or comprehension)

  30. [30]

    ‘plan_generation‘: Stating or deciding on a plan of ac- tion (often meta-reasoning)

  31. [31]

    ‘fact_retrieval‘: Recalling facts, formulas, problem de- tails (without immediate computation)

  32. [32]

    ‘active_computation‘: Performing algebra, calculations, ma- nipulations toward the answer

  33. [33]

    ‘result_consolidation‘: Aggregating intermediate results, sum- marizing, or preparing the final answer

  34. [34]

    ‘uncertainty_management‘: Expressing confusion, re- evaluating, proposing alterna- tive plans (includes backtracking)

  35. [35]

    ‘final_answer_emission‘: Explicit statement of the fi- nal boxed answer or earlier steps that con- tain the final answer

  36. [36]

    function_tag

    ‘self_checking‘: Verifying previ- ous steps, Pythagorean checking, re- confirmations. ### Task Construct a *hierarchical plan out- © 2026 The Author(s). Computer Graphics Forum published by Eurographics and John Wiley & Sons Ltd. 18 of 18S. Chen & N. Sritharan & X. Wen & C. Zhang & X. Wang & Y. Wang / Interactive Diagnosis of LLM Chain-of-Thought Errors l...