When the Chain Breaks: Interactive Diagnosis of LLM Chain-of-Thought Reasoning Errors
Pith reviewed 2026-05-15 06:55 UTC · model grok-4.3
The pith
ReasonDiag uses error detection and interactive diagrams to help users identify and trace logical and factual mistakes in LLM chain-of-thought outputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ReasonDiag provides an integrated arc diagram to show reasoning-step distributions and error-propagation patterns, and a hierarchical node-link diagram to visualize high-level reasoning flows and premise dependencies, built on an error detection pipeline that combines external fact-checking with symbolic formal logical validation to identify errors at the step level.
What carries the argument
The error detection pipeline combining external fact-checking with symbolic formal logical validation to flag errors at each reasoning step, which then drives the arc diagram for propagation patterns and the node-link diagram for premise dependencies.
If this is right
- Users can interpret lengthy CoT traces more efficiently by seeing error locations and spread.
- Erroneous steps and their root causes become identifiable through combined visual and detection support.
- Trust calibration improves when users can trace how specific flaws affect final outputs.
- Both logical inconsistencies and factual inaccuracies receive unified diagnosis in one interface.
Where Pith is reading between the lines
- Embedding ReasonDiag-style diagnostics into everyday LLM chat interfaces could reduce reliance on undetected flawed reasoning.
- The pipeline approach might scale to additional error types such as arithmetic or causal mistakes not covered in the current validation.
- Deployment in high-stakes domains like medical or financial advice would test whether visible root-cause tracing changes user decisions.
Load-bearing premise
The error detection pipeline accurately identifies logical and factual errors at the step level without substantial false positives or missed issues that would mislead the visualizations and user diagnosis.
What would settle it
A controlled study in which participants diagnose the same set of CoT traces with and without ReasonDiag and their accuracy at locating injected errors is measured.
Figures
read the original abstract
Current Large Language Models (LLMs), especially Large Reasoning Models, can generate Chain-of-Thought (CoT) reasoning traces to illustrate how they produce final outputs, thereby facilitating trust calibration for users. However, these CoT reasoning traces are usually lengthy and tedious, and can contain various issues, such as logical and factual errors, which make it difficult for users to interpret the reasoning traces efficiently and accurately. To address these challenges, we develop an error detection pipeline that combines external fact-checking with symbolic formal logical validation to identify errors at the step level. Building on this pipeline, we propose ReasonDiag, an interactive visualization system for diagnosing CoT reasoning traces. ReasonDiag provides 1) an integrated arc diagram to show reasoning-step distributions and error-propagation patterns, and 2) a hierarchical node-link diagram to visualize high-level reasoning flows and premise dependencies. We evaluate ReasonDiag through a technical evaluation for the error detection pipeline, two case studies, and user interviews with 16 participants. The results indicate that ReasonDiag helps users effectively understand CoT reasoning traces, identify erroneous steps, and determine their root causes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ReasonDiag, an interactive visualization system for diagnosing errors in LLM Chain-of-Thought (CoT) reasoning traces. It features an error detection pipeline combining external fact-checking with symbolic formal logical validation to identify step-level errors. The visualizations include an integrated arc diagram showing reasoning-step distributions and error-propagation patterns, plus a hierarchical node-link diagram for high-level reasoning flows and premise dependencies. The system is evaluated via a technical evaluation of the pipeline, two case studies, and interviews with 16 participants; results indicate that ReasonDiag helps users understand CoT traces, identify erroneous steps, and determine root causes.
Significance. If the error detection pipeline is shown to be reliable, ReasonDiag would offer a meaningful advance in HCI tools for LLM interpretability by supporting interactive diagnosis of lengthy and flawed reasoning traces, potentially improving user trust calibration in applications relying on CoT outputs.
major comments (1)
- [Abstract / Technical Evaluation] Abstract and technical evaluation section: The manuscript states that the technical evaluation reports positive outcomes for the error detection pipeline, yet provides no quantitative metrics such as precision, recall, false-positive rate, or validation against ground-truth error labels on realistic CoT traces. This is load-bearing for the central claim, as the arc diagram, node-link views, case studies, and user interviews all depend on the pipeline producing accurate step-level error labels; unquantified accuracy leaves the reported user benefits uninterpretable.
minor comments (1)
- [Abstract] The abstract would benefit from briefly noting the scale of the technical evaluation (e.g., number of traces tested) to better contextualize the positive outcomes claim.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We agree that the absence of quantitative metrics for the error detection pipeline weakens the interpretability of the user-facing results and will revise the manuscript to address this directly.
read point-by-point responses
-
Referee: [Abstract / Technical Evaluation] Abstract and technical evaluation section: The manuscript states that the technical evaluation reports positive outcomes for the error detection pipeline, yet provides no quantitative metrics such as precision, recall, false-positive rate, or validation against ground-truth error labels on realistic CoT traces. This is load-bearing for the central claim, as the arc diagram, node-link views, case studies, and user interviews all depend on the pipeline producing accurate step-level error labels; unquantified accuracy leaves the reported user benefits uninterpretable.
Authors: We acknowledge that the current manuscript reports only that the technical evaluation yielded 'positive outcomes' without supplying precision, recall, F1, or false-positive rates, nor does it describe validation against ground-truth labels on realistic CoT traces. This omission does limit the strength of claims about downstream user benefits. In the revised version we will add a dedicated quantitative evaluation subsection that reports precision, recall, and F1 scores computed on a held-out set of 200 CoT traces manually annotated for step-level errors by two independent raters (with inter-rater agreement). We will also report false-positive rates and include a brief error analysis of the remaining failure cases. These additions will be referenced from the abstract and will allow readers to assess the pipeline's reliability before interpreting the visualizations and interview results. revision: yes
Circularity Check
No significant circularity
full rationale
The paper introduces an error detection pipeline (external fact-checking plus symbolic validation) and the ReasonDiag visualization system, then evaluates them via a technical evaluation, two case studies, and user interviews with 16 participants. The central claims about helping users understand CoT traces and identify root causes rest on these independent external assessments rather than any self-referential equations, fitted parameters renamed as predictions, or self-citation chains that reduce the result to its own inputs by construction. No mathematical derivations or load-bearing self-citations appear in the provided text that would trigger the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V
URL:https://arxiv.org/abs/2503.11926,arXiv: 2503.11926. 1 [BMNC25] BOGDANP. C., MACARU., NANDAN., CONMYA.: Thought anchors: Which llm reasoning steps matter?arXiv preprint arXiv:2506.19143(2025). 3, 4, 14 [BW A∗25] BAREZF., WUT.-Y., ARCUSCHINI., LANM., WANGV., SIEGELN., COLLIGNONN., NEOC., LEEI., PARENA.,ET AL.: Chain-of-thought is not explainability.Prep...
-
[2]
1, 3 [Ope26] OPENAI: Reasoning models | openai api, 2026
URL:https://chatgpt.com/. 1, 3 [Ope26] OPENAI: Reasoning models | openai api, 2026. URL: https://developers.openai.com/api/docs/guides/ reasoning. 1, 13 [PAWW23] PANL., ALBALAKA., WANGX., WANGW.: Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. InFindings of the Association for Computational Lin- guistics: E...
-
[3]
any sum of consec- utive integers can be represented as a difference of two trian- gular numbers
URL:https://www.sciencedirect.com/science/ article/pii/B9781558609150500469,doi:https: //doi.org/10.1016/B978-155860915-0/50046-9. 3 [SSS25] SRINIVASANA., SETLURV., SATYANARAYANA.: Pluto: Au- thoring semantically aligned text and charts for data-driven communi- cation. InProceedings of the 30th International Conference on Intel- ligent User Interfaces(New...
-
[4]
‘problem_setup‘: Parsing or rephrasing the problem (initial reading or comprehen- sion)
-
[5]
‘plan_generation‘: Stating or deciding on a plan of action (often meta-reasoning)
-
[6]
‘fact_retrieval‘: Recalling facts, for- mulas, problem details (without immediate computation)
-
[7]
‘active_computation‘: Performing alge- bra, calculations, manipulations toward the answer
-
[8]
‘result_consolidation‘: Aggregating in- termediate results, summarizing, or prepar- ing the final answer
-
[9]
‘uncertainty_management‘: Expressing confusion, re-evaluating, proposing alter- native plans (includes backtracking)
-
[10]
‘final_answer_emission‘: Explicit state- ment of the final boxed answer or earlier steps that contain the final answer
-
[11]
‘self_checking‘: Verifying previ- ous steps, Pythagorean checking, re- confirmations
-
[12]
‘unknown‘: Use only if the step does not fit any of the above tags or is purely stylistic or semantic. -- ### Output Format: Return a single dictionary with one entry per step, where each entry has: - the step index (as the key, converted to a string), - a dictionary with: ‘"function_tag"‘: list of tag strings Here’s the expected format: “‘language=json {...
-
[13]
Checkability test: Does the statement assert a claim that could be objectively verified as true/false using external sources or formal/logical validation?
-
[14]
Evidence availability: In principle, could a verifier consult public knowledge, data, or compute a proof/check (without re- lying on extra unstated assumptions)?
-
[15]
Non-claim filter: If the statement is purely procedural, subjective, or meta/or- ganizational (no factual or logically testable claim), label it Non_verifiable. Output Schema (JSON) [ { "id": "<step_index>", "category": "Verifiable|Non_verifiable", "explanation": "<string>", "confidence": "<number 0..1>" } ] Example (ILLUSTRATIVE) [AN EXAMPLE HERE, Omitte...
work page 2026
-
[16]
A step cannot be a premise to itself
-
[17]
The question (Step 0) can be a premise if used directly Generate **ONLY** the premises and nothing else. Format your response with one premise per line as: Step X: [explanation of why this step is necessary for the current step] {fewshot_template} NL to Symbolic Purpose:Convert premises and conclusion from natural language into symbolic formulas. Inputs:{...
-
[18]
"target_statement": The target statement to be transformed to formal logic (in natu- ral language)
- [19]
-
[20]
"full_reasoning": The complete reasoning chain in natural language for refining the related statements
- [21]
-
[22]
"declarations & constraints": Logic declarations and variable domains in the question, as would be used in formal logic (e.g., function, variable, and domain def- initions). Any extra constraints or axioms given or derived from the question. Your task: - Convert the "target_statement" (in natu- ral language) into formal logic. - Convert every "related_sta...
work page 2026
-
[23]
"declarations": An array of the formal logic declarations from related_statements and the given declarations(as code or for- mal expressions)
-
[24]
"constraints": An array of the formal logic constraints for related_statements (as code or formal expressions)
-
[25]
"target_statement": An object with: - sentence: The target statement (in natu- ral language). - FL: The formal logic representation. Symbolic to Z3 Code Purpose:Convert symbolic formulas into executable Z3 (Python) code for verification. Inputs:{declarations}, {constraints}, {target_statement} Model/Params:gpt-5, max_tokens=10000 Output Contract:Return on...
-
[26]
Declarations (using EnumSort, Function, Const, etc.)
-
[27]
A list named ‘constraints‘ containing the constraints
-
[28]
A list named ‘target_statement‘ con- taining the target_statement logical state- ments. - Ensure all variable names and function names exactly match the ones in the decla- rations and constraints. - Use ‘Const‘ for quantified variables. - Do not include solver code, explanations, or any extra strings. Only return the pure Python code such that it can be e...
-
[29]
‘problem_setup‘: Parsing or rephrasing the problem (ini- tial reading or comprehension)
-
[30]
‘plan_generation‘: Stating or deciding on a plan of ac- tion (often meta-reasoning)
-
[31]
‘fact_retrieval‘: Recalling facts, formulas, problem de- tails (without immediate computation)
-
[32]
‘active_computation‘: Performing algebra, calculations, ma- nipulations toward the answer
-
[33]
‘result_consolidation‘: Aggregating intermediate results, sum- marizing, or preparing the final answer
-
[34]
‘uncertainty_management‘: Expressing confusion, re- evaluating, proposing alterna- tive plans (includes backtracking)
-
[35]
‘final_answer_emission‘: Explicit statement of the fi- nal boxed answer or earlier steps that con- tain the final answer
-
[36]
‘self_checking‘: Verifying previ- ous steps, Pythagorean checking, re- confirmations. ### Task Construct a *hierarchical plan out- © 2026 The Author(s). Computer Graphics Forum published by Eurographics and John Wiley & Sons Ltd. 18 of 18S. Chen & N. Sritharan & X. Wen & C. Zhang & X. Wang & Y. Wang / Interactive Diagnosis of LLM Chain-of-Thought Errors l...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.