When the Chain Breaks: Interactive Diagnosis of LLM Chain-of-Thought Reasoning Errors

Chenxi Zhang; Niruthikka Sritharan; Shiwei Chen; Xiaolin Wen; Xingbo Wang; Yong Wang

arxiv: 2603.21286 · v2 · submitted 2026-03-22 · 💻 cs.HC

When the Chain Breaks: Interactive Diagnosis of LLM Chain-of-Thought Reasoning Errors

Shiwei Chen , Niruthikka Sritharan , Xiaolin Wen , Chenxi Zhang , Xingbo Wang , Yong Wang This is my paper

Pith reviewed 2026-05-15 06:55 UTC · model grok-4.3

classification 💻 cs.HC

keywords chain-of-thoughtLLM reasoningerror detectioninteractive visualizationreasoning diagnosishuman-AI interactionvisual analyticstrust calibration

0 comments

The pith

ReasonDiag uses error detection and interactive diagrams to help users identify and trace logical and factual mistakes in LLM chain-of-thought outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models generate long chain-of-thought traces that frequently contain logical and factual errors, which users struggle to spot and understand. The paper builds an error detection pipeline that applies external fact-checking plus symbolic formal validation to flag problems at the individual step level. ReasonDiag then presents these results through an arc diagram that maps step distributions and error propagation alongside a hierarchical node-link diagram that shows reasoning flows and premise dependencies. Technical evaluation, case studies, and interviews with 16 participants indicate the system lets users understand traces, locate erroneous steps, and determine root causes. This matters because clearer visibility into flawed reasoning supports better calibration of trust in model outputs.

Core claim

ReasonDiag provides an integrated arc diagram to show reasoning-step distributions and error-propagation patterns, and a hierarchical node-link diagram to visualize high-level reasoning flows and premise dependencies, built on an error detection pipeline that combines external fact-checking with symbolic formal logical validation to identify errors at the step level.

What carries the argument

The error detection pipeline combining external fact-checking with symbolic formal logical validation to flag errors at each reasoning step, which then drives the arc diagram for propagation patterns and the node-link diagram for premise dependencies.

If this is right

Users can interpret lengthy CoT traces more efficiently by seeing error locations and spread.
Erroneous steps and their root causes become identifiable through combined visual and detection support.
Trust calibration improves when users can trace how specific flaws affect final outputs.
Both logical inconsistencies and factual inaccuracies receive unified diagnosis in one interface.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Embedding ReasonDiag-style diagnostics into everyday LLM chat interfaces could reduce reliance on undetected flawed reasoning.
The pipeline approach might scale to additional error types such as arithmetic or causal mistakes not covered in the current validation.
Deployment in high-stakes domains like medical or financial advice would test whether visible root-cause tracing changes user decisions.

Load-bearing premise

The error detection pipeline accurately identifies logical and factual errors at the step level without substantial false positives or missed issues that would mislead the visualizations and user diagnosis.

What would settle it

A controlled study in which participants diagnose the same set of CoT traces with and without ReasonDiag and their accuracy at locating injected errors is measured.

Figures

Figures reproduced from arXiv: 2603.21286 by Chenxi Zhang, Niruthikka Sritharan, Shiwei Chen, Xiaolin Wen, Xingbo Wang, Yong Wang.

**Figure 1.** Figure 1: The Error Detection Pipeline comprises three stages: (1) Premise Tree Generation structures the raw CoT by classifying step roles and mapping dependencies; (2) Factual Error Detection verifies checkable claims using retrieval-augmented external evidence; and (3) Logical Error Detection translates natural language steps into symbolic constraints for formal consistency checks via the Z3 solver. The CoT reaso… view at source ↗

**Figure 2.** Figure 2: ReasonDiag interface: (A) The Overview displays ordinal reasoning steps (A1) along a horizontal axis (A4) and highlights uncertain regions (A7) and error propagation (A6). Users can adjust the shown steps (A5) using two filters (A2, A3). (B) The Section View presents a hierarchical summary through textual section labels (B1) and colored step markers (B2), allowing users to click on erroneous steps to reve… view at source ↗

**Figure 3.** Figure 3: With ReasonDiag, a user diagnoses errors and reasoning patterns in a mathematical CoT. (A) Problem statement. (B) Overview of step types and error propagation, with (B1) highlighting the “polluted” final answer and (B2) a retroactive reasoning pattern. (C) Structured reasoning trace, where (C1) shows the premise–conclusion chain to the answer and the red error path, and (C2) describes the self-correction p… view at source ↗

**Figure 4.** Figure 4: With ReasonDiag, a user diagnoses illusory truth and logical gaps. (A) Structured reasoning trace where only a sparse subset of steps contributes to the final answer; (A1) marks the key factual error and (A2) a disconnected high-importance section. (B) Shows the model’s shift from tentative to definitive statements, illustrating the illusory truth effect. (C) Highlights unused player statistics. (D) Show… view at source ↗

**Figure 5.** Figure 5: The user interview questionnaire results. Q1-Q13 are closed-ended and rated on a 7-point Likert scale. Q14 and Q15 are openended questions to collect participants’ feedback. The detailed scores of Q1-Q10 are shown in a stacked bar chart. trace, like correcting erroneous CoT reasoning steps or simplifying repetitive reasoning. 7. Discussion and Future Work Despite the effectiveness of ReasonDiag demonstrat… view at source ↗

**Figure 1.** Figure 1: Example of Claude presenting a summarized reasoning trace with multiple steps [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗

**Figure 2.** Figure 2: Example of ChatGPT presenting a summarized reasoning trace with brief explanations [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

read the original abstract

Current Large Language Models (LLMs), especially Large Reasoning Models, can generate Chain-of-Thought (CoT) reasoning traces to illustrate how they produce final outputs, thereby facilitating trust calibration for users. However, these CoT reasoning traces are usually lengthy and tedious, and can contain various issues, such as logical and factual errors, which make it difficult for users to interpret the reasoning traces efficiently and accurately. To address these challenges, we develop an error detection pipeline that combines external fact-checking with symbolic formal logical validation to identify errors at the step level. Building on this pipeline, we propose ReasonDiag, an interactive visualization system for diagnosing CoT reasoning traces. ReasonDiag provides 1) an integrated arc diagram to show reasoning-step distributions and error-propagation patterns, and 2) a hierarchical node-link diagram to visualize high-level reasoning flows and premise dependencies. We evaluate ReasonDiag through a technical evaluation for the error detection pipeline, two case studies, and user interviews with 16 participants. The results indicate that ReasonDiag helps users effectively understand CoT reasoning traces, identify erroneous steps, and determine their root causes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReasonDiag integrates a fact-check plus symbolic error pipeline with arc diagrams for propagation and hierarchical node-link views for flows, but the pipeline's accuracy metrics are missing from the reported evaluation.

read the letter

The paper's core offering is ReasonDiag, a system that first runs an error detection pipeline on LLM Chain-of-Thought traces and then renders the results in two tailored diagrams so users can spot bad steps and trace their origins more quickly than reading raw text. The pipeline combines external fact-checking with symbolic formal logical validation to flag issues at the step level. The visualizations are an arc diagram that shows step distributions and error spread, plus a hierarchical node-link diagram that lays out high-level reasoning flows and premise dependencies. The authors describe a technical evaluation of the pipeline, two case studies, and interviews with 16 participants, with the feedback indicating that users could understand the traces, locate erroneous steps, and identify root causes more effectively. This integrated approach is new enough in the HCI and interpretability space to give practitioners a concrete starting point for building debugging tools. The diagrams are a reasonable match for the propagation and dependency patterns that matter in long CoT outputs. The main gap is the lack of numbers on how well the pipeline actually works. The abstract mentions a technical evaluation but supplies no precision, recall, false-positive rates, or ground-truth agreement scores. Without those, it is hard to know whether the positive user outcomes come from accurate labels or from the interface itself. If the pipeline misflags steps, the diagrams will display misleading patterns and the interview results lose their force. That is a fixable but load-bearing issue for the claims. The work is aimed at HCI researchers and tool builders who need practical ways to make LLM reasoning more inspectable. A reader looking for visualization patterns or pipeline ideas for AI explanations will find usable details here. I would send it to peer review. The system is motivated by real user problems and the visualizations are grounded, even though the detection accuracy needs tighter quantification before the results can be taken at face value.

Referee Report

1 major / 1 minor

Summary. The paper introduces ReasonDiag, an interactive visualization system for diagnosing errors in LLM Chain-of-Thought (CoT) reasoning traces. It features an error detection pipeline combining external fact-checking with symbolic formal logical validation to identify step-level errors. The visualizations include an integrated arc diagram showing reasoning-step distributions and error-propagation patterns, plus a hierarchical node-link diagram for high-level reasoning flows and premise dependencies. The system is evaluated via a technical evaluation of the pipeline, two case studies, and interviews with 16 participants; results indicate that ReasonDiag helps users understand CoT traces, identify erroneous steps, and determine root causes.

Significance. If the error detection pipeline is shown to be reliable, ReasonDiag would offer a meaningful advance in HCI tools for LLM interpretability by supporting interactive diagnosis of lengthy and flawed reasoning traces, potentially improving user trust calibration in applications relying on CoT outputs.

major comments (1)

[Abstract / Technical Evaluation] Abstract and technical evaluation section: The manuscript states that the technical evaluation reports positive outcomes for the error detection pipeline, yet provides no quantitative metrics such as precision, recall, false-positive rate, or validation against ground-truth error labels on realistic CoT traces. This is load-bearing for the central claim, as the arc diagram, node-link views, case studies, and user interviews all depend on the pipeline producing accurate step-level error labels; unquantified accuracy leaves the reported user benefits uninterpretable.

minor comments (1)

[Abstract] The abstract would benefit from briefly noting the scale of the technical evaluation (e.g., number of traces tested) to better contextualize the positive outcomes claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We agree that the absence of quantitative metrics for the error detection pipeline weakens the interpretability of the user-facing results and will revise the manuscript to address this directly.

read point-by-point responses

Referee: [Abstract / Technical Evaluation] Abstract and technical evaluation section: The manuscript states that the technical evaluation reports positive outcomes for the error detection pipeline, yet provides no quantitative metrics such as precision, recall, false-positive rate, or validation against ground-truth error labels on realistic CoT traces. This is load-bearing for the central claim, as the arc diagram, node-link views, case studies, and user interviews all depend on the pipeline producing accurate step-level error labels; unquantified accuracy leaves the reported user benefits uninterpretable.

Authors: We acknowledge that the current manuscript reports only that the technical evaluation yielded 'positive outcomes' without supplying precision, recall, F1, or false-positive rates, nor does it describe validation against ground-truth labels on realistic CoT traces. This omission does limit the strength of claims about downstream user benefits. In the revised version we will add a dedicated quantitative evaluation subsection that reports precision, recall, and F1 scores computed on a held-out set of 200 CoT traces manually annotated for step-level errors by two independent raters (with inter-rater agreement). We will also report false-positive rates and include a brief error analysis of the remaining failure cases. These additions will be referenced from the abstract and will allow readers to assess the pipeline's reliability before interpreting the visualizations and interview results. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces an error detection pipeline (external fact-checking plus symbolic validation) and the ReasonDiag visualization system, then evaluates them via a technical evaluation, two case studies, and user interviews with 16 participants. The central claims about helping users understand CoT traces and identify root causes rest on these independent external assessments rather than any self-referential equations, fitted parameters renamed as predictions, or self-citation chains that reduce the result to its own inputs by construction. No mathematical derivations or load-bearing self-citations appear in the provided text that would trigger the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied HCI system paper with no mathematical derivations, fitted parameters, or new postulated entities; the central claim rests on the design of the error detection pipeline and visualizations plus user feedback rather than axioms or free parameters.

pith-pipeline@v0.9.0 · 5511 in / 1198 out tokens · 68697 ms · 2026-05-15T06:55:56.604989+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

[1]

In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

URL:https://arxiv.org/abs/2503.11926,arXiv: 2503.11926. 1 [BMNC25] BOGDANP. C., MACARU., NANDAN., CONMYA.: Thought anchors: Which llm reasoning steps matter?arXiv preprint arXiv:2506.19143(2025). 3, 4, 14 [BW A∗25] BAREZF., WUT.-Y., ARCUSCHINI., LANM., WANGV., SIEGELN., COLLIGNONN., NEOC., LEEI., PARENA.,ET AL.: Chain-of-thought is not explainability.Prep...

work page doi:10.18653/v1/2025 2025
[2]

1, 3 [Ope26] OPENAI: Reasoning models | openai api, 2026

URL:https://chatgpt.com/. 1, 3 [Ope26] OPENAI: Reasoning models | openai api, 2026. URL: https://developers.openai.com/api/docs/guides/ reasoning. 1, 13 [PAWW23] PANL., ALBALAKA., WANGX., WANGW.: Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. InFindings of the Association for Computational Lin- guistics: E...

work page arXiv 2026
[3]

any sum of consec- utive integers can be represented as a difference of two trian- gular numbers

URL:https://www.sciencedirect.com/science/ article/pii/B9781558609150500469,doi:https: //doi.org/10.1016/B978-155860915-0/50046-9. 3 [SSS25] SRINIVASANA., SETLURV., SATYANARAYANA.: Pluto: Au- thoring semantically aligned text and charts for data-driven communi- cation. InProceedings of the 30th International Conference on Intel- ligent User Interfaces(New...

work page doi:10.1016/b978-155860915-0/50046-9 2025
[4]

‘problem_setup‘: Parsing or rephrasing the problem (initial reading or comprehen- sion)

work page
[5]

‘plan_generation‘: Stating or deciding on a plan of action (often meta-reasoning)

work page
[6]

‘fact_retrieval‘: Recalling facts, for- mulas, problem details (without immediate computation)

work page
[7]

‘active_computation‘: Performing alge- bra, calculations, manipulations toward the answer

work page
[8]

‘result_consolidation‘: Aggregating in- termediate results, summarizing, or prepar- ing the final answer

work page
[9]

‘uncertainty_management‘: Expressing confusion, re-evaluating, proposing alter- native plans (includes backtracking)

work page
[10]

‘final_answer_emission‘: Explicit state- ment of the final boxed answer or earlier steps that contain the final answer

work page
[11]

‘self_checking‘: Verifying previ- ous steps, Pythagorean checking, re- confirmations

work page
[12]

function_tag

‘unknown‘: Use only if the step does not fit any of the above tags or is purely stylistic or semantic. -- ### Output Format: Return a single dictionary with one entry per step, where each entry has: - the step index (as the key, converted to a string), - a dictionary with: ‘"function_tag"‘: list of tag strings Here’s the expected format: “‘language=json {...

work page
[13]

Checkability test: Does the statement assert a claim that could be objectively verified as true/false using external sources or formal/logical validation?

work page
[14]

Evidence availability: In principle, could a verifier consult public knowledge, data, or compute a proof/check (without re- lying on extra unstated assumptions)?

work page
[15]

id": "<step_index>

Non-claim filter: If the statement is purely procedural, subjective, or meta/or- ganizational (no factual or logically testable claim), label it Non_verifiable. Output Schema (JSON) [ { "id": "<step_index>", "category": "Verifiable|Non_verifiable", "explanation": "<string>", "confidence": "<number 0..1>" } ] Example (ILLUSTRATIVE) [AN EXAMPLE HERE, Omitte...

work page 2026
[16]

A step cannot be a premise to itself

work page
[17]

The question (Step 0) can be a premise if used directly Generate **ONLY** the premises and nothing else. Format your response with one premise per line as: Step X: [explanation of why this step is necessary for the current step] {fewshot_template} NL to Symbolic Purpose:Convert premises and conclusion from natural language into symbolic formulas. Inputs:{...

work page
[18]

target_statement

"target_statement": The target statement to be transformed to formal logic (in natu- ral language)

work page
[19]

related_

"related_": A list of supporting state- ments (in natural language) relevant to the main statement that need to be transformed to formal logic as constraints

work page
[20]

full_reasoning

"full_reasoning": The complete reasoning chain in natural language for refining the related statements

work page
[21]

question_context

"question_context": The original ques- tion text for background

work page
[22]

declarations & constraints

"declarations & constraints": Logic declarations and variable domains in the question, as would be used in formal logic (e.g., function, variable, and domain def- initions). Any extra constraints or axioms given or derived from the question. Your task: - Convert the "target_statement" (in natu- ral language) into formal logic. - Convert every "related_sta...

work page 2026
[23]

declarations

"declarations": An array of the formal logic declarations from related_statements and the given declarations(as code or for- mal expressions)

work page
[24]

constraints

"constraints": An array of the formal logic constraints for related_statements (as code or formal expressions)

work page
[25]

target_statement

"target_statement": An object with: - sentence: The target statement (in natu- ral language). - FL: The formal logic representation. Symbolic to Z3 Code Purpose:Convert symbolic formulas into executable Z3 (Python) code for verification. Inputs:{declarations}, {constraints}, {target_statement} Model/Params:gpt-5, max_tokens=10000 Output Contract:Return on...

work page
[26]

Declarations (using EnumSort, Function, Const, etc.)

work page
[27]

A list named ‘constraints‘ containing the constraints

work page
[28]

sentence

A list named ‘target_statement‘ con- taining the target_statement logical state- ments. - Ensure all variable names and function names exactly match the ones in the decla- rations and constraints. - Use ‘Const‘ for quantified variables. - Do not include solver code, explanations, or any extra strings. Only return the pure Python code such that it can be e...

work page
[29]

‘problem_setup‘: Parsing or rephrasing the problem (ini- tial reading or comprehension)

work page
[30]

‘plan_generation‘: Stating or deciding on a plan of ac- tion (often meta-reasoning)

work page
[31]

‘fact_retrieval‘: Recalling facts, formulas, problem de- tails (without immediate computation)

work page
[32]

‘active_computation‘: Performing algebra, calculations, ma- nipulations toward the answer

work page
[33]

‘result_consolidation‘: Aggregating intermediate results, sum- marizing, or preparing the final answer

work page
[34]

‘uncertainty_management‘: Expressing confusion, re- evaluating, proposing alterna- tive plans (includes backtracking)

work page
[35]

‘final_answer_emission‘: Explicit statement of the fi- nal boxed answer or earlier steps that con- tain the final answer

work page
[36]

function_tag

‘self_checking‘: Verifying previ- ous steps, Pythagorean checking, re- confirmations. ### Task Construct a *hierarchical plan out- © 2026 The Author(s). Computer Graphics Forum published by Eurographics and John Wiley & Sons Ltd. 18 of 18S. Chen & N. Sritharan & X. Wen & C. Zhang & X. Wang & Y. Wang / Interactive Diagnosis of LLM Chain-of-Thought Errors l...

work page 2026

[1] [1]

In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

URL:https://arxiv.org/abs/2503.11926,arXiv: 2503.11926. 1 [BMNC25] BOGDANP. C., MACARU., NANDAN., CONMYA.: Thought anchors: Which llm reasoning steps matter?arXiv preprint arXiv:2506.19143(2025). 3, 4, 14 [BW A∗25] BAREZF., WUT.-Y., ARCUSCHINI., LANM., WANGV., SIEGELN., COLLIGNONN., NEOC., LEEI., PARENA.,ET AL.: Chain-of-thought is not explainability.Prep...

work page doi:10.18653/v1/2025 2025

[2] [2]

1, 3 [Ope26] OPENAI: Reasoning models | openai api, 2026

URL:https://chatgpt.com/. 1, 3 [Ope26] OPENAI: Reasoning models | openai api, 2026. URL: https://developers.openai.com/api/docs/guides/ reasoning. 1, 13 [PAWW23] PANL., ALBALAKA., WANGX., WANGW.: Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. InFindings of the Association for Computational Lin- guistics: E...

work page arXiv 2026

[3] [3]

any sum of consec- utive integers can be represented as a difference of two trian- gular numbers

URL:https://www.sciencedirect.com/science/ article/pii/B9781558609150500469,doi:https: //doi.org/10.1016/B978-155860915-0/50046-9. 3 [SSS25] SRINIVASANA., SETLURV., SATYANARAYANA.: Pluto: Au- thoring semantically aligned text and charts for data-driven communi- cation. InProceedings of the 30th International Conference on Intel- ligent User Interfaces(New...

work page doi:10.1016/b978-155860915-0/50046-9 2025

[4] [4]

‘problem_setup‘: Parsing or rephrasing the problem (initial reading or comprehen- sion)

work page

[5] [5]

‘plan_generation‘: Stating or deciding on a plan of action (often meta-reasoning)

work page

[6] [6]

‘fact_retrieval‘: Recalling facts, for- mulas, problem details (without immediate computation)

work page

[7] [7]

‘active_computation‘: Performing alge- bra, calculations, manipulations toward the answer

work page

[8] [8]

‘result_consolidation‘: Aggregating in- termediate results, summarizing, or prepar- ing the final answer

work page

[9] [9]

‘uncertainty_management‘: Expressing confusion, re-evaluating, proposing alter- native plans (includes backtracking)

work page

[10] [10]

‘final_answer_emission‘: Explicit state- ment of the final boxed answer or earlier steps that contain the final answer

work page

[11] [11]

‘self_checking‘: Verifying previ- ous steps, Pythagorean checking, re- confirmations

work page

[12] [12]

function_tag

‘unknown‘: Use only if the step does not fit any of the above tags or is purely stylistic or semantic. -- ### Output Format: Return a single dictionary with one entry per step, where each entry has: - the step index (as the key, converted to a string), - a dictionary with: ‘"function_tag"‘: list of tag strings Here’s the expected format: “‘language=json {...

work page

[13] [13]

Checkability test: Does the statement assert a claim that could be objectively verified as true/false using external sources or formal/logical validation?

work page

[14] [14]

Evidence availability: In principle, could a verifier consult public knowledge, data, or compute a proof/check (without re- lying on extra unstated assumptions)?

work page

[15] [15]

id": "<step_index>

Non-claim filter: If the statement is purely procedural, subjective, or meta/or- ganizational (no factual or logically testable claim), label it Non_verifiable. Output Schema (JSON) [ { "id": "<step_index>", "category": "Verifiable|Non_verifiable", "explanation": "<string>", "confidence": "<number 0..1>" } ] Example (ILLUSTRATIVE) [AN EXAMPLE HERE, Omitte...

work page 2026

[16] [16]

A step cannot be a premise to itself

work page

[17] [17]

The question (Step 0) can be a premise if used directly Generate **ONLY** the premises and nothing else. Format your response with one premise per line as: Step X: [explanation of why this step is necessary for the current step] {fewshot_template} NL to Symbolic Purpose:Convert premises and conclusion from natural language into symbolic formulas. Inputs:{...

work page

[18] [18]

target_statement

"target_statement": The target statement to be transformed to formal logic (in natu- ral language)

work page

[19] [19]

related_

"related_": A list of supporting state- ments (in natural language) relevant to the main statement that need to be transformed to formal logic as constraints

work page

[20] [20]

full_reasoning

"full_reasoning": The complete reasoning chain in natural language for refining the related statements

work page

[21] [21]

question_context

"question_context": The original ques- tion text for background

work page

[22] [22]

declarations & constraints

"declarations & constraints": Logic declarations and variable domains in the question, as would be used in formal logic (e.g., function, variable, and domain def- initions). Any extra constraints or axioms given or derived from the question. Your task: - Convert the "target_statement" (in natu- ral language) into formal logic. - Convert every "related_sta...

work page 2026

[23] [23]

declarations

"declarations": An array of the formal logic declarations from related_statements and the given declarations(as code or for- mal expressions)

work page

[24] [24]

constraints

"constraints": An array of the formal logic constraints for related_statements (as code or formal expressions)

work page

[25] [25]

target_statement

"target_statement": An object with: - sentence: The target statement (in natu- ral language). - FL: The formal logic representation. Symbolic to Z3 Code Purpose:Convert symbolic formulas into executable Z3 (Python) code for verification. Inputs:{declarations}, {constraints}, {target_statement} Model/Params:gpt-5, max_tokens=10000 Output Contract:Return on...

work page

[26] [26]

Declarations (using EnumSort, Function, Const, etc.)

work page

[27] [27]

A list named ‘constraints‘ containing the constraints

work page

[28] [28]

sentence

A list named ‘target_statement‘ con- taining the target_statement logical state- ments. - Ensure all variable names and function names exactly match the ones in the decla- rations and constraints. - Use ‘Const‘ for quantified variables. - Do not include solver code, explanations, or any extra strings. Only return the pure Python code such that it can be e...

work page

[29] [29]

‘problem_setup‘: Parsing or rephrasing the problem (ini- tial reading or comprehension)

work page

[30] [30]

‘plan_generation‘: Stating or deciding on a plan of ac- tion (often meta-reasoning)

work page

[31] [31]

‘fact_retrieval‘: Recalling facts, formulas, problem de- tails (without immediate computation)

work page

[32] [32]

‘active_computation‘: Performing algebra, calculations, ma- nipulations toward the answer

work page

[33] [33]

‘result_consolidation‘: Aggregating intermediate results, sum- marizing, or preparing the final answer

work page

[34] [34]

‘uncertainty_management‘: Expressing confusion, re- evaluating, proposing alterna- tive plans (includes backtracking)

work page

[35] [35]

‘final_answer_emission‘: Explicit statement of the fi- nal boxed answer or earlier steps that con- tain the final answer

work page

[36] [36]

function_tag

‘self_checking‘: Verifying previ- ous steps, Pythagorean checking, re- confirmations. ### Task Construct a *hierarchical plan out- © 2026 The Author(s). Computer Graphics Forum published by Eurographics and John Wiley & Sons Ltd. 18 of 18S. Chen & N. Sritharan & X. Wen & C. Zhang & X. Wang & Y. Wang / Interactive Diagnosis of LLM Chain-of-Thought Errors l...

work page 2026