Playing Psychic: Using Thought Trees to Predict Reasoning Models Accuracy on Coding Tasks
Pith reviewed 2026-05-10 07:17 UTC · model grok-4.3
The pith
The structure of a reasoning trace predicts whether an LLM's code solution is correct
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The structure of a reasoning trace, not just its contents, is a strong predictor of correctness. By representing reasoning traces as structured thought-trees, features extracted from these trees allow training a lightweight classifier to predict trace correctness, and flagging and retrying structurally anomalous traces yields consistent gains at lower complexity levels.
What carries the argument
structured thought-trees, a hierarchical representation of the reasoning trace that captures branching and sequential steps in the model's thinking to enable feature extraction for correctness prediction
Load-bearing premise
Features extracted from thought trees remain predictive across different frontier models and task difficulties, and the automatic task generation process does not introduce structural biases that inflate the observed correlation.
What would settle it
Training the classifier on thought-tree features from one set of models and tasks and finding that it predicts correctness no better than chance on a new set of frontier models or harder tasks would falsify the claim that structure is a strong predictor.
read the original abstract
Recent advances in large language models (LLMs) have shown that test-time scaling can substantially improve model performance on complex tasks, particularly in the coding domain. Under this paradigm, models use a larger token budget during inference to generate intermediate reasoning traces before producing a final answer. However, current evaluations primarily rely on competitive programming benchmarks, which may not capture the full range of reasoning abilities. In this work, we perform a systematic study of frontier reasoning models to understand their performance on real-world coding benchmarks. To gain more insights into the performance of such models, we devise a programmatic way to {\em automatically generate} coding tasks of arbitrary difficulty and structure from existing benchmarks. Using this framework, our analysis reveals that the structure of a reasoning trace, not just its contents, is a strong predictor of correctness. Motivated by this, we propose structured thought-trees as means to represent reasoning traces. To illustrate their use, we train a lightweight classifier on features extracted from thought-trees to predict trace correctness, and demonstrate that flagging and retrying structurally anomalous traces based on the extracted features yields consistent gains at lower complexity levels.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a programmatic framework for automatically generating coding tasks of arbitrary difficulty and structure from existing benchmarks. It represents LLM reasoning traces as thought-trees, extracts structural features from these trees, and trains a lightweight classifier to predict trace correctness. The central claims are that trace structure (independent of content) is a strong predictor of correctness on coding tasks and that flagging/retrying structurally anomalous traces yields consistent performance gains, particularly at lower complexity levels.
Significance. If the empirical claims hold after addressing the noted gaps, the work would provide a novel, low-overhead mechanism for improving test-time scaling in reasoning models by leveraging structural properties of intermediate traces rather than additional sampling or content-based heuristics. This could inform more efficient inference strategies for coding and related domains, though its impact depends on demonstrating robustness beyond the synthetic task distribution.
major comments (2)
- [Abstract] The abstract and methods description assert that structural features from thought-trees are a 'strong predictor' of correctness and that retrying anomalous traces yields 'consistent gains,' yet no quantitative results, baselines, error bars, statistical tests, or details on feature definitions, classifier performance, or task difficulty metrics are provided. This absence makes the central claims unassessable and is load-bearing for any recommendation.
- [§3] §3 (Automatic Task Generation): The programmatic generator is used for both training data and evaluation without reported ablations or controls to isolate whether tree structures (e.g., depth, branching factor, anomaly scores) correlate with solvability due to embedded generation rules rather than intrinsic reasoning properties. This risks confounding the structure-correctness link and undermines generalizability across models and real benchmarks.
minor comments (2)
- The paper should include explicit definitions and pseudocode for the thought-tree construction process and the extracted features to allow reproduction.
- Missing references to prior work on tree-based representations of reasoning (e.g., tree-of-thoughts variants) and test-time scaling methods would improve context.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important opportunities to strengthen the presentation of our empirical results and to further validate the task generation framework. We address each major comment below and have made revisions to the manuscript to incorporate additional details, quantitative summaries, and controls.
read point-by-point responses
-
Referee: [Abstract] The abstract and methods description assert that structural features from thought-trees are a 'strong predictor' of correctness and that retrying anomalous traces yields 'consistent gains,' yet no quantitative results, baselines, error bars, statistical tests, or details on feature definitions, classifier performance, or task difficulty metrics are provided. This absence makes the central claims unassessable and is load-bearing for any recommendation.
Authors: We agree that the abstract would benefit from including key quantitative results to make the central claims immediately assessable. The body of the manuscript (particularly Sections 4 and 5) already reports classifier performance (including accuracy, precision, and recall), baselines such as random and content-based classifiers, error bars from repeated runs, statistical significance tests for the observed gains, explicit feature definitions (e.g., depth, branching factor, anomaly scores), and task difficulty metrics derived from structural complexity. We have revised the abstract to summarize these findings concisely, including the magnitude of gains at lower complexity levels. This change directly addresses the assessability concern without altering the underlying results. revision: yes
-
Referee: [§3] §3 (Automatic Task Generation): The programmatic generator is used for both training data and evaluation without reported ablations or controls to isolate whether tree structures (e.g., depth, branching factor, anomaly scores) correlate with solvability due to embedded generation rules rather than intrinsic reasoning properties. This risks confounding the structure-correctness link and undermines generalizability across models and real benchmarks.
Authors: We acknowledge the risk of confounding if generation rules inadvertently link structure to solvability. The generator is explicitly parameterized to vary structural properties (depth, branching) independently of content-based solvability rules, and we already compute task difficulty metrics separately from the tree features. To isolate the effect and improve generalizability, we have added ablations in the revised manuscript: (1) training on generated tasks but evaluating the classifier and retry strategy on held-out real benchmarks (HumanEval, MBPP) that were never passed through the generator, and (2) reporting partial correlations and regression controls between tree features and correctness while holding generation parameters fixed. These additions demonstrate that the structure-correctness relationship persists beyond the synthetic distribution. revision: yes
Circularity Check
No circularity: classifier trained on external correctness labels from task execution
full rationale
The paper's central result is an empirical observation that tree-derived features correlate with independently measured correctness (whether the model's final code output passes tests). A lightweight classifier is trained to predict this external binary label from features such as depth and branching factor; flagging and retrying is then applied at inference time. No equations, self-definitional loops, or fitted parameters renamed as predictions exist. The programmatic task generator supplies data but does not define correctness or force the reported correlation by construction, leaving the derivation self-contained against the external evaluation signal.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
The task is to predict the output of a given function on a specific input
CRUXEval(Gu et al., 2024) is a code execution reasoning benchmark that consists of 800 standalone Python functions. The task is to predict the output of a given function on a specific input
work page 2024
-
[2]
SAFIM(Gong et al., 2024) is a fill-in-the-middle benchmark and includes 17,720 examples from Java, C, C# and Python with tasks designed for the syntax-aware completion of program structures. In this paper, we evaluate the block completion, a task requiring the model to generate an entire code block based on surrounding context and problem background
work page 2024
-
[3]
YES"); } else { System.out.println(
Codelingua(Pan et al., 2024) involves the translation of 1,700 code samples from three benchmarks and two real-world projects. The task asks a model to read a snippet in a source language and reconstruct it in a target language without altering the underlying logic and final outputs. A number of different programming languages are represented in the sourc...
work page 2024
-
[4]
Do not output any extra information, even if the function is incorrect or incomplete
Level 1 Prompt 17 Playing Psychic: Using Thought Trees to Predict Reasoning Models Accuracy on Coding Tasks Based on the given Python code, which may contain errors, complete the assert statement with the output when executing the code on the given test case. Do not output any extra information, even if the function is incorrect or incomplete. {code} asse...
-
[5]
Do not output any extra information, even if the function is incorrect or incomplete
Level 2 Prompt Based on the given Python code, which may contain errors, complete the assert statement with the output when executing the code on the given test case. Do not output any extra information, even if the function is incorrect or incomplete. # f1 {f1 code} # f2 {f2 code} assert f2(f1({input})) == Only return the output of the function without a...
-
[6]
Do not output any extra information, even if the function is incorrect or incomplete
Level 3 Prompt Based on the given Python code, which may contain errors, complete the assert statement with the output when executing the code on the given test case. Do not output any extra information, even if the function is incorrect or incomplete. # f1 {f1 code} # f2 {f2 code} # f3 {f3 code} assert f3(f2(f1({input}))) == Only return the output of the...
-
[7]
Output only the missing code so that the program will run correctly
Level 1 Prompt You will be given code with missing lines or blocks that you must fill in. Output only the missing code so that the program will run correctly. Output the missing code as plain text, NOT as markdown code. Do NOT output the entire program or any additional information. {code}
-
[8]
Level 2 Prompt 18 Playing Psychic: Using Thought Trees to Predict Reasoning Models Accuracy on Coding Tasks You will be given program1.py and program2.py with missing lines or blocks that you must fill in. Output only the missing code so that ”python3 program1.py|program2.py” runs correctly. Output the missing code, NOT as markdown code. Do NOT output the...
-
[9]
Level 3 Prompt You will be given program1.py, program2.py and program3.py with missing lines or blocks that you must fill in. Output only the missing code so that ”python3 program1.py | program2.py | program3.py” runs correctly. Output the missing code, NOT as markdown code. Do NOT output the entire program or explanations or any additional information. #...
-
[10]
Level 1 DeepSeek-R1 Prompt 19 Playing Psychic: Using Thought Trees to Predict Reasoning Models Accuracy on Coding Tasks You are an AI programming assistant, utilizing the DeepSeek Coder model, developed by DeepSeek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and othe...
-
[11]
Level 1 QwQ Prompt Translate the following code from{source language}to{target language}: {code}
-
[12]
Higher Level Prompt You are given a set of {source language} programs that are meant to be executed in sequence, where the output of each program is used as the input to the next. Translate the *entire sequence* into a single {target language} program that reproduces the same behavior. - Only the first block should handle reading input. - Only the last bl...
-
[13]
Adding more details, explanations, or examples
-
[14]
Providing evidence or justification
-
[15]
Refining or clarifying the parent’s point. - Contrast: The new segment proposes a different, alternative, or opposing idea compared to the parent node. - Rephrase: The new segment expresses the exact same core idea as the parent but in different words. Return your answer in this JSON format: “‘json {{ ”parent id”: ”thought X”, // id of parent thought, or ...
-
[16]
64 mental execution Loop sets a[i] for i in range n
The extra 1 at the end ofais removed bya[:-1]. 64 mental execution Loop sets a[i] for i in range n. Output a[:− 1] matches length n. Condi- tionx>iis the simplest valid heuristic. 65 mental execution Final JSON structure: program1 ( need=False), program2 ( tt assign- ment), program3 (a[i]assignment). Table 2: Complete Reasoning Trace of a SAFIM L3 Example...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.