arxiv: 2604.05643 · v2 · submitted 2026-04-07 · 💻 cs.CL

Graph-Based Chain-of-Thought Pruning for Reducing Redundant Reflections in Reasoning LLMs

Hongyuan Yuan , Xinran He , Run Shao , Bolei He , Xianwei Xue , Mengke Chen , Qiutong Pan , Haiwei Wang

show 1 more author

Haifeng Li

This is my paper

Pith reviewed 2026-05-10 19:58 UTC · model grok-4.3

classification 💻 cs.CL

keywords chain-of-thought pruningredundant reflectionsgraph-based CoTLLM reasoning efficiencyoverthinking in LLMsdirected acyclic graphsreflection patterns

0 comments

The pith

Converting linear chain-of-thought into graphs and pruning weak or repeated reflections cuts average reasoning tokens by 42 percent while keeping or raising accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that much of the extra computation in reasoning LLMs comes from two bad reflection habits: checking everything indiscriminately and re-checking things already settled. It converts each linear reasoning trace into a directed acyclic graph that makes the dependencies between steps explicit, then applies two pruning rules to drop low-value reflection branches and late-stage re-verifications. These pruned traces are used to train the model in three stages, first teaching it the shorter style, then preferring correct short answers over longer ones, and finally reinforcing both correctness and brevity. The resulting models generate correct answers with far fewer tokens. A reader cares because shorter reasoning traces mean faster responses and lower cost when deploying these models on hard problems.

Core claim

The paper claims that turning a linear chain-of-thought into a directed acyclic graph with explicit dependency edges, then applying branch-level pruning to remove weakly contributing reflection branches and depth-level pruning to remove late re-verifications, produces concise yet correct reasoning traces. These traces are distilled into the model via supervised fine-tuning on the pruned examples, direct preference optimization for shorter correct trajectories, and group relative policy optimization with a length penalty, yielding an average 42 percent reduction in reasoning tokens while accuracy stays the same or improves.

What carries the argument

The directed acyclic graph conversion of a linear CoT trace together with the dual pruning rules that remove low-impact reflection branches and late-stage re-verifications.

If this is right

Reasoning models trained this way produce shorter traces on average while solving the same problems correctly.
The three-stage training pipeline (SFT on pruned traces, DPO for brevity preference, GRPO with length penalty) jointly optimizes accuracy and token efficiency.
Indiscriminate and repetitive reflection patterns can be systematically identified and removed using dependency structure in the graph.
The same pruning logic applies across different reasoning domains where overthinking appears as extra reflection steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same graph construction and pruning steps could be applied at inference time without retraining to shorten outputs on the fly.
Explicit dependency graphs may help detect other forms of redundancy besides reflection, such as unnecessary intermediate calculations.
If the pruning rules prove stable across domains, they could be used to create smaller specialized reasoning models that retain performance at lower compute cost.

Load-bearing premise

The graph conversion and pruning rules can always tell the difference between reflections that do not affect the final answer and reflections that are actually required for correctness on new problems.

What would settle it

Measuring error rates on a held-out set of reasoning problems after pruning and finding that the pruned model makes more mistakes than the original model on the same problems.

Figures

Figures reproduced from arXiv: 2604.05643 by Bolei He, Haifeng Li, Haiwei Wang, Hongyuan Yuan, Mengke Chen, Qiutong Pan, Run Shao, Xianwei Xue, Xinran He.

**Figure 2.** Figure 2: Training pipeline of our graph-based redundancy-aware post-training framework. We first perform cold [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Stage-wise ablation across five benchmarks. (a) Accuracy (%). (b) Relative reasoning length measured by the average number of generated tokens, normalized to the BASE setting (token ratio; BASE=1.0, lower is better). The compared configurations follow a cumulative training recipe: starting from BASE, we sequentially add SFT, then DPO, and finally GRPO. the results across five benchmarks. We report (a) accu… view at source ↗

**Figure 4.** Figure 4: Changes in reasoning length and reasoning token usage before and after training on AIME24. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: RL training dynamics for different model [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Instruction for updating a graph-structured chain-of-thought [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison of reflection behavior between the base model and our trained model. The left [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

read the original abstract

Extending CoT through RL has been widely used to enhance the reasoning capabilities of LLMs. However, due to the sparsity of reward signals, it can also induce undesirable thinking patterns such as overthinking, i.e., generating redundant intermediate reasoning content. In this work, we argue that a major source of such redundancy is inefficient reflection, which often manifests in two problematic patterns: Indiscriminate Reflection, where the model performs broad, low-impact checks throughout reasoning, and Repetitive Reflection, where it repeatedly re-verifies an already established conclusion. To address this, we introduce a graph-based CoT optimization framework. Specifically, we convert each linear CoT into a directed acyclic graph (DAG) with explicit dependency edges, and design a dual pruning strategy: branch-level pruning removes weakly contributing reflection branches, while depth-level pruning eliminates late-stage re-verification. We distill this behavior via a three-stage pipeline: (1) SFT to initialize the policy on pruned concise traces, (2) DPO to prefer correct but less redundant trajectories, and (3) GRPO with length penalty to jointly optimize answer correctness and efficiency. Experiments show that our approach reduces the average reasoning tokens by 42\% while maintaining or improving accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper turns CoT into a DAG and prunes low-impact or repetitive reflection branches before distilling the behavior back in, but the 42% token cut claim rests on thin experimental detail.

read the letter

The paper's main move is to model chain-of-thought as a directed acyclic graph with explicit dependency edges, then apply two pruning rules: one that drops reflection branches judged to contribute little, and another that cuts late-stage re-verification of already settled points. They follow this with a three-stage distillation process that starts with supervised fine-tuning on the pruned traces, moves to DPO to favor concise correct trajectories, and ends with GRPO that adds a length penalty. The result they report is a 42% drop in average reasoning tokens while accuracy holds or improves slightly. That framing of indiscriminate and repetitive reflection as the main sources of overthinking is a clear way to organize the problem, and the graph conversion gives the pruning rules something concrete to operate on rather than relying on prompt engineering alone. The staged training pipeline is a straightforward way to transfer the pruned behavior into the model without starting from scratch. The work is aimed at teams that already run RL post-training on reasoning models and now want to control inference cost. A reader who needs practical knobs for token budgets would get a usable recipe to test. The soft spot is the lack of supporting detail around the central claim. The abstract gives no information on the baselines, the exact datasets, how pruning thresholds were chosen, or whether any tuning happened on the test distribution. Without ablations that check accuracy on problems where reflections are conditional or necessary for later corrections, it is hard to know whether the dual pruning safely removes only redundancy. The stress-test worry about discarding useful verification steps on unseen cases looks like it could matter if the dependency edges do not capture all relevant conditionals. If the full paper has those controls and shows they hold, the method becomes more convincing; right now the empirical grounding is the part that needs the most work. I would send this to peer review because the idea is concrete and the problem is timely, but the authors should expect requests for stronger baselines, ablations, and clearer definitions of the pruning criteria.

Referee Report

3 major / 2 minor

Summary. The paper claims that redundant reflections in LLM Chain-of-Thought reasoning—specifically indiscriminate and repetitive patterns—can be mitigated by converting linear CoT traces into DAGs with explicit dependency edges, applying a dual pruning strategy (branch-level removal of weakly contributing reflections and depth-level elimination of late re-verifications), and distilling the resulting concise traces via a three-stage pipeline of SFT, DPO, and GRPO with length penalty. Experiments are reported to achieve a 42% reduction in average reasoning tokens while maintaining or improving accuracy.

Significance. If the empirical results hold under rigorous validation, the work offers a structured, graph-based approach to improving reasoning efficiency in LLMs without sacrificing correctness, addressing overthinking induced by RL. The three-stage distillation pipeline and explicit separation of reflection patterns represent a clear methodological contribution. Credit is due for the reproducible pipeline design and the attempt to operationalize redundancy via DAG dependencies, though the absence of detailed experimental protocols limits immediate applicability.

major comments (3)

[Abstract and §4] Abstract and §4: The central claim of a 42% token reduction with maintained or improved accuracy provides no information on the specific baselines used, the datasets and problem types evaluated, statistical significance tests, or whether pruning thresholds were tuned post-hoc on test data; this directly weakens support for the reported efficiency gains.
[§3.2] §3.2: The dual pruning strategy defines branch-level pruning as removing 'weakly contributing reflection branches' and depth-level pruning as eliminating 'late-stage re-verification,' but supplies neither a formal definition of contribution/dependency nor explicit construction rules for the DAG edges; this is load-bearing for the claim that pruning removes only redundancy rather than necessary conditional verification steps.
[§4.3] §4.3: No ablation studies are presented that apply the pruning rules to problems known to require iterative self-correction and then measure accuracy preservation on held-out instances; such results are required to substantiate that the subsequent SFT+DPO+GRPO stages do not optimize on incomplete traces.

minor comments (2)

[Figure 1] Figure 1 (DAG conversion example) would benefit from explicit annotation of the dependency edges and pruning decisions to illustrate the branch-level and depth-level rules.
[§2] The manuscript should include references to prior work on reflection detection and CoT compression (e.g., recent papers on overthinking in o1-style models) to better situate the dual pruning contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve clarity, formalization, and validation of the claims.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4: The central claim of a 42% token reduction with maintained or improved accuracy provides no information on the specific baselines used, the datasets and problem types evaluated, statistical significance tests, or whether pruning thresholds were tuned post-hoc on test data; this directly weakens support for the reported efficiency gains.

Authors: We agree that the abstract and the summary presentation in §4 should be more self-contained to better support the central claim. In the revised manuscript, we will expand the abstract to explicitly list the baselines, the datasets and problem types used in evaluation, and the statistical significance tests applied. We will also add a clarification in §4 that pruning thresholds were selected via cross-validation on a development set, not tuned post-hoc on test data. These revisions will strengthen the empirical presentation without altering the reported results. revision: yes
Referee: [§3.2] §3.2: The dual pruning strategy defines branch-level pruning as removing 'weakly contributing reflection branches' and depth-level pruning as eliminating 'late-stage re-verification,' but supplies neither a formal definition of contribution/dependency nor explicit construction rules for the DAG edges; this is load-bearing for the claim that pruning removes only redundancy rather than necessary conditional verification steps.

Authors: This comment correctly identifies a need for greater formal rigor in §3.2. We will revise this section to include a formal definition of contribution (as the marginal effect on reasoning path validity when a reflection node is removed) and dependency (as logical or semantic precedence between CoT steps). We will also provide explicit, step-by-step rules for constructing the DAG edges from the linear trace. These additions will make clear that the pruning targets only redundant patterns and does not eliminate necessary conditional verifications. revision: yes
Referee: [§4.3] §4.3: No ablation studies are presented that apply the pruning rules to problems known to require iterative self-correction and then measure accuracy preservation on held-out instances; such results are required to substantiate that the subsequent SFT+DPO+GRPO stages do not optimize on incomplete traces.

Authors: We acknowledge that targeted ablations on self-correction-intensive problems would provide stronger evidence. Our existing evaluation covers a range of reasoning tasks that include iterative elements, but to directly address the concern we will add a dedicated ablation subsection in the revised §4.3. This will apply the pruning rules to a subset of problems known to require self-correction, report accuracy on held-out instances, and confirm that the three-stage distillation preserves correctness on the pruned traces. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline validated by experiments

full rationale

The paper presents an applied method that converts linear CoT traces to DAGs, applies explicit branch-level and depth-level pruning rules, and then uses a three-stage distillation process (SFT on pruned traces, DPO for preference, GRPO with length penalty). The central result—a 42% average token reduction with maintained or improved accuracy—is reported as an experimental outcome on held-out data rather than a mathematical prediction or fitted quantity derived from the same inputs. No equations, self-definitional loops, or load-bearing self-citations appear in the derivation chain; the pruning criteria and training stages are described as independent design choices whose effectiveness is tested directly. This is a standard empirical engineering contribution whose claims remain falsifiable outside any internal fitting.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract supplies insufficient detail to enumerate concrete free parameters or axioms; the approach implicitly assumes that graph dependencies can be extracted reliably from linear text and that length-penalized RL preserves correctness.

axioms (1)

domain assumption Reflection branches can be scored for contribution via dependency structure in the DAG
Invoked when defining branch-level pruning

invented entities (1)

Dual pruning strategy on CoT DAG no independent evidence
purpose: To remove low-impact reflections while preserving correctness
New mechanism introduced to target overthinking patterns

pith-pipeline@v0.9.0 · 5545 in / 1331 out tokens · 37645 ms · 2026-05-10T19:58:52.369758+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

convert each linear CoT into a directed acyclic graph (DAG) with explicit dependency edges, and design a dual pruning strategy: branch-level pruning removes weakly contributing reflection branches, while depth-level pruning eliminates late-stage re-verification
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

B(v) = |Desc(v)| ... d(v)/dmax > m ... m=0.9 and k=2
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three-stage pipeline: (1) SFT ... (2) DPO ... (3) GRPO with length penalty

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages · 1 internal anchor

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning.Preprint, arXiv:2501.12948. Yunzhen Feng, Julia Kempe, Cheng Zhang, Parag Jain, and Anthony Hartshorn. 2025. What characterizes effective reasoning? revisiting length, review, and structure of cot.Preprint, arXiv:2509.19284. Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Wait, we don’t need to" wait"! removing thinking tokens improves reasoning efficiency.arXiv preprint arXiv:2506.08343, 2025a

Wait, we don’t need to "wait"! removing think- ing tokens improves reasoning efficiency.Preprint, arXiv:2506.08343. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-thought prompting elic- its reasoning in large language models.Preprint, arXiv:2201.11903. Liang Wen, Yunke Cai, F...

work page arXiv 2023
[3]

•current_step: the current reasoning text segment (continuous content from the CoT)

Inputs •graph: an existing partial reasoning graph (Mermaid code). •current_step: the current reasoning text segment (continuous content from the CoT)

work page
[4]

– Abstraction: donotrecord low-level arithmetic/symbolic manipulation

Node Definition (Reasoning Unit) • A node represents anabstract reasoning unitwith: – Semantic completeness: a clear intent and its reasoning product (e.g., introducing a constraint, deriving a conclusion, setting a sub-goal, establishing a framework). – Abstraction: donotrecord low-level arithmetic/symbolic manipulation. – Dependability: its product can ...

work page
[5]

•Product generation: yields a key conclusion/property/constraint/equivalence used later

Independence CriteriaTreatcurrent_stepas a new reasoning unit (preferInsert) if it satisfies any of: •Goal introduction: introduces a new intermediate goal/subproblem. •Product generation: yields a key conclusion/property/constraint/equivalence used later. •Method switch: changes the reasoning strategy or framework. •Structural advancement: adds structure...

work page
[6]

advance” and “reflect

Update Operations • Merge: Allowed only if current_step can be integrated into exactly one existing node while keeping itabstractand single-purpose.Hard constraints: Review contentcannotbe merged into aprogressnode; Do not merge if it causes one node to mix “advance” and “reflect” as major functions. Must specifytarget_nodeandupdated_node_description. • I...

work page
[7]

• Branch origin principle: new attempts must branch fromstill-valid ancestor products, not from negated/dead-end nodes

Edge Construction Rules •Dependency principle: if node B uses products from node A, add edgeA→Bwith a clear dependency label. • Branch origin principle: new attempts must branch fromstill-valid ancestor products, not from negated/dead-end nodes. •Ordering constraints: –Node IDs increase lexicographically: A–Z, AA–AZ, BA–BZ, . . . –Edges must go from lexic...

work page
[8]

decision

Output Format (Strict JSON) { "decision": "Insert or Merge", "target_node": "(If Merge, the node ID to merge into; if Insert, leave empty)", "new_node": { "id": "(If Insert, a unique ID; if Merge, leave empty)", "description": "(If Insert, a concise description of the new node; if Merge, leave empty)", "type": "progress or review" }, "edges": [ { "from": ...

work page
[9]

…… Let’s subtract equation (2) from equation (1): 9/s − 9/(s + 2) = 1.6, which gives s(s + 2) = 11.25

= 2.4 − t/60. …… Let’s subtract equation (2) from equation (1): 9/s − 9/(s + 2) = 1.6, which gives s(s + 2) = 11.25. …… Solve the quadratic equation and discard the negative solution to get s = 2.5 km/h. Plug s = 2.5 back into equation (1) to find t = 24 minutes. Let me check this with equation (2) to ensure consistency: 9 / 4.5 = 2 hours, and 2.4 − 0.4 =...

work page