arxiv: 2602.03249 · v2 · submitted 2026-02-03 · 💻 cs.AI · cs.LG

Accordion-Thinking: Self-Regulated Step Summaries for Efficient and Readable LLM Reasoning

Zhicheng Yang , Zhijiang Guo , Yinya Huang , Yongxin Wang , Wenlei Shi , Yiwei Wang , Xiaodan Liang , Jing Tang This is my paper

Pith reviewed 2026-05-16 08:24 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords LLM reasoningChain-of-Thoughtreinforcement learningcontext compressionefficient inferenceself-summarizationdynamic folding

0 comments p. Extension

The pith

LLMs learn to fold their reasoning into compact summaries that match full accuracy after training

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that LLMs can be trained with reinforcement learning to self-regulate the detail level of their reasoning steps through periodic summarization. This creates an efficient Fold mode that discards earlier thoughts after creating summaries, cutting the growth of context length and attention costs. A central result is that the accuracy gap to the full Unfold mode shrinks and disappears during training, showing the summaries have captured everything needed for correct answers. The method delivers three times higher throughput on a 48GB GPU while producing readable step summaries, directly addressing the memory and compute limits of long Chain-of-Thought reasoning.

Core claim

Accordion-Thinking trains LLMs to dynamically summarize their thought process and discard prior tokens in a Fold inference mode. Reinforcement learning incentivizes this behavior, causing the accuracy of the efficient Fold mode to converge to the exhaustive Unfold mode. This convergence demonstrates that the model encodes all essential reasoning information into the compact summaries, achieving effective compression of the reasoning context without loss of solution quality.

What carries the argument

Dynamic summarization mechanism that lets the model periodically fold its reasoning steps into compact summaries while reinforcement learning trains it to preserve necessary information across folds.

If this is right

Complex reasoning tasks become solvable with far lower token overhead and KV-cache usage.
Structured step summaries give a human-readable record of the reasoning process.
Threefold throughput gains are realized on fixed 48GB GPU memory while accuracy is maintained.
The model learns to perform lossless compression of its own reasoning context during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same self-compression approach could extend to multi-step planning or tool-use sequences that currently require long contexts.
Reasoning length may no longer need to scale linearly with problem difficulty once compression is learned.
Training regimes that reward both correctness and brevity could become standard for efficient inference.

Load-bearing premise

Reinforcement learning can train the model to produce summaries that keep every piece of information required for correct final answers, without hidden irreversible losses that only appear on harder problems.

What would settle it

If the Fold mode accuracy remains measurably below the Unfold mode on harder test problems even after extended training, the claim that summaries preserve all essential information would be false.

Figures

Figures reproduced from arXiv: 2602.03249 by Jing Tang, Wenlei Shi, Xiaodan Liang, Yinya Huang, Yiwei Wang, Yongxin Wang, Zhicheng Yang, Zhijiang Guo.

**Figure 1.** Figure 1: Comparison of Vanilla CoT and our Accordion CoT. As the generation length increases, the computational complexity per token in Vanilla CoT grows quadratically. In contrast, our Accordion CoT folds the context after each step, reducing the computational complexity for the next token generation and improving inference speed. We force the model to follow the Accordion Format, which splits the whole thinking p… view at source ↗

**Figure 2.** Figure 2: Ablation study on synthetic Accordion data for Qwen2.5-Math-7B and Qwen3-4B-Base on Fold mode. Qwen2.5-Math-7B Qwen3-4B-Base gap vanish gap vanish 0.3 0.8 0.7 0.6 0.5 0.4 300 300 300 300 0.8 0.7 0.5 0.4 0.3 0.4 0.2 0.1 0.0 0.2 0.25 0.15 0.1 0.0 Unfol 0.05 d Fold Unfold Fold Training Reward Reward Gap of the 2 Mode Training Reward Reward Gap of the 2 Mode [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Reward gap between Fold mode and Unfold mode vanishes during Mix-RL training. 2. Standard RL improves reasoning but still neglects compression. The Unfold-RL baseline, which optimizes reasoning using the full context history, significantly boosts general performance compared to the Cold-Start model (e.g., 52.2% vs. 48.5% on Qwen2.5-Math-7B). However, it does not close the compression gap. When Fold-RL mode… view at source ↗

**Figure 4.** Figure 4: Comparison of token efficiency in raw PyTorch. 4.4. Performance Gap Vanish To understand the dynamic relationship between full-context and compressed reasoning, we visualize the reward trajectories of both Fold and Unfold modes during the Mix-RL training process in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Accordion-Thinking leads to dense attention. Folded reasoning consistently yields darker and more concentrated heat patterns than the unfolded baseline [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Statistical results of the training dynamics [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Ablation study on max folding steps on Qwen3-4B-Base. F. Ablation Study on Max Folding Steps and Max Step Length We first conducted an ablation study on the max folding step and found that the model demonstrates strong adaptive capabilities. As shown in [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Case study analysis of summary readability. The step summaries, when pieced together, can serve as a substitute for the final solution. Accordion CoT provides users with instant, readable information about the reasoning process. H. LLM as a Judge For Readability You are a lenient evaluator for reasoning summaries. Your task is to judge a summary of prior reasoning on exactly two dimensions: 1. Coverage and… view at source ↗

**Figure 9.** Figure 9: The prompt and expected output format in LLM as a judge for readability. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

read the original abstract

Scaling test-time compute via long Chain-of-Thought unlocks remarkable gains in reasoning capabilities, yet it faces practical limits due to the linear growth of KV cache and quadratic attention complexity. In this paper, we introduce Accordion-Thinking, an end-to-end framework where LLMs learn to self-regulate the granularity of the reasoning steps through dynamic summarization. This mechanism enables a Fold inference mode, where the model periodically summarizes its thought process and discards former thoughts to reduce dependency on historical tokens. We apply reinforcement learning to incentivize this capability further, uncovering a critical insight: the accuracy gap between the highly efficient Fold mode and the exhaustive Unfold mode progressively narrows and eventually vanishes over the course of training. This phenomenon demonstrates that the model learns to encode essential reasoning information into compact summaries, achieving effective compression of the reasoning context. Our Accordion-Thinking demonstrates that with learned self-compression, LLMs can tackle complex reasoning tasks with minimal dependency token overhead without compromising solution quality, and it achieves a three times throughput while maintaining accuracy on a 48GB GPU memory configuration, while the structured step summaries provide a human-readable account of the reasoning process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Accordion-Thinking trains LLMs via RL to self-summarize reasoning steps so Fold mode accuracy eventually matches full Unfold mode, but the abstract gives no experimental details to confirm the summaries are actually lossless rather than the model just reasoning more compactly from the start.

read the letter

The paper's main observation is that RL training makes the accuracy gap between their compressed Fold inference mode and the exhaustive Unfold mode shrink to zero. The model learns to produce periodic summaries, discards prior tokens, and still reaches the same final answers. They report this yields 3x throughput on a 48GB GPU while keeping the summaries human-readable. That directly targets the KV-cache and quadratic attention costs that cap long CoT scaling. The training dynamic itself looks like the novel piece: prior summarization work did not produce this exact vanishing-gap behavior under joint RL. The practical payoff and the readable output are clear upsides if the result holds. The soft spot is the lack of any experimental substance in what we have. No baselines, no error bars, no description of the RL reward, no check on whether Unfold behavior itself simplifies during training. Without those, the vanishing gap could come from the model learning shorter reasoning paths overall instead of faithfully compressing the original trace into summaries. The RL objective only cares about final-answer correctness, so it does not force the summaries to carry every necessary detail. This is for labs focused on test-time compute efficiency and practical CoT deployment. Someone already running long reasoning traces on fixed hardware would get a concrete idea to test. The work deserves peer review because the problem is real, the method is straightforward, and the observation is worth verifying with proper controls and ablations even if the current write-up is preliminary.

Referee Report

3 major / 1 minor

Summary. The paper introduces Accordion-Thinking, an end-to-end framework in which LLMs learn to dynamically summarize reasoning steps and discard prior context, enabling an efficient Fold inference mode. Reinforcement learning is used to train this capability, with the central claim being that the accuracy gap between Fold and exhaustive Unfold modes narrows and vanishes over training, interpreted as the model learning lossless compression of essential reasoning information into compact summaries. This yields 3x throughput on a 48GB GPU while preserving accuracy and producing human-readable step summaries.

Significance. If the vanishing gap truly reflects faithful compression rather than simplification of reasoning paths, the method would offer a practical route to scaling test-time compute for long CoT without linear KV-cache growth or quadratic attention costs. The readable summaries also address interpretability, which is a secondary but useful contribution for complex reasoning tasks.

major comments (3)

[Abstract / experimental results] Abstract and experimental results: the claim that the Fold-Unfold accuracy gap 'progressively narrows and eventually vanishes' is presented without reporting Unfold-mode token usage, step complexity, or branching behavior over the course of RL training. This leaves open the possibility that the gap closes because both modes converge on shorter reasoning traces rather than because summaries faithfully encode the original information.
[Method / training procedure] The RL objective (final-answer accuracy) does not distinguish lossless summary compression from irreversible early pruning of non-critical branches. No ablation (e.g., comparing joint Fold+Unfold training against Unfold-only training, or inspecting summary content for information retention) is described to support the interpretation that summaries 'encode essential reasoning information.'
[Results] The 3x throughput result on 48GB GPU is stated without baseline comparisons (standard CoT, other KV-cache eviction or summarization methods), number of runs, error bars, or dataset-specific breakdowns, rendering the efficiency claim difficult to evaluate for robustness or generality.

minor comments (1)

[Abstract] Notation for Fold and Unfold modes should be defined once in the introduction and used consistently; the abstract introduces them without a clear forward reference to the method section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below, agreeing that additional evidence and comparisons will strengthen the claims, and we will incorporate the suggested revisions in the next version of the manuscript.

read point-by-point responses

Referee: [Abstract / experimental results] Abstract and experimental results: the claim that the Fold-Unfold accuracy gap 'progressively narrows and eventually vanishes' is presented without reporting Unfold-mode token usage, step complexity, or branching behavior over the course of RL training. This leaves open the possibility that the gap closes because both modes converge on shorter reasoning traces rather than because summaries faithfully encode the original information.

Authors: We agree that reporting the evolution of token usage, step counts, and branching behavior for Unfold mode during RL training would better substantiate our interpretation of lossless compression. In the revised manuscript we will add plots tracking average token consumption and reasoning-step complexity for both Fold and Unfold modes across training checkpoints, together with a brief analysis of branching frequency. These additions will show that Unfold traces remain substantially longer while Fold achieves compression without accuracy degradation, thereby addressing the alternative explanation of uniform shortening. revision: yes
Referee: [Method / training procedure] The RL objective (final-answer accuracy) does not distinguish lossless summary compression from irreversible early pruning of non-critical branches. No ablation (e.g., comparing joint Fold+Unfold training against Unfold-only training, or inspecting summary content for information retention) is described to support the interpretation that summaries 'encode essential reasoning information.'

Authors: The referee correctly notes that the current RL objective alone cannot fully separate faithful compression from early pruning. We will add an ablation study comparing a jointly trained Fold+Unfold model against an Unfold-only baseline, reporting both final accuracy and the resulting Fold-mode efficiency. We will also include quantitative retention metrics (e.g., accuracy when Unfold traces are replaced by their Fold summaries) and representative summary examples with information-preservation annotations to support the claim that essential reasoning content is retained. revision: yes
Referee: [Results] The 3x throughput result on 48GB GPU is stated without baseline comparisons (standard CoT, other KV-cache eviction or summarization methods), number of runs, error bars, or dataset-specific breakdowns, rendering the efficiency claim difficult to evaluate for robustness or generality.

Authors: We acknowledge that the efficiency results require additional context for proper evaluation. In the revision we will expand the experimental section to include direct comparisons against standard Chain-of-Thought, recent KV-cache eviction techniques, and other summarization baselines. All throughput numbers will be reported as means over multiple independent runs with standard-error bars, accompanied by per-dataset breakdowns to demonstrate consistency across tasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity in Accordion-Thinking derivation chain

full rationale

The paper's core claim is an empirical observation: after RL training on final-answer accuracy, the measured Fold-Unfold accuracy gap narrows to zero. This is reported as a training outcome rather than a quantity derived from any equation, fitted parameter, or self-citation that reduces the result to its own inputs by construction. No self-definitional loops, ansatzes smuggled via prior work, or uniqueness theorems appear in the provided text; the interpretation that summaries encode essential information follows directly from the observed behavior under the stated RL objective. The derivation therefore remains self-contained and falsifiable against external test sets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that summarization can be learned to be lossless for the tasks tested; no free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Reinforcement learning can train LLMs to produce summaries that retain all information required for correct downstream reasoning.
Invoked to explain why the accuracy gap vanishes; no independent verification outside the training loop is provided.

pith-pipeline@v0.9.0 · 5524 in / 1249 out tokens · 75830 ms · 2026-05-16T08:24:47.249466+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the accuracy gap between the highly efficient Fold mode and the exhaustive Unfold mode progressively narrows and eventually vanishes... the model learns to encode essential reasoning information into compact summaries
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Fold mode... discards former thoughts... reduces dependency on historical tokens... achieves effective compression of the reasoning context

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MEMENTO: Teaching LLMs to Manage Their Own Context
cs.AI 2026-04 unverdicted novelty 6.0

MEMENTO trains LLMs to segment reasoning into blocks, generate mementos as dense summaries, and reason forward using only mementos and KV states, cutting peak KV cache by ~2.5x while preserving benchmark accuracy.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · cited by 1 Pith paper

[2]

ISBN 979-8-89176-332-6

URL https://aclanthology.org/2025. emnlp-main.165/. Xiang, K., Li, H., Zhang, T. J., Huang, Y ., Liu, Z., Qu, P., He, J., Chen, J., Yuan, Y .-J., Han, J., Xu, H., Li, H., Sachan, M., and Liang, X. Seephys: Does seeing help thinking? 11 Accordion-Thinking: Self-Regulated Step Summaries for Efficient and Readable LLM Reasoning – benchmarking vision-based ph...

work page doi:10.18653/v1/2025.emnlp-main 2025
[3]

Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach

URL https://aclanthology.org/2025. emnlp-main.673/. Zhang, Z., Sheng, Y ., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y ., R´e, C., Barrett, C., et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023. Zhao, R., Meterez, A., Kakade, S.,...

work page arXiv 2025
[4]

The sum of the numbers formed by each row: (100A + 10B + C) + (100D + 10E + F) = 999 Simplifying, we get: 100(A + D) + 10(B + E) + (C + F) = 999 This implies: A + D = 9, \\quad B + E = 9, \\quad C + F = 9

work page
[5]

The number of solutions is given by the combination formula \\( \\binom{8 + 3 - 1}{3 - 1} = \\binom{10}{2} = 45 \\)

The sum of the numbers formed by each column: (10A + D) + (10B + E) + (10C + F) = 99 Substituting \\(D = 9 - A\\), \\(E = 9 - B\\), \\(F = 9 - C\\) into the equation, we get: 10(A + B + C) + (27 - (A + B + C)) = 99 Simplifying, we find: 9(A + B + C) + 27 = 99 \\implies 9(A + B + C) = 72 \\implies A + B + C = 8 Thus, we need to find the number of non-negat...

work page
[6]

The summary only needs to cover the main idea, main steps, or main conclusion of the earlier reasoning in a recognizable way

Coverage and faithfulness: This dimension should be judged loosely. The summary only needs to cover the main idea, main steps, or main conclusion of the earlier reasoning in a recognizable way. It does not need to preserve exact wording, exact order, detailed derivations, failed attempts, repeated checks, or minor side explorations. It is acceptable to co...

work page
[7]

Minor shorthand is acceptable if a reader can still follow it

Readability: The summary should be understandable, coherent, and reasonably well-structured. Minor shorthand is acceptable if a reader can still follow it. This dimension fails only when the summary is genuinely confusing, disorganized, or hard to read. Scoring rule: - label = 1 only if both dimensions pass. - Otherwise label = 0. Use a clearly lenient st...

work page
[8]

gives reasonable coverage of the main earlier reasoning process

work page
[9]

coverage_and_faithfulness

is readable and clear Important notes: - Focus on whether the summary provides coverage of the earlier reasoning, not on mathematical correctness. - Judge coverage loosely and at a high level. - It is acceptable for the summary to omit many details, false starts, repeated checks, and non-essential intermediate steps. - A short high-level summary can still...

work page
[10]

**SEGMENTATION GUIDANCE: ** 38 * Identify logical breaks in the reasoning process (e.g., problem decomposition, ,→definition setup, calculation phases, verification, refinement steps) 39 * Create a new step for each major conceptual unit 40 * Ensure each step has a clear, focused purpose 41 * Aim for around 5 steps in total, avoiding too many or too few 42

work page
[11]

45 * Only insert ‘<step>...</step>‘ tags with summaries between segments

**PRESERVE ORIGINAL CONTENT: ** 44 * DO NOT modify any part of the original response, preserve all of the original ,→content. 45 * Only insert ‘<step>...</step>‘ tags with summaries between segments. 46

work page
[12]

final solution

**SUMMARY CONTENT REQUIRMENTS: ** 48 * Text Style: The summaries MUST align closely with the content and the text style ,→of the "final solution" section after ‘</think>‘ in the original response. You ,→can even directly copy the content from the solution section. (except for ,→verify steps) 49 * For Each Step Summary: 50- Any **key variables, quantities,...

work page