Accordion-Thinking: Self-Regulated Step Summaries for Efficient and Readable LLM Reasoning
Pith reviewed 2026-05-16 08:24 UTC · model grok-4.3
The pith
LLMs learn to fold their reasoning into compact summaries that match full accuracy after training
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Accordion-Thinking trains LLMs to dynamically summarize their thought process and discard prior tokens in a Fold inference mode. Reinforcement learning incentivizes this behavior, causing the accuracy of the efficient Fold mode to converge to the exhaustive Unfold mode. This convergence demonstrates that the model encodes all essential reasoning information into the compact summaries, achieving effective compression of the reasoning context without loss of solution quality.
What carries the argument
Dynamic summarization mechanism that lets the model periodically fold its reasoning steps into compact summaries while reinforcement learning trains it to preserve necessary information across folds.
If this is right
- Complex reasoning tasks become solvable with far lower token overhead and KV-cache usage.
- Structured step summaries give a human-readable record of the reasoning process.
- Threefold throughput gains are realized on fixed 48GB GPU memory while accuracy is maintained.
- The model learns to perform lossless compression of its own reasoning context during training.
Where Pith is reading between the lines
- The same self-compression approach could extend to multi-step planning or tool-use sequences that currently require long contexts.
- Reasoning length may no longer need to scale linearly with problem difficulty once compression is learned.
- Training regimes that reward both correctness and brevity could become standard for efficient inference.
Load-bearing premise
Reinforcement learning can train the model to produce summaries that keep every piece of information required for correct final answers, without hidden irreversible losses that only appear on harder problems.
What would settle it
If the Fold mode accuracy remains measurably below the Unfold mode on harder test problems even after extended training, the claim that summaries preserve all essential information would be false.
Figures
read the original abstract
Scaling test-time compute via long Chain-of-Thought unlocks remarkable gains in reasoning capabilities, yet it faces practical limits due to the linear growth of KV cache and quadratic attention complexity. In this paper, we introduce Accordion-Thinking, an end-to-end framework where LLMs learn to self-regulate the granularity of the reasoning steps through dynamic summarization. This mechanism enables a Fold inference mode, where the model periodically summarizes its thought process and discards former thoughts to reduce dependency on historical tokens. We apply reinforcement learning to incentivize this capability further, uncovering a critical insight: the accuracy gap between the highly efficient Fold mode and the exhaustive Unfold mode progressively narrows and eventually vanishes over the course of training. This phenomenon demonstrates that the model learns to encode essential reasoning information into compact summaries, achieving effective compression of the reasoning context. Our Accordion-Thinking demonstrates that with learned self-compression, LLMs can tackle complex reasoning tasks with minimal dependency token overhead without compromising solution quality, and it achieves a three times throughput while maintaining accuracy on a 48GB GPU memory configuration, while the structured step summaries provide a human-readable account of the reasoning process.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Accordion-Thinking, an end-to-end framework in which LLMs learn to dynamically summarize reasoning steps and discard prior context, enabling an efficient Fold inference mode. Reinforcement learning is used to train this capability, with the central claim being that the accuracy gap between Fold and exhaustive Unfold modes narrows and vanishes over training, interpreted as the model learning lossless compression of essential reasoning information into compact summaries. This yields 3x throughput on a 48GB GPU while preserving accuracy and producing human-readable step summaries.
Significance. If the vanishing gap truly reflects faithful compression rather than simplification of reasoning paths, the method would offer a practical route to scaling test-time compute for long CoT without linear KV-cache growth or quadratic attention costs. The readable summaries also address interpretability, which is a secondary but useful contribution for complex reasoning tasks.
major comments (3)
- [Abstract / experimental results] Abstract and experimental results: the claim that the Fold-Unfold accuracy gap 'progressively narrows and eventually vanishes' is presented without reporting Unfold-mode token usage, step complexity, or branching behavior over the course of RL training. This leaves open the possibility that the gap closes because both modes converge on shorter reasoning traces rather than because summaries faithfully encode the original information.
- [Method / training procedure] The RL objective (final-answer accuracy) does not distinguish lossless summary compression from irreversible early pruning of non-critical branches. No ablation (e.g., comparing joint Fold+Unfold training against Unfold-only training, or inspecting summary content for information retention) is described to support the interpretation that summaries 'encode essential reasoning information.'
- [Results] The 3x throughput result on 48GB GPU is stated without baseline comparisons (standard CoT, other KV-cache eviction or summarization methods), number of runs, error bars, or dataset-specific breakdowns, rendering the efficiency claim difficult to evaluate for robustness or generality.
minor comments (1)
- [Abstract] Notation for Fold and Unfold modes should be defined once in the introduction and used consistently; the abstract introduces them without a clear forward reference to the method section.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below, agreeing that additional evidence and comparisons will strengthen the claims, and we will incorporate the suggested revisions in the next version of the manuscript.
read point-by-point responses
-
Referee: [Abstract / experimental results] Abstract and experimental results: the claim that the Fold-Unfold accuracy gap 'progressively narrows and eventually vanishes' is presented without reporting Unfold-mode token usage, step complexity, or branching behavior over the course of RL training. This leaves open the possibility that the gap closes because both modes converge on shorter reasoning traces rather than because summaries faithfully encode the original information.
Authors: We agree that reporting the evolution of token usage, step counts, and branching behavior for Unfold mode during RL training would better substantiate our interpretation of lossless compression. In the revised manuscript we will add plots tracking average token consumption and reasoning-step complexity for both Fold and Unfold modes across training checkpoints, together with a brief analysis of branching frequency. These additions will show that Unfold traces remain substantially longer while Fold achieves compression without accuracy degradation, thereby addressing the alternative explanation of uniform shortening. revision: yes
-
Referee: [Method / training procedure] The RL objective (final-answer accuracy) does not distinguish lossless summary compression from irreversible early pruning of non-critical branches. No ablation (e.g., comparing joint Fold+Unfold training against Unfold-only training, or inspecting summary content for information retention) is described to support the interpretation that summaries 'encode essential reasoning information.'
Authors: The referee correctly notes that the current RL objective alone cannot fully separate faithful compression from early pruning. We will add an ablation study comparing a jointly trained Fold+Unfold model against an Unfold-only baseline, reporting both final accuracy and the resulting Fold-mode efficiency. We will also include quantitative retention metrics (e.g., accuracy when Unfold traces are replaced by their Fold summaries) and representative summary examples with information-preservation annotations to support the claim that essential reasoning content is retained. revision: yes
-
Referee: [Results] The 3x throughput result on 48GB GPU is stated without baseline comparisons (standard CoT, other KV-cache eviction or summarization methods), number of runs, error bars, or dataset-specific breakdowns, rendering the efficiency claim difficult to evaluate for robustness or generality.
Authors: We acknowledge that the efficiency results require additional context for proper evaluation. In the revision we will expand the experimental section to include direct comparisons against standard Chain-of-Thought, recent KV-cache eviction techniques, and other summarization baselines. All throughput numbers will be reported as means over multiple independent runs with standard-error bars, accompanied by per-dataset breakdowns to demonstrate consistency across tasks. revision: yes
Circularity Check
No significant circularity in Accordion-Thinking derivation chain
full rationale
The paper's core claim is an empirical observation: after RL training on final-answer accuracy, the measured Fold-Unfold accuracy gap narrows to zero. This is reported as a training outcome rather than a quantity derived from any equation, fitted parameter, or self-citation that reduces the result to its own inputs by construction. No self-definitional loops, ansatzes smuggled via prior work, or uniqueness theorems appear in the provided text; the interpretation that summaries encode essential information follows directly from the observed behavior under the stated RL objective. The derivation therefore remains self-contained and falsifiable against external test sets.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reinforcement learning can train LLMs to produce summaries that retain all information required for correct downstream reasoning.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
the accuracy gap between the highly efficient Fold mode and the exhaustive Unfold mode progressively narrows and eventually vanishes... the model learns to encode essential reasoning information into compact summaries
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Fold mode... discards former thoughts... reduces dependency on historical tokens... achieves effective compression of the reasoning context
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
MEMENTO: Teaching LLMs to Manage Their Own Context
MEMENTO trains LLMs to segment reasoning into blocks, generate mementos as dense summaries, and reason forward using only mementos and KV states, cutting peak KV cache by ~2.5x while preserving benchmark accuracy.
Reference graph
Works this paper leans on
-
[2]
URL https://aclanthology.org/2025. emnlp-main.165/. Xiang, K., Li, H., Zhang, T. J., Huang, Y ., Liu, Z., Qu, P., He, J., Chen, J., Yuan, Y .-J., Han, J., Xu, H., Li, H., Sachan, M., and Liang, X. Seephys: Does seeing help thinking? 11 Accordion-Thinking: Self-Regulated Step Summaries for Efficient and Readable LLM Reasoning – benchmarking vision-based ph...
-
[3]
Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach
URL https://aclanthology.org/2025. emnlp-main.673/. Zhang, Z., Sheng, Y ., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y ., R´e, C., Barrett, C., et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023. Zhao, R., Meterez, A., Kakade, S.,...
-
[4]
The sum of the numbers formed by each row: (100A + 10B + C) + (100D + 10E + F) = 999 Simplifying, we get: 100(A + D) + 10(B + E) + (C + F) = 999 This implies: A + D = 9, \\quad B + E = 9, \\quad C + F = 9
-
[5]
The sum of the numbers formed by each column: (10A + D) + (10B + E) + (10C + F) = 99 Substituting \\(D = 9 - A\\), \\(E = 9 - B\\), \\(F = 9 - C\\) into the equation, we get: 10(A + B + C) + (27 - (A + B + C)) = 99 Simplifying, we find: 9(A + B + C) + 27 = 99 \\implies 9(A + B + C) = 72 \\implies A + B + C = 8 Thus, we need to find the number of non-negat...
-
[6]
Coverage and faithfulness: This dimension should be judged loosely. The summary only needs to cover the main idea, main steps, or main conclusion of the earlier reasoning in a recognizable way. It does not need to preserve exact wording, exact order, detailed derivations, failed attempts, repeated checks, or minor side explorations. It is acceptable to co...
-
[7]
Minor shorthand is acceptable if a reader can still follow it
Readability: The summary should be understandable, coherent, and reasonably well-structured. Minor shorthand is acceptable if a reader can still follow it. This dimension fails only when the summary is genuinely confusing, disorganized, or hard to read. Scoring rule: - label = 1 only if both dimensions pass. - Otherwise label = 0. Use a clearly lenient st...
-
[8]
gives reasonable coverage of the main earlier reasoning process
-
[9]
is readable and clear Important notes: - Focus on whether the summary provides coverage of the earlier reasoning, not on mathematical correctness. - Judge coverage loosely and at a high level. - It is acceptable for the summary to omit many details, false starts, repeated checks, and non-essential intermediate steps. - A short high-level summary can still...
-
[10]
**SEGMENTATION GUIDANCE: ** 38 * Identify logical breaks in the reasoning process (e.g., problem decomposition, ,→definition setup, calculation phases, verification, refinement steps) 39 * Create a new step for each major conceptual unit 40 * Ensure each step has a clear, focused purpose 41 * Aim for around 5 steps in total, avoiding too many or too few 42
-
[11]
45 * Only insert ‘<step>...</step>‘ tags with summaries between segments
**PRESERVE ORIGINAL CONTENT: ** 44 * DO NOT modify any part of the original response, preserve all of the original ,→content. 45 * Only insert ‘<step>...</step>‘ tags with summaries between segments. 46
-
[12]
**SUMMARY CONTENT REQUIRMENTS: ** 48 * Text Style: The summaries MUST align closely with the content and the text style ,→of the "final solution" section after ‘</think>‘ in the original response. You ,→can even directly copy the content from the solution section. (except for ,→verify steps) 49 * For Each Step Summary: 50- Any **key variables, quantities,...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.