pith. the verified trust layer for science. sign in

arxiv: 2602.03249 · v2 · submitted 2026-02-03 · 💻 cs.AI · cs.LG

Accordion-Thinking: Self-Regulated Step Summaries for Efficient and Readable LLM Reasoning

Pith reviewed 2026-05-16 08:24 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords LLM reasoningChain-of-Thoughtreinforcement learningcontext compressionefficient inferenceself-summarizationdynamic folding
0
0 comments X p. Extension

The pith

LLMs learn to fold their reasoning into compact summaries that match full accuracy after training

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that LLMs can be trained with reinforcement learning to self-regulate the detail level of their reasoning steps through periodic summarization. This creates an efficient Fold mode that discards earlier thoughts after creating summaries, cutting the growth of context length and attention costs. A central result is that the accuracy gap to the full Unfold mode shrinks and disappears during training, showing the summaries have captured everything needed for correct answers. The method delivers three times higher throughput on a 48GB GPU while producing readable step summaries, directly addressing the memory and compute limits of long Chain-of-Thought reasoning.

Core claim

Accordion-Thinking trains LLMs to dynamically summarize their thought process and discard prior tokens in a Fold inference mode. Reinforcement learning incentivizes this behavior, causing the accuracy of the efficient Fold mode to converge to the exhaustive Unfold mode. This convergence demonstrates that the model encodes all essential reasoning information into the compact summaries, achieving effective compression of the reasoning context without loss of solution quality.

What carries the argument

Dynamic summarization mechanism that lets the model periodically fold its reasoning steps into compact summaries while reinforcement learning trains it to preserve necessary information across folds.

If this is right

  • Complex reasoning tasks become solvable with far lower token overhead and KV-cache usage.
  • Structured step summaries give a human-readable record of the reasoning process.
  • Threefold throughput gains are realized on fixed 48GB GPU memory while accuracy is maintained.
  • The model learns to perform lossless compression of its own reasoning context during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same self-compression approach could extend to multi-step planning or tool-use sequences that currently require long contexts.
  • Reasoning length may no longer need to scale linearly with problem difficulty once compression is learned.
  • Training regimes that reward both correctness and brevity could become standard for efficient inference.

Load-bearing premise

Reinforcement learning can train the model to produce summaries that keep every piece of information required for correct final answers, without hidden irreversible losses that only appear on harder problems.

What would settle it

If the Fold mode accuracy remains measurably below the Unfold mode on harder test problems even after extended training, the claim that summaries preserve all essential information would be false.

Figures

Figures reproduced from arXiv: 2602.03249 by Jing Tang, Wenlei Shi, Xiaodan Liang, Yinya Huang, Yiwei Wang, Yongxin Wang, Zhicheng Yang, Zhijiang Guo.

Figure 1
Figure 1. Figure 1: Comparison of Vanilla CoT and our Accordion CoT. As the generation length increases, the computational complexity per token in Vanilla CoT grows quadratically. In contrast, our Accordion CoT folds the context after each step, reducing the computational complexity for the next token generation and improving inference speed. We force the model to follow the Accordion Format, which splits the whole thinking p… view at source ↗
Figure 2
Figure 2. Figure 2: Ablation study on synthetic Accordion data for Qwen2.5-Math-7B and Qwen3-4B-Base on Fold mode. Qwen2.5-Math-7B Qwen3-4B-Base gap vanish gap vanish 0.3 0.8 0.7 0.6 0.5 0.4 300 300 300 300 0.8 0.7 0.5 0.4 0.3 0.4 0.2 0.1 0.0 0.2 0.25 0.15 0.1 0.0 Unfol 0.05 d Fold Unfold Fold Training Reward Reward Gap of the 2 Mode Training Reward Reward Gap of the 2 Mode [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Reward gap between Fold mode and Unfold mode vanishes during Mix-RL training. 2. Standard RL improves reasoning but still neglects compression. The Unfold-RL baseline, which optimizes reasoning using the full context history, significantly boosts general performance compared to the Cold-Start model (e.g., 52.2% vs. 48.5% on Qwen2.5-Math-7B). However, it does not close the compression gap. When Fold-RL mode… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of token efficiency in raw PyTorch. 4.4. Performance Gap Vanish To understand the dynamic relationship between full-context and compressed reasoning, we visualize the reward trajec￾tories of both Fold and Unfold modes during the Mix-RL training process in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accordion-Thinking leads to dense attention. Folded reasoning consistently yields darker and more concentrated heat patterns than the unfolded baseline [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Statistical results of the training dynamics [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ablation study on max folding steps on Qwen3-4B-Base. F. Ablation Study on Max Folding Steps and Max Step Length We first conducted an ablation study on the max folding step and found that the model demonstrates strong adaptive capabilities. As shown in [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Case study analysis of summary readability. The step summaries, when pieced together, can serve as a substitute for the final solution. Accordion CoT provides users with instant, readable information about the reasoning process. H. LLM as a Judge For Readability You are a lenient evaluator for reasoning summaries. Your task is to judge a summary of prior reasoning on exactly two dimensions: 1. Coverage and… view at source ↗
Figure 9
Figure 9. Figure 9: The prompt and expected output format in LLM as a judge for readability. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
read the original abstract

Scaling test-time compute via long Chain-of-Thought unlocks remarkable gains in reasoning capabilities, yet it faces practical limits due to the linear growth of KV cache and quadratic attention complexity. In this paper, we introduce Accordion-Thinking, an end-to-end framework where LLMs learn to self-regulate the granularity of the reasoning steps through dynamic summarization. This mechanism enables a Fold inference mode, where the model periodically summarizes its thought process and discards former thoughts to reduce dependency on historical tokens. We apply reinforcement learning to incentivize this capability further, uncovering a critical insight: the accuracy gap between the highly efficient Fold mode and the exhaustive Unfold mode progressively narrows and eventually vanishes over the course of training. This phenomenon demonstrates that the model learns to encode essential reasoning information into compact summaries, achieving effective compression of the reasoning context. Our Accordion-Thinking demonstrates that with learned self-compression, LLMs can tackle complex reasoning tasks with minimal dependency token overhead without compromising solution quality, and it achieves a three times throughput while maintaining accuracy on a 48GB GPU memory configuration, while the structured step summaries provide a human-readable account of the reasoning process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces Accordion-Thinking, an end-to-end framework in which LLMs learn to dynamically summarize reasoning steps and discard prior context, enabling an efficient Fold inference mode. Reinforcement learning is used to train this capability, with the central claim being that the accuracy gap between Fold and exhaustive Unfold modes narrows and vanishes over training, interpreted as the model learning lossless compression of essential reasoning information into compact summaries. This yields 3x throughput on a 48GB GPU while preserving accuracy and producing human-readable step summaries.

Significance. If the vanishing gap truly reflects faithful compression rather than simplification of reasoning paths, the method would offer a practical route to scaling test-time compute for long CoT without linear KV-cache growth or quadratic attention costs. The readable summaries also address interpretability, which is a secondary but useful contribution for complex reasoning tasks.

major comments (3)
  1. [Abstract / experimental results] Abstract and experimental results: the claim that the Fold-Unfold accuracy gap 'progressively narrows and eventually vanishes' is presented without reporting Unfold-mode token usage, step complexity, or branching behavior over the course of RL training. This leaves open the possibility that the gap closes because both modes converge on shorter reasoning traces rather than because summaries faithfully encode the original information.
  2. [Method / training procedure] The RL objective (final-answer accuracy) does not distinguish lossless summary compression from irreversible early pruning of non-critical branches. No ablation (e.g., comparing joint Fold+Unfold training against Unfold-only training, or inspecting summary content for information retention) is described to support the interpretation that summaries 'encode essential reasoning information.'
  3. [Results] The 3x throughput result on 48GB GPU is stated without baseline comparisons (standard CoT, other KV-cache eviction or summarization methods), number of runs, error bars, or dataset-specific breakdowns, rendering the efficiency claim difficult to evaluate for robustness or generality.
minor comments (1)
  1. [Abstract] Notation for Fold and Unfold modes should be defined once in the introduction and used consistently; the abstract introduces them without a clear forward reference to the method section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below, agreeing that additional evidence and comparisons will strengthen the claims, and we will incorporate the suggested revisions in the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract / experimental results] Abstract and experimental results: the claim that the Fold-Unfold accuracy gap 'progressively narrows and eventually vanishes' is presented without reporting Unfold-mode token usage, step complexity, or branching behavior over the course of RL training. This leaves open the possibility that the gap closes because both modes converge on shorter reasoning traces rather than because summaries faithfully encode the original information.

    Authors: We agree that reporting the evolution of token usage, step counts, and branching behavior for Unfold mode during RL training would better substantiate our interpretation of lossless compression. In the revised manuscript we will add plots tracking average token consumption and reasoning-step complexity for both Fold and Unfold modes across training checkpoints, together with a brief analysis of branching frequency. These additions will show that Unfold traces remain substantially longer while Fold achieves compression without accuracy degradation, thereby addressing the alternative explanation of uniform shortening. revision: yes

  2. Referee: [Method / training procedure] The RL objective (final-answer accuracy) does not distinguish lossless summary compression from irreversible early pruning of non-critical branches. No ablation (e.g., comparing joint Fold+Unfold training against Unfold-only training, or inspecting summary content for information retention) is described to support the interpretation that summaries 'encode essential reasoning information.'

    Authors: The referee correctly notes that the current RL objective alone cannot fully separate faithful compression from early pruning. We will add an ablation study comparing a jointly trained Fold+Unfold model against an Unfold-only baseline, reporting both final accuracy and the resulting Fold-mode efficiency. We will also include quantitative retention metrics (e.g., accuracy when Unfold traces are replaced by their Fold summaries) and representative summary examples with information-preservation annotations to support the claim that essential reasoning content is retained. revision: yes

  3. Referee: [Results] The 3x throughput result on 48GB GPU is stated without baseline comparisons (standard CoT, other KV-cache eviction or summarization methods), number of runs, error bars, or dataset-specific breakdowns, rendering the efficiency claim difficult to evaluate for robustness or generality.

    Authors: We acknowledge that the efficiency results require additional context for proper evaluation. In the revision we will expand the experimental section to include direct comparisons against standard Chain-of-Thought, recent KV-cache eviction techniques, and other summarization baselines. All throughput numbers will be reported as means over multiple independent runs with standard-error bars, accompanied by per-dataset breakdowns to demonstrate consistency across tasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity in Accordion-Thinking derivation chain

full rationale

The paper's core claim is an empirical observation: after RL training on final-answer accuracy, the measured Fold-Unfold accuracy gap narrows to zero. This is reported as a training outcome rather than a quantity derived from any equation, fitted parameter, or self-citation that reduces the result to its own inputs by construction. No self-definitional loops, ansatzes smuggled via prior work, or uniqueness theorems appear in the provided text; the interpretation that summaries encode essential information follows directly from the observed behavior under the stated RL objective. The derivation therefore remains self-contained and falsifiable against external test sets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that summarization can be learned to be lossless for the tasks tested; no free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption Reinforcement learning can train LLMs to produce summaries that retain all information required for correct downstream reasoning.
    Invoked to explain why the accuracy gap vanishes; no independent verification outside the training loop is provided.

pith-pipeline@v0.9.0 · 5524 in / 1249 out tokens · 75830 ms · 2026-05-16T08:24:47.249466+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MEMENTO: Teaching LLMs to Manage Their Own Context

    cs.AI 2026-04 unverdicted novelty 6.0

    MEMENTO trains LLMs to segment reasoning into blocks, generate mementos as dense summaries, and reason forward using only mementos and KV states, cutting peak KV cache by ~2.5x while preserving benchmark accuracy.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · cited by 1 Pith paper

  1. [2]

    ISBN 979-8-89176-332-6

    URL https://aclanthology.org/2025. emnlp-main.165/. Xiang, K., Li, H., Zhang, T. J., Huang, Y ., Liu, Z., Qu, P., He, J., Chen, J., Yuan, Y .-J., Han, J., Xu, H., Li, H., Sachan, M., and Liang, X. Seephys: Does seeing help thinking? 11 Accordion-Thinking: Self-Regulated Step Summaries for Efficient and Readable LLM Reasoning – benchmarking vision-based ph...

  2. [3]

    Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach

    URL https://aclanthology.org/2025. emnlp-main.673/. Zhang, Z., Sheng, Y ., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y ., R´e, C., Barrett, C., et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023. Zhao, R., Meterez, A., Kakade, S.,...

  3. [4]

    The sum of the numbers formed by each row: (100A + 10B + C) + (100D + 10E + F) = 999 Simplifying, we get: 100(A + D) + 10(B + E) + (C + F) = 999 This implies: A + D = 9, \\quad B + E = 9, \\quad C + F = 9

  4. [5]

    The number of solutions is given by the combination formula \\( \\binom{8 + 3 - 1}{3 - 1} = \\binom{10}{2} = 45 \\)

    The sum of the numbers formed by each column: (10A + D) + (10B + E) + (10C + F) = 99 Substituting \\(D = 9 - A\\), \\(E = 9 - B\\), \\(F = 9 - C\\) into the equation, we get: 10(A + B + C) + (27 - (A + B + C)) = 99 Simplifying, we find: 9(A + B + C) + 27 = 99 \\implies 9(A + B + C) = 72 \\implies A + B + C = 8 Thus, we need to find the number of non-negat...

  5. [6]

    The summary only needs to cover the main idea, main steps, or main conclusion of the earlier reasoning in a recognizable way

    Coverage and faithfulness: This dimension should be judged loosely. The summary only needs to cover the main idea, main steps, or main conclusion of the earlier reasoning in a recognizable way. It does not need to preserve exact wording, exact order, detailed derivations, failed attempts, repeated checks, or minor side explorations. It is acceptable to co...

  6. [7]

    Minor shorthand is acceptable if a reader can still follow it

    Readability: The summary should be understandable, coherent, and reasonably well-structured. Minor shorthand is acceptable if a reader can still follow it. This dimension fails only when the summary is genuinely confusing, disorganized, or hard to read. Scoring rule: - label = 1 only if both dimensions pass. - Otherwise label = 0. Use a clearly lenient st...

  7. [8]

    gives reasonable coverage of the main earlier reasoning process

  8. [9]

    coverage_and_faithfulness

    is readable and clear Important notes: - Focus on whether the summary provides coverage of the earlier reasoning, not on mathematical correctness. - Judge coverage loosely and at a high level. - It is acceptable for the summary to omit many details, false starts, repeated checks, and non-essential intermediate steps. - A short high-level summary can still...

  9. [10]

    **SEGMENTATION GUIDANCE: ** 38 * Identify logical breaks in the reasoning process (e.g., problem decomposition, ,→definition setup, calculation phases, verification, refinement steps) 39 * Create a new step for each major conceptual unit 40 * Ensure each step has a clear, focused purpose 41 * Aim for around 5 steps in total, avoiding too many or too few 42

  10. [11]

    45 * Only insert ‘<step>...</step>‘ tags with summaries between segments

    **PRESERVE ORIGINAL CONTENT: ** 44 * DO NOT modify any part of the original response, preserve all of the original ,→content. 45 * Only insert ‘<step>...</step>‘ tags with summaries between segments. 46

  11. [12]

    final solution

    **SUMMARY CONTENT REQUIRMENTS: ** 48 * Text Style: The summaries MUST align closely with the content and the text style ,→of the "final solution" section after ‘</think>‘ in the original response. You ,→can even directly copy the content from the solution section. (except for ,→verify steps) 49 * For Each Step Summary: 50- Any **key variables, quantities,...