pith. sign in

arxiv: 2605.20201 · v2 · pith:263E7XYFnew · submitted 2026-04-06 · 💻 cs.CL · cs.AI· cs.LG

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

Pith reviewed 2026-05-25 07:01 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords long-context reasoningchain-of-thoughtproxy contextsupervised fine-tuningreinforcement learningdistillationlarge language models
0
0 comments X

The pith

ProxyCoT transfers high-quality chain-of-thought from short proxy contexts to full long contexts through supervised fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models can accept inputs up to millions of tokens yet still fail at tasks that require complex reasoning over the full sequence. The paper proposes training first on shorter proxy versions of the input to generate reliable chain-of-thought traces, either through reinforcement learning or distillation, and then using supervised fine-tuning to ground those traces in the complete long context. This produces models that reason better on long inputs than direct training baselines while requiring less computation. The same models also show improved performance on out-of-domain long-context tasks. Readers would care because the method offers a concrete route to make existing long-context models more capable without scaling training data or context length further.

Core claim

High-quality chain-of-thought reasoning traces obtained on proxy contexts can be reliably transferred to full long contexts by supervised fine-tuning, yielding consistent gains over strong baselines on long-context reasoning tasks along with reduced computational cost and improved generalization to out-of-domain settings.

What carries the argument

ProxyCoT, the two-stage process of first acquiring chain-of-thought traces on proxy contexts via reinforcement learning or distillation and then aligning those traces to full contexts via supervised fine-tuning.

If this is right

  • Models trained with ProxyCoT outperform strong baselines on long-context reasoning benchmarks.
  • The training process incurs lower computational overhead than methods that operate on full sequences throughout.
  • Capabilities acquired via ProxyCoT transfer to long-context tasks outside the original training distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same proxy-to-full transfer pattern could be tested on other sequence-heavy domains such as long documents or multi-turn dialogues.
  • If the performance gap between proxy and full contexts narrows further, the method might reduce reliance on ever-longer native context windows.
  • Combining ProxyCoT with retrieval methods could further lower the cost of maintaining reasoning quality over very large inputs.

Load-bearing premise

Reasoning traces that work on short proxy contexts remain valid and usable when the model is later shown the complete long input during fine-tuning.

What would settle it

Fine-tuning a model on full contexts using only the proxy-derived traces produces no measurable gain over a baseline trained directly on full contexts, or the resulting model fails to improve on out-of-domain long-context tasks.

Figures

Figures reproduced from arXiv: 2605.20201 by Alexander Gurung, Irina Saparina, Miao Li, Mirella Lapata.

Figure 1
Figure 1. Figure 1: The disparity of model performance in the [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: General two-stage pipeline of ProxyCoT (left), and two instantiations (right): ProxyCoT-ZS and ProxyCoT [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The prompt template for HotpotQA evaluation with GPT5-mini as the judge. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
read the original abstract

Recent large language models support inputs of up to 10 million tokens, yet they perform poorly on long-context tasks that require complex reasoning. Such tasks can be solved using only a subset of the input -- a proxy context -- rather than the full sequence. Despite sharing the same underlying reasoning process, models exhibit a significant performance disparity between proxy and full contexts. To improve long-context reasoning, we propose ProxyCoT, a novel training framework that transfers reasoning capabilities from short proxy contexts to full long contexts. Specifically, we first obtain high-quality chain-of-thought reasoning traces on proxy contexts through reinforcement learning or distillation from a larger teacher model, and then ground the generated traces in full long contexts with supervised fine-tuning. Experiments across different datasets demonstrate that ProxyCoT consistently outperforms strong baselines with reduced computational overhead. Furthermore, models trained with ProxyCoT generalize their long-context reasoning capabilities to out-of-domain tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes ProxyCoT, a two-stage framework for long-context reasoning: high-quality chain-of-thought traces are first obtained on short proxy contexts via reinforcement learning or teacher distillation, then grounded in full long contexts through supervised fine-tuning. The abstract claims this yields consistent outperformance over strong baselines with reduced computational overhead and enables generalization of long-context reasoning to out-of-domain tasks.

Significance. If the empirical results are robust, the approach could offer a practical route to efficient long-context reasoning by leveraging cheaper proxy-context supervision, potentially reducing the data and compute demands of direct long-context training while addressing the observed proxy-to-full performance gap.

major comments (2)
  1. [Abstract] Abstract: the central claims of consistent outperformance, reduced overhead, and out-of-domain generalization are stated without any dataset names, baseline descriptions, evaluation metrics, or ablation results, so the evidence for the claims cannot be assessed from the provided text.
  2. [Method] Method description (implied in abstract): the SFT grounding step presupposes that next-token prediction on full contexts paired with proxy-derived traces will induce reliable attention to the relevant subset, yet the abstract itself notes a significant performance disparity between proxy and full contexts; no mechanism, attention analysis, or ablation is described to rule out the model simply ignoring extraneous tokens or memorizing surface patterns.
minor comments (1)
  1. The abstract refers to 'different datasets' and 'strong baselines' without enumeration or citation; adding these details would improve clarity even if the full experimental section is present.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claims of consistent outperformance, reduced overhead, and out-of-domain generalization are stated without any dataset names, baseline descriptions, evaluation metrics, or ablation results, so the evidence for the claims cannot be assessed from the provided text.

    Authors: We agree that the abstract would be strengthened by greater specificity. In the revised version we will incorporate the names of the primary datasets, the main baselines, the key evaluation metrics, and a concise reference to the ablation studies demonstrating reduced overhead and out-of-domain generalization. revision: yes

  2. Referee: [Method] Method description (implied in abstract): the SFT grounding step presupposes that next-token prediction on full contexts paired with proxy-derived traces will induce reliable attention to the relevant subset, yet the abstract itself notes a significant performance disparity between proxy and full contexts; no mechanism, attention analysis, or ablation is described to rule out the model simply ignoring extraneous tokens or memorizing surface patterns.

    Authors: The Method section of the full manuscript details the two-stage ProxyCoT procedure, with the SFT stage explicitly pairing proxy-derived traces with full-context inputs to ground the reasoning. While the performance gap is acknowledged, the reported experiments show that this procedure yields consistent gains over baselines. We concede that the current text does not include attention visualizations or targeted ablations that would directly rule out superficial memorization. We will add a short discussion of this point and, if space allows, a supporting ablation in the revision. revision: partial

Circularity Check

0 steps flagged

Empirical training framework with no mathematical derivation or self-referential reduction

full rationale

The paper proposes ProxyCoT as a two-stage empirical procedure (collect CoT traces on proxy contexts via RL/distillation, then SFT on full contexts) and validates it via experiments. No equations, uniqueness theorems, or predictions are presented that reduce by construction to fitted inputs or self-citations. The central claims rest on reported performance numbers rather than any definitional loop. This is the normal non-circular outcome for an applied ML methods paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review conducted on abstract only; no explicit free parameters, axioms, or invented entities are stated. The load-bearing premise that proxy and full contexts share the same underlying reasoning process is implicit but unelaborated.

pith-pipeline@v0.9.0 · 5690 in / 1079 out tokens · 32255 ms · 2026-05-25T07:01:01.971705+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 8 internal anchors

  1. [1]

    Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2402.03300 , eprinttype =. 2402.03300 , timestamp =

  2. [2]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and Yu Yue and Tiantian Fan and Gaohong Liu and Lingjun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and Jiangjie Chen and Chengyi Wang and Hongli ...

  3. [3]

    2025 , eprint=

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=

  4. [4]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  5. [5]

    Bakouch, Elie and Ben Allal, Loubna and Lozhkov, Anton and Tazi, Nouamane and Tunstall, Lewis and Patiño, Carlos Miguel and Beeching, Edward and Roucher, Aymeric and Reedi, Aksel Joonas and Gallouédec, Quentin and Rasul, Kashif and Habib, Nathan and Fourrier, Clémentine and Kydlicek, Hynek and Penedo, Guilherme and Larcher, Hugo and Morlon, Mathieu and Sr...

  6. [6]

    2025 , eprint=

    Magistral , author=. 2025 , eprint=

  7. [7]

    2025 , eprint=

    OpenThoughts: Data Recipes for Reasoning Models , author=. 2025 , eprint=

  8. [8]

    2025 , eprint=

    NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks , author=. 2025 , eprint=

  9. [9]

    Open R1: A fully open reproduction of DeepSeek-R1 , url =

  10. [10]

    CoRR , volume =

    Miao Li and Alexander Gurung and Irina Saparina and Mirella Lapata , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.21028 , eprinttype =. 2509.21028 , timestamp =

  11. [11]

    Cohen and Ruslan Salakhutdinov and Christopher D

    Zhilin Yang and Peng Qi and Saizheng Zhang and Yoshua Bengio and William W. Cohen and Ruslan Salakhutdinov and Christopher D. Manning , editor =. HotpotQA:. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018 , pages =. 2018 , url =. doi:10.18653/V1/D18-1259 , timestamp =

  12. [12]

    Gemma 3 Technical Report

    Gemma 3 Technical Report , journal =. 2025 , url =. doi:10.48550/ARXIV.2503.19786 , eprinttype =. 2503.19786 , timestamp =

  13. [13]

    Qwen3 Technical Report

    An Yang and Anfeng Li and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Gao and Chengen Huang and Chenxu Lv and Chujie Zheng and Dayiheng Liu and Fan Zhou and Fei Huang and Feng Hu and Hao Ge and Haoran Wei and Huan Lin and Jialong Tang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jian Yang and Jiaxi Yang and Ji...

  14. [14]

    Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc

    Minzheng Wang and Longze Chen and Fu Cheng and Shengyi Liao and Xinghua Zhang and Bingli Wu and Haiyang Yu and Nan Xu and Lei Zhang and Run Luo and Yunshui Li and Min Yang and Fei Huang and Yongbin Li , editor =. Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc. Proceedings of the 2024 Conference on Empirical Methods in Nat...

  15. [15]

    Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step , booktitle =

    Liunian Harold Li and Jack Hessel and Youngjae Yu and Xiang Ren and Kai. Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step , booktitle =. 2023 , url =. doi:10.18653/V1/2023.ACL-LONG.150 , timestamp =

  16. [16]

    Large Language Models Are Reasoning Teachers , booktitle =

    Namgyu Ho and Laura Schmid and Se. Large Language Models Are Reasoning Teachers , booktitle =. 2023 , url =. doi:10.18653/V1/2023.ACL-LONG.830 , timestamp =

  17. [17]

    Head-to-Tail: How Knowledgeable are Large Language Models ( LLM s)? A

    Sun, Kai and Xu, Yifan and Zha, Hanwen and Liu, Yue and Dong, Xin Luna. Head-to-Tail: How Knowledgeable are Large Language Models ( LLM s)? A . K . A . Will LLM s Replace Knowledge Graphs?. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)....

  18. [18]

    Uncertainty Quantification in Retrieval Augmented Question Answering , journal =

    Laura Perez. Uncertainty Quantification in Retrieval Augmented Question Answering , journal =. 2025 , url =

  19. [19]

    CoRR , volume =

    Komal Kumar and Tajamul Ashraf and Omkar Thawakar and Rao Muhammad Anwer and Hisham Cholakkal and Mubarak Shah and Ming. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2502.21321 , eprinttype =. 2502.21321 , timestamp =

  20. [20]

    Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =

    Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph Gonzalez and Hao Zhang and Ion Stoica , editor =. Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =. 2023 , url =. doi:10.1145/3600006.3613165 , timestamp =

  21. [21]

    OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

    OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework , author=. arXiv preprint arXiv:2405.11143 , year=

  22. [22]

    2024 , issn =

    RoFormer: Enhanced transformer with Rotary Position Embedding , journal =. 2024 , issn =. doi:https://doi.org/10.1016/j.neucom.2023.127063 , url =

  23. [23]

    Bowen Peng and Jeffrey Quesnelle and Honglu Fan and Enrico Shippole , booktitle=. Ya. 2024 , url=

  24. [24]

    Proceedings of the 41st International Conference on Machine Learning , articleno =

    An, Chenxin and Huang, Fei and Zhang, Jun and Gong, Shansan and Qiu, Xipeng and Zhou, Chang and Kong, Lingpeng , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  25. [25]

    2025 , eprint=

    Speed Always Wins: A Survey on Efficient Architectures for Large Language Models , author=. 2025 , eprint=

  26. [26]

    2025 , eprint=

    Efficient Attention Mechanisms for Large Language Models: A Survey , author=. 2025 , eprint=

  27. [27]

    First Conference on Language Modeling , year=

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author=. First Conference on Language Modeling , year=

  28. [28]

    2025 , eprint=

    Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search , author=. 2025 , eprint=

  29. [29]

    2024 , eprint=

    MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression , author=. 2024 , eprint=

  30. [30]

    Abdi and Dongsheng Li and Chin-Yew Lin and Yuqing Yang and Lili Qiu , booktitle=

    Huiqiang Jiang and YUCHENG LI and Chengruidong Zhang and Qianhui Wu and Xufang Luo and Surin Ahn and Zhenhua Han and Amir H. Abdi and Dongsheng Li and Chin-Yew Lin and Yuqing Yang and Lili Qiu , booktitle=. 2024 , url=

  31. [31]

    2020 , eprint=

    Longformer: The Long-Document Transformer , author=. 2020 , eprint=

  32. [32]

    2023 , url=

    Joshua Ainslie and James Lee-Thorp and Michiel de Jong and Yury Zemlyanskiy and Federico Lebron and Sumit Sanghai , booktitle=. 2023 , url=

  33. [33]

    The Twelfth International Conference on Learning Representations , year=

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author=. The Twelfth International Conference on Learning Representations , year=

  34. [34]

    The Llama 3 Herd of Models

    Abhimanyu Dubey and Abhinav Jauhri and Abhinav Pandey and Abhishek Kadian and Ahmad Al. The Llama 3 Herd of Models , journal =. 2024 , url =. doi:10.48550/ARXIV.2407.21783 , eprinttype =. 2407.21783 , timestamp =

  35. [35]

    Technical Report , url=

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. Technical Report , url=

  36. [36]

    Technical Report , url=

    The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation , author=. Technical Report , url=

  37. [37]

    Qwen2.5-1M Technical Report

    An Yang and Bowen Yu and Chengyuan Li and Dayiheng Liu and Fei Huang and Haoyan Huang and Jiandong Jiang and Jianhong Tu and Jianwei Zhang and Jingren Zhou and Junyang Lin and Kai Dang and Kexin Yang and Le Yu and Mei Li and Minmin Sun and Qin Zhu and Rui Men and Tao He and Weijia Xu and Wenbiao Yin and Wenyuan Yu and Xiafei Qiu and Xingzhang Ren and Xinl...

  38. [38]

    CoRR , volume =

    Jiaheng Liu and Dawei Zhu and Zhiqi Bai and Yancheng He and Huanxuan Liao and Haoran Que and Zekun Wang and Chenchen Zhang and Ge Zhang and Jiebin Zhang and Yuanxing Zhang and Zhuo Chen and Hangyu Guo and Shilong Li and Ziqiang Liu and Yong Shan and Yifan Song and Jiayi Tian and Wenhao Wu and Zhejian Zhou and Ruijie Zhu and Junlan Feng and Yang Gao and Sh...

  39. [39]

    Robertson and Hugo Zaragoza , title =

    Stephen E. Robertson and Hugo Zaragoza , title =. Found. Trends Inf. Retr. , volume =. 2009 , url =. doi:10.1561/1500000019 , timestamp =

  40. [40]

    Olmo 3

    Olmo 3 , author=. arXiv preprint arXiv:2512.13961 , year=