Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

Alexander Gurung; Irina Saparina; Miao Li; Mirella Lapata

arxiv: 2605.20201 · v2 · pith:263E7XYFnew · submitted 2026-04-06 · 💻 cs.CL · cs.AI· cs.LG

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

Miao Li , Irina Saparina , Alexander Gurung , Mirella Lapata This is my paper

Pith reviewed 2026-05-25 07:01 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords long-context reasoningchain-of-thoughtproxy contextsupervised fine-tuningreinforcement learningdistillationlarge language models

0 comments

The pith

ProxyCoT transfers high-quality chain-of-thought from short proxy contexts to full long contexts through supervised fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models can accept inputs up to millions of tokens yet still fail at tasks that require complex reasoning over the full sequence. The paper proposes training first on shorter proxy versions of the input to generate reliable chain-of-thought traces, either through reinforcement learning or distillation, and then using supervised fine-tuning to ground those traces in the complete long context. This produces models that reason better on long inputs than direct training baselines while requiring less computation. The same models also show improved performance on out-of-domain long-context tasks. Readers would care because the method offers a concrete route to make existing long-context models more capable without scaling training data or context length further.

Core claim

High-quality chain-of-thought reasoning traces obtained on proxy contexts can be reliably transferred to full long contexts by supervised fine-tuning, yielding consistent gains over strong baselines on long-context reasoning tasks along with reduced computational cost and improved generalization to out-of-domain settings.

What carries the argument

ProxyCoT, the two-stage process of first acquiring chain-of-thought traces on proxy contexts via reinforcement learning or distillation and then aligning those traces to full contexts via supervised fine-tuning.

If this is right

Models trained with ProxyCoT outperform strong baselines on long-context reasoning benchmarks.
The training process incurs lower computational overhead than methods that operate on full sequences throughout.
Capabilities acquired via ProxyCoT transfer to long-context tasks outside the original training distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same proxy-to-full transfer pattern could be tested on other sequence-heavy domains such as long documents or multi-turn dialogues.
If the performance gap between proxy and full contexts narrows further, the method might reduce reliance on ever-longer native context windows.
Combining ProxyCoT with retrieval methods could further lower the cost of maintaining reasoning quality over very large inputs.

Load-bearing premise

Reasoning traces that work on short proxy contexts remain valid and usable when the model is later shown the complete long input during fine-tuning.

What would settle it

Fine-tuning a model on full contexts using only the proxy-derived traces produces no measurable gain over a baseline trained directly on full contexts, or the resulting model fails to improve on out-of-domain long-context tasks.

Figures

Figures reproduced from arXiv: 2605.20201 by Alexander Gurung, Irina Saparina, Miao Li, Mirella Lapata.

**Figure 2.** Figure 2: General two-stage pipeline of ProxyCoT (left), and two instantiations (right): ProxyCoT-ZS and ProxyCoT [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The prompt template for HotpotQA evaluation with GPT5-mini as the judge. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

read the original abstract

Recent large language models support inputs of up to 10 million tokens, yet they perform poorly on long-context tasks that require complex reasoning. Such tasks can be solved using only a subset of the input -- a proxy context -- rather than the full sequence. Despite sharing the same underlying reasoning process, models exhibit a significant performance disparity between proxy and full contexts. To improve long-context reasoning, we propose ProxyCoT, a novel training framework that transfers reasoning capabilities from short proxy contexts to full long contexts. Specifically, we first obtain high-quality chain-of-thought reasoning traces on proxy contexts through reinforcement learning or distillation from a larger teacher model, and then ground the generated traces in full long contexts with supervised fine-tuning. Experiments across different datasets demonstrate that ProxyCoT consistently outperforms strong baselines with reduced computational overhead. Furthermore, models trained with ProxyCoT generalize their long-context reasoning capabilities to out-of-domain tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ProxyCoT transfers CoT traces from short proxies to long contexts via SFT, but the transfer step rests on an untested assumption about attention.

read the letter

ProxyCoT collects high-quality CoT traces on short proxy contexts through RL or distillation, then runs supervised fine-tuning on the full long inputs paired with those traces. The goal is to close the performance gap between proxy and full contexts while keeping compute down. This framing is the main new piece: treating the proxy as an explicit training scaffold rather than just a prompt trick. It builds on standard CoT and distillation methods but packages them into a two-stage pipeline aimed at long-context generalization. The reduced-overhead claim follows logically if the heavy RL or teacher work happens only on short inputs. The out-of-domain generalization result is the part that would matter most if it holds. The soft spot is the grounding step itself. The abstract already flags a significant performance disparity, which means the base model does not reliably surface or attend to the relevant subset inside longer sequences. Standard next-token SFT on full inputs has no built-in mechanism to force the model to learn the right long-range attention patterns instead of ignoring noise or memorizing surface features from the proxy traces. Without ablations on attention, controls for post-hoc fitting, or direct tests of that assumption, both the in-domain gains and the generalization claim are hard to evaluate. The paper is aimed at people doing practical fine-tuning for long-document or multi-step LLM tasks. It deserves a serious referee if the full experiments include proper baselines, ablations, and checks on whether the SFT actually induces the needed attention behavior. Otherwise the central claim stays unverified.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes ProxyCoT, a two-stage framework for long-context reasoning: high-quality chain-of-thought traces are first obtained on short proxy contexts via reinforcement learning or teacher distillation, then grounded in full long contexts through supervised fine-tuning. The abstract claims this yields consistent outperformance over strong baselines with reduced computational overhead and enables generalization of long-context reasoning to out-of-domain tasks.

Significance. If the empirical results are robust, the approach could offer a practical route to efficient long-context reasoning by leveraging cheaper proxy-context supervision, potentially reducing the data and compute demands of direct long-context training while addressing the observed proxy-to-full performance gap.

major comments (2)

[Abstract] Abstract: the central claims of consistent outperformance, reduced overhead, and out-of-domain generalization are stated without any dataset names, baseline descriptions, evaluation metrics, or ablation results, so the evidence for the claims cannot be assessed from the provided text.
[Method] Method description (implied in abstract): the SFT grounding step presupposes that next-token prediction on full contexts paired with proxy-derived traces will induce reliable attention to the relevant subset, yet the abstract itself notes a significant performance disparity between proxy and full contexts; no mechanism, attention analysis, or ablation is described to rule out the model simply ignoring extraneous tokens or memorizing surface patterns.

minor comments (1)

The abstract refers to 'different datasets' and 'strong baselines' without enumeration or citation; adding these details would improve clarity even if the full experimental section is present.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims of consistent outperformance, reduced overhead, and out-of-domain generalization are stated without any dataset names, baseline descriptions, evaluation metrics, or ablation results, so the evidence for the claims cannot be assessed from the provided text.

Authors: We agree that the abstract would be strengthened by greater specificity. In the revised version we will incorporate the names of the primary datasets, the main baselines, the key evaluation metrics, and a concise reference to the ablation studies demonstrating reduced overhead and out-of-domain generalization. revision: yes
Referee: [Method] Method description (implied in abstract): the SFT grounding step presupposes that next-token prediction on full contexts paired with proxy-derived traces will induce reliable attention to the relevant subset, yet the abstract itself notes a significant performance disparity between proxy and full contexts; no mechanism, attention analysis, or ablation is described to rule out the model simply ignoring extraneous tokens or memorizing surface patterns.

Authors: The Method section of the full manuscript details the two-stage ProxyCoT procedure, with the SFT stage explicitly pairing proxy-derived traces with full-context inputs to ground the reasoning. While the performance gap is acknowledged, the reported experiments show that this procedure yields consistent gains over baselines. We concede that the current text does not include attention visualizations or targeted ablations that would directly rule out superficial memorization. We will add a short discussion of this point and, if space allows, a supporting ablation in the revision. revision: partial

Circularity Check

0 steps flagged

Empirical training framework with no mathematical derivation or self-referential reduction

full rationale

The paper proposes ProxyCoT as a two-stage empirical procedure (collect CoT traces on proxy contexts via RL/distillation, then SFT on full contexts) and validates it via experiments. No equations, uniqueness theorems, or predictions are presented that reduce by construction to fitted inputs or self-citations. The central claims rest on reported performance numbers rather than any definitional loop. This is the normal non-circular outcome for an applied ML methods paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review conducted on abstract only; no explicit free parameters, axioms, or invented entities are stated. The load-bearing premise that proxy and full contexts share the same underlying reasoning process is implicit but unelaborated.

pith-pipeline@v0.9.0 · 5690 in / 1079 out tokens · 32255 ms · 2026-05-25T07:01:01.971705+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 8 internal anchors

[1]

Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2402.03300 , eprinttype =. 2402.03300 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300 2024
[2]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and Yu Yue and Tiantian Fan and Gaohong Liu and Lingjun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and Jiangjie Chen and Chengyi Wang and Hongli ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14476 2025
[3]

2025 , eprint=

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[4]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[5]

Bakouch, Elie and Ben Allal, Loubna and Lozhkov, Anton and Tazi, Nouamane and Tunstall, Lewis and Patiño, Carlos Miguel and Beeching, Edward and Roucher, Aymeric and Reedi, Aksel Joonas and Gallouédec, Quentin and Rasul, Kashif and Habib, Nathan and Fourrier, Clémentine and Kydlicek, Hynek and Penedo, Guilherme and Larcher, Hugo and Morlon, Mathieu and Sr...

work page
[6]

2025 , eprint=

Magistral , author=. 2025 , eprint=

work page 2025
[7]

2025 , eprint=

OpenThoughts: Data Recipes for Reasoning Models , author=. 2025 , eprint=

work page 2025
[8]

2025 , eprint=

NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks , author=. 2025 , eprint=

work page 2025
[9]

Open R1: A fully open reproduction of DeepSeek-R1 , url =

work page
[10]

CoRR , volume =

Miao Li and Alexander Gurung and Irina Saparina and Mirella Lapata , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.21028 , eprinttype =. 2509.21028 , timestamp =

work page doi:10.48550/arxiv.2509.21028 2025
[11]

Cohen and Ruslan Salakhutdinov and Christopher D

Zhilin Yang and Peng Qi and Saizheng Zhang and Yoshua Bengio and William W. Cohen and Ruslan Salakhutdinov and Christopher D. Manning , editor =. HotpotQA:. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018 , pages =. 2018 , url =. doi:10.18653/V1/D18-1259 , timestamp =

work page doi:10.18653/v1/d18-1259 2018
[12]

Gemma 3 Technical Report

Gemma 3 Technical Report , journal =. 2025 , url =. doi:10.48550/ARXIV.2503.19786 , eprinttype =. 2503.19786 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.19786 2025
[13]

Qwen3 Technical Report

An Yang and Anfeng Li and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Gao and Chengen Huang and Chenxu Lv and Chujie Zheng and Dayiheng Liu and Fan Zhou and Fei Huang and Feng Hu and Hao Ge and Haoran Wei and Huan Lin and Jialong Tang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jian Yang and Jiaxi Yang and Ji...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[14]

Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc

Minzheng Wang and Longze Chen and Fu Cheng and Shengyi Liao and Xinghua Zhang and Bingli Wu and Haiyang Yu and Nan Xu and Lei Zhang and Run Luo and Yunshui Li and Min Yang and Fei Huang and Yongbin Li , editor =. Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc. Proceedings of the 2024 Conference on Empirical Methods in Nat...

work page doi:10.18653/v1/2024.emnlp-main.322 2024
[15]

Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step , booktitle =

Liunian Harold Li and Jack Hessel and Youngjae Yu and Xiang Ren and Kai. Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step , booktitle =. 2023 , url =. doi:10.18653/V1/2023.ACL-LONG.150 , timestamp =

work page doi:10.18653/v1/2023.acl-long.150 2023
[16]

Large Language Models Are Reasoning Teachers , booktitle =

Namgyu Ho and Laura Schmid and Se. Large Language Models Are Reasoning Teachers , booktitle =. 2023 , url =. doi:10.18653/V1/2023.ACL-LONG.830 , timestamp =

work page doi:10.18653/v1/2023.acl-long.830 2023
[17]

Head-to-Tail: How Knowledgeable are Large Language Models ( LLM s)? A

Sun, Kai and Xu, Yifan and Zha, Hanwen and Liu, Yue and Dong, Xin Luna. Head-to-Tail: How Knowledgeable are Large Language Models ( LLM s)? A . K . A . Will LLM s Replace Knowledge Graphs?. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)....

work page doi:10.18653/v1/2024.naacl-long.18 2024
[18]

Uncertainty Quantification in Retrieval Augmented Question Answering , journal =

Laura Perez. Uncertainty Quantification in Retrieval Augmented Question Answering , journal =. 2025 , url =

work page 2025
[19]

CoRR , volume =

Komal Kumar and Tajamul Ashraf and Omkar Thawakar and Rao Muhammad Anwer and Hisham Cholakkal and Mubarak Shah and Ming. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2502.21321 , eprinttype =. 2502.21321 , timestamp =

work page doi:10.48550/arxiv.2502.21321 2025
[20]

Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =

Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph Gonzalez and Hao Zhang and Ion Stoica , editor =. Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =. 2023 , url =. doi:10.1145/3600006.3613165 , timestamp =

work page doi:10.1145/3600006.3613165 2023
[21]

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework , author=. arXiv preprint arXiv:2405.11143 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

2024 , issn =

RoFormer: Enhanced transformer with Rotary Position Embedding , journal =. 2024 , issn =. doi:https://doi.org/10.1016/j.neucom.2023.127063 , url =

work page doi:10.1016/j.neucom.2023.127063 2024
[23]

Bowen Peng and Jeffrey Quesnelle and Honglu Fan and Enrico Shippole , booktitle=. Ya. 2024 , url=

work page 2024
[24]

Proceedings of the 41st International Conference on Machine Learning , articleno =

An, Chenxin and Huang, Fei and Zhang, Jun and Gong, Shansan and Qiu, Xipeng and Zhou, Chang and Kong, Lingpeng , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

work page 2024
[25]

2025 , eprint=

Speed Always Wins: A Survey on Efficient Architectures for Large Language Models , author=. 2025 , eprint=

work page 2025
[26]

2025 , eprint=

Efficient Attention Mechanisms for Large Language Models: A Survey , author=. 2025 , eprint=

work page 2025
[27]

First Conference on Language Modeling , year=

Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author=. First Conference on Language Modeling , year=

work page
[28]

2025 , eprint=

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search , author=. 2025 , eprint=

work page 2025
[29]

2024 , eprint=

MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression , author=. 2024 , eprint=

work page 2024
[30]

Abdi and Dongsheng Li and Chin-Yew Lin and Yuqing Yang and Lili Qiu , booktitle=

Huiqiang Jiang and YUCHENG LI and Chengruidong Zhang and Qianhui Wu and Xufang Luo and Surin Ahn and Zhenhua Han and Amir H. Abdi and Dongsheng Li and Chin-Yew Lin and Yuqing Yang and Lili Qiu , booktitle=. 2024 , url=

work page 2024
[31]

2020 , eprint=

Longformer: The Long-Document Transformer , author=. 2020 , eprint=

work page 2020
[32]

2023 , url=

Joshua Ainslie and James Lee-Thorp and Michiel de Jong and Yury Zemlyanskiy and Federico Lebron and Sumit Sanghai , booktitle=. 2023 , url=

work page 2023
[33]

The Twelfth International Conference on Learning Representations , year=

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author=. The Twelfth International Conference on Learning Representations , year=

work page
[34]

The Llama 3 Herd of Models

Abhimanyu Dubey and Abhinav Jauhri and Abhinav Pandey and Abhishek Kadian and Ahmad Al. The Llama 3 Herd of Models , journal =. 2024 , url =. doi:10.48550/ARXIV.2407.21783 , eprinttype =. 2407.21783 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
[35]

Technical Report , url=

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. Technical Report , url=

work page
[36]

Technical Report , url=

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation , author=. Technical Report , url=

work page
[37]

Qwen2.5-1M Technical Report

An Yang and Bowen Yu and Chengyuan Li and Dayiheng Liu and Fei Huang and Haoyan Huang and Jiandong Jiang and Jianhong Tu and Jianwei Zhang and Jingren Zhou and Junyang Lin and Kai Dang and Kexin Yang and Le Yu and Mei Li and Minmin Sun and Qin Zhu and Rui Men and Tao He and Weijia Xu and Wenbiao Yin and Wenyuan Yu and Xiafei Qiu and Xingzhang Ren and Xinl...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.15383 2025
[38]

CoRR , volume =

Jiaheng Liu and Dawei Zhu and Zhiqi Bai and Yancheng He and Huanxuan Liao and Haoran Que and Zekun Wang and Chenchen Zhang and Ge Zhang and Jiebin Zhang and Yuanxing Zhang and Zhuo Chen and Hangyu Guo and Shilong Li and Ziqiang Liu and Yong Shan and Yifan Song and Jiayi Tian and Wenhao Wu and Zhejian Zhou and Ruijie Zhu and Junlan Feng and Yang Gao and Sh...

work page doi:10.48550/arxiv.2503.17407 2025
[39]

Robertson and Hugo Zaragoza , title =

Stephen E. Robertson and Hugo Zaragoza , title =. Found. Trends Inf. Retr. , volume =. 2009 , url =. doi:10.1561/1500000019 , timestamp =

work page doi:10.1561/1500000019 2009
[40]

Olmo 3

Olmo 3 , author=. arXiv preprint arXiv:2512.13961 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2402.03300 , eprinttype =. 2402.03300 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300 2024

[2] [2]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and Yu Yue and Tiantian Fan and Gaohong Liu and Lingjun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and Jiangjie Chen and Chengyi Wang and Hongli ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14476 2025

[3] [3]

2025 , eprint=

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=

work page 2025

[4] [4]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025

[5] [5]

Bakouch, Elie and Ben Allal, Loubna and Lozhkov, Anton and Tazi, Nouamane and Tunstall, Lewis and Patiño, Carlos Miguel and Beeching, Edward and Roucher, Aymeric and Reedi, Aksel Joonas and Gallouédec, Quentin and Rasul, Kashif and Habib, Nathan and Fourrier, Clémentine and Kydlicek, Hynek and Penedo, Guilherme and Larcher, Hugo and Morlon, Mathieu and Sr...

work page

[6] [6]

2025 , eprint=

Magistral , author=. 2025 , eprint=

work page 2025

[7] [7]

2025 , eprint=

OpenThoughts: Data Recipes for Reasoning Models , author=. 2025 , eprint=

work page 2025

[8] [8]

2025 , eprint=

NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks , author=. 2025 , eprint=

work page 2025

[9] [9]

Open R1: A fully open reproduction of DeepSeek-R1 , url =

work page

[10] [10]

CoRR , volume =

Miao Li and Alexander Gurung and Irina Saparina and Mirella Lapata , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.21028 , eprinttype =. 2509.21028 , timestamp =

work page doi:10.48550/arxiv.2509.21028 2025

[11] [11]

Cohen and Ruslan Salakhutdinov and Christopher D

Zhilin Yang and Peng Qi and Saizheng Zhang and Yoshua Bengio and William W. Cohen and Ruslan Salakhutdinov and Christopher D. Manning , editor =. HotpotQA:. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018 , pages =. 2018 , url =. doi:10.18653/V1/D18-1259 , timestamp =

work page doi:10.18653/v1/d18-1259 2018

[12] [12]

Gemma 3 Technical Report

Gemma 3 Technical Report , journal =. 2025 , url =. doi:10.48550/ARXIV.2503.19786 , eprinttype =. 2503.19786 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.19786 2025

[13] [13]

Qwen3 Technical Report

An Yang and Anfeng Li and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Gao and Chengen Huang and Chenxu Lv and Chujie Zheng and Dayiheng Liu and Fan Zhou and Fei Huang and Feng Hu and Hao Ge and Haoran Wei and Huan Lin and Jialong Tang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jian Yang and Jiaxi Yang and Ji...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025

[14] [14]

Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc

Minzheng Wang and Longze Chen and Fu Cheng and Shengyi Liao and Xinghua Zhang and Bingli Wu and Haiyang Yu and Nan Xu and Lei Zhang and Run Luo and Yunshui Li and Min Yang and Fei Huang and Yongbin Li , editor =. Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc. Proceedings of the 2024 Conference on Empirical Methods in Nat...

work page doi:10.18653/v1/2024.emnlp-main.322 2024

[15] [15]

Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step , booktitle =

Liunian Harold Li and Jack Hessel and Youngjae Yu and Xiang Ren and Kai. Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step , booktitle =. 2023 , url =. doi:10.18653/V1/2023.ACL-LONG.150 , timestamp =

work page doi:10.18653/v1/2023.acl-long.150 2023

[16] [16]

Large Language Models Are Reasoning Teachers , booktitle =

Namgyu Ho and Laura Schmid and Se. Large Language Models Are Reasoning Teachers , booktitle =. 2023 , url =. doi:10.18653/V1/2023.ACL-LONG.830 , timestamp =

work page doi:10.18653/v1/2023.acl-long.830 2023

[17] [17]

Head-to-Tail: How Knowledgeable are Large Language Models ( LLM s)? A

Sun, Kai and Xu, Yifan and Zha, Hanwen and Liu, Yue and Dong, Xin Luna. Head-to-Tail: How Knowledgeable are Large Language Models ( LLM s)? A . K . A . Will LLM s Replace Knowledge Graphs?. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)....

work page doi:10.18653/v1/2024.naacl-long.18 2024

[18] [18]

Uncertainty Quantification in Retrieval Augmented Question Answering , journal =

Laura Perez. Uncertainty Quantification in Retrieval Augmented Question Answering , journal =. 2025 , url =

work page 2025

[19] [19]

CoRR , volume =

Komal Kumar and Tajamul Ashraf and Omkar Thawakar and Rao Muhammad Anwer and Hisham Cholakkal and Mubarak Shah and Ming. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2502.21321 , eprinttype =. 2502.21321 , timestamp =

work page doi:10.48550/arxiv.2502.21321 2025

[20] [20]

Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =

Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph Gonzalez and Hao Zhang and Ion Stoica , editor =. Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =. 2023 , url =. doi:10.1145/3600006.3613165 , timestamp =

work page doi:10.1145/3600006.3613165 2023

[21] [21]

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework , author=. arXiv preprint arXiv:2405.11143 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

2024 , issn =

RoFormer: Enhanced transformer with Rotary Position Embedding , journal =. 2024 , issn =. doi:https://doi.org/10.1016/j.neucom.2023.127063 , url =

work page doi:10.1016/j.neucom.2023.127063 2024

[23] [23]

Bowen Peng and Jeffrey Quesnelle and Honglu Fan and Enrico Shippole , booktitle=. Ya. 2024 , url=

work page 2024

[24] [24]

Proceedings of the 41st International Conference on Machine Learning , articleno =

An, Chenxin and Huang, Fei and Zhang, Jun and Gong, Shansan and Qiu, Xipeng and Zhou, Chang and Kong, Lingpeng , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

work page 2024

[25] [25]

2025 , eprint=

Speed Always Wins: A Survey on Efficient Architectures for Large Language Models , author=. 2025 , eprint=

work page 2025

[26] [26]

2025 , eprint=

Efficient Attention Mechanisms for Large Language Models: A Survey , author=. 2025 , eprint=

work page 2025

[27] [27]

First Conference on Language Modeling , year=

Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author=. First Conference on Language Modeling , year=

work page

[28] [28]

2025 , eprint=

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search , author=. 2025 , eprint=

work page 2025

[29] [29]

2024 , eprint=

MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression , author=. 2024 , eprint=

work page 2024

[30] [30]

Abdi and Dongsheng Li and Chin-Yew Lin and Yuqing Yang and Lili Qiu , booktitle=

Huiqiang Jiang and YUCHENG LI and Chengruidong Zhang and Qianhui Wu and Xufang Luo and Surin Ahn and Zhenhua Han and Amir H. Abdi and Dongsheng Li and Chin-Yew Lin and Yuqing Yang and Lili Qiu , booktitle=. 2024 , url=

work page 2024

[31] [31]

2020 , eprint=

Longformer: The Long-Document Transformer , author=. 2020 , eprint=

work page 2020

[32] [32]

2023 , url=

Joshua Ainslie and James Lee-Thorp and Michiel de Jong and Yury Zemlyanskiy and Federico Lebron and Sumit Sanghai , booktitle=. 2023 , url=

work page 2023

[33] [33]

The Twelfth International Conference on Learning Representations , year=

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author=. The Twelfth International Conference on Learning Representations , year=

work page

[34] [34]

The Llama 3 Herd of Models

Abhimanyu Dubey and Abhinav Jauhri and Abhinav Pandey and Abhishek Kadian and Ahmad Al. The Llama 3 Herd of Models , journal =. 2024 , url =. doi:10.48550/ARXIV.2407.21783 , eprinttype =. 2407.21783 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024

[35] [35]

Technical Report , url=

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. Technical Report , url=

work page

[36] [36]

Technical Report , url=

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation , author=. Technical Report , url=

work page

[37] [37]

Qwen2.5-1M Technical Report

An Yang and Bowen Yu and Chengyuan Li and Dayiheng Liu and Fei Huang and Haoyan Huang and Jiandong Jiang and Jianhong Tu and Jianwei Zhang and Jingren Zhou and Junyang Lin and Kai Dang and Kexin Yang and Le Yu and Mei Li and Minmin Sun and Qin Zhu and Rui Men and Tao He and Weijia Xu and Wenbiao Yin and Wenyuan Yu and Xiafei Qiu and Xingzhang Ren and Xinl...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.15383 2025

[38] [38]

CoRR , volume =

Jiaheng Liu and Dawei Zhu and Zhiqi Bai and Yancheng He and Huanxuan Liao and Haoran Que and Zekun Wang and Chenchen Zhang and Ge Zhang and Jiebin Zhang and Yuanxing Zhang and Zhuo Chen and Hangyu Guo and Shilong Li and Ziqiang Liu and Yong Shan and Yifan Song and Jiayi Tian and Wenhao Wu and Zhejian Zhou and Ruijie Zhu and Junlan Feng and Yang Gao and Sh...

work page doi:10.48550/arxiv.2503.17407 2025

[39] [39]

Robertson and Hugo Zaragoza , title =

Stephen E. Robertson and Hugo Zaragoza , title =. Found. Trends Inf. Retr. , volume =. 2009 , url =. doi:10.1561/1500000019 , timestamp =

work page doi:10.1561/1500000019 2009

[40] [40]

Olmo 3

Olmo 3 , author=. arXiv preprint arXiv:2512.13961 , year=

work page internal anchor Pith review Pith/arXiv arXiv