pith. sign in

arxiv: 2510.19669 · v5 · submitted 2025-10-22 · 💻 cs.CL

DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference

Pith reviewed 2026-05-18 04:28 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM reasoningtoken efficiencyadaptive inferenceentropy analysisdifficulty classificationoverthinkinghidden state probe
0
0 comments X

The pith

LLMs maintain accuracy while cutting token use up to 22% by adapting inference effort to each question's difficulty.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies a U-shaped pattern in the entropy of token probabilities during LLM reasoning traces, with unexpectedly high entropy on easy problems despite strong accuracy. This pattern indicates overthinking, where models expend unnecessary computation on simple questions. DiffAdapt addresses this by training a small probe on the model's final hidden state to classify incoming questions as easy, normal, or hard. It then applies a fixed strategy for each class, defined by a specific prompt, temperature setting, and maximum token length. The result is reduced token consumption across five models and eight benchmarks without sacrificing performance.

Core claim

We observe a consistent U-shaped entropy pattern in reasoning traces across three models: high entropy on easy problems despite high accuracy, low entropy on medium difficulty, and high entropy on hard problems reflecting uncertainty, with a 22-25% entropy reduction from easy to medium regions suggesting overthinking on easy instances. Building on this, we introduce DiffAdapt, a lightweight framework that selects Easy/Normal/Hard inference strategies per question based on their difficulty and reasoning trace entropy. Each strategy uses a fixed prompt, temperature, and maximum token length. A small probe classifies the LLM's final hidden state, enabling inexpensive adaptation without fine-tun

What carries the argument

DiffAdapt framework that trains a small probe on final hidden states to classify questions into Easy/Normal/Hard categories and applies a matching preset inference strategy of prompt, temperature, and token limit.

If this is right

  • Token usage drops by up to 22.4% while accuracy holds or improves on the evaluated benchmarks.
  • The base LLM requires no fine-tuning, only a lightweight probe is trained.
  • The same approach applies across five different models and eight reasoning benchmarks.
  • Fixed long reasoning traces prove wasteful for easier problems once difficulty is detected.
  • Adaptation happens at inference time using only the final hidden state from the trace.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could extend to other open-ended generation tasks where effort should scale with input complexity.
  • Replacing the three discrete categories with a continuous difficulty score might enable smoother token budget control.
  • The entropy pattern observation offers a diagnostic tool for spotting overthinking in other prompting setups.
  • Pairing DiffAdapt with dynamic early-exit rules could produce additional efficiency gains beyond the reported savings.

Load-bearing premise

The probe on final hidden states will correctly sort questions into difficulty classes where the chosen fixed strategies are optimal.

What would settle it

Run the probe on a held-out set and compare accuracy plus token counts for questions labeled easy when using the short easy strategy versus the full normal strategy on those exact questions.

Figures

Figures reproduced from arXiv: 2510.19669 by Eunsol Choi, Xiang Liu, Xiaowen Chu, Xuming Hu.

Figure 1
Figure 1. Figure 1: Visualization of model accuracy (blue bar), generation entropy (red line) per difficulty [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Our three inference strategy configurations. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Plots summarizing task performance (y-axis) and efficiency (x-axis) on Qwen3-4B model [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the DiffAdapt framework. Top: (i) [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance across reasoning LLMs model architectures and domains. The x-axis represents different maximum token limit constraints as a percentage of the full token budget, demonstrating how different strategies perform under varying computational budgets. (a) In-domain: DiffAdapt consistently outperforms fixed strategies. (b) Out-of-domain: Effectiveness maintained under domain shift. et al., 2024), along… view at source ↗
Figure 6
Figure 6. Figure 6: DiffAdapt orthogonality with Length Control RL methods. Performance analysis across three LC-RL trained models on both ID and OOD datasets. particularly at higher token budgets. DeepSeek-R1-Qwen-7B demonstrates stable gains, while DeepSeek-R1-Llama-8B exhibits significant benefits across model families. On out-of-domain eval￾uation (Figure 5b), the framework remains effective: DiffAdapt delivers consistent… view at source ↗
Figure 7
Figure 7. Figure 7: Overthinking phenomenon in DeepSeek-R1-Distill-Qwen-1.5B model showing the char [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Overthinking phenomenon in Nemotron-Research-Reasoning-Qwen-1.5B model demon [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: DeepSeek-R1-Qwen-7B Oracle Analysis: Performance vs. Token Consumption Trade [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Nemotron-1.5B Oracle Analysis: Performance vs. Token Consumption Trade-offs. Even [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: DeepSeek-R1-Llama-8B Oracle Analysis: Performance vs. Token Consumption Trade [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
read the original abstract

Recent reasoning Large Language Models (LLMs) demonstrate remarkable problem-solving abilities but often generate long thinking traces whose utility is unclear. Our work aims to improve their efficiency, enabling them to reach high performance without overthinking. First, we analyze the entropy of token probabilities in reasoning traces. Across three models, we observe a consistent U-shaped entropy pattern: high entropy on easy problems despite high accuracy, low entropy on problems with medium difficulty, and high entropy on hard problems reflecting uncertainty. Specifically, we notice 22--25\% entropy reduction from easy to medium difficulty regions, suggesting an {overthinking} phenomenon on easy instances. Building on these insights, we introduce \textbf{DiffAdapt}, a lightweight framework that selects Easy/Normal/Hard inference strategies per question based on their difficulty and reasoning trace entropy. Each inference strategy consists of a fixed prompt, temperature and maximum token length. In contrast to existing efficiency optimization methods, our approach does not fine-tune base LLM but a small probe that classifies LLM's final hidden state, allowing inexpensive adaptation. We comprehensively evaluate our method on five models and eight benchmarks. Our method achieves comparable or improved accuracy while reducing token usage by up to 22.4\%, establishing a practical path toward compute-efficient reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes DiffAdapt, a framework that first identifies a U-shaped entropy pattern in LLM reasoning traces (high entropy on easy and hard problems, lower on medium) across three models. It then trains a small probe on final hidden states to classify incoming questions into Easy/Normal/Hard buckets and applies one of three fixed inference strategies (prompt variant, temperature, max token length) per bucket. The central empirical claim is that this yields comparable or higher accuracy while cutting token usage by up to 22.4% on five models and eight benchmarks, without fine-tuning the base LLM.

Significance. If the probe reliably predicts difficulty and the hand-chosen strategies are near-optimal, the work supplies a lightweight, non-fine-tuning route to token-efficient reasoning that directly targets the overthinking phenomenon observed in the entropy analysis. The multi-model entropy observation and the decision to adapt only via a small probe are concrete strengths that could be useful for practical deployment.

major comments (3)
  1. [§4 and §5] §4 (Method) and §5 (Experiments): the central efficiency claim depends on the probe correctly mapping questions to the three difficulty buckets, yet no probe accuracy, confusion matrix, or per-bucket classification statistics are reported. Without these numbers it is impossible to determine whether the observed 22.4% token reduction is driven by accurate adaptation or by fortunate alignment of the fixed strategies with the test distribution.
  2. [§5] §5 (Experiments): no ablation replaces the learned probe with oracle difficulty labels or with a stronger classifier. Such an ablation is load-bearing because it would isolate whether token savings survive when classification error is removed; its absence leaves open the possibility that misclassified hard items incur hidden accuracy drops or that misclassified easy items waste tokens.
  3. [§5] §5 (Experiments): the abstract and results tables report aggregate accuracy and token counts but supply no data-split details, statistical significance tests, or error bars. This makes it difficult to judge whether the “comparable or improved accuracy” claim is robust across random seeds or benchmark partitions.
minor comments (2)
  1. [§3] The entropy reduction figure of 22–25% from easy to medium is stated without the precise definition of the entropy measure (token-level or sequence-level) or the exact binning procedure used to define the three regions.
  2. [§4.2] The three fixed strategies are described only at a high level; a table listing the exact prompt template, temperature, and max-length values for each bucket would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and outline the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [§4 and §5] §4 (Method) and §5 (Experiments): the central efficiency claim depends on the probe correctly mapping questions to the three difficulty buckets, yet no probe accuracy, confusion matrix, or per-bucket classification statistics are reported. Without these numbers it is impossible to determine whether the observed 22.4% token reduction is driven by accurate adaptation or by fortunate alignment of the fixed strategies with the test distribution.

    Authors: We agree with this observation. Reporting the probe's performance metrics is essential to validate the adaptation mechanism. In the revised manuscript, we will include the accuracy of the difficulty classifier, a confusion matrix across the Easy/Normal/Hard buckets, and per-bucket statistics on how questions are classified in the test sets. This addition will help readers assess the reliability of the difficulty prediction and its contribution to the token savings. revision: yes

  2. Referee: [§5] §5 (Experiments): no ablation replaces the learned probe with oracle difficulty labels or with a stronger classifier. Such an ablation is load-bearing because it would isolate whether token savings survive when classification error is removed; its absence leaves open the possibility that misclassified hard items incur hidden accuracy drops or that misclassified easy items waste tokens.

    Authors: We recognize the importance of such an ablation study. To address this, we will add an oracle experiment in the revised version, where we assume perfect difficulty classification (using labels derived from the entropy patterns observed in the analysis) and apply the corresponding inference strategies. We will compare the accuracy and token usage against the probe-based DiffAdapt to quantify the impact of classification errors. If feasible with the available data, we will also consider a stronger classifier for comparison. revision: yes

  3. Referee: [§5] §5 (Experiments): the abstract and results tables report aggregate accuracy and token counts but supply no data-split details, statistical significance tests, or error bars. This makes it difficult to judge whether the “comparable or improved accuracy” claim is robust across random seeds or benchmark partitions.

    Authors: We will update the Experiments section to provide detailed information on the data splits, including how the probe training and evaluation sets were partitioned. Furthermore, we will report results averaged over multiple random seeds with standard error bars and conduct statistical significance tests (such as Wilcoxon signed-rank tests or t-tests) to substantiate the claims of comparable or improved accuracy and token efficiency. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation on external benchmarks

full rationale

The paper's central claims rest on empirical measurements of token usage and accuracy across five models and eight benchmarks after training a probe on final hidden states. The U-shaped entropy observation motivates the Easy/Normal/Hard buckets but is not used to derive the final performance numbers by construction. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that would reduce the reported 22.4% token reduction to an input by definition. The method is self-contained against held-out benchmark results.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach depends on an empirical entropy pattern and a learned probe; no new physical entities are introduced, but several modeling choices function as free parameters.

free parameters (2)
  • entropy thresholds for Easy/Normal/Hard
    The boundaries separating the three difficulty regions are chosen or fitted to produce the reported token savings.
  • probe training hyperparameters
    The small classifier on final hidden states requires training data and hyperparameters whose exact values are not stated in the abstract.
axioms (1)
  • domain assumption The observed U-shaped entropy pattern generalizes across models and benchmarks beyond the three models tested.
    The method assumes this pattern is stable enough to serve as a reliable difficulty signal.

pith-pipeline@v0.9.0 · 5754 in / 1370 out tokens · 29624 ms · 2026-05-18T04:28:06.336562+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.

  2. DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation

    cs.LG 2026-05 unverdicted novelty 6.0

    DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.

  3. Position: LLM Inference Should Be Evaluated as Energy-to-Token Production

    cs.CE 2026-05 unverdicted novelty 5.0

    LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.

  4. SAT: Balancing Reasoning Accuracy and Efficiency with Stepwise Adaptive Thinking

    cs.AI 2026-04 unverdicted novelty 5.0

    SAT reduces reasoning tokens by up to 40% across multiple large reasoning models and benchmarks by adaptively pruning steps based on difficulty while maintaining or improving accuracy.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 4 Pith papers · 11 internal anchors

  1. [1]

    xverify: Efficient answer verifier for reasoning model evalua- tions.arXiv preprint arXiv:2504.10481,

    Ding Chen, Qingchen Yu, Pengyuan Wang, Wentao Zhang, Bo Tang, Feiyu Xiong, Xinchi Li, Minchuan Yang, and Zhiyu Li. xverify: Efficient answer verifier for reasoning model evalua- tions.arXiv preprint arXiv:2504.10481,

  2. [2]

    Evaluating Large Language Models Trained on Code

    9 DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

  3. [3]

    Yuanlin Chu, Bo Wang, Xiang Liu, Hong Chen, Aiwei Liu, and Xuming Hu

    URLhttps://arxiv.org/abs/2506.14755. Yuanlin Chu, Bo Wang, Xiang Liu, Hong Chen, Aiwei Liu, and Xuming Hu. Ssr: Speculative parallel scaling reasoning in test-time.arXiv preprint arXiv:2505.15340,

  4. [4]

    Training Verifiers to Solve Math Word Problems

    URLhttps://arxiv. org/abs/2110.14168. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi D...

  5. [5]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    URLhttps://arxiv.org/abs/2501.12948. Gongfan Fang, Xinyin Ma, and Xinchao Wang. Thinkless: Llm learns when to think.arXiv preprint arXiv:2505.13379,

  6. [6]

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    URLhttps://arxiv.org/abs/2402.14008. Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging, de- contaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456,

  7. [7]

    ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning

    Bairu Hou, Yang Zhang, Jiabao Ji, Yujian Liu, Kaizhi Qian, Jacob Andreas, and Shiyu Chang. Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning.arXiv preprint arXiv:2504.01296,

  8. [8]

    Adactrl: Towards adaptive and controllable reasoning via difficulty-aware budgeting

    URLhttps://arxiv.org/abs/2505.18822. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,

  9. [9]

    Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022a

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ra- masesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022a. Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer,...

  10. [10]

    Let's Verify Step by Step

    URL https://arxiv.org/abs/2305.20050. Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models, 2025a. URLhttps://arxiv.org/abs/2505.24864. Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Mingjie Liu, Min-Hung Chen, Hongxu Yin,...

  11. [11]

    Knowledge-Centric Hallucination Detection

    Association for Computational Linguistics. doi: 10.18653/v1/2024. findings-emnlp.48. URLhttps://aclanthology.org/2024.findings-emnlp. 48/. Chenwei Lou, Zewei Sun, Xinnian Liang, Meng Qu, Wei Shen, Wenqi Wang, Yuntao Li, Qing- ping Yang, and Shuangzhi Wu. Adacot: Pareto-optimal adaptive chain-of-thought triggering via reinforcement learning.arXiv preprint ...

  12. [12]

    URLhttps://arxiv.org/abs/2504. 09858. 11 DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand `es, and Tatsunori Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393,

  13. [13]

    Plum: Prompt learning using metaheuristics

    Rui Pan, Shuo Xing, Shizhe Diao, Wenhe Sun, Xiang Liu, KaShun Shum, Jipeng Zhang, Renjie Pi, and Tong Zhang. Plum: Prompt learning using metaheuristics. In Lun-Wei Ku, Andre Mar- tins, and Vivek Srikumar (eds.),Findings of the Association for Computational Linguistics: ACL 2024, pp. 2177–2197, Bangkok, Thailand, August

  14. [14]

    doi: 10.18653/v1/2024.findings-acl.129

    Association for Computational Linguis- tics. doi: 10.18653/v1/2024.findings-acl.129. URLhttps://aclanthology.org/2024. findings-acl.129/. Xiaoye Qu, Yafu Li, Zhaochen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, et al. A survey of efficient reasoning for large reasoning models: Language, multimodality, and b...

  15. [15]

    Dast: Difficulty-adaptive slow-thinking for large reasoning models,

    URLhttps://arxiv.org/abs/2503.04472. Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419,

  16. [16]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    URLhttps://arxiv.org/abs/ 2506.01939. Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi- task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290,

  17. [17]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  18. [18]

    LIMO: Less is More for Reasoning

    URLhttps://arxiv.org/abs/2502.03387. Ping Yu, Jing Xu, Jason Weston, and Ilia Kulikov. Distilling system 2 into system 1.arXiv preprint arXiv:2407.06023,

  19. [19]

    arXiv preprint arXiv:2505.24863 , year=

    12 DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference Junyu Zhang, Runpei Dong, Han Wang, Xuying Ning, Haoran Geng, Peihao Li, Xialin He, Yutong Bai, Jitendra Malik, Saurabh Gupta, and Huan Zhang. Alphaone: Reasoning models thinking slow and fast at test time, 2025a. URLhttps://arxiv.org/abs/2505.24863. Ruiqi Zhang, Changyi Xiao, a...

  20. [20]

    SGLang: Efficient Execution of Structured Language Model Programs

    URLhttps://arxiv. org/abs/2312.07104. 13 DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference APPENDIX A Limitations and Future Work 15 B Complete Reasoning Strategy 15 B.1 Easy Strategy Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 B.2 Normal Strategy Prompt . . . . . . . . . . . . . . . . . . . . . . ....

  21. [21]

    Analogous rounding was applied to all model–benchmark pairs (see Table 5)

    Procedure example.With a 32K cap, Qwen3-4B produced longest responses of approxi- mately 1,500 tokens on GSM8K and 18,000 tokens on AIME24; we therefore set max tokens to 1,500 and 18,000 for those benchmarks, respectively. Analogous rounding was applied to all model–benchmark pairs (see Table 5). 17 DiffAdapt: Difficulty-Adaptive Reasoning for Token-Effi...