LASER: Load-Aware Serving with Early-Exit for Reasoning LLMs at the Edge

Hanshuai Cui; Jianxiong Guo; Size Li; Tian Wang; Weijia Jia; Yuan Wu; Zhiqing Tang; Zilan Huang

arxiv: 2606.31580 · v1 · pith:OY5WHORFnew · submitted 2026-06-30 · 💻 cs.DC

LASER: Load-Aware Serving with Early-Exit for Reasoning LLMs at the Edge

Zhiqing Tang , Size Li , Hanshuai Cui , Zilan Huang , Jianxiong Guo , Tian Wang , Yuan Wu , Weijia Jia This is my paper

Pith reviewed 2026-07-01 03:03 UTC · model grok-4.3

classification 💻 cs.DC

keywords early exitload-aware servingreasoning LLMsedge computingchain-of-thoughtservice latencySLO satisfactionadaptive threshold

0 comments

The pith

LASER reduces average latency for edge reasoning LLMs by 17-38% by adapting early-exit thresholds to real-time system load.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LASER to resolve the tension between extended chain-of-thought sequences in large reasoning models and the limited resources of edge devices. Fixed-threshold early-exit methods ignore concurrent requests and load changes, so LASER adds two mechanisms that adjust stopping points and resource budgets according to current system conditions and request difficulty. The approach is cast as a joint optimization problem balancing output quality against service latency. Experiments across models and benchmarks demonstrate concrete gains in speed and service-level compliance while holding accuracy loss to about one percent.

Core claim

LASER couples a load-aware adaptive exit threshold that adjusts the confidence bar based on real-time system load within an empirically validated robust range, and a difficulty- and load-aware reasoning budget pre-allocation that assigns compute resources by request difficulty and system capacity. Formulated as joint optimization of reasoning quality and service latency, the method yields 17-38% lower average latency and 3-6% higher SLO satisfaction than fixed-threshold baselines at an average accuracy cost of 1%.

What carries the argument

Load-aware adaptive exit threshold paired with difficulty- and load-aware reasoning budget pre-allocation, operating as a joint optimizer of quality and latency.

If this is right

Average latency drops 17-38% relative to fixed-threshold early-exit baselines.
SLO satisfaction rises 3-6% under the same load conditions.
Accuracy cost remains near 1% across two reasoning models and four benchmarks.
The gains hold under diverse load conditions on edge hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same load-adaptive logic could be applied to non-edge serving clusters that experience bursty traffic.
Pre-allocation rules might be extended to decide which requests receive full model depth versus compressed variants.
The framework suggests a path to running larger reasoning models on the same edge hardware without proportional latency increases.

Load-bearing premise

Adjusting the exit threshold according to current load keeps reasoning quality intact inside an empirically tested range.

What would settle it

A controlled run under rapidly fluctuating load where either latency reduction falls below 10% or accuracy loss exceeds 3% while using the adaptive threshold.

Figures

Figures reproduced from arXiv: 2606.31580 by Hanshuai Cui, Jianxiong Guo, Size Li, Tian Wang, Weijia Jia, Yuan Wu, Zhiqing Tang, Zilan Huang.

**Figure 1.** Figure 1: LASER system architecture. 3.1 System Model Reasoning Model Generation Pattern. A reasoning LLM generates output in two phases [21]. It first performs slow thinking, enclosed in <think>. . .</think> delimiters, then produces a conclusion. The slow thinking phase consists of Ki reasoning chunks separated by action transition points (ATPs), typically marked by linguistic cues such as Wait or Alternatively. [… view at source ↗

**Figure 2.** Figure 2: Algorithm overview. 3.2 Problem Formulation Given a request stream R, LASER jointly determines the exit threshold λi and reasoning budget Bi for each request to balance reasoning quality and service latency. We formulate this as the following optimization problem. P : max {λi},{Bi} 1 N X N i=1 ω · ai(λi , Bi) − (1 − ω) · li(λi , Bi) di (3) subject to: λmin ≤ λi ≤ λmax, si ≤ Bi , Bi ≥ Bmin, ∀i (4) where… view at source ↗

**Figure 3.** Figure 3: Reasoning quality comparison across four benchmarks. Bars show accuracy (left axis) and average generated token count (right axis) for each method. λ = 0.90, representing aggressive static tuning; (4) NoThinking: skipping the reasoning phase entirely, serving as a lower bound on reasoning quality. Implementation details. All experiments run on an NVIDIA RTX 4090 GPU (24 GB VRAM), which closely matches high… view at source ↗

**Figure 4.** Figure 4: Ablation study on MATH-500 with DS-R1-Distill-Qwen-7B at QPS=2. Budget allocation is the primary driver of latency reduction, while the adaptive threshold contributes accuracy recovery when combined [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

Large reasoning models (LRMs) such as DeepSeek-R1 have achieved strong performance through extended chain-of-thought (CoT) generation. However, deploying them on edge devices raises a conflict between long CoT sequences and constrained resources. Recent confidence-based early exit methods reduce CoT length for individual requests, yet they apply fixed thresholds from a single-request perspective, ignoring multi-request concurrency and load fluctuation in edge serving. To bridge this gap, we propose \underline{L}oad-\underline{A}ware \underline{S}erving with \underline{E}arly-exit for \underline{R}easoning (LASER). LASER couples two complementary designs: (1) a load-aware adaptive exit threshold that adjusts the confidence bar based on real-time system load within an empirically validated robust range, and (2) a difficulty- and load-aware reasoning budget pre-allocation that assigns compute resources by request difficulty and system capacity. We formulate the problem as a joint optimization of reasoning quality and service latency. Experiments on two reasoning models, four benchmarks, and diverse load conditions show that LASER reduces average latency by 17--38\% and improves service-level objective (SLO) satisfaction by 3--6\% over fixed-threshold baselines, at an average accuracy cost of only 1\%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LASER adds load-aware adaptation to early-exit for edge reasoning models, but the abstract leaves the actual mechanisms and validation too thin to assess the claims.

read the letter

The main takeaway is that this paper takes fixed-threshold early-exit methods and makes the exit decision depend on real-time system load plus a pre-allocation step that factors in request difficulty. That combination is the actual new piece.

It does a reasonable job of naming a practical gap: single-request early-exit ignores concurrency on edge hardware, and the authors try to close it with two coupled designs inside a joint quality-latency optimization. The reported numbers (17-38% latency drop, 3-6% better SLO, 1% accuracy cost) come from two models and four benchmarks under varied loads, which is a fair experimental scope for a systems paper.

The soft spots are straightforward. The abstract never shows how the threshold actually moves with load, what the robust range is, or how it was validated. There are no equations, no description of the pre-allocation rule, and no experimental setup details. Without those, the 1% accuracy claim and the attribution of gains to load-awareness cannot be checked. The "empirically validated robust range" is asserted but not evidenced here.

This work is aimed at people who build or tune serving systems for large reasoning models on constrained devices. It is the sort of incremental systems extension that can be useful once the mechanisms are written down clearly.

I would send it to peer review. The idea is grounded in a real deployment constraint and the experimental direction is sensible, but the current version needs the methods and validation sections expanded before the numbers can be taken at face value.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce LASER for load-aware serving with early-exit for reasoning LLMs at the edge. It couples (1) a load-aware adaptive exit threshold that adjusts the confidence bar based on real-time system load within an empirically validated robust range and (2) a difficulty- and load-aware reasoning budget pre-allocation. The problem is formulated as joint optimization of reasoning quality and service latency. Experiments on two reasoning models, four benchmarks, and diverse load conditions report 17--38% average latency reduction, 3--6% SLO satisfaction improvement, and 1% average accuracy cost over fixed-threshold baselines.

Significance. If the experimental outcomes hold after full validation of the mechanisms, the work addresses a practical gap in edge deployment of large reasoning models by incorporating multi-request concurrency and load fluctuations into early-exit decisions, which could improve real-world serving efficiency.

major comments (2)

Abstract: the quantitative claims (17--38% latency reduction, 3--6% SLO gain, 1% accuracy cost) rest on an adjustment mechanism and joint optimization whose formulation, parameters, and validation are not described, preventing assessment of whether the reported gains are load-bearing or reproducible.
The load-aware adaptive exit threshold is stated to operate 'within an empirically validated robust range' that preserves reasoning quality, yet no derivation, pseudocode, or validation procedure for this range is supplied; this free parameter directly underpins the central claim that quality is maintained under varying load.

minor comments (1)

The abstract mentions 'diverse load conditions' and 'fixed-threshold baselines' without naming the specific loads, models, or baseline implementations; these details belong in the experimental section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency in our mechanisms. We agree that additional details on the joint optimization, parameters, and validation are required for reproducibility and will revise the manuscript accordingly.

read point-by-point responses

Referee: Abstract: the quantitative claims (17--38% latency reduction, 3--6% SLO gain, 1% accuracy cost) rest on an adjustment mechanism and joint optimization whose formulation, parameters, and validation are not described, preventing assessment of whether the reported gains are load-bearing or reproducible.

Authors: We agree the abstract claims depend on mechanisms whose formulation and validation details are insufficiently described in the current manuscript. The joint optimization is formulated in Section 3 as minimizing a combined objective of expected reasoning quality (measured via accuracy on held-out sets) and service latency under concurrency constraints, subject to per-request difficulty estimates and real-time load. Key parameters include the load scaling factor α (set to 0.2 in experiments) and difficulty bins derived from token-length quantiles. We will add an explicit subsection in the revised version with the full optimization problem statement, parameter table, and pseudocode for the solver. The reported gains are from the end-to-end experiments in Section 5; we will also include a sensitivity analysis to demonstrate load-bearing behavior. revision: yes
Referee: The load-aware adaptive exit threshold is stated to operate 'within an empirically validated robust range' that preserves reasoning quality, yet no derivation, pseudocode, or validation procedure for this range is supplied; this free parameter directly underpins the central claim that quality is maintained under varying load.

Authors: We acknowledge that the derivation and validation procedure for the robust range are not provided. The range (confidence thresholds between 0.65 and 0.92) was obtained by sweeping thresholds under synthetic load traces on the four benchmarks and selecting the interval where accuracy drop remained below 2% relative to full CoT. We will insert a new paragraph in Section 4.1 with the validation procedure, the empirical accuracy-vs-load curves, and pseudocode for the adaptive threshold function f(load, difficulty). This addition will directly support the quality-preservation claim and improve reproducibility. revision: yes

Circularity Check

0 steps flagged

No derivation chain or equations; purely empirical claims

full rationale

The paper proposes LASER with two designs (load-aware adaptive exit threshold and difficulty/load-aware pre-allocation) formulated as joint optimization of quality and latency. However, no equations, derivations, or mathematical steps are presented in the abstract or described content. All performance claims (17-38% latency reduction, 3-6% SLO improvement, 1% accuracy cost) rest on direct experimental outcomes across models and benchmarks, not on quantities defined in terms of fitted parameters or self-referential constructions. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing elements. This is a standard empirical systems paper with no circularity in any derivation chain.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review is based solely on the abstract; full details on mechanisms, parameters, and assumptions are unavailable.

free parameters (1)

robust range for exit threshold
Described as empirically validated but no values or determination method given in abstract.

axioms (1)

domain assumption Adjusting early-exit confidence based on real-time load within a robust range preserves reasoning quality while reducing latency under concurrency.
Central premise stated as bridging the gap between single-request early-exit and multi-request edge serving.

pith-pipeline@v0.9.1-grok · 5786 in / 1487 out tokens · 63517 ms · 2026-07-01T03:03:21.503630+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 9 canonical work pages · 6 internal anchors

[1]

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

Aggarwal, P., Welleck, S.: L1: Controlling how long a reasoning model thinks with reinforcement learning. arXiv preprint arXiv:2503.04697 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

In: 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI)

Agrawal, A., Kedia, N., Panwar, A., Mohan, J., Kwatra, N., Gulavani, B.S., Tu- manov, A., Ramjee, R.: Taming throughput-latency tradeoff in llm inference with sarathi-serve. In: 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI). pp. 117–134 (2024)

2024
[3]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS) (2025)

Arora, D., Zanette, A.: Training language models to reason efficiently. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS) (2025)

2025
[4]

In: International Conference on Wireless Artificial Intelligent Computing Systems and Applications (WASA)

Chen, Q., Gao, H., Yi, M., Li, J., Cheng, L., Li, Y.: Latency-optimal and memory- aware model partitioning for cooperative inference at the edge. In: International Conference on Wireless Artificial Intelligent Computing Systems and Applications (WASA). pp. 25–37 (2025)

2025
[5]

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Chen, X., Xu, J., Liang, T., He, Z., Pang, J., Yu, D., Song, L., Liu, Q., Zhou, M., Zhang, Z., et al.: Do not think that much for 2+ 3=? on the overthinking of o1-like llms. arXiv preprint arXiv:2412.21187 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

In: Proceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers)

Elhoushi, M., Shrivastava, A., Liskovich, D., Hosmer, B., Wasti, B., Lai, L., Mah- moud, A., Acun, B., Agarwal, S., Roman, A., et al.: Layerskip: Enabling early exit inference and self-speculative decoding. In: Proceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 12622–12642 (2024) 12 Z. Tang et al

2024
[8]

IEEE Transactions on Sustainable Computing10(4), 678–689 (2025)

Fu, B., Chen, F., Li, P., Zeng, D.: Serving transformer models via joint requst scheduling and batching in the network edge. IEEE Transactions on Sustainable Computing10(4), 678–689 (2025)

2025
[9]

In: ICLR 2025 Workshop on Foundation Models in the Wild (2025)

Fu, Y., Chen, J., Zhuang, Y., Fu, Z., Stoica, I., Zhang, H.: Reasoning without self- doubt: More efficient chain-of-thought through certainty probing. In: ICLR 2025 Workshop on Foundation Models in the Wild (2025)

2025
[10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

In: The Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS) (2021)

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., Steinhardt, J.: Measuring mathematical problem solving with the math dataset. In: The Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS) (2021)

2021
[12]

Qwen2.5-Coder Technical Report

Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Lu, K., et al.: Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Kang, Y., Sun, X., Chen, L., Zou, W.: C3ot: Generating shorter chain-of-thought without compromising effectiveness. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 24312–24320 (2025)

2025
[14]

In: Proceedings of the 29th symposium on operating systems prin- ciples (SOSP)

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., Stoica, I.: Efficient memory management for large language model serving with pagedattention. In: Proceedings of the 29th symposium on operating systems prin- ciples (SOSP). pp. 611–626 (2023)

2023
[15]

arXiv preprint arXiv:2505.17052 (2025)

Park, J., Cho, S., Han, D.: Specedge: Scalable edge-assisted serving framework for interactive llms. arXiv preprint arXiv:2505.17052 (2025)

work page arXiv 2025
[16]

In: First conference on language modeling (COLM) (2024)

Rein, D., Hou, B.L., Stickland, A.C., Petty, J., Pang, R.Y., Dirani, J., Michael, J., Bowman, S.R.: Gpqa: A graduate-level google-proof q&a benchmark. In: First conference on language modeling (COLM) (2024)

2024
[17]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Snell, C., Lee, J., Xu, K., Kumar, A.: Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

In: 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI)

Sun, B., Huang, Z., Zhao, H., Xiao, W., Zhang, X., Li, Y., Lin, W.: Llumnix: Dy- namic scheduling for large language model serving. In: 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI). pp. 173–191 (2024)

2024
[19]

Chain of draft: Thinking faster by writing less.arXiv preprint arXiv:2502.18600, 2025a

Xu, S., Xie, W., Zhao, L., He, P.: Chain of draft: Thinking faster by writing less. arXiv preprint arXiv:2502.18600 (2025)

work page arXiv 2025
[20]

IEEE transactions on Mobile computing (2025)

Xu, X., Hu, Y., Cui, G., Qi, L., Dou, W., Cai, Z.: Cadec: a combinatorial auction for dynamic distributed dnn inference scheduling in edge-cloud networks. IEEE transactions on Mobile computing (2025)

2025
[21]

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T

Yang, C., Si, Q., Duan, Y., Zhu, Z., Zhu, C., Li, Q., Chen, M., Lin, Z., Wang, W.: Dynamic early exit in reasoning models. arXiv preprint arXiv:2504.15895 (2025)

work page arXiv 2025
[22]

In: 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI)

Yu, G.I., Jeong, J.S., Kim, G.W., Kim, S., Chun, B.G.: Orca: A distributed serving system for{Transformer-Based}generative models. In: 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI). pp. 521–538 (2022)

2022
[23]

IEEE Internet of Things Journal12(10), 13119– 13131 (2024)

Zhang, M., Shen, X., Cao, J., Cui, Z., Jiang, S.: Edgeshard: Efficient llm inference via collaborative edge computing. IEEE Internet of Things Journal12(10), 13119– 13131 (2024)

2024
[24]

In: 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI)

Zhong, Y., Liu, S., Chen, J., Hu, J., Zhu, Y., Liu, X., Jin, X., Zhang, H.: Dist- serve: Disaggregating prefill and decoding for goodput-optimized large language model serving. In: 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI). pp. 193–210 (2024)

2024

[1] [1]

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

Aggarwal, P., Welleck, S.: L1: Controlling how long a reasoning model thinks with reinforcement learning. arXiv preprint arXiv:2503.04697 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

In: 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI)

Agrawal, A., Kedia, N., Panwar, A., Mohan, J., Kwatra, N., Gulavani, B.S., Tu- manov, A., Ramjee, R.: Taming throughput-latency tradeoff in llm inference with sarathi-serve. In: 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI). pp. 117–134 (2024)

2024

[3] [3]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS) (2025)

Arora, D., Zanette, A.: Training language models to reason efficiently. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS) (2025)

2025

[4] [4]

In: International Conference on Wireless Artificial Intelligent Computing Systems and Applications (WASA)

Chen, Q., Gao, H., Yi, M., Li, J., Cheng, L., Li, Y.: Latency-optimal and memory- aware model partitioning for cooperative inference at the edge. In: International Conference on Wireless Artificial Intelligent Computing Systems and Applications (WASA). pp. 25–37 (2025)

2025

[5] [5]

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Chen, X., Xu, J., Liang, T., He, Z., Pang, J., Yu, D., Song, L., Liu, Q., Zhou, M., Zhang, Z., et al.: Do not think that much for 2+ 3=? on the overthinking of o1-like llms. arXiv preprint arXiv:2412.21187 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

In: Proceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers)

Elhoushi, M., Shrivastava, A., Liskovich, D., Hosmer, B., Wasti, B., Lai, L., Mah- moud, A., Acun, B., Agarwal, S., Roman, A., et al.: Layerskip: Enabling early exit inference and self-speculative decoding. In: Proceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 12622–12642 (2024) 12 Z. Tang et al

2024

[8] [8]

IEEE Transactions on Sustainable Computing10(4), 678–689 (2025)

Fu, B., Chen, F., Li, P., Zeng, D.: Serving transformer models via joint requst scheduling and batching in the network edge. IEEE Transactions on Sustainable Computing10(4), 678–689 (2025)

2025

[9] [9]

In: ICLR 2025 Workshop on Foundation Models in the Wild (2025)

Fu, Y., Chen, J., Zhuang, Y., Fu, Z., Stoica, I., Zhang, H.: Reasoning without self- doubt: More efficient chain-of-thought through certainty probing. In: ICLR 2025 Workshop on Foundation Models in the Wild (2025)

2025

[10] [10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

In: The Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS) (2021)

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., Steinhardt, J.: Measuring mathematical problem solving with the math dataset. In: The Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS) (2021)

2021

[12] [12]

Qwen2.5-Coder Technical Report

Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Lu, K., et al.: Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Kang, Y., Sun, X., Chen, L., Zou, W.: C3ot: Generating shorter chain-of-thought without compromising effectiveness. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 24312–24320 (2025)

2025

[14] [14]

In: Proceedings of the 29th symposium on operating systems prin- ciples (SOSP)

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., Stoica, I.: Efficient memory management for large language model serving with pagedattention. In: Proceedings of the 29th symposium on operating systems prin- ciples (SOSP). pp. 611–626 (2023)

2023

[15] [15]

arXiv preprint arXiv:2505.17052 (2025)

Park, J., Cho, S., Han, D.: Specedge: Scalable edge-assisted serving framework for interactive llms. arXiv preprint arXiv:2505.17052 (2025)

work page arXiv 2025

[16] [16]

In: First conference on language modeling (COLM) (2024)

Rein, D., Hou, B.L., Stickland, A.C., Petty, J., Pang, R.Y., Dirani, J., Michael, J., Bowman, S.R.: Gpqa: A graduate-level google-proof q&a benchmark. In: First conference on language modeling (COLM) (2024)

2024

[17] [17]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Snell, C., Lee, J., Xu, K., Kumar, A.: Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

In: 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI)

Sun, B., Huang, Z., Zhao, H., Xiao, W., Zhang, X., Li, Y., Lin, W.: Llumnix: Dy- namic scheduling for large language model serving. In: 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI). pp. 173–191 (2024)

2024

[19] [19]

Chain of draft: Thinking faster by writing less.arXiv preprint arXiv:2502.18600, 2025a

Xu, S., Xie, W., Zhao, L., He, P.: Chain of draft: Thinking faster by writing less. arXiv preprint arXiv:2502.18600 (2025)

work page arXiv 2025

[20] [20]

IEEE transactions on Mobile computing (2025)

Xu, X., Hu, Y., Cui, G., Qi, L., Dou, W., Cai, Z.: Cadec: a combinatorial auction for dynamic distributed dnn inference scheduling in edge-cloud networks. IEEE transactions on Mobile computing (2025)

2025

[21] [21]

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T

Yang, C., Si, Q., Duan, Y., Zhu, Z., Zhu, C., Li, Q., Chen, M., Lin, Z., Wang, W.: Dynamic early exit in reasoning models. arXiv preprint arXiv:2504.15895 (2025)

work page arXiv 2025

[22] [22]

In: 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI)

Yu, G.I., Jeong, J.S., Kim, G.W., Kim, S., Chun, B.G.: Orca: A distributed serving system for{Transformer-Based}generative models. In: 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI). pp. 521–538 (2022)

2022

[23] [23]

IEEE Internet of Things Journal12(10), 13119– 13131 (2024)

Zhang, M., Shen, X., Cao, J., Cui, Z., Jiang, S.: Edgeshard: Efficient llm inference via collaborative edge computing. IEEE Internet of Things Journal12(10), 13119– 13131 (2024)

2024

[24] [24]

In: 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI)

Zhong, Y., Liu, S., Chen, J., Hu, J., Zhu, Y., Liu, X., Jin, X., Zhang, H.: Dist- serve: Disaggregating prefill and decoding for goodput-optimized large language model serving. In: 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI). pp. 193–210 (2024)

2024