End-to-End Dynamic Sparsity for Resource-Adaptive LLM Inference

Chonglin Sun; Fei Tian; Frank Shyu; Jinhao Duan; Luke Simon; Mingfu Liang; Parish Aggarwal; Ruichen Zhang; Sandeep Pandey; Tianlong Chen

arxiv: 2606.27743 · v1 · pith:6PIULZTInew · submitted 2026-06-26 · 💻 cs.IR · cs.AI· cs.LG

End-to-End Dynamic Sparsity for Resource-Adaptive LLM Inference

Yuhang Chen , Jinhao Duan , Ruichen Zhang , Mingfu Liang , Xiaohan Wei , Yunchen Pu , Fei Tian , Chonglin Sun

show 6 more authors

Parish Aggarwal Frank Shyu Luke Simon Sandeep Pandey Tianlong Chen Xi Liu

This is my paper

Pith reviewed 2026-06-29 03:22 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.LG

keywords dynamic sparsityresource adaptive inferenceLLMgating networkslayer skippingPareto frontierend-to-end trainingbudget conditioned

0 comments

The pith

L2A trains a single LLM with budget-conditioned gates to dynamically skip layers, prune heads, and reduce tokens while staying within 0.6% of dense accuracy across resource budgets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that LLMs can be made resource-adaptive by integrating lightweight gating networks that condition on both the input and the current runtime budget. These gates are trained end-to-end using a unified objective covering task performance, logical consistency, and resource usage through layer skipping, head pruning, and token reduction. A sympathetic reader would care because this allows one model to handle fluctuating cloud resources like spot instances without crashing or wasting compute, unlike static models that need separate tuning for each budget. The result is a single model tracing the full compute-accuracy Pareto frontier on models like Llama-3-8B.

Core claim

We propose Learning to Allocate (L2A), an end-to-end framework where lightweight budget-conditioned and input-aware gating networks are integrated into the LLM and trained via a unified objective that jointly optimizes task performance, logical consistency, and resource costs. This enables the model to adaptively configure its computational footprint with respect to real-time resource dynamics, maximizing reasoning depth when resources permit while enforcing frugality when budgets tighten. A single L2A model traces the entire compute-accuracy Pareto frontier on Llama-3-8B and Qwen-3-4B at up to 34% layer sparsity while staying within 0.6% of the dense baseline on GSM8K and zero-shot OOD task

What carries the argument

Lightweight budget-conditioned and input-aware gating networks integrated into the LLM and trained with a three-axis unified objective for layer skipping, head pruning, and reasoning-token reduction.

If this is right

One model suffices for all resource budgets instead of training separate static models for each.
At 34% realized layer sparsity the accuracy drop stays under 0.6% on GSM8K and holds on out-of-distribution tasks.
Static and heuristic baselines require separate tuning per budget and suffer 5-10% drops at similar inference times.
The approach applies across Llama-3-8B and Qwen-3-4B without post-training adjustments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deploying in volatile cloud environments could reduce the need for over-provisioning hardware for peak loads.
Extending the gating to other modalities or larger models might further improve efficiency under variable constraints.
The three-axis objective could generalize to additional resource axes like energy consumption if measured during training.

Load-bearing premise

The lightweight gating networks trained with the unified objective produce decisions that preserve task performance and logical consistency across budgets without adding prohibitive overhead.

What would settle it

Measure whether a single L2A model at 34% sparsity on a held-out task and model drops more than 1% accuracy relative to dense while matching or beating tuned static baselines at the same latency.

read the original abstract

Large Language Models (LLMs) inference is typically deployed under a static resource assumption, where models execute a fixed computational graph regardless of the runtime environment. However, real-world cloud infrastructure is inherently dynamic, characterized by fluctuating availability (e.g., spot instance preemption) and tiered Quality-of-Service requirements. In such volatile settings, static models are inflexible: they either crash under resource constraints or waste compute on redundant operations. To bridge this gap, we propose Learning to Allocate (L2A), an end-to-end framework for resource-adaptive inference. Unlike prior methods that condition only on input difficulty, we formulate inference as a constrained allocation problem conditioned on both the input and the runtime resource budget itself. We introduce lightweight, budget-conditioned and input-aware gating networks integrated into the LLM. These gates are trained via a unified objective that jointly optimizes task performance, logical consistency, and resource costs along three axes matching how real-world dynamics manifest: layer skipping for memory and depth pressure, head pruning for throughput contention, and reasoning-token reduction for latency tightening. This lets the model learn a budget-aware policy beyond input difficulty alone: it adaptively configures its computational footprint with respect to real-time resource dynamics, maximizing reasoning depth when resources permit while enforcing strict frugality when budgets tighten. A single L2A model traces the entire compute-accuracy Pareto frontier on Llama-3-8B and Qwen-3-4B: at up to 34% realized layer sparsity, it stays within 0.6% of the dense baseline on GSM8K, with the same gap holding zero-shot on out-of-distribution tasks, while every static or heuristic baseline requires a separately tuned model and still drops by 5-10% at comparable inference time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

One model with budget-and-input conditioned gates for layer, head, and token sparsity keeps accuracy close to dense at moderate sparsity levels, but the abstract leaves the logical consistency term and training details unspecified.

read the letter

The main point is that a single L2A model learns to adjust layer skipping, head pruning, and token reduction by feeding both the input and the runtime budget into lightweight gates, and it reports staying within 0.6% of the dense baseline on GSM8K at 34% realized layer sparsity while holding the gap on zero-shot out-of-distribution tasks.

The new element is the explicit end-to-end training of gates that condition on the budget itself rather than input difficulty alone. The three sparsity axes are chosen to match common cloud pressures, and the unified objective tries to trade off task performance, consistency, and cost in one pass. That framing directly targets the static-graph problem in volatile environments, and the reported numbers show the model tracing a Pareto curve where static and heuristic baselines each need separate tuning and lose more accuracy.

The soft spots sit in the missing mechanics. The abstract states the objective jointly optimizes the three factors but gives no equation, no form of the consistency term, and no training hyperparameters or error bars. If the consistency component reduces to the task loss or is weak, the small gap could be tied to the particular training distribution instead of a general property of the dynamic policy. Gate overhead at inference time is also unaddressed, which matters for the resource-adaptive claim.

This is aimed at inference researchers and serving engineers who deal with spot instances and tiered QoS. A reader working on practical efficiency would find the problem setup and the three-axis design useful even before the full methods are checked.

It deserves a serious referee because the deployment mismatch is real and the proposed structure is a clear extension of prior input-only work. The concrete numbers make it worth verifying the objective and baselines.

I would send it to peer review with the expectation that the authors supply the exact objective, training procedure, and controls on the consistency term.

Referee Report

2 major / 0 minor

Summary. The paper proposes Learning to Allocate (L2A), an end-to-end framework for resource-adaptive LLM inference using lightweight budget-conditioned and input-aware gating networks integrated into models like Llama-3-8B and Qwen-3-4B. These gates are trained via a unified objective jointly optimizing task performance, logical consistency, and resource costs across three axes (layer skipping, head pruning, reasoning-token reduction). The central claim is that a single L2A model traces the full compute-accuracy Pareto frontier, achieving within 0.6% of dense baseline on GSM8K at up to 34% realized layer sparsity (with the gap holding zero-shot on OOD tasks), while static or heuristic baselines require separate per-budget tuning and drop 5-10%.

Significance. If the results hold with proper verification, this would be a meaningful contribution to dynamic LLM inference in volatile environments such as cloud spot instances. The end-to-end training of a single budget-aware model that adapts its computational graph without retraining or post-hoc adjustments could reduce deployment overhead compared to maintaining multiple static variants. The three-axis formulation matching real-world dynamics is a conceptual strength if the consistency enforcement is rigorously specified and shown to be non-collapsing.

major comments (2)

[Abstract] Abstract: The unified objective is stated to jointly optimize 'task performance, logical consistency, and resource costs' but supplies no equation, auxiliary loss term, or supervision signal for the logical consistency axis. This is load-bearing for the 0.6% gap claim at 34% sparsity, because without an explicit mechanism the observed performance could be an artifact of the training distribution rather than a general property of the dynamic policy (directly matching the stress-test concern on whether the term prevents harmful skipping patterns).
[Abstract] Abstract / Methods: No training details, baseline implementations, error bars, or verification that the lightweight gates produce decisions preserving logical consistency across budgets (without prohibitive overhead) are provided. This directly impacts the soundness rating and prevents assessment of whether the single-model Pareto claim is reproducible or generalizes beyond the reported GSM8K setting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract's clarity regarding the unified objective and the need for additional implementation details. We address each major comment below and will revise the manuscript to incorporate the requested information.

read point-by-point responses

Referee: [Abstract] Abstract: The unified objective is stated to jointly optimize 'task performance, logical consistency, and resource costs' but supplies no equation, auxiliary loss term, or supervision signal for the logical consistency axis. This is load-bearing for the 0.6% gap claim at 34% sparsity, because without an explicit mechanism the observed performance could be an artifact of the training distribution rather than a general property of the dynamic policy (directly matching the stress-test concern on whether the term prevents harmful skipping patterns).

Authors: We agree that the abstract does not include the explicit equation, auxiliary loss term, or supervision signal for the logical consistency axis. We will revise the abstract to add the full unified objective equation and a brief description of the logical consistency term (including how it is supervised) to clarify the mechanism and address the concern about potential artifacts in the training distribution. revision: yes
Referee: [Abstract] Abstract / Methods: No training details, baseline implementations, error bars, or verification that the lightweight gates produce decisions preserving logical consistency across budgets (without prohibitive overhead) are provided. This directly impacts the soundness rating and prevents assessment of whether the single-model Pareto claim is reproducible or generalizes beyond the reported GSM8K setting.

Authors: We acknowledge that the abstract does not provide these details. We will revise the manuscript to include key training details, baseline implementations, error bars, and verification of logical consistency preservation (with overhead analysis) either by expanding the abstract or adding a concise methods paragraph, thereby improving reproducibility and allowing assessment of the Pareto claim. revision: yes

Circularity Check

0 steps flagged

No circularity in claimed derivation chain

full rationale

The provided abstract and description contain no equations, derivations, or self-citations. The unified objective is described at a high level as jointly optimizing three axes, but no mathematical reduction, fitted parameter, or self-referential definition is exhibited that would make any result equivalent to its inputs by construction. The central claims are framed as empirical outcomes of end-to-end training rather than analytic predictions derived from prior steps within the paper. This is the expected self-contained case for a methods paper without visible derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are identifiable. The gating networks are learned parameters whose training details are not provided. The unified objective is mentioned but not formalized.

pith-pipeline@v0.9.1-grok · 5900 in / 1288 out tokens · 75001 ms · 2026-06-29T03:22:58.130696+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 3 linked inside Pith

[1]

Slicegpt: Compress large language models by deleting rows and columns.arXiv preprint arXiv:2401.15024,

Saleh Ashkboos, Maximilian L Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns.arXiv preprint arXiv:2401.15024,

arXiv
[2]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

2020
[3]

The llama 3 herd of models.CoRR, abs/2407.21783,

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, and others. The llama 3 herd of models.CoRR, abs/2407.21783,

Pith/arXiv arXiv
[4]

Layerskip: Enabling early exit inference and self-speculative decoding

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, BilgeAcun, SaurabhAgarwal, AhmedRoman, AhmedAAly, BeidiChen, andCarole-JeanWu. Layerskip: Enabling early exit inference and self-speculative decoding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volum...

2024
[5]

Not all layers of llms are necessary during inference.arXiv preprint arXiv:2403.02181,

Siqi Fan, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang, and Zhongyuan Wang. Not all layers of llms are necessary during inference.arXiv preprint arXiv:2403.02181,

arXiv
[6]

A framework for few-shot language model evaluation, 2024.https://zenodo.org/records/12608602

LeoGao, JonathanTow, BaberAbbasi, StellaBiderman, SidBlack, AnthonyDiPofi, CharlesFoster, LaurenceGolding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few...

arXiv 2024
[7]

Mea- suring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021,

2021
[8]

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InFindings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017,

2023
[9]

Nalisnick

Metod Jazbec, Alexander Timans, Tin Hadzi Veljkovic, Kaspar Sakmann, Dan Zhang, Christian Andersson Naesseth, and Eric T. Nalisnick. Fast yet safe: Early-exiting with risk control. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024,

2024
[10]

Shortened llama: Depth pruning for large language models with comparison of retraining methods.arXiv preprint arXiv:2402.02834,

Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, and Hyoung-Kyu Song. Shortened llama: Depth pruning for large language models with comparison of retraining methods.arXiv preprint arXiv:2402.02834,

arXiv
[11]

Generative reasoning re-ranker.arXiv preprint arXiv:2602.07774,

Mingfu Liang, Yufei Li, Jay Xu, Kavosh Asadi, Xi Liu, Shuo Gu, Kaushik Rangadurai, Frank Shyu, Shuaiwen Wang, Song Yang, et al. Generative reasoning re-ranker.arXiv preprint arXiv:2602.07774,

Pith/arXiv arXiv
[12]

Bridging discrete and backpropagation: Straight-through and beyond

Liyuan Liu, Chengyu Dong, Xiaodong Liu, Bin Yu, and Jianfeng Gao. Bridging discrete and backpropagation: Straight-through and beyond. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023,

2023
[13]

Fastbert: a self-distilling BERT with adaptive inference time

Weijie Liu, Peng Zhou, Zhiruo Wang, Zhe Zhao, Haotang Deng, and Qi Ju. Fastbert: a self-distilling BERT with adaptive inference time. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 6035–6044, 2020a. Weijie Liu, Peng Zhou, Zhiruo Wang, Zhe Zhao, Haotang Deng, and Qi Ju. Fa...

arXiv 2020
[14]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, and others. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022,

2022
[15]

Mixture- of-depths: Dynamicallyallocatingcomputeintransformer-basedlanguagemodels.arXiv preprint arXiv:2404.02258, 2024a

DavidRaposo, SamRitter, BlakeRichards, TimothyLillicrap, PeterConwayHumphreys, andAdamSantoro. Mixture- of-depths: Dynamicallyallocatingcomputeintransformer-basedlanguagemodels.arXiv preprint arXiv:2404.02258, 2024a. David Raposo, Samuel Ritter, Blake A. Richards, Timothy P. Lillicrap, Peter Conway Humphreys, and Adam Santoro. Mixture-of-depths: Dynamical...

Pith/arXiv arXiv 2022
[16]

Deebert: Dynamic early exiting for accelerating BERT inference

Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin. Deebert: Dynamic early exiting for accelerating BERT inference. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 2246–2251,

2020
[17]

Qwen2.5 technical report.CoRR, 2024a

An Yang, Baosong Yang, Beichen Zhang, and others. Qwen2.5 technical report.CoRR, 2024a. Yifei Yang, Zouying Cao, and Hai Zhao. Laco: Large language model pruning via layer collapse. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 6401–6417, 2024b. Dewen Zeng, Nan Du, Tao Wang, Yuanzhong Xu, Tao Lei, Zhifeng Chen, and Claire ...

arXiv 2024
[18]

PCEE-BERT: accelerating BERT inference via patient and confident early exiting

Zhen Zhang, Wei Zhu, Jinfan Zhang, Peng Wang, Rize Jin, and Tae-Sun Chung. PCEE-BERT: accelerating BERT inference via patient and confident early exiting. InFindings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pages 327–338,

2022
[19]

Skipgpt: Dynamic layer pruning reinvented with token awareness and module decoupling.arXiv preprint arXiv:2506.04179, 2025

Anhao Zhao, Fanghua Ye, Yingqi Fan, Junlong Tong, Zhiwei Fei, Hui Su, and Xiaoyu Shen. Skipgpt: Dynamic layer pruning reinvented with token awareness and module decoupling.arXiv preprint arXiv:2506.04179, 2025

arXiv 2025

[1] [1]

Slicegpt: Compress large language models by deleting rows and columns.arXiv preprint arXiv:2401.15024,

Saleh Ashkboos, Maximilian L Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns.arXiv preprint arXiv:2401.15024,

arXiv

[2] [2]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

2020

[3] [3]

The llama 3 herd of models.CoRR, abs/2407.21783,

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, and others. The llama 3 herd of models.CoRR, abs/2407.21783,

Pith/arXiv arXiv

[4] [4]

Layerskip: Enabling early exit inference and self-speculative decoding

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, BilgeAcun, SaurabhAgarwal, AhmedRoman, AhmedAAly, BeidiChen, andCarole-JeanWu. Layerskip: Enabling early exit inference and self-speculative decoding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volum...

2024

[5] [5]

Not all layers of llms are necessary during inference.arXiv preprint arXiv:2403.02181,

Siqi Fan, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang, and Zhongyuan Wang. Not all layers of llms are necessary during inference.arXiv preprint arXiv:2403.02181,

arXiv

[6] [6]

A framework for few-shot language model evaluation, 2024.https://zenodo.org/records/12608602

LeoGao, JonathanTow, BaberAbbasi, StellaBiderman, SidBlack, AnthonyDiPofi, CharlesFoster, LaurenceGolding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few...

arXiv 2024

[7] [7]

Mea- suring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021,

2021

[8] [8]

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InFindings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017,

2023

[9] [9]

Nalisnick

Metod Jazbec, Alexander Timans, Tin Hadzi Veljkovic, Kaspar Sakmann, Dan Zhang, Christian Andersson Naesseth, and Eric T. Nalisnick. Fast yet safe: Early-exiting with risk control. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024,

2024

[10] [10]

Shortened llama: Depth pruning for large language models with comparison of retraining methods.arXiv preprint arXiv:2402.02834,

Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, and Hyoung-Kyu Song. Shortened llama: Depth pruning for large language models with comparison of retraining methods.arXiv preprint arXiv:2402.02834,

arXiv

[11] [11]

Generative reasoning re-ranker.arXiv preprint arXiv:2602.07774,

Mingfu Liang, Yufei Li, Jay Xu, Kavosh Asadi, Xi Liu, Shuo Gu, Kaushik Rangadurai, Frank Shyu, Shuaiwen Wang, Song Yang, et al. Generative reasoning re-ranker.arXiv preprint arXiv:2602.07774,

Pith/arXiv arXiv

[12] [12]

Bridging discrete and backpropagation: Straight-through and beyond

Liyuan Liu, Chengyu Dong, Xiaodong Liu, Bin Yu, and Jianfeng Gao. Bridging discrete and backpropagation: Straight-through and beyond. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023,

2023

[13] [13]

Fastbert: a self-distilling BERT with adaptive inference time

Weijie Liu, Peng Zhou, Zhiruo Wang, Zhe Zhao, Haotang Deng, and Qi Ju. Fastbert: a self-distilling BERT with adaptive inference time. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 6035–6044, 2020a. Weijie Liu, Peng Zhou, Zhiruo Wang, Zhe Zhao, Haotang Deng, and Qi Ju. Fa...

arXiv 2020

[14] [14]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, and others. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022,

2022

[15] [15]

Mixture- of-depths: Dynamicallyallocatingcomputeintransformer-basedlanguagemodels.arXiv preprint arXiv:2404.02258, 2024a

DavidRaposo, SamRitter, BlakeRichards, TimothyLillicrap, PeterConwayHumphreys, andAdamSantoro. Mixture- of-depths: Dynamicallyallocatingcomputeintransformer-basedlanguagemodels.arXiv preprint arXiv:2404.02258, 2024a. David Raposo, Samuel Ritter, Blake A. Richards, Timothy P. Lillicrap, Peter Conway Humphreys, and Adam Santoro. Mixture-of-depths: Dynamical...

Pith/arXiv arXiv 2022

[16] [16]

Deebert: Dynamic early exiting for accelerating BERT inference

Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin. Deebert: Dynamic early exiting for accelerating BERT inference. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 2246–2251,

2020

[17] [17]

Qwen2.5 technical report.CoRR, 2024a

An Yang, Baosong Yang, Beichen Zhang, and others. Qwen2.5 technical report.CoRR, 2024a. Yifei Yang, Zouying Cao, and Hai Zhao. Laco: Large language model pruning via layer collapse. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 6401–6417, 2024b. Dewen Zeng, Nan Du, Tao Wang, Yuanzhong Xu, Tao Lei, Zhifeng Chen, and Claire ...

arXiv 2024

[18] [18]

PCEE-BERT: accelerating BERT inference via patient and confident early exiting

Zhen Zhang, Wei Zhu, Jinfan Zhang, Peng Wang, Rize Jin, and Tae-Sun Chung. PCEE-BERT: accelerating BERT inference via patient and confident early exiting. InFindings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pages 327–338,

2022

[19] [19]

Skipgpt: Dynamic layer pruning reinvented with token awareness and module decoupling.arXiv preprint arXiv:2506.04179, 2025

Anhao Zhao, Fanghua Ye, Yingqi Fan, Junlong Tong, Zhiwei Fei, Hui Su, and Xiaoyu Shen. Skipgpt: Dynamic layer pruning reinvented with token awareness and module decoupling.arXiv preprint arXiv:2506.04179, 2025

arXiv 2025