RAP: Runtime Adaptive Pruning for LLM Inference

· 2025 · cs.LG · arXiv 2505.17138

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Large language models (LLMs) excel at language understanding and generation, but their enormous computational and memory requirements hinder deployment. Compression offers a potential solution to mitigate these constraints. However, most existing methods rely on fixed heuristics and thus fail to adapt to runtime memory variations or heterogeneous KV-cache demands arising from diverse user requests. To address these limitations, we propose RAP, an elastic pruning framework driven by reinforcement learning (RL) that dynamically adjusts compression strategies in a runtime-aware manner. Specifically, RAP dynamically tracks the evolving ratio between model parameters and KV-cache across practical execution. Recognizing that FFNs house most parameters, whereas parameter -light attention layers dominate KV-cache formation, the RL agent retains only those components that maximize utility within the current memory budget, conditioned on instantaneous workload and device state. Extensive experiments results demonstrate that RAP outperforms state-of-the-art baselines, marking the first time to jointly consider model weights and KV-cache on the fly.

representative citing papers

SecRL-Prune: Structured Reinforcement Learning-Based Pruning of CodeLLMs for Preserving Adversarial Code Mutation

cs.CR · 2026-06-04 · unverdicted · novelty 5.0

SecRL-Prune learns layer-wise pruning policies via RL on CodeLLMs, preserving higher pass@k and var@k than baselines at 10-30% compression on HumanEval and enabling semantics-preserving mutations that reduce malware detections in a case study.

citing papers explorer

Showing 1 of 1 citing paper after filters.

SecRL-Prune: Structured Reinforcement Learning-Based Pruning of CodeLLMs for Preserving Adversarial Code Mutation cs.CR · 2026-06-04 · unverdicted · none · ref 15 · internal anchor
SecRL-Prune learns layer-wise pruning policies via RL on CodeLLMs, preserving higher pass@k and var@k than baselines at 10-30% compression on HumanEval and enabling semantics-preserving mutations that reduce malware detections in a case study.

RAP: Runtime Adaptive Pruning for LLM Inference

fields

years

verdicts

representative citing papers

citing papers explorer