Budgeted LoRA treats LLM distillation as structured compute allocation under a single global budget, producing student models with tunable inference speedups of 1.74x to 4.05x while controlling perplexity and task accuracy.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.LG 3roles
background 1polarities
background 1representative citing papers
An adaptive compute-optimal strategy for scaling LLM test-time compute achieves over 4x efficiency gains versus best-of-N and lets smaller models outperform 14x larger ones on some problems.
citing papers explorer
-
Budgeted LoRA: Distillation as Structured Compute Allocation for Efficient Inference
Budgeted LoRA treats LLM distillation as structured compute allocation under a single global budget, producing student models with tunable inference speedups of 1.74x to 4.05x while controlling perplexity and task accuracy.
-
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
An adaptive compute-optimal strategy for scaling LLM test-time compute achieves over 4x efficiency gains versus best-of-N and lets smaller models outperform 14x larger ones on some problems.
- LT2: Linear-Time Looped Transformers