arxiv: 2605.00649 · v2 · submitted 2026-05-01 · 💻 cs.LG

Recognition: unknown

Model Compression with Exact Budget Constraints via Riemannian Manifolds

Michael Helcig , Dan Alistarh

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:51 UTC · model grok-4.3

classification 💻 cs.LG

keywords model compressionriemannian optimizationbudget constraintsquantizationpruningexpert selectionconstrained optimizationsoftmax relaxation

0 comments

The pith

The budget constraint under softmax relaxation forms a Riemannian manifold with simple geometry allowing exact first-order optimization of model loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that assigning one of K discrete options to each of N groups under a fixed total cost budget can be relaxed by treating the assignment logits as points on a smooth manifold. The manifold is induced directly by the budget hyperplane once probabilities are obtained via softmax, and its geometry turns out to be elementary: the normal is the cost vector itself, moving along that vector changes expected cost monotonically, and parallel transport collapses to a single inner product. These facts let the authors replace penalty or evolutionary methods with a Riemannian constrained optimizer that projects an Adam step onto the tangent space, retracts via binary search to restore the exact budget, and transports momentum with negligible cost. The resulting procedure minimizes the true joint loss while guaranteeing the budget is met, then recovers a feasible discrete solution by dynamic programming. On both synthetic assignment tasks and realistic large-model compression benchmarks the method matches or exceeds prior state-of-the-art while using less wall-clock time and no constraint-specific hyperparameters.

Core claim

Under the softmax relaxation the level set of logit vectors whose expected cost equals the prescribed budget is a smooth Riemannian manifold whose normal vector equals the cost vector. This structure supplies a closed-form orthogonal projection onto the tangent space, a monotonic retraction obtained by binary search along the cost direction, and vector transport that reduces to an inner product, so that any first-order optimizer can be turned into an exact-budget method by three inexpensive augmentations.

What carries the argument

The Riemannian manifold in logit space defined by the linear budget constraint once probabilities are formed by softmax, together with its closed-form normal, tangent projection, and retraction operator.

If this is right

Non-decomposable objectives that depend on the joint assignment become directly optimizable by gradient methods.
The budget is satisfied exactly at every iterate without introducing Lagrange multipliers or penalty coefficients.
No constraint-specific hyperparameters are required beyond the settings of the base optimizer.
The same manifold construction works for any linear cost constraint once the expectation is taken under softmax.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same manifold view could be applied to other linear resource constraints that arise in routing, scheduling, or hardware allocation.
Because retraction cost is independent of dimension, the approach should remain practical when the number of groups N grows to tens of thousands.
Replacing binary-search retraction with a learned or analytic map might further reduce per-iteration overhead in very large models.

Load-bearing premise

The continuous softmax relaxation must remain sufficiently faithful to the underlying discrete one-hot assignments that optimizing on the manifold produces high-quality discrete solutions after the final projection step.

What would settle it

Solve a small instance whose globally optimal discrete assignment is known by exhaustive enumeration; if the final discrete selection recovered by dynamic programming violates the budget or returns strictly higher loss than the enumerated optimum, the claimed geometric properties do not guarantee optimality.

Figures

Figures reproduced from arXiv: 2605.00649 by Dan Alistarh, Michael Helcig.

**Figure 2.** Figure 2: One optimization step on M: project gradient, Adam step, retract via binary search. rather than after so that the first moment m accumulates a tangent direction across steps; without the pre-projection, momentum would drift toward the normal and be wasted by retraction. An Adam step along gtan may still drift off M due to manifold curvature and Adam’s per-coordinate scaling. To return, we shift all logits … view at source ↗

**Figure 3.** Figure 3: MCKP, huge scenario (N=1000, K=32, three random instances). Left: gap to DP optimum. Center: raw constraint violation. Right: |violation| on log scale; manifold (∼10−9 ) vs. Lagrangian (∼10−1 ). Additional scenarios in Appendix G. Three patterns emerge. First, manifold projection satisfies the budget exactly (|C(α) − B| < 10−8 ) in every scenario, while Lagrangian methods maintain violations of order 10−1 … view at source ↗

**Figure 4.** Figure 4: Binary-search iterations per retraction across five MCKP scenarios (boxes: IQR; whiskers: [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Per-constraint violation trajectories with [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: MCKP convergence traces for adversarial and correlated tight scenarios. [PITH_FULL_IMAGE:figures/full_fig_p029_6.png] view at source ↗

**Figure 7.** Figure 7: MCKP convergence traces for boundary-clustered and cheap-optimal scenarios. [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗

read the original abstract

Assigning one of K options to each of N groups under a total cost budget is a recurring problem in efficient AI, including mixed-precision quantization, non-uniform pruning, and expert selection. The objective, typically model loss, depends jointly on all assignments and does not decompose across groups, preventing combinatorial solvers from directly optimizing the true objective and forcing reliance on proxy formulations. Methods such as evolutionary search evaluate the actual loss but lack gradient information, while penalty-based approaches enforce the budget only approximately and often require extensive hyperparameter tuning. We present a new approach by showing that, under softmax relaxation, the budget constraint defines a smooth Riemannian manifold in logit space with unusually simple geometry. The normal vector admits a closed-form expression, shifting logits along the cost vector changes expected cost monotonically, and vector transport reduces to a single inner product. Building on these properties, we propose Riemannian Constrained Optimization (RCO), which augments a standard Adam step with tangent projection, binary-search retraction, and momentum transport. Combined with Gumbel straight-through estimation and budget-constrained dynamic programming for discrete feasibility, RCO enables first-order optimization of the actual loss under exact budget enforcement without introducing constraint-specific hyperparameters. Across both synthetic benchmarks and realistic LLM compression settings, RCO matches or exceeds state-of-the-art methods while often requiring substantially less wall-clock time. Source code is available at https://github.com/IST-DASLab/RCO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RCO maps the exact budget constraint to a Riemannian manifold with closed-form normal and retraction, letting first-order methods optimize the true loss without penalty tuning.

read the letter

The main point is that under a softmax relaxation the budget constraint becomes a smooth hypersurface in logit space whose normal is exactly p ⊙ (c - μ) per group. Shifting the logits along the cost vector changes expected cost at a rate equal to the sum of per-group cost variances, which is nonnegative and usually positive, so binary-search retraction stays well-defined and cheap. Tangent projection and momentum transport then collapse to a single inner product against that normal. They wrap this into the RCO optimizer, add Gumbel straight-through gradients, and finish with a small dynamic program to recover a feasible discrete assignment. The result is first-order optimization of the actual model loss under a hard budget with no extra constraint hyperparameters to tune.

Referee Report

1 major / 2 minor

Summary. The paper claims that under a softmax relaxation of discrete group assignments, the exact total-cost budget constraint defines a smooth embedded hypersurface (Riemannian manifold) in logit space whose Euclidean normal vector has a closed-form expression (per-group p_i ⊙ (c_i − μ_i)), whose level sets are crossed monotonically by shifts along the cost vector (derivative equals sum of per-group cost variances), and on which tangent projection and vector transport reduce to a single inner product. Building on these properties, the authors introduce the Riemannian Constrained Optimization (RCO) algorithm that augments an Adam step with tangent projection, binary-search retraction onto the exact budget level set, and momentum transport; combined with Gumbel straight-through estimation and a final budget-constrained dynamic program, RCO optimizes the true non-decomposable loss while enforcing the budget exactly and without constraint-specific hyperparameters. Experiments on synthetic benchmarks and realistic LLM compression tasks show that RCO matches or exceeds prior art while often using less wall-clock time.

Significance. If the geometric claims and the correctness of the retraction/transport steps hold, the work provides a principled, hyperparameter-free route to exact-budget discrete optimization that directly optimizes the target loss rather than a proxy. This is potentially significant for mixed-precision quantization, non-uniform pruning, and expert routing, where penalty methods require extensive tuning and combinatorial or evolutionary search lacks gradients. The open-source implementation and the reduction of manifold operations to elementary vector operations are additional strengths that support reproducibility and practical adoption.

major comments (1)

[§3] The central guarantee that binary-search retraction reaches the exact budget level set relies on the monotonicity property (derivative = sum of per-group variances > 0 unless all groups are deterministic). While the skeptic note confirms this follows from standard embedded-hypersurface calculus, the manuscript should include the explicit derivative calculation (likely in §3) rather than stating it only in the abstract, because any hidden approximation here would undermine the “exact budget enforcement” claim.

minor comments (2)

[Abstract] The abstract asserts that “vector transport reduces to a single inner product”; a one-sentence reminder of the standard formula for the projection onto the tangent space of an embedded hypersurface would improve readability for readers unfamiliar with Riemannian optimization.
[Experiments] Table captions and axis labels in the experimental section should explicitly state whether the reported cost is the expected cost under the relaxed distribution or the realized cost after the final DP rounding step.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [§3] The central guarantee that binary-search retraction reaches the exact budget level set relies on the monotonicity property (derivative = sum of per-group variances > 0 unless all groups are deterministic). While the skeptic note confirms this follows from standard embedded-hypersurface calculus, the manuscript should include the explicit derivative calculation (likely in §3) rather than stating it only in the abstract, because any hidden approximation here would undermine the “exact budget enforcement” claim.

Authors: We agree that the explicit derivative calculation should appear in the main text. The monotonicity property follows directly from the chain rule applied to the embedded hypersurface defined by the softmax-relaxed budget constraint. In the revised manuscript we will insert a short derivation in §3 (immediately preceding the binary-search retraction paragraph) that computes dE[cost]/dλ = ∑_i Var_i(cost) and shows the sum is strictly positive unless every group is deterministic. This addition will make the exactness of the level-set retraction fully self-contained without reference to the abstract or any external note. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation begins from the explicit definition of expected cost under independent per-group softmax relaxations and computes its Euclidean gradient (the normal) via direct differentiation, yielding the closed-form p ⊙ (c − μ). Monotonicity of cost under shifts along the cost vector follows immediately from the non-negative variance identity, and tangent operations reduce to the standard inner-product projection for any embedded hypersurface. These identities are obtained by elementary calculus on the given manifold definition rather than by fitting, self-referential prediction, or load-bearing self-citation. Standard Riemannian retraction and transport are then applied without additional ansatz or uniqueness theorems imported from prior author work. The overall procedure is therefore self-contained and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the softmax relaxation turning the discrete budget problem into a continuous manifold optimization task; no free parameters are explicitly fitted in the abstract description.

axioms (1)

domain assumption Softmax relaxation of discrete group assignments is a valid continuous surrogate that preserves the ability to recover feasible discrete solutions via Gumbel straight-through and dynamic programming.
Invoked to define the manifold in logit space and enable gradient-based optimization.

invented entities (1)

Riemannian Constrained Optimization (RCO) algorithm no independent evidence
purpose: Augment Adam with tangent projection, binary-search retraction, and momentum transport to enforce exact budget.
New procedure built on the claimed manifold properties.

pith-pipeline@v0.9.0 · 5548 in / 1438 out tokens · 39714 ms · 2026-05-09T19:51:36.158910+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 7 canonical work pages · 4 internal anchors

[1]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432,

work page internal anchor Pith review arXiv
[2]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.arXiv preprint arXiv:2101.03961,

work page internal anchor Pith review arXiv
[3]

RC-DARTS: Resource constrained differentiable architecture search.arXiv preprint arXiv:1912.12814,

10 Xiaojie Jin, Jiang Wang, Joshua Slocum, Ming-Hsuan Yang, Shengyang Dai, Shuicheng Yan, and Jiashi Feng. RC-DARTS: Resource constrained differentiable architecture search.arXiv preprint arXiv:1912.12814,

work page arXiv 1912
[4]

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Michael Lasby, Reza Bayat, Ivan Googler, Konstantinos N. Plataniotis, and Mahdi S. Hosseini. REAP: Router-weighted expert activation pruning.arXiv preprint arXiv:2510.13999,

work page internal anchor Pith review arXiv
[5]

LLM-MQ: Mixed-precision quantization for efficient LLM deployment

Shiyao Li, Xuefei Ning, Ke Hong, Tengxuan Liu, Luning Wang, Xiuhong Li, Kai Zhong, Guohao Dai, Huazhong Yang, and Yu Wang. LLM-MQ: Mixed-precision quantization for efficient LLM deployment. InNeurIPS 2023 Workshop on Efficient Natural Language and Speech Processing, pages 1–5,

2023
[6]

EvoESAP: Non-Uniform Expert Pruning for Sparse MoE

Zongfang Liu, Shengkun Tang, Boyang Sun, Ping Wang, Zhiqiang Shen, and Xin Yuan. EvoESAP: Non-uniform expert pruning for sparse MoE.arXiv preprint arXiv:2603.06003,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

HIGGS: Pushing the limits of large language model quantization via the linearity theorem

Vladimir Malinovskii, Andrei Panferov, Ivan Ilin, Han Guo, Peter Richtárik, and Dan Alistarh. HIGGS: Pushing the limits of large language model quantization via the linearity theorem. In Proceedings of the 2025 Conference of the North American Chapter of the Association for Compu- tational Linguistics, volume 1, pages 10857–10886. Association for Computat...

2025
[8]

IMPQ: Interaction-aware layerwise mixed preci- sion quantization for LLMs.arXiv preprint arXiv:2509.15455,

Junchen Zhao, Ali Derakhshan, Dushyant Bharadwaj, Jayden Kana Hyman, Junhao Dong, Sangeetha Abdu Jyothi, and Ian Harris. IMPQ: Interaction-aware layerwise mixed precision quantization for LLMs.arXiv preprint arXiv:2509.15455,

work page arXiv
[9]

By Proposition 2, the (i, k) entry of ∇C(α) isw i pik(ck −E pi[c])

The expected cost C(α) =P i wi⟨softmax(αi),c⟩ is a composition of the smooth softmax with a linear function, hence smooth. By Proposition 2, the (i, k) entry of ∇C(α) isw i pik(ck −E pi[c]). Sincesoftmaxproduces strictly positive probabilities (p ik >0for alli, k), ∥∇C∥ 2 = NX i=1 w2 i KX k=1 p2 ik ck −E pi[c] 2 . If the costs are not all equal, then for ...

2023
[10]

All scenarios concentrate near the theoretical bound and stay below45in the worst case. A.3 Verification of retraction axioms A retraction on a manifold M is a smooth map R:TM → M satisfying two axioms [Absil et al., 2008, Definition 4.1]: (i) Rα(0) =α (centering), and (ii) d dt Rα(tξ) t=0 =ξ for all ξ∈T αM (local rigidity). Our retraction takes a tangent...

2008
[11]

Therefore d dt t=0Rα(tξ) =ξ

The first term vanishes because ξ∈T αM, and ⟨∇C, ˜c⟩ ̸= 0 by Proposition 3, so dt∗ dt t=0 = 0 . Therefore d dt t=0Rα(tξ) =ξ. A.4 Riemannian gradient The Riemannian gradient of a function f:M →R at α∈ M is the unique tangent vector gradf(α)∈T αM satisfying ⟨gradf,ξ⟩=Df(α)[ξ] for all ξ∈T αM [Boumal, 2023, Proposi- tion 3.61]. We verify that the tangent proj...

2023
[12]

When the constraints are separable (each depends on a disjoint subset of groups), independent scalar retraction per constraint suffices

Quadratic convergence typically requires 3–4 iterations for general q. When the constraints are separable (each depends on a disjoint subset of groups), independent scalar retraction per constraint suffices. Figure 5 validates this extension on a synthetic problem withq=16 simultaneous resource constraints, N=500 groups, and K=32 options. The budget manif...

2023
[13]

All layers are treated as independent groups (no structural grouping unless stated otherwise)

at a target of 2.5 average bits per parameter. All layers are treated as independent groups (no structural grouping unless stated otherwise). Calibration uses FineWeb-Edu with 256 samples at sequence length 2048 unless noted. Evaluation reports perplexity on two held-out corpora: FineWeb-Edu (FW) and C4, each with 131k tokens at sequence length

2048
[14]

The dominant source of variance is the discrete assignment landscape (seed variance, Section E.7), not calibration noise

Cal Config FW↓C4↓ 256g=8,T=200, lr=0.1 15.78 24.48 512g=8,T=200, lr=0.2 15.65 24.51 256g=4,T=200, lr=0.1 15.81 24.73 512g=4,T=200, lr=0.2 16.23 25.38 No clear benefit from doubling calibration data. The dominant source of variance is the discrete assignment landscape (seed variance, Section E.7), not calibration noise. At 256 samples (524k tokens), the gr...

work page arXiv 2026
[15]

We run EvoPress for 100 generations, matching the search budget recommended by Sieberling et al

do not include Qwen3-8B in their reported experiments. We run EvoPress for 100 generations, matching the search budget recommended by Sieberling et al. [2025]. RCO and EvoPress are measured in the same compute environment. 1https://github.com/ZongfangLiu/EvoESAP 2https://github.com/IST-DASLab/EvoPress 26 F.3 MxMoE All MxMoE numbers for Qwen1.5-MoE-A2.7B a...

2025