arxiv: 2604.16888 · v1 · submitted 2026-04-18 · 💻 cs.LG · math.OC

Recognition: unknown

Towards Fully Parameter-Free Stochastic Optimization: Grid Search with Self-Bounding Analysis

Yuheng Zhao , Yu-Hu Yan , Amit Attia , Tomer Koren , Lijun Zhang , Peng Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:46 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords stochastic optimizationparameter-free optimizationgrid searchself-bounding analysisnon-convex optimizationconvex optimizationmodel ensemble

0 comments

The pith

Grasp grid search with self-bounding analysis removes all unverifiable parameter bounds from stochastic optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to create stochastic optimization algorithms that need no prior knowledge whatsoever of problem parameters such as bounds or scales. Earlier approaches still required some unverifiable conditions on those parameters; the new Grasp framework instead runs a grid search whose ranges are set by a self-bounding analysis that derives the necessary limits from the algorithm's own behavior. This produces fully parameter-free methods that work in both non-convex and convex settings. In the non-convex case the resulting rate is near-optimal up to logarithmic factors; in the convex case the methods stay competitive with accelerated and universal baselines. A reader would care because the methods can be dropped onto new problems without first estimating quantities that cannot be checked in advance.

Core claim

The central claim is that the Grasp grid-search framework, equipped with a novel self-bounding analysis, yields fully parameter-free stochastic optimization algorithms whose inputs satisfy no unverifiable condition on the true problem parameters. For non-convex problems the method attains near-optimal convergence rates up to logarithmic factors. For convex problems the parameter-free variants remain competitive in both acceleration and universality. The framework also supplies a sharper guarantee for the final model-ensemble step under an interpolated variance characterization.

What carries the argument

Grasp, a grid-search framework that uses self-bounding analysis to determine parameter search ranges from the algorithm's own iterates without external bounds.

Load-bearing premise

The self-bounding analysis produces search ranges that always contain the optimal parameter values for the underlying problem.

What would settle it

A concrete stochastic optimization instance in which the self-bounding ranges exclude the best parameter, causing the method's convergence rate to fall short of a known tuned baseline by more than logarithmic factors.

read the original abstract

Parameter-free stochastic optimization aims to design algorithms that are agnostic to the underlying problem parameters while still achieving convergence rates competitive with optimally tuned methods. While some parameter-free methods do not require the specific values of the problem parameters, they still rely on prior knowledge, such as the lower or upper bounds of them. We refer to such methods as ``partially parameter-free''. In this work, we target achieving ``fully parameter-free'' methods, i.e., the algorithmic inputs do not need to satisfy any unverifiable condition related to the true problem parameters. We propose a powerful and general grid search framework, named \textsc{Grasp}, with a novel self-bounding analysis technique that effectively determines the search ranges of parameters, in contrast to previous work. Our method demonstrates generality in: (i) the non-convex case, where we propose a fully parameter-free method that achieves near-optimal convergence rate, up to logarithmic factors; (ii) the convex case, where our parameter-free methods are competitive with strong performance in terms of acceleration and universality. Finally, we contribute a sharper guarantee for the model ensemble, a final step of the grid search framework, under interpolated variance characterization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Grasp's grid search with self-bounding analysis aims at fully parameter-free stochastic optimization, but the proofs must confirm no hidden dependence on unknown constants like smoothness or variance.

read the letter

The paper's core claim is that Grasp uses a self-bounding analysis to set grid ranges for all parameters without any unverifiable conditions on true problem values. If that holds, it moves past partially parameter-free methods and gives a practical template for tuning-free stochastic optimization in both non-convex and convex settings. The non-convex result reaches near-optimal rates up to logs, the convex side claims competitive acceleration and universality, and they add a sharper ensemble bound under interpolated variance. That combination of generality and the internal range-setting trick is what is actually new here. The framing of fully versus partially parameter-free is also clean and useful. The work shows honest engagement with the literature on adaptive methods. The main soft spot is exactly the one the stress-test flags: the self-bounding lemmas need to be checked line by line to ensure grid bounds never reference unknown quantities even implicitly. The abstract alone gives no derivations, so the full paper's analysis is the load-bearing part. No experiments are described in the supplied text, which leaves practical performance as an open question for review. This is for researchers working on adaptive optimization and ML training pipelines who care about reducing hyperparameter dependence. A reader who wants concrete techniques for removing manual tuning will find value in the framework even if the rates are not the absolute best. It deserves a serious referee to verify the self-bounding details and any empirical support. I would send it to review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Grasp, a grid search framework for stochastic optimization that uses a novel self-bounding analysis to set algorithmic parameter ranges (e.g., step sizes) without requiring any unverifiable conditions on unknown problem parameters such as smoothness constants, variance bounds, or strong-convexity moduli. In the non-convex stochastic setting the method is claimed to achieve near-optimal convergence rates up to logarithmic factors; in the convex setting it is claimed to deliver competitive accelerated and universal performance; an additional sharper guarantee is provided for the final model-ensemble step under an interpolated variance characterization.

Significance. If the self-bounding analysis is shown to be free of implicit dependence on unknown quantities, the work would constitute a meaningful advance toward truly fully parameter-free stochastic optimization, removing a practical barrier that persists in existing partially parameter-free methods. The claimed generality across non-convex and convex regimes plus the ensemble refinement would be valuable contributions.

major comments (2)

[§4] §4 (non-convex analysis): the self-bounding lemmas must be checked to confirm that every upper and lower bound placed on the grid for step-size or other parameters is expressed solely in terms of quantities that are either known a priori or can be observed during execution; any appearance of the unknown Lipschitz constant L or noise variance bound inside the derived range (even inside a logarithm or as a loose overestimate) would render the method only partially parameter-free and contradict the central claim.
[§5] §5 (convex analysis): the universality and acceleration claims rest on the grid ranges being independent of the unknown strong-convexity parameter μ; the self-bounding argument must be shown to produce these ranges without reference to μ or any other unverifiable problem constant.

minor comments (2)

Notation for the grid-search ranges and the self-bounding quantities should be introduced with explicit definitions before first use to improve readability.
The abstract and introduction would benefit from a short table contrasting Grasp with prior partially parameter-free methods on the exact set of required inputs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review of our manuscript. We address the major comments point by point below, providing clarifications on the self-bounding analysis and noting the revisions made to strengthen the presentation.

read point-by-point responses

Referee: [§4] §4 (non-convex analysis): the self-bounding lemmas must be checked to confirm that every upper and lower bound placed on the grid for step-size or other parameters is expressed solely in terms of quantities that are either known a priori or can be observed during execution; any appearance of the unknown Lipschitz constant L or noise variance bound inside the derived range (even inside a logarithm or as a loose overestimate) would render the method only partially parameter-free and contradict the central claim.

Authors: We appreciate the referee's emphasis on this key requirement for fully parameter-free methods. Upon careful re-examination of the self-bounding lemmas in §4, all upper and lower bounds on the grid for step sizes and related parameters are expressed solely in terms of a priori known quantities (e.g., iteration horizon T and dimension d) and quantities observable during execution (e.g., realized gradient norms or function values at prior iterates). The analysis contains no dependence—explicit, implicit, or via logarithms—on the unknown Lipschitz constant L or noise variance bound. The self-bounding technique achieves this by using the algorithm's own trajectory to adaptively delimit the search range without presupposing problem constants. To make this independence fully transparent, we have added a clarifying paragraph and a short remark in the revised §4. revision: partial
Referee: [§5] §5 (convex analysis): the universality and acceleration claims rest on the grid ranges being independent of the unknown strong-convexity parameter μ; the self-bounding argument must be shown to produce these ranges without reference to μ or any other unverifiable problem constant.

Authors: We agree that independence from μ is essential for the claimed universality and acceleration. The self-bounding argument in §5 constructs the grid ranges using only known quantities (such as T) and runtime observables (such as observed function-value gaps or gradient statistics), with no reference to the unknown strong-convexity modulus μ or other unverifiable constants. This independence is what enables the competitive accelerated and universal performance guarantees. We have inserted an explicit statement together with a brief verification in the revised §5 to confirm that the range selection proceeds without invoking μ. revision: partial

Circularity Check

0 steps flagged

No circularity detected; derivation remains self-contained

full rationale

The abstract and description present a grid-search framework (Grasp) whose central technique is a self-bounding analysis that sets parameter ranges using only quantities observable during execution or known a priori, without embedding unknown problem constants inside the bounds. No equations, lemmas, or self-citations are supplied that would reduce any claimed convergence rate or range selection to a fitted quantity or prior result by construction. The contrast to previous work is stated at the level of the overall guarantee rather than through a load-bearing uniqueness theorem or ansatz imported from the authors' own earlier papers. Consequently the derivation chain does not collapse into its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No specific free parameters, axioms, or invented entities can be identified from the abstract alone. The self-bounding analysis is described at a high level without revealing its internal assumptions or any fitted quantities.

pith-pipeline@v0.9.0 · 5522 in / 1127 out tokens · 74233 ms · 2026-05-10T07:46:01.397715+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 3 canonical work pages · 2 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

and Koren, T

Attia, A. and Koren, T. Sgd with adagrad stepsizes: Full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance. In Proceedings of the 40th International Conference on Machine Learning (ICML), pp.\ 1147--1171. PMLR, 2023

2023
[3]

and Koren, T

Attia, A. and Koren, T. How free is parameter-free stochastic optimization? In Proceedings of the 41st International Conference on Machine Learning (ICML), pp.\ 2009--2034, 2024

2009
[4]

E., and Nocedal, J

Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for large-scale machine learning. SIAM Review, 60 0 (2): 0 223--311, 2018

2018
[5]

Anytime online-to-batch, optimism and acceleration

Cutkosky, A. Anytime online-to-batch, optimism and acceleration. In Proceedings of the 36th International Conference on Machine Learning (ICML), pp.\ 1446--1454, 2019

2019
[6]

and Mishchenko, K

Defazio, A. and Mishchenko, K. Learning-rate-free learning by D -adaptation. In Proceedings of the 40th International Conference on Machine Learning (ICML), pp.\ 7449--7479, 2023

2023
[7]

First-order methods of smooth convex optimization with inexact oracle

Devolder, O., Glineur, F., and Nesterov, Y. First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming, 146: 0 37--75, 2014

2014
[8]

From continual learning to SGD and back: Better rates for continual linear models

Evron, I., Levinstein, R., Schliserman, M., Sherman, U., Koren, T., Soudry, D., and Srebro, N. From continual learning to SGD and back: Better rates for continual linear models. arXiv preprint, arXiv:2504.04579, 2025

work page arXiv 2025
[9]

The power of adaptivity in SGD : Self-tuning step sizes with unbounded gradients and affine variance

Faw, M., Tziotis, I., Caramanis, C., Mokhtari, A., Shakkottai, S., and Ward, R. The power of adaptivity in SGD : Self-tuning step sizes with unbounded gradients and affine variance. In Proceedings of the 35th Annual Conference on Learning Theory (COLT), pp.\ 313--355, 2022

2022
[10]

and Lan, G

Ghadimi, S. and Lan, G. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23 0 (4): 0 2341--2368, 2013

2013
[11]

Deep residual learning for image recognition

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016

2016
[12]

DoG is SGD 's best friend: A parameter-free dynamic step size schedule

Ivgi, M., Hinder, O., and Carmon, Y. DoG is SGD 's best friend: A parameter-free dynamic step size schedule. In Proceedings of the 40th International Conference on Machine Learning (ICML), pp.\ 14465--14499, 2023

2023
[13]

Y., Bach, F

Kavis, A., Levy, K. Y., Bach, F. R., and Cevher, V. UniXGrad : A universal, adaptive algorithm with optimal guarantees for constrained optimization. In Advances in Neural Information Processing Systems 32 (NeurIPS), pp.\ 6257--6266, 2019

2019
[14]

Y., and Cevher, V

Kavis, A., Levy, K. Y., and Cevher, V. High probability bounds for a class of nonconvex algorithms with adagrad stepsize. In Proceedings of the 10th International Conference on Learning Representations (ICLR), 2022

2022
[15]

and Jin, C

Khaled, A. and Jin, C. Tuning-free stochastic optimization. In Proceedings of the 41st International Conference on Machine Learning (ICML), pp.\ 23622--23661, 2024

2024
[16]

Kingma, D. P. and Ba, J. Adam : A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2015

2015
[17]

Accelerated parameter-free stochastic optimization

Kreisler, I., Ivgi, M., Hinder, O., and Carmon, Y. Accelerated parameter-free stochastic optimization. In Proceedings of the 37th Annual Conference on Learning Theory (COLT), pp.\ 3257--3324, 2024

2024
[18]

Learning multiple layers of features from tiny images

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009

2009
[19]

First-order and Stochastic Optimization Methods for Machine Learning, volume 1

Lan, G. First-order and Stochastic Optimization Methods for Machine Learning, volume 1. Springer, 2020

2020
[20]

and Lan, G

Li, T. and Lan, G. A simple uniformly optimal method without line search for convex optimization. Mathematical Programming, pp.\ 1--38, 2025

2025
[21]

Aiming towards the minimizers: Fast convergence of SGD for overparametrized problems

Liu, C., Drusvyatskiy, D., Belkin, M., Davis, D., and Ma, Y. Aiming towards the minimizers: Fast convergence of SGD for overparametrized problems. In Advances in Neural Information Processing Systems 36 (NeurIPS), pp.\ 60748--60767, 2023 a

2023
[22]

D., Nguyen, T

Liu, Z., Nguyen, T. D., Nguyen, T. H., Ene, A., and Nguyen, H. High probability convergence of stochastic gradient methods. In Proceedings of the 40th International Conference on Machine Learning (ICML), pp.\ 21884--21914. PMLR, 2023 b

2023
[23]

and Hutter, F

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In Proceedings of the 7th International Conference on Learning Representations (ICLR), 2019

2019
[24]

The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning

Ma, S., Bassily, R., and Belkin, M. The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. In Proceedings of the 35th International Conference on Machine Learning (ICML), pp.\ 3325--3334, 2018

2018
[25]

Universal gradient methods for convex optimization problems

Nesterov, Y. Universal gradient methods for convex optimization problems. Mathematical Programming, 152 0 (1): 0 381--404, 2015

2015
[26]

and Monro, S

Robbins, H. and Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics, 22 0 (3): 0 400 -- 407, 1951

1951
[27]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al. DeepSeekMath : Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Wainwright, M. J. High-dimensional Statistics: A Non-asymptotic Viewpoint, volume 48. Cambridge university press, 2019

2019
[30]

Y., and Zhao, P

Zhao, Y., Yan, Y.-H., Levy, K. Y., and Zhao, P. Gradient-variation online adaptivity for accelerated optimization with hölder smoothness. In Advances in Neural Information Processing Systems 38 (NeurIPS), pp.\ to appear, 2025

2025