pith. sign in

arxiv: 2510.07086 · v2 · pith:66OWUQQYnew · submitted 2025-10-08 · 💻 cs.LG

Non-Stationary Online Structured Prediction with Surrogate Losses

Pith reviewed 2026-05-18 08:36 UTC · model grok-4.3

classification 💻 cs.LG
keywords online structured predictionsurrogate lossesnon-stationarydynamic regretonline gradient descentFenchel-Young losstarget losspath length
0
0 comments X

The pith

The cumulative target loss in non-stationary online structured prediction is bounded by the surrogate loss of any comparator sequence plus a term depending on its path length.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that in online structured prediction tasks where the environment can change over time, one can still obtain useful performance guarantees by comparing to a sequence of estimators rather than a single fixed one. Specifically, the total target loss incurred is no worse than the total surrogate loss of the best varying sequence plus a penalty proportional to how much that sequence changes. This matters because in real applications like online classification with drifting data, fixed estimators can perform arbitrarily poorly as time progresses. The result is achieved by merging dynamic regret bounds for online gradient descent with methods that exploit the difference between target and surrogate losses. A Polyak-style learning rate is highlighted as particularly effective for producing these bounds.

Core claim

We obtain an upper bound of F_T + O(1 + P_T) on the cumulative target loss, where F_T is the cumulative surrogate loss of any comparator sequence and P_T is its path length. This bound depends on T only through F_T and P_T, thus offering stronger guarantees under non-stationarity. Our core idea is to combine the dynamic regret analysis of online gradient descent (OGD) with the exploit-the-surrogate-gap technique. This viewpoint sheds light on the usefulness of a Polyak-style learning rate for OGD, which systematically yields target-loss bounds and performs well empirically. We then extend our approach to broader settings beyond prior work via the convolutional Fenchel--Young loss. Finally, a

What carries the argument

Dynamic regret analysis of online gradient descent combined with the exploit-the-surrogate-gap technique applied to surrogate losses like the convolutional Fenchel-Young loss.

Load-bearing premise

The exploit-the-surrogate-gap technique remains valid and combines cleanly with dynamic regret analysis of OGD when the comparator sequence varies arbitrarily.

What would settle it

Construct a non-stationary online structured prediction problem where for some comparator sequence the cumulative target loss grows faster than F_T + O(1 + P_T), violating the bound.

Figures

Figures reproduced from arXiv: 2510.07086 by Han Bao, Shinsaku Sakaue, Yuzhou Cao.

Figure 1
Figure 1. Figure 1: Experimental results for different learning rates under varying numbers of label flips. [PITH_FULL_IMAGE:figures/full_fig_p019_1.png] view at source ↗
read the original abstract

Online structured prediction, including online classification as a special case, is the task of sequentially predicting labels from input features. In this setting, the surrogate regret -- the cumulative excess of the actual target loss (e.g., the 0-1 loss) over the surrogate loss (e.g., the logistic loss) incurred by the best fixed estimator -- has gained attention because it admits a finite bound independent of the time horizon $T$. However, such guarantees break down in non-stationary environments, where every fixed estimator may incur surrogate loss that grows linearly with $T$. To address this limitation, we obtain an upper bound of $F_T + O(1 + P_T)$ on the cumulative target loss, where $F_T$ is the cumulative surrogate loss of any comparator sequence and $P_T$ is its path length. This bound depends on $T$ only through $F_T$ and $P_T$, thus offering stronger guarantees under non-stationarity. Our core idea is to combine the dynamic regret analysis of online gradient descent (OGD) with the exploit-the-surrogate-gap technique. This viewpoint sheds light on the usefulness of a Polyak-style learning rate for OGD, which systematically yields target-loss bounds and performs well empirically. We then extend our approach to broader settings beyond prior work via the convolutional Fenchel--Young loss. Finally, a lower bound shows that the dependence on $F_T$ and $P_T$ is tight.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims an upper bound of F_T + O(1 + P_T) on the cumulative target loss in non-stationary online structured prediction, where F_T is the cumulative surrogate loss of any comparator sequence and P_T is its path length. The bound is obtained by combining dynamic regret analysis of online gradient descent (OGD) with the exploit-the-surrogate-gap technique. The work extends the approach to convolutional Fenchel-Young losses and provides a matching lower bound establishing tightness of the dependence on F_T and P_T.

Significance. If the central derivation holds, the result supplies stronger non-stationary guarantees than stationary surrogate-regret bounds, since the target-loss bound depends on T only through the comparator's own F_T and P_T. The explicit usefulness of Polyak-style step sizes for OGD in producing target-loss bounds, together with the empirical performance and the extension beyond prior work, constitute concrete contributions. The lower bound further strengthens the claims by showing necessity.

major comments (1)
  1. [Section 3 (main theorem and proof of the F_T + O(1 + P_T) bound)] The central claim requires that the exploit-the-surrogate-gap step fully cancels all T-dependent terms (including any O(sqrt(V_T)) contribution) from the dynamic-regret bound of OGD when the comparator sequence u_t is allowed to vary arbitrarily. Standard dynamic-regret analyses for convex Lipschitz losses produce an extra variation term alongside the path-length contribution; the manuscript must show explicitly how the gap inequality, often derived under a fixed comparator, extends to the non-stationary case without leaving a residual sum involving ||u_{t+1}-u_t|| multiplied by a quantity that can grow with T. This cancellation is load-bearing for the stated bound F_T + O(1 + P_T).
minor comments (2)
  1. [Abstract] Clarify the precise statement of the matching lower bound (what quantity it lower-bounds and up to what constants) already in the abstract.
  2. [Introduction and Section 4] Define F_T and P_T with explicit notation before their first use in the introduction and ensure the convolutional Fenchel-Young loss is introduced with its surrogate-gap definition.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for identifying a key point that requires clearer exposition in the proof. We address the major comment below and will revise the manuscript accordingly to strengthen the presentation.

read point-by-point responses
  1. Referee: [Section 3 (main theorem and proof of the F_T + O(1 + P_T) bound)] The central claim requires that the exploit-the-surrogate-gap step fully cancels all T-dependent terms (including any O(sqrt(V_T)) contribution) from the dynamic-regret bound of OGD when the comparator sequence u_t is allowed to vary arbitrarily. Standard dynamic-regret analyses for convex Lipschitz losses produce an extra variation term alongside the path-length contribution; the manuscript must show explicitly how the gap inequality, often derived under a fixed comparator, extends to the non-stationary case without leaving a residual sum involving ||u_{t+1}-u_t|| multiplied by a quantity that can grow with T. This cancellation is load-bearing for the stated bound F_T + O(1 + P_T).

    Authors: We appreciate the referee highlighting the need for explicit verification of this cancellation. In the proof of Theorem 3.1, we first invoke the dynamic regret bound for OGD equipped with Polyak-style step sizes, which for convex Lipschitz losses yields a surrogate regret of O(1 + P_T) with respect to any comparator sequence u_{1:T}; the path length P_T precisely absorbs all variation terms, leaving no residual O(sqrt(V_T)) or T-dependent factor outside of P_T. The exploit-the-surrogate-gap step is then applied pointwise: at each t the target loss is bounded above by the surrogate loss incurred by the algorithm plus a non-negative gap term that depends only on the prediction and label at time t. Because this per-step inequality is local and holds independently of the global comparator sequence (fixed or varying), it combines directly with the dynamic regret bound without introducing additional sums over ||u_{t+1} - u_t|| multiplied by a quantity that can grow with T. The resulting cumulative target loss is therefore at most F_T + O(1 + P_T). We acknowledge that the manuscript would benefit from a more explicit lemma isolating the extension of the gap inequality to time-varying comparators. We will add this lemma (with a short derivation) to Section 3 in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: bound derived from standard dynamic-regret + surrogate-gap combination

full rationale

The claimed bound F_T + O(1 + P_T) is obtained by applying the exploit-the-surrogate-gap technique to the dynamic regret of OGD under Polyak-style stepsizes, then extending via convolutional Fenchel-Young losses. No step reduces by construction to a fitted parameter, self-definition, or load-bearing self-citation; the derivation relies on external dynamic-regret lemmas and the surrogate-gap inequality, both of which are independent of the final non-stationary bound. The paper's own equations do not force the T-independent form; any residual terms are controlled by the path-length analysis rather than by re-using the target result. This is the normal case of a self-contained theoretical argument.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Analysis rests on standard online convex optimization assumptions plus the validity of the surrogate-gap technique in the dynamic-comparator case; no free parameters or new entities are introduced in the abstract.

axioms (2)
  • domain assumption Dynamic regret analysis of online gradient descent applies when the comparator sequence varies over time
    Invoked to obtain the F_T + O(1 + P_T) bound via combination with surrogate-gap exploitation.
  • domain assumption The surrogate gap between target loss and surrogate loss can be exploited to transfer dynamic-regret guarantees to target loss
    Central to the core idea stated in the abstract.

pith-pipeline@v0.9.0 · 5795 in / 1469 out tokens · 46063 ms · 2026-05-18T08:36:58.987112+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 1 internal anchor

  1. [1]

    BakIr, T

    G. BakIr, T. Hofmann, B. Schölkopf, A. J. Smola, B. Taskar, and S. V. N. Vishwanathan.Predicting Structured Data. The MIT Press, 2007 (cited on page 1)

  2. [2]

    P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, and risk bounds.Journal of the American Statistical Association, 101(473):138–156, 2006 (cited on page 1)

  3. [3]

    Beck and S

    A. Beck and S. Shtern. Linearly convergent away-step conditional gradient for non-strongly convex functions.Mathematical Programming, 164:1–27, 2017 (cited on pages 8, 17)

  4. [4]

    Besançon, S

    M. Besançon, S. Pokutta, and E. S. Wirth. The pivoting framework: Frank–Wolfe algorithms with active set size control. InProceedings of the 28th International Conference on Artificial Intelligence and Statistics, volume 258, pages 271–279. PMLR, 2025 (cited on pages 8, 17)

  5. [5]

    M. Blondel. Structured prediction with projection oracles. InAdvances in Neural Information Processing Systems, volume 32, pages 12167–12178. Curran Associates, Inc., 2019 (cited on pages 3, 4, 6, 8, 14)

  6. [6]

    Blondel, F

    M. Blondel, F. Llinares-Lopez, R. Dadashi, L. Hussenot, and M. Geist. Learning energy networks with generalized Fenchel–Young losses. InAdvances in Neural Information Processing Systems, volume 35, pages 12516–12528. Curran Associates, Inc., 2022 (cited on pages 6, 15, 21)

  7. [7]

    Blondel, A

    M. Blondel, A. F. T. Martins, and V. Niculae. Learning with Fenchel–Young losses.Journal of Machine Learning Research, 21(35):1–69, 2020 (cited on pages 3, 5, 9, 10, 15, 16)

  8. [8]

    Boudart, A

    P. Boudart, A. Rudi, and P. Gaillard. Structured prediction in online learning.arXiv:2406.12366, 2024 (cited on page 3)

  9. [9]

    Boyd and L

    S. Boyd and L. Vandenberghe.Convex Optimization. Cambridge University Press, 2004 (cited on page 15)

  10. [10]

    Y. Cao, H. Bao, L. Feng, and B. An. Establishing linear surrogate regret bounds for convex smooth losses via convolutional Fenchel–Young losses.arXiv:2505.09432, 2025 (cited on pages 3, 4, 8, 9, 16, 17, 20, 21)

  11. [11]

    Cesa-Bianchi and G

    N. Cesa-Bianchi and G. Lugosi.Prediction, Learning, and Games. Cambridge University Press, 2006 (cited on page 2)

  12. [12]

    Ciliberto, L

    C. Ciliberto, L. Rosasco, and A. Rudi. A consistent regularization approach for structured prediction. InAdvances in Neural Information Processing Systems, volume 29, pages 4412–4420. Curran Associates, Inc., 2016 (cited on pages 2, 3)

  13. [13]

    Ciliberto, L

    C. Ciliberto, L. Rosasco, and A. Rudi. A general framework for consistent structured prediction with implicit loss embeddings.Journal of Machine Learning Research, 21(98):1–67, 2020 (cited on pages 2, 3). 11

  14. [14]

    Crammer and Y

    K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines.Journal of Machine Learning Research, 2:265–292, 2001 (cited on pages 15, 17)

  15. [15]

    Daniely, A

    A. Daniely, A. Gonen, and S. Shalev-Shwartz. Strongly adaptive online learning. InProceedings of the 32nd International Conference on Machine Learning, volume 37, pages 1405–1411. PMLR, 2015 (cited on page 7)

  16. [16]

    J. M. Danskin. The theory of max-min, with applications.SIAM Journal on Applied Mathematics, 14(4):641–664, 1966 (cited on pages 16, 21)

  17. [17]

    Garber and N

    D. Garber and N. Wolf. Frank–Wolfe with a nearest extreme point oracle. InProceedings of the 34th Conference on Learning Theory, volume 134, pages 2103–2132. PMLR, 2021 (cited on page 16)

  18. [18]

    Herbster and M

    M. Herbster and M. K. Warmuth. Tracking the best linear predictor.Journal of Machine Learning Research, 1:281–309, 2001 (cited on page 3)

  19. [19]

    S. M. Kakade, S. Shalev-Shwartz, and A. Tewari. On the duality of strong convexity and strong smoothness: Learning applications and matrix regularization. Technical report, 2009 (cited on page 5)

  20. [20]

    Lacoste-Julien and M

    S. Lacoste-Julien and M. Jaggi. On the global linear convergence of Frank–Wolfe optimization variants. InAdvances in Neural Information Processing Systems, volume 28, pages 496–504. Curran Associates, Inc., 2015 (cited on pages 8, 16)

  21. [21]

    Lafferty, A

    J. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. InProceedings of the 18th International Conference on Machine Learning, pages 282–289. Morgan Kaufmann Publishers Inc., 2001 (cited on page 5)

  22. [22]

    Niculae, A

    V. Niculae, A. F. T. Martins, M. Blondel, and C. Cardie. SparseMAP: Differentiable sparse structured inference. InProceedings of the 35th International Conference on Machine Learning, volume 80, pages 3799–3808. PMLR, 2018 (cited on pages 5, 16)

  23. [23]

    F. Orabona. A modern introduction to online learning.arXiv:1912.13213, 2023. https://arxiv.org/abs/ 1912.13213v6 (cited on pages 2, 7, 18, 19)

  24. [24]

    B. T. Polyak.Introduction to Optimization. Optimization Software, 1987 (cited on page 8)

  25. [25]

    J. D. M. Rennie and N. Srebro. Loss functions for preference levels: Regression with discrete ordered labels. InProceedings of the IJCAI Multidisciplinary Workshop on Advances in Preference Handling, pages 180–186. Kluwer Norwell, 2005 (cited on page 15)

  26. [26]

    R. T. Rockafellar.Convex Analysis. Princeton University Press, 1970 (cited on page 9)

  27. [27]

    Rosenblatt

    F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain.Psychological Review, 65(6):386–408, 1958 (cited on pages 2, 3)

  28. [28]

    Sakaue, H

    S. Sakaue, H. Bao, T. Tsuchiya, and T. Oki. Online structured prediction with Fenchel–Young losses and improved surrogate regret for online multiclass classification with logistic loss. InProceedings of the 37th Conference on Learning Theory, volume 247, pages 4458–4486. PMLR, 2024 (cited on pages 2–6, 8, 14, 16, 19)

  29. [29]

    Shibukawa, T

    Y. Shibukawa, T. Tsuchiya, S. Sakaue, and K. Yamanishi. Bandit and delayed feedback in online structured prediction.arXiv:2502.18709, 2025 (cited on pages 2, 3, 11)

  30. [30]

    Srebro, K

    N. Srebro, K. Sridharan, and A. Tewari. Smoothness, low noise and fast rates. InAdvances in Neural Information Processing Systems, volume 23, pages 2199–2207. Curran Associates, Inc., 2010 (cited on page 7)

  31. [31]

    Van der Hoeven

    D. Van der Hoeven. Exploiting the surrogate gap in online multiclass classification. InAdvances in Neural Information Processing Systems, volume 33, pages 9562–9572. Curran Associates, Inc., 2020 (cited on pages 2, 3, 5–8, 11, 14, 16, 17, 19)

  32. [32]

    Van der Hoeven, F

    D. Van der Hoeven, F. Fusco, and N. Cesa-Bianchi. Beyond bandit feedback in online multiclass classification. InAdvances in Neural Information Processing Systems, volume 34, pages 13280–13291. Curran Associates, Inc., 2021 (cited on pages 2, 3, 5–8, 11, 14, 16, 17, 19). 12

  33. [33]

    A. C.-C. Yao. Probabilistic computations: Toward a unified measure of complexity. InProceedings of the 18th Annual Symposium on Foundations of Computer Science, pages 222–227. IEEE, 1977 (cited on page 21)

  34. [34]

    Zhang, S

    L. Zhang, S. Lu, and Z.-H. Zhou. Adaptive online learning in dynamic environments. InAdvances in Neural Information Processing Systems, volume 31, pages 1323–1333. Curran Associates, Inc., 2018 (cited on page 3)

  35. [35]

    Zhao, Y.-J

    P. Zhao, Y.-J. Zhang, L. Zhang, and Z.-H. Zhou. Dynamic regret of convex and smooth functions. In Advances in Neural Information Processing Systems, volume 33, pages 12510–12520. Curran Associates, Inc., 2020 (cited on page 3)

  36. [36]

    Zhao, Y.-J

    P. Zhao, Y.-J. Zhang, L. Zhang, and Z.-H. Zhou. Adaptivity and non-stationarity: Problem-dependent dynamic regret for online convex optimization.Journal of Machine Learning Research, 25(98):1–52, 2024 (cited on pages 2, 3, 7)

  37. [37]

    1 flip” setting (nearly stationary), “Polyak-style

    M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. InProceedings of the 20th International Conference on Machine Learning, pages 928–935. AAAI Press, 2003 (cited on pages 2, 3, 5). 13 A Examples of Structured Prediction Problems Below we discuss a part of examples provided in Blondel [5] and Sakaue et al. [28]. A.1 Mult...