pith. sign in

arxiv: 1907.11584 · v1 · pith:A2DAOPWInew · submitted 2019-07-26 · 💻 cs.LG · stat.ML

Scalable Semi-Supervised SVM via Triply Stochastic Gradients

Pith reviewed 2026-05-24 15:47 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords semi-supervised SVMstochastic gradientkernel learningnon-convex optimizationscalable machine learningrandom features
0
0 comments X

The pith

Triply stochastic gradients let semi-supervised SVMs converge to stationary points on large non-convex problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TSGS³VM to scale kernel-based semi-supervised SVM training to big data. It forms each gradient estimate by sampling one labeled instance, one unlabeled instance, and random features together, then updates the solution with this estimate. Earlier doubly stochastic methods required convexity and could not handle the non-convex S³VM objective. The new analysis proves convergence to a stationary point, and experiments show the method runs faster and handles larger collections than prior S³VM solvers. A reader would care because many practical tasks supply far more unlabeled than labeled examples yet lack scalable training algorithms for them.

Core claim

TSGS³VM samples a labeled instance, an unlabeled instance, and random features in each iteration to form a triply stochastic gradient estimate for the non-convex S³VM objective, then uses the estimate to update the model; the paper proves that this procedure converges to a stationary point and demonstrates empirically that the algorithm is substantially more efficient and scalable than existing S³VM methods.

What carries the argument

The triply stochastic gradient estimator obtained by simultaneously sampling labeled data, unlabeled data, and random features to approximate the non-convex kernel S³VM objective.

If this is right

  • TSGS³VM converges to a stationary point without any convexity assumption on the objective.
  • The training procedure scales to large numbers of both labeled and unlabeled examples.
  • Kernel semi-supervised learning becomes feasible on datasets where prior S³VM solvers are impractical.
  • Per-iteration cost stays independent of the full training-set size.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same triply stochastic sampling pattern could be tested on other non-convex semi-supervised objectives such as those arising in deep networks.
  • The method might be adapted to streaming or online settings where labeled and unlabeled examples arrive continuously.
  • Further speed-ups could be obtained by combining the estimator with existing variance-reduction techniques.

Load-bearing premise

Sampling one labeled instance, one unlabeled instance, and random features each iteration produces a gradient estimate whose bias and variance permit convergence analysis for the non-convex S³VM objective.

What would settle it

A run of TSGS³VM on a standard large-scale S³VM benchmark in which the method either fails to reach a stationary point or requires more wall-clock time than a baseline S³VM solver to reach comparable accuracy.

Figures

Figures reproduced from arXiv: 1907.11584 by Bin Gu, Guansheng Zheng, Heng Huang, Wanli Shi, Xiang Geng, Xiang Li.

Figure 1
Figure 1. Figure 1: Illustration of how TSGS3VM converge to a stationary point, where e denotes for the error between ft and ht, the white line denote the objective value R(h). In this toy model we assume all horizontal points in H have the same objective value. 5 Theoretical Guarantees We follow the common goal of non-convex analysis [Ghadimi and Lan, 2013; Gu et al., 2018a; Huo et al., 2018] to bound E||∇R(f)||2 , which mea… view at source ↗
Figure 2
Figure 2. Figure 2: Running time of different S3VM solvers v.s. training size on the eight benchmark data sets, where the lines of BLS3VM and S 3VMlight are incomplete on several datasets due to the corresponding implementations crash on the large-scale training set. CodRna Dota2 HP Higgs IJCNN1 Skin SUSY W6a 0 0.2 0.4 0.6 Test Error BLS3VM S 3VMlight BGS3VM FRS3VM TSGS3VM [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The boxplot of test error for different methods on d [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Test error of different S3VM solvers v.s. iterations on the eight benchmark data sets. [Shi et al., 2019] Wanli Shi, Bin Gu, Xiang Li, Xiang Geng, and Heng Huang. Quadruply stochastic gradients for large scale nonlinear semi-supervised auc optimization. In 28th International Joint Conference on Artificial Intelligence, 2019. [Wang et al., 2007] Junhui Wang, Xiaotong Shen, and Wei Pan. On transductive suppo… view at source ↗
read the original abstract

Semi-supervised learning (SSL) plays an increasingly important role in the big data era because a large number of unlabeled samples can be used effectively to improve the performance of the classifier. Semi-supervised support vector machine (S$^3$VM) is one of the most appealing methods for SSL, but scaling up S$^3$VM for kernel learning is still an open problem. Recently, a doubly stochastic gradient (DSG) algorithm has been proposed to achieve efficient and scalable training for kernel methods. However, the algorithm and theoretical analysis of DSG are developed based on the convexity assumption which makes them incompetent for non-convex problems such as S$^3$VM. To address this problem, in this paper, we propose a triply stochastic gradient algorithm for S$^3$VM, called TSGS$^3$VM. Specifically, to handle two types of data instances involved in S$^3$VM, TSGS$^3$VM samples a labeled instance and an unlabeled instance as well with the random features in each iteration to compute a triply stochastic gradient. We use the approximated gradient to update the solution. More importantly, we establish new theoretic analysis for TSGS$^3$VM which guarantees that TSGS$^3$VM can converge to a stationary point. Extensive experimental results on a variety of datasets demonstrate that TSGS$^3$VM is much more efficient and scalable than existing S$^3$VM algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes TSGS³VM, a triply stochastic gradient algorithm for scaling up semi-supervised SVM (S³VM) to kernel learning. It samples one labeled instance, one unlabeled instance, and random features per iteration to compute an approximate gradient for the non-convex objective, updates the solution accordingly, and provides new theoretical analysis claiming convergence to a stationary point. Experiments on various datasets are said to show superior efficiency and scalability over existing S³VM algorithms.

Significance. Should the convergence theorem hold under the triple sampling scheme for the non-convex objective, the work would offer a valuable extension of doubly stochastic gradient methods to non-convex SSL problems, potentially enabling kernel-based S³VM on large datasets where current methods struggle with scalability. The practical efficiency gains reported could impact applications in big data semi-supervised classification.

major comments (2)
  1. Abstract: The assertion that 'new theoretic analysis for TSGS³VM which guarantees that TSGS³VM can converge to a stationary point' is load-bearing for the central claim, yet the abstract supplies no proof sketch, no conditions on the random feature approximation bias, and no analysis of how the combined Monte-Carlo variance from labeled/unlabeled sampling and random feature bias permit the standard non-convex SGD convergence arguments to go through.
  2. Theoretical analysis section: For the triply stochastic gradient to satisfy the requirements of non-convex SGD, the paper must demonstrate that the bias from the random feature map (interacting with the non-convex unlabeled loss) vanishes at a sufficient rate and that the variance remains bounded independently of the feature dimension; if this is not shown, the descent lemma or Lyapunov function argument would not close.
minor comments (1)
  1. Abstract: The description of the method could specify the form of the triply stochastic gradient more explicitly to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying areas where the presentation of the theoretical claims could be strengthened. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: Abstract: The assertion that 'new theoretic analysis for TSGS³VM which guarantees that TSGS³VM can converge to a stationary point' is load-bearing for the central claim, yet the abstract supplies no proof sketch, no conditions on the random feature approximation bias, and no analysis of how the combined Monte-Carlo variance from labeled/unlabeled sampling and random feature bias permit the standard non-convex SGD convergence arguments to go through.

    Authors: We agree that the abstract is concise and does not contain a proof sketch or explicit conditions. Abstracts are space-limited, but we will revise it to briefly note that convergence to a stationary point holds under standard assumptions on the random-feature bias vanishing with the number of features and bounded variance of the triply stochastic estimator. The full argument appears in Section 4. revision: yes

  2. Referee: Theoretical analysis section: For the triply stochastic gradient to satisfy the requirements of non-convex SGD, the paper must demonstrate that the bias from the random feature map (interacting with the non-convex unlabeled loss) vanishes at a sufficient rate and that the variance remains bounded independently of the feature dimension; if this is not shown, the descent lemma or Lyapunov function argument would not close.

    Authors: The analysis in Section 4 establishes that the random-feature bias term is controlled by the approximation error of the kernel (which vanishes as the number of random features grows) and that the variance of the combined labeled/unlabeled/random-feature estimator is bounded by a constant independent of the feature dimension, because each sampling distribution is fixed. We will add an explicit remark and a supporting lemma stating the bias-vanishing rate to make the application of the non-convex SGD convergence result fully transparent. revision: partial

Circularity Check

0 steps flagged

No circularity: new convergence analysis is independent of fitted inputs or self-citation chains

full rationale

The paper defines TSGS3VM via explicit triple sampling (one labeled instance, one unlabeled instance, and random features per iteration) and states that it establishes a new convergence theorem guaranteeing a stationary point for the non-convex S3VM objective. This derivation chain begins from the standard non-convex SGD framework and applies it to the specific sampling scheme; it does not reduce any claimed prediction or theorem to a fitted parameter, a self-citation load-bearing premise, or an ansatz imported from prior work by the same authors. The DSG reference is used only as background motivation, not as the justification for the new bias/variance control. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view supplies no explicit free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5797 in / 916 out tokens · 18047 ms · 2026-05-24T15:47:00.882369+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

  1. [1]

    Semi-supervised support vector machines

    [Bennett and Demiriz, 1999 ] Kristin P Bennett and A yhan Demiriz. Semi-supervised support vector machines. In Advances in Neural Information processing systems, pages 368–374,

  2. [2]

    Generalized smo algorithm for svm- based multitask learning

    [Cai and Cherkassky, 2012] Feng Cai and Vladimir Cherkassky. Generalized smo algorithm for svm- based multitask learning. IEEE transactions on neural networks and learning systems , 23(6):997–1003,

  3. [3]

    Learning with sgd and random fea- tures

    [Carratino et al., 2018] Luigi Carratino, Alessandro Rudi, and Lorenzo Rosasco. Learning with sgd and random fea- tures. In Advances in Neural Information Processing Sys- tems, pages 10213–10224,

  4. [4]

    LIBSVM: A library for support vector machines

    [Chang and Lin, 2011 ] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and T echnology, 2:27:1–27:27,

  5. [5]

    Semi-supervised classification by low density sepa- ration

    [Chapelle and Zien, 2005 ] Olivier Chapelle and Alexander Zien. Semi-supervised classification by low density sepa- ration. In AISTATS, volume 2005, pages 57–64. Citeseer,

  6. [6]

    Training a support vector machine in the primal

    [Chapelle, 2007] Olivier Chapelle. Training a support vector machine in the primal. Neural computation , 19(5):1155– 1178,

  7. [7]

    Large scale transductive svms

    [Collobert et al., 2006] Ronan Collobert, Fabian Sinz, Jason Weston, and L´ eon Bottou. Large scale transductive svms. Journal of Machine Learning Research , 7(Aug):1687– 1712,

  8. [8]

    Scalable kernel methods via doubly stochastic gradients

    [Dai et al., 2014] Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina F Balcan, and Le Song. Scalable kernel methods via doubly stochastic gradients. In Ad- vances in Neural Information Processing Systems , pages 3041–3049,

  9. [9]

    On the nystr¨ om method for approximating a gram matrix for improved kernel- based learning

    [Drineas and Mahoney, 2005] Petros Drineas and Michael W Mahoney. On the nystr¨ om method for approximating a gram matrix for improved kernel- based learning. journal of machine learning research , 6(Dec):2153–2175,

  10. [10]

    Stochastic first-and zeroth-order methods for non- convex stochastic programming

    [Ghadimi and Lan, 2013 ] Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for non- convex stochastic programming. SIAM Journal on Opti- mization, 23(4):2341–2368,

  11. [11]

    Training neural networks using features replay

    [Huo et al., 2018] Zhouyuan Huo, Bin Gu, and Heng Huang. Training neural networks using features replay. In Ad- vances in Neural Information Processing Systems , pages 6659–6668,

  12. [12]

    Transductive inference for text classification using support vector machines

    [Joachims, 1999] Thorsten Joachims. Transductive inference for text classification using support vector machines. In Icml, volume 99, pages 200–209,

  13. [13]

    Bud- geted semi-supervised support vector machine

    [Le et al., 2016] Trung Le, Phuong Duong, Mi Dinh, Tu Dinh Nguyen, Vu Nguyen, and Dinh Q Phung. Bud- geted semi-supervised support vector machine. In UAI,

  14. [14]

    Triply stochastic gradients on multiple kernel learning

    [Li et al., 2017] Xiang Li, Bin Gu, Shuang Ao, Huaimin Wang, and Charles X Ling. Triply stochastic gradients on multiple kernel learning. In UAI,

  15. [15]

    Randomized nonlinear component analysis

    [Lopez-Paz et al., 2014] David Lopez-Paz, Suvrit Sra, Alex Smola, Zoubin Ghahramani, and Bernhard Sch¨ olkopf. Randomized nonlinear component analysis. Computer Science, 4:1359–1367,

  16. [16]

    Random features for large-scale kernel machines

    [Rahimi and Recht, 2008 ] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Ad- vances in neural information processing systems , pages 1177–1184,

  17. [17]

    Stochastic vari- ance reduction for nonconvex optimization

    [Reddi et al., 2016] Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic vari- ance reduction for nonconvex optimization. In Interna- tional conference on machine learning , pages 314–323,

  18. [18]

    iterations on the eight benchmark data sets

    0 100 200 300 400 500 600 Iterations 0.1 0.2 0.3 0.4 0.5 0.6Test error TSGS3VM (a) CodRNA 0 50 100 150 200 250 Iterations 0 0.2 0.4 0.6 0.8 1 Test error TSGS3VM (b) W6a 0 50 100 150 200 250 Iterations 0 0.2 0.4 0.6 0.8 1 Test error TSGS3VM (c) IJCNN1 0 50 100 150 Iterations 0.3 0.35 0.4 0.45Test error TSGS3VM (d) SUSY 0 50 100 150 Iterations 0 0.05 0.1 0....

  19. [19]

    On transductive support vector machines

    [Wang et al., 2007] Junhui Wang, Xiaotong Shen, and Wei Pan. On transductive support vector machines. Contem- porary Mathematics, 443:7–20,

  20. [20]

    Scattered data approx- imation, volume

    [Wendland, 2004] Holger Wendland. Scattered data approx- imation, volume

  21. [21]

    Scale up nonlinear component analysis with doubly stochastic gradients

    [Xie et al., 2015] Bo Xie, Yingyu Liang, and Le Song. Scale up nonlinear component analysis with doubly stochastic gradients. In Advances in Neural Information Processing Systems, pages 2341–2349,

  22. [22]

    Tackle balancing constraint for incremental semi-supervised support vecto r learning

    [Y uet al., 2019] Shuyang Y u, Bin Gu, Kunpeng Ning, Haiyan Chen, Jian Pei, and Heng Huang. Tackle balancing constraint for incremental semi-supervised support vecto r learning. In Proceedings of the 25th ACM SIGKDD In- ternational Conference on Knowledge Discovery & Data Mining,

  23. [23]

    The concave-convex procedure (cccp)

    [Y uille and Rangarajan, 2002] Alan L Y uille and Anand Rangarajan. The concave-convex procedure (cccp). In Advances in neural information processing systems , pages 1033–1040, 2002