Scalable Semi-Supervised SVM via Triply Stochastic Gradients
Pith reviewed 2026-05-24 15:47 UTC · model grok-4.3
The pith
Triply stochastic gradients let semi-supervised SVMs converge to stationary points on large non-convex problems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TSGS³VM samples a labeled instance, an unlabeled instance, and random features in each iteration to form a triply stochastic gradient estimate for the non-convex S³VM objective, then uses the estimate to update the model; the paper proves that this procedure converges to a stationary point and demonstrates empirically that the algorithm is substantially more efficient and scalable than existing S³VM methods.
What carries the argument
The triply stochastic gradient estimator obtained by simultaneously sampling labeled data, unlabeled data, and random features to approximate the non-convex kernel S³VM objective.
If this is right
- TSGS³VM converges to a stationary point without any convexity assumption on the objective.
- The training procedure scales to large numbers of both labeled and unlabeled examples.
- Kernel semi-supervised learning becomes feasible on datasets where prior S³VM solvers are impractical.
- Per-iteration cost stays independent of the full training-set size.
Where Pith is reading between the lines
- The same triply stochastic sampling pattern could be tested on other non-convex semi-supervised objectives such as those arising in deep networks.
- The method might be adapted to streaming or online settings where labeled and unlabeled examples arrive continuously.
- Further speed-ups could be obtained by combining the estimator with existing variance-reduction techniques.
Load-bearing premise
Sampling one labeled instance, one unlabeled instance, and random features each iteration produces a gradient estimate whose bias and variance permit convergence analysis for the non-convex S³VM objective.
What would settle it
A run of TSGS³VM on a standard large-scale S³VM benchmark in which the method either fails to reach a stationary point or requires more wall-clock time than a baseline S³VM solver to reach comparable accuracy.
Figures
read the original abstract
Semi-supervised learning (SSL) plays an increasingly important role in the big data era because a large number of unlabeled samples can be used effectively to improve the performance of the classifier. Semi-supervised support vector machine (S$^3$VM) is one of the most appealing methods for SSL, but scaling up S$^3$VM for kernel learning is still an open problem. Recently, a doubly stochastic gradient (DSG) algorithm has been proposed to achieve efficient and scalable training for kernel methods. However, the algorithm and theoretical analysis of DSG are developed based on the convexity assumption which makes them incompetent for non-convex problems such as S$^3$VM. To address this problem, in this paper, we propose a triply stochastic gradient algorithm for S$^3$VM, called TSGS$^3$VM. Specifically, to handle two types of data instances involved in S$^3$VM, TSGS$^3$VM samples a labeled instance and an unlabeled instance as well with the random features in each iteration to compute a triply stochastic gradient. We use the approximated gradient to update the solution. More importantly, we establish new theoretic analysis for TSGS$^3$VM which guarantees that TSGS$^3$VM can converge to a stationary point. Extensive experimental results on a variety of datasets demonstrate that TSGS$^3$VM is much more efficient and scalable than existing S$^3$VM algorithms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes TSGS³VM, a triply stochastic gradient algorithm for scaling up semi-supervised SVM (S³VM) to kernel learning. It samples one labeled instance, one unlabeled instance, and random features per iteration to compute an approximate gradient for the non-convex objective, updates the solution accordingly, and provides new theoretical analysis claiming convergence to a stationary point. Experiments on various datasets are said to show superior efficiency and scalability over existing S³VM algorithms.
Significance. Should the convergence theorem hold under the triple sampling scheme for the non-convex objective, the work would offer a valuable extension of doubly stochastic gradient methods to non-convex SSL problems, potentially enabling kernel-based S³VM on large datasets where current methods struggle with scalability. The practical efficiency gains reported could impact applications in big data semi-supervised classification.
major comments (2)
- Abstract: The assertion that 'new theoretic analysis for TSGS³VM which guarantees that TSGS³VM can converge to a stationary point' is load-bearing for the central claim, yet the abstract supplies no proof sketch, no conditions on the random feature approximation bias, and no analysis of how the combined Monte-Carlo variance from labeled/unlabeled sampling and random feature bias permit the standard non-convex SGD convergence arguments to go through.
- Theoretical analysis section: For the triply stochastic gradient to satisfy the requirements of non-convex SGD, the paper must demonstrate that the bias from the random feature map (interacting with the non-convex unlabeled loss) vanishes at a sufficient rate and that the variance remains bounded independently of the feature dimension; if this is not shown, the descent lemma or Lyapunov function argument would not close.
minor comments (1)
- Abstract: The description of the method could specify the form of the triply stochastic gradient more explicitly to aid readability.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for identifying areas where the presentation of the theoretical claims could be strengthened. We address each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: Abstract: The assertion that 'new theoretic analysis for TSGS³VM which guarantees that TSGS³VM can converge to a stationary point' is load-bearing for the central claim, yet the abstract supplies no proof sketch, no conditions on the random feature approximation bias, and no analysis of how the combined Monte-Carlo variance from labeled/unlabeled sampling and random feature bias permit the standard non-convex SGD convergence arguments to go through.
Authors: We agree that the abstract is concise and does not contain a proof sketch or explicit conditions. Abstracts are space-limited, but we will revise it to briefly note that convergence to a stationary point holds under standard assumptions on the random-feature bias vanishing with the number of features and bounded variance of the triply stochastic estimator. The full argument appears in Section 4. revision: yes
-
Referee: Theoretical analysis section: For the triply stochastic gradient to satisfy the requirements of non-convex SGD, the paper must demonstrate that the bias from the random feature map (interacting with the non-convex unlabeled loss) vanishes at a sufficient rate and that the variance remains bounded independently of the feature dimension; if this is not shown, the descent lemma or Lyapunov function argument would not close.
Authors: The analysis in Section 4 establishes that the random-feature bias term is controlled by the approximation error of the kernel (which vanishes as the number of random features grows) and that the variance of the combined labeled/unlabeled/random-feature estimator is bounded by a constant independent of the feature dimension, because each sampling distribution is fixed. We will add an explicit remark and a supporting lemma stating the bias-vanishing rate to make the application of the non-convex SGD convergence result fully transparent. revision: partial
Circularity Check
No circularity: new convergence analysis is independent of fitted inputs or self-citation chains
full rationale
The paper defines TSGS3VM via explicit triple sampling (one labeled instance, one unlabeled instance, and random features per iteration) and states that it establishes a new convergence theorem guaranteeing a stationary point for the non-convex S3VM objective. This derivation chain begins from the standard non-convex SGD framework and applies it to the specific sampling scheme; it does not reduce any claimed prediction or theorem to a fitted parameter, a self-citation load-bearing premise, or an ansatz imported from prior work by the same authors. The DSG reference is used only as background motivation, not as the justification for the new bias/variance control. The result is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Semi-supervised support vector machines
[Bennett and Demiriz, 1999 ] Kristin P Bennett and A yhan Demiriz. Semi-supervised support vector machines. In Advances in Neural Information processing systems, pages 368–374,
work page 1999
-
[2]
Generalized smo algorithm for svm- based multitask learning
[Cai and Cherkassky, 2012] Feng Cai and Vladimir Cherkassky. Generalized smo algorithm for svm- based multitask learning. IEEE transactions on neural networks and learning systems , 23(6):997–1003,
work page 2012
-
[3]
Learning with sgd and random fea- tures
[Carratino et al., 2018] Luigi Carratino, Alessandro Rudi, and Lorenzo Rosasco. Learning with sgd and random fea- tures. In Advances in Neural Information Processing Sys- tems, pages 10213–10224,
work page 2018
-
[4]
LIBSVM: A library for support vector machines
[Chang and Lin, 2011 ] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and T echnology, 2:27:1–27:27,
work page 2011
-
[5]
Semi-supervised classification by low density sepa- ration
[Chapelle and Zien, 2005 ] Olivier Chapelle and Alexander Zien. Semi-supervised classification by low density sepa- ration. In AISTATS, volume 2005, pages 57–64. Citeseer,
work page 2005
-
[6]
Training a support vector machine in the primal
[Chapelle, 2007] Olivier Chapelle. Training a support vector machine in the primal. Neural computation , 19(5):1155– 1178,
work page 2007
-
[7]
[Collobert et al., 2006] Ronan Collobert, Fabian Sinz, Jason Weston, and L´ eon Bottou. Large scale transductive svms. Journal of Machine Learning Research , 7(Aug):1687– 1712,
work page 2006
-
[8]
Scalable kernel methods via doubly stochastic gradients
[Dai et al., 2014] Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina F Balcan, and Le Song. Scalable kernel methods via doubly stochastic gradients. In Ad- vances in Neural Information Processing Systems , pages 3041–3049,
work page 2014
-
[9]
On the nystr¨ om method for approximating a gram matrix for improved kernel- based learning
[Drineas and Mahoney, 2005] Petros Drineas and Michael W Mahoney. On the nystr¨ om method for approximating a gram matrix for improved kernel- based learning. journal of machine learning research , 6(Dec):2153–2175,
work page 2005
-
[10]
Stochastic first-and zeroth-order methods for non- convex stochastic programming
[Ghadimi and Lan, 2013 ] Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for non- convex stochastic programming. SIAM Journal on Opti- mization, 23(4):2341–2368,
work page 2013
-
[11]
Training neural networks using features replay
[Huo et al., 2018] Zhouyuan Huo, Bin Gu, and Heng Huang. Training neural networks using features replay. In Ad- vances in Neural Information Processing Systems , pages 6659–6668,
work page 2018
-
[12]
Transductive inference for text classification using support vector machines
[Joachims, 1999] Thorsten Joachims. Transductive inference for text classification using support vector machines. In Icml, volume 99, pages 200–209,
work page 1999
-
[13]
Bud- geted semi-supervised support vector machine
[Le et al., 2016] Trung Le, Phuong Duong, Mi Dinh, Tu Dinh Nguyen, Vu Nguyen, and Dinh Q Phung. Bud- geted semi-supervised support vector machine. In UAI,
work page 2016
-
[14]
Triply stochastic gradients on multiple kernel learning
[Li et al., 2017] Xiang Li, Bin Gu, Shuang Ao, Huaimin Wang, and Charles X Ling. Triply stochastic gradients on multiple kernel learning. In UAI,
work page 2017
-
[15]
Randomized nonlinear component analysis
[Lopez-Paz et al., 2014] David Lopez-Paz, Suvrit Sra, Alex Smola, Zoubin Ghahramani, and Bernhard Sch¨ olkopf. Randomized nonlinear component analysis. Computer Science, 4:1359–1367,
work page 2014
-
[16]
Random features for large-scale kernel machines
[Rahimi and Recht, 2008 ] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Ad- vances in neural information processing systems , pages 1177–1184,
work page 2008
-
[17]
Stochastic vari- ance reduction for nonconvex optimization
[Reddi et al., 2016] Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic vari- ance reduction for nonconvex optimization. In Interna- tional conference on machine learning , pages 314–323,
work page 2016
-
[18]
iterations on the eight benchmark data sets
0 100 200 300 400 500 600 Iterations 0.1 0.2 0.3 0.4 0.5 0.6Test error TSGS3VM (a) CodRNA 0 50 100 150 200 250 Iterations 0 0.2 0.4 0.6 0.8 1 Test error TSGS3VM (b) W6a 0 50 100 150 200 250 Iterations 0 0.2 0.4 0.6 0.8 1 Test error TSGS3VM (c) IJCNN1 0 50 100 150 Iterations 0.3 0.35 0.4 0.45Test error TSGS3VM (d) SUSY 0 50 100 150 Iterations 0 0.05 0.1 0....
work page 2019
-
[19]
On transductive support vector machines
[Wang et al., 2007] Junhui Wang, Xiaotong Shen, and Wei Pan. On transductive support vector machines. Contem- porary Mathematics, 443:7–20,
work page 2007
-
[20]
Scattered data approx- imation, volume
[Wendland, 2004] Holger Wendland. Scattered data approx- imation, volume
work page 2004
-
[21]
Scale up nonlinear component analysis with doubly stochastic gradients
[Xie et al., 2015] Bo Xie, Yingyu Liang, and Le Song. Scale up nonlinear component analysis with doubly stochastic gradients. In Advances in Neural Information Processing Systems, pages 2341–2349,
work page 2015
-
[22]
Tackle balancing constraint for incremental semi-supervised support vecto r learning
[Y uet al., 2019] Shuyang Y u, Bin Gu, Kunpeng Ning, Haiyan Chen, Jian Pei, and Heng Huang. Tackle balancing constraint for incremental semi-supervised support vecto r learning. In Proceedings of the 25th ACM SIGKDD In- ternational Conference on Knowledge Discovery & Data Mining,
work page 2019
-
[23]
The concave-convex procedure (cccp)
[Y uille and Rangarajan, 2002] Alan L Y uille and Anand Rangarajan. The concave-convex procedure (cccp). In Advances in neural information processing systems , pages 1033–1040, 2002
work page 2002
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.