Recognition: 2 theorem links
· Lean TheoremVariance Matters: Improving Domain Adaptation via Stratified Sampling
Pith reviewed 2026-05-17 00:52 UTC · model grok-4.3
The pith
Stratified sampling reduces variance in discrepancy estimates for unsupervised domain adaptation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that ad hoc stratification objectives for correlation alignment and MMD, when used in stochastic settings, reduce variance in discrepancy estimates, with the MMD version proven to be theoretically optimal under the paper's assumptions, yielding improved error bounds and empirical results via a practical optimization procedure.
What carries the argument
The stratification objectives for MMD and correlation alignment that partition samples to achieve lower variance in the discrepancy calculations.
If this is right
- Reduced variance leads to more reliable discrepancy estimates during training.
- Improved performance on the target domain across multiple datasets.
- Theoretical guarantees in the form of error bounds for the adapted models.
- The k-means optimization provides an efficient way to apply the method in practice.
Where Pith is reading between the lines
- Similar stratification techniques might benefit other variance-sensitive methods in machine learning.
- Extending this to additional discrepancy measures could broaden the applicability.
- Validating the assumptions through varied experimental setups would strengthen the optimality claim.
Load-bearing premise
That the certain assumptions hold for the MMD objective to be variance-minimizing, and that the ad hoc objectives prove effective beyond the tested cases.
What would settle it
Demonstrating that the proposed MMD stratification does not achieve lower variance than standard sampling in a controlled stochastic setting, or that target domain accuracy does not improve with the method.
Figures
read the original abstract
Domain shift remains a key challenge in deploying machine learning models to the real world. Unsupervised domain adaptation (UDA) aims to address this by minimising domain discrepancy during training, but the discrepancy estimates suffer from high variance in stochastic settings, which can stifle the theoretical benefits of the method. This paper proposes Variance-Reduced Domain Adaptation via Stratified Sampling (VaRDASS), the first specialised stochastic variance reduction technique for UDA. We consider two specific discrepancy measures -- correlation alignment and the maximum mean discrepancy (MMD) -- and derive ad hoc stratification objectives for these terms. We then present expected and worst-case error bounds, and prove that our proposed objective for the MMD is theoretically optimal (i.e., minimises the variance) under certain assumptions. Finally, a practical k-means style optimisation algorithm is introduced and analysed. Experiments on four domain shift datasets demonstrate improved discrepancy estimation accuracy and target domain performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VaRDASS, a stochastic variance reduction technique for unsupervised domain adaptation (UDA) based on stratified sampling. It derives ad hoc stratification objectives for correlation alignment and maximum mean discrepancy (MMD), provides expected and worst-case error bounds, proves that the MMD objective is theoretically optimal (minimizing variance) under certain assumptions, presents a practical k-means-style optimization algorithm, and reports empirical gains in discrepancy estimation accuracy and target-domain performance on four domain-shift datasets.
Significance. If the optimality result and error bounds hold under realistic conditions, the work offers a targeted variance-reduction approach for discrepancy-based UDA methods, which could stabilize training in mini-batch settings and improve generalization. The explicit optimality proof for the MMD stratification objective (when assumptions are met) and the provision of both theoretical bounds and a practical algorithm are notable strengths.
major comments (2)
- [§4, Theorem 3] §4, Theorem 3 (optimality proof): The claim that the proposed MMD stratification objective minimizes variance relies on assumptions including fixed/oracle strata, specific kernel properties, and independence conditions on the discrepancy estimator. These are not shown to hold when strata are instead estimated via the paper's k-means procedure on finite samples drawn from shifted source and target domains; without this transfer, the variance-reduction guarantee does not necessarily support the reported error bounds or empirical improvements in the stochastic regime.
- [§3.2] §3.2 (ad hoc stratification objectives): The correlation-alignment objective is presented without a corresponding optimality proof or variance analysis comparable to the MMD case; this asymmetry weakens the unified claim that stratified sampling improves discrepancy estimation across both measures.
minor comments (2)
- [§5] The k-means-style algorithm in §5 is described at a high level but lacks pseudocode or explicit complexity analysis, which would aid reproducibility.
- Notation for the strata and sampling weights is introduced inconsistently between the theoretical sections and the experimental setup; a single consolidated table of symbols would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and indicate where revisions have been made to the manuscript.
read point-by-point responses
-
Referee: [§4, Theorem 3] §4, Theorem 3 (optimality proof): The claim that the proposed MMD stratification objective minimizes variance relies on assumptions including fixed/oracle strata, specific kernel properties, and independence conditions on the discrepancy estimator. These are not shown to hold when strata are instead estimated via the paper's k-means procedure on finite samples drawn from shifted source and target domains; without this transfer, the variance-reduction guarantee does not necessarily support the reported error bounds or empirical improvements in the stochastic regime.
Authors: We thank the referee for this precise observation. Theorem 3 proves optimality of the proposed MMD objective under the explicit assumptions of oracle strata, kernel properties, and independence; these are stated in the theorem. The expected and worst-case error bounds derived earlier in §4 apply to stratified estimators for arbitrary strata and do not rely on optimality. The k-means procedure is presented as a practical heuristic to approximate good strata from data. In the revised manuscript we have added a dedicated paragraph after Theorem 3 that (i) reiterates the oracle-strata assumption, (ii) notes that k-means provides an empirical approximation whose quality depends on sample size and domain shift, and (iii) clarifies that the reported empirical gains are observed directly and do not rest on the theoretical optimality transferring exactly. We believe this makes the scope of the guarantee transparent. revision: yes
-
Referee: [§3.2] §3.2 (ad hoc stratification objectives): The correlation-alignment objective is presented without a corresponding optimality proof or variance analysis comparable to the MMD case; this asymmetry weakens the unified claim that stratified sampling improves discrepancy estimation across both measures.
Authors: We agree that the presentation is asymmetric. The CORAL stratification objective was obtained by minimising an upper bound on the variance of the correlation-alignment estimator, analogous to the derivation for MMD, but we did not establish a formal optimality result. The paper’s unified claim is that stratified sampling can be instantiated for both discrepancy measures and yields measurable variance reduction in practice; it does not assert identical theoretical optimality for both. In the revision we have (i) rephrased the introductory paragraph of §3.2 to avoid any implication of symmetric optimality and (ii) inserted a short paragraph providing an explicit variance expression for the CORAL stratified estimator. A complete optimality proof for CORAL remains future work. revision: partial
Circularity Check
No circularity: optimality proof and bounds are independent of fitted inputs
full rationale
The paper derives stratification objectives for MMD and correlation alignment, states expected/worst-case error bounds, and proves the MMD objective minimizes variance under explicit assumptions. These steps are presented as first-principles derivations rather than reductions to self-defined quantities, fitted parameters renamed as predictions, or self-citation chains. The subsequent k-means algorithm is a practical implementation analyzed separately from the theoretical optimality result. No load-bearing step reduces by construction to its own inputs per the abstract and context; the derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Certain assumptions under which the proposed MMD objective minimises the variance.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
prove that our proposed objective for the MMD is theoretically optimal (i.e., minimises the variance) under certain assumptions
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
URLhttps://dl.acm.org/doi/ pdf/10.1145/1137856.1137880
doi: 10.1145/1137856.1137880. URLhttps://dl.acm.org/doi/ pdf/10.1145/1137856.1137880. David Arthur and Sergei Vassilvitskii. k-means++: The Advantages of Careful Seeding.Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms,
-
[2]
k-Means has Polynomial Smoothed Complexity
ISSN 02725428. doi: 10.1109/FOCS.2009.14. URLhttps://arxiv.org/pdf/0904.1113. Péter Bándi, Oscar Geessink, Quirine Manson, Marcory Van Dijk, Maschenka Balkenhol, Meyke Hermsen, Babak Ehteshami Bejnordi, Byungjae Lee, Kyunghyun Paeng, Aoxiao Zhong, Quanzheng Li, Farhad Ghazvinian Zanjani, Svitlana Zinger, Keisuke Fukuta, Daisuke Komura, Vlado Ovtcharov, Sh...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/focs.2009.14 2009
-
[3]
ISSN 1558-254X. doi: 10.1109/TMI.2018.2867350. URLhttps: //pubmed.ncbi.nlm.nih.gov/30716025/. 11 Remi Bardenet, Subhroshekhar Ghosh, and Meixia Lin. Determinantal point processes based on orthogonal polynomials for sampling minibatches in SGD.NeurIPS, 20:16226–16237, 12
-
[4]
URL https://arxiv.org/abs/2112.06007v1
ISSN 10495258. URL https://arxiv.org/abs/2112.06007v1. Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of Representations for Domain Adaptation.NeurIPS, 19,
-
[5]
URLhttps://link.springer.com/article/10.1007/s10994-009-5152-4
1007/S10994-009-5152-4/METRICS. URLhttps://link.springer.com/article/10.1007/s10994-009-5152-4. Wacha Bounliphone, Eugene Belilovsky, Matthew B. Blaschko, Ioannis Antonoglou, and Arthur Gretton. A Test of Relative Similarity For Model Selection in Generative Models.ICLR, 11
-
[6]
Scalable Kernel Clustering: Approximate Kernel k-means
URLhttps://arxiv.org/abs/1402.3849v1. Renato Cordeiro de Amorim and Vladimir Makarenkov. On k-means iterations and Gaussian clusters.Neu- rocomputing, 553:126547, 10
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
doi: 10.1016/J.NEUCOM.2023.126547
ISSN 0925-2312. doi: 10.1016/J.NEUCOM.2023.126547. URLhttps: //www.sciencedirect.com/science/article/pii/S0925231223006707. Somayeh Danafar, Paola M. V. Rancoita, Tobias Glasmachers, Kevin Whittingstall, and Juergen Schmidhuber. Test- ing Hypotheses by Regularized Maximum Mean Discrepancy.International Journal of Computer and Information Technology, 5
-
[8]
Testing Hypotheses by Regularized Maximum Mean Discrepancy
URLhttps://arxiv.org/pdf/1305.0423. Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives.NeurIPS, 2(January):1646–1654, 7
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives
ISSN 10495258. URL https://arxiv.org/abs/1407.0202v3. Abhimanyu Dubey, Vignesh Ramanathan, Alex Pentland, and Dhruv Mahajan. Adaptive Methods for Real-World Domain Generalization.CVPR,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
doi: 10.1109/CVPR46437.2021.01411
ISSN 10636919. doi: 10.1109/CVPR46437.2021.01411. URLhttps: //arxiv.org/abs/2103.15796v2. Tianfan Fu and Zhihua Zhang. CPSG-MCMC: Clustering-Based Preprocessing method for Stochastic Gradient MCMC.AISTATS, pp. 841–850, 4
- [11]
-
[12]
ISSN 15582256. doi: 10.1109/JPROC.2020. 3028013. Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A Kernel Two- Sample Test.Journal of Machine Learning Research, 13:723–773,
-
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition.Pro- ceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016-December: 770–778, 12
work page 2016
-
[14]
Deep Residual Learning for Image Recognition
ISSN 10636919. doi: 10.48550/arxiv.1512.03385. URLhttps://arxiv.org/abs/1512.03385v1. I. Heller and C.B. Tompkins. An Extension of a Theorem of Dantzig’s.Linear Inequalities and Related Systems,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1512.03385
-
[15]
doi: 10.1007/978-3-030-58589-1{\_}28
ISSN 16113349. doi: 10.1007/978-3-030-58589-1{\_}28. URL https://arxiv.org/pdf/1912.03699. 12 Rie Johnson and Tong Zhang. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction. NeurIPS,
-
[16]
Diederik P. Kingma and Jimmy Lei Ba. Adam: A Method for Stochastic Optimization.3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, 12
work page 2015
-
[17]
Adam: A Method for Stochastic Optimization
doi: 10.48550/arxiv.1412.6980. URLhttps://arxiv.org/abs/1412.6980v9. Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton A. Earnshaw, Imran S. Haque, Sara Beery, Jure Leskovec, Anshul Kunda...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1412.6980
-
[18]
Olivier Ledoit and Michael Wolf
URLhttps://arxiv.org/abs/2501.13296v1. Olivier Ledoit and Michael Wolf. The Power of (Non-)Linear Shrinking: A Review and Guide to Covariance Matrix Estimation.Journal of Financial Econometrics,
-
[19]
doi: 10.1093/jjfinec/nbaa007. URLhttps://academic. oup.com/jfec/advance-article/doi/10.1093/jjfinec/nbaa007/5861007. Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C. Kot. Domain Generalization with Adversarial Feature Learning.CVPR, pp. 5400–5409, 12
-
[20]
ISSN 10636919. doi: 10.1109/CVPR.2018.00566. Feng Liu, Wenkai Xu, Jie Lu, Guangquan Zhang, Arthur Gretton, and D. J. Sutherland. Learning Deep Kernels for Non-Parametric Two-Sample Tests.ICML, pp. 6272–6282, 2 2020a. Jingchang Liu and Linli Xu. Accelerating Stochastic Gradient Descent Using Antithetic Sampling.arXiv, 10
-
[21]
Accelerating Stochastic Gradient Descent Using Antithetic Sampling
URLhttps://arxiv.org/abs/1810.03124v1. Weijie Liu, Hui Qian, Chao Zhang, Zebang Shen, Jiahao Xie, and Nenggan Zheng. Accelerating Stratified Sampling SGD by Reconstructing Strata.IJCAI, 2020b. Stuart P. Lloyd. Least Squares Quantization in PCM.IEEE Transactions on Information Theory, 28(2):129–137,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
ISSN 15579654. doi: 10.1109/TIT.1982.1056489. MingshengLong, YueCao, JianminWang, andMichaelJordan. LearningTransferableFeatureswithDeepAdaptation Networks. In Francis Bach and David Blei (eds.),Proceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pp. 97–105, Lille, France,
-
[23]
URL https://proceedings.mlr.press/v37/long15.html
PMLR. URL https://proceedings.mlr.press/v37/long15.html. Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I. Jordan. Conditional Adversarial Domain Adaptation. Advances in Neural Information Processing Systems, 2018-December:1640–1650, 5
work page 2018
-
[24]
Conditional Adversarial Domain Adaptation
ISSN 10495258. URL https://arxiv.org/abs/1705.10667v4. Ilya Loshchilov and Frank Hutter. Online Batch Selection for Faster Training of Neural Networks.ICLR workshop track, 11
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Online Batch Selection for Faster Training of Neural Networks
URLhttps://arxiv.org/pdf/1511.06343. Yucheng Lu, Youngsuk Park, Lifan Chen, Yuyang Wang, Christopher De Sa, and Dean Foster. Variance Reduced Training with Stratified Sampling for Forecasting Models.ICML, 139:7145–7155, 3
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
URL https://arxiv.org/abs/2103.02062v2
ISSN 26403498. URL https://arxiv.org/abs/2103.02062v2. Aengus Lynch, Gbètondji J-S Dovonon, Jean Kaddour, and Ricardo Silva. Spawrious: A Benchmark for Fine Control of Spurious Correlation Biases.arXiv, 3
-
[27]
Krikamol Muandet, Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, and Bernhard Schölkopf
URLhttps://arxiv.org/abs/2303.05470v3. Krikamol Muandet, Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, and Bernhard Schölkopf. Kernel Mean Shrinkage Estimators.JMLR, 17:xx–xx, 5
-
[28]
Kernel Mean Shrinkage Estimators
ISSN 15337928. URLhttps://arxiv.org/pdf/1405.5505. 13 Andrea Napoli and Paul White. Unsupervised Domain Adaptation for the Cross-Dataset Detection of Humpback Whale Calls.DCASE,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
XinyuPeng, LiLi, andFeiYueWang
URLhttp://arxiv.org/abs/2410.04235. XinyuPeng, LiLi, andFeiYueWang. AcceleratingMinibatchStochasticGradientDescentUsingTypicalitySampling. IEEE Transactions on Neural Networks and Learning Systems, 31(11):4649–4659, 11
-
[30]
doi: 10.1109/TNNLS.2019.2957003
ISSN 21622388. doi: 10.1109/TNNLS.2019.2957003. B. T. Polyak. Some methods of speeding up the convergence of iteration methods.USSR Computational Mathematics and Mathematical Physics, 4(5):1–17,
-
[31]
doi: 10.1016/0041-5553(64)90137-5
ISSN 00415553. doi: 10.1016/0041-5553(64)90137-5. B. T. Polyak and A. B. Juditsky. Acceleration of Stochastic Approximation by Averaging.SIAM Journal on Control and Optimization, 30(4):838–855,
-
[32]
ISSN 03630129. doi: 10.1137/0330046. Ievgen Redko, Emilie Morvant, Amaury Habrard, Marc Sebban, and Younès Bennani. A survey on domain adaptation theory: learning bounds and theoretical guarantees.arXiv,
-
[33]
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
ISSN 15324435. URLhttps://arxiv.org/abs/1209.1873v2. Baochen Sun and Kate Saenko. Deep CORAL: Correlation Alignment for Deep Domain Adaptation.ECCV, 9915 LNCS:443–450, 7
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Deep CORAL: Correlation Alignment for Deep Domain Adaptation
ISSN 16113349. URLhttps://arxiv.org/abs/1607.01719v1. Danica J Sutherland and Namrata Deka. Unbiased estimators for the variance of MMD estimators.arXiv, 6
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
URLhttps://arxiv.org/pdf/1906.02104. The MathWorks Inc. MATLAB,
-
[36]
Deep Domain Confusion: Maximizing for Domain Invariance
URLhttps://arxiv.org/abs/1412.3474v1. Zirui Wang, Zihang Dai, Barnabas Poczos, and Jaime Carbonell. Characterizing and Avoiding Negative Transfer. CVPR, 2019-June:11285–11294, 11
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[37]
ISSN 10636919. doi: 10.1109/CVPR.2019.01155. URLhttps://arxiv. org/abs/1811.09751v4. Ruijia Xu, Guanbin Li, Jihan Yang, and Liang Lin. Larger Norm More Transferable: An Adaptive Feature Norm Approach for Unsupervised Domain Adaptation.ICCV, pp. 1426–1435, 11
-
[38]
ISSN 15505499. doi: 10.1109/ ICCV.2019.00151. URLhttps://arxiv.org/pdf/1811.07456. Cheng Zhang, Hedvig Kjellström, and Stephan Mandt. Determinantal Point Processes for Mini-Batch Diversification. Uncertainty in Artificial Intelligence,
-
[39]
doi: 10.1609/AAAI.V33I01.33015741
ISSN 2374-3468. doi: 10.1609/AAAI.V33I01.33015741. URL https://ojs.aaai.org/index.php/AAAI/article/view/4520. Marvin Zhang, Henrik Marklund, Nikita Dhawan, Abhishek Gupta, Sergey Levine, and Chelsea Finn. Adaptive Risk Minimization: Learning to Adapt to Domain Shift.Advances in Neural Information Processing Systems, 28: 23664–23678, 7
-
[40]
URLhttps://arxiv.org/abs/2007.02931v4
ISSN 10495258. URLhttps://arxiv.org/abs/2007.02931v4. Peilin Zhao and Tong Zhang. Accelerating Minibatch Stochastic Gradient Descent using Stratified Sampling.arXiv, 5
-
[41]
Accelerating Minibatch Stochastic Gradient Descent using Stratified Sampling
URLhttps://arxiv.org/abs/1405.3080v1. Peilin Zhao and Tong Zhang. Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. ICML, pp. 1–9, 6
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
URLhttps://proceedings.mlr.press/v37/zhaoa15.html
ISSN 1938-7228. URLhttps://proceedings.mlr.press/v37/zhaoa15.html. 14 A Proof of Theorem 1 Theorem 1.AssumeHhas finite dimensionalitydandΣ ˆµϕ,s =σ2Ifor some positive scalarσ2. Then, arg min S Var ( ˆLMMD ) = arg min S Var (ˆµϕ,s).(18) Proof.Noting thatVar (ˆµϕ,s) =σ2d, we have Var ( ˆLMMD ) = 2 Tr (( σ2I+ Σ ˆµϕ,t )2) + 4mT ( σ2I+ Σ ˆµϕ,t ) m = 2 Tr ( σ4I...
work page 1938
-
[43]
+ 4µTACΣACµ(21) Var ( ˆRs ) =γ2 ( Var ( ˆR′ s ) + 2 Tr ( A2Σ )2 −4 Tr ( A3Σ 2)) (22) whereµ= (µ1,...,µk)T,Σ = diag ((Σ 1,...,Σ k)),A= 1 ns diag ((|S1|,...,|Sk|)),γ=ns ns−1is a bias correc- tion factor, andC=I−JAis the weighted centering matrix, withJthe square matrix of ones. Proof.For scalar˜zs,h, the covariances can be written as quadratic forms ˆRs = (...
work page 2007
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.