arxiv: 2512.05226 · v2 · submitted 2025-12-04 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Variance Matters: Improving Domain Adaptation via Stratified Sampling

Andrea Napoli , Paul White

Authors on Pith no claims yet

Pith reviewed 2026-05-17 00:52 UTC · model grok-4.3

classification 💻 cs.LG

keywords unsupervised domain adaptationstratified samplingvariance reductionmaximum mean discrepancycorrelation alignmentdomain shift

0 comments

The pith

Stratified sampling reduces variance in discrepancy estimates for unsupervised domain adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses high variance in stochastic estimates of domain discrepancy that limits the benefits of unsupervised domain adaptation. It proposes using stratified sampling to derive specific objectives for correlation alignment and maximum mean discrepancy measures. These lead to expected and worst-case error bounds, with a proof that the MMD objective is optimal for minimizing variance under certain assumptions. A k-means style algorithm optimizes the stratification in practice, and experiments on domain shift datasets show gains in estimation accuracy and target performance.

Core claim

The central claim is that ad hoc stratification objectives for correlation alignment and MMD, when used in stochastic settings, reduce variance in discrepancy estimates, with the MMD version proven to be theoretically optimal under the paper's assumptions, yielding improved error bounds and empirical results via a practical optimization procedure.

What carries the argument

The stratification objectives for MMD and correlation alignment that partition samples to achieve lower variance in the discrepancy calculations.

If this is right

Reduced variance leads to more reliable discrepancy estimates during training.
Improved performance on the target domain across multiple datasets.
Theoretical guarantees in the form of error bounds for the adapted models.
The k-means optimization provides an efficient way to apply the method in practice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar stratification techniques might benefit other variance-sensitive methods in machine learning.
Extending this to additional discrepancy measures could broaden the applicability.
Validating the assumptions through varied experimental setups would strengthen the optimality claim.

Load-bearing premise

That the certain assumptions hold for the MMD objective to be variance-minimizing, and that the ad hoc objectives prove effective beyond the tested cases.

What would settle it

Demonstrating that the proposed MMD stratification does not achieve lower variance than standard sampling in a controlled stochastic setting, or that target domain accuracy does not improve with the method.

Figures

Figures reproduced from arXiv: 2512.05226 by Andrea Napoli, Paul White.

**Figure 1.** Figure 1: The relationship between the target (Var LbMMD ) and surrogate (Var (µbϕ,s)) partitioning objectives, for a range of feature dimensionalities d. rank l on f (S) – i.e., the best partitioning on g is the lth best partitioning on f. Theorem 2 (Worst-case error bound). Assume the growth rate of the sorted f (S) (i) is bounded by some constant K, such that, for any indices i, j, f (S) (i) − f (S) (j) ≤ K(i… view at source ↗

**Figure 2.** Figure 2: The relationship between the target (Var Rbs ) and surrogate (Var Rb′ s ) partitioning objectives, for a range of feature dimensionalities d, and two distributions. 2.4 Optimisation algorithm Equation (9) can be solved to a local minimum by applying a Lloyd’s-style alternating optimisation algorithm (Lloyd, 1982) along with the kernel trick (Algorithm 1). Algorithm 1 Dynamically-weighted kernel k-m… view at source ↗

**Figure 3.** Figure 3: The performance characteristics of Algorithm 2. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Estimator variance vs minibatch size for Algorithm 1 (purple) vs 3 ablations. Algorithm 2 Greedy incremental construction for dynamically weighted assignments Require: D ∈ R ns×k Ensure: U ∈ {0, 1} ns×k , P j Uij = 1 1: U ← 0ns×k 2: n1, . . . , nk ← 0 ▷ Interim cluster sizes 3: for all i ∈ {1, . . . , ns} do 4: j ← arg minj Dij (nj + 1) 5: Uij ← 1 6: nj ← nj + 1 7: end for 8: return U The performance char… view at source ↗

**Figure 5.** Figure 5: Estimator variance (Var (µbϕ,s)) throughout training with VaRDASS for different values of T, also compared to uniform random sampling. the targeted losses, and that this results in consistent improvements in target domain accuracy. For future work, the method could adopt a strata reconstruction condition as in Liu et al. (2020b), be combined with complementary variance reduction techniques, or be extended … view at source ↗

read the original abstract

Domain shift remains a key challenge in deploying machine learning models to the real world. Unsupervised domain adaptation (UDA) aims to address this by minimising domain discrepancy during training, but the discrepancy estimates suffer from high variance in stochastic settings, which can stifle the theoretical benefits of the method. This paper proposes Variance-Reduced Domain Adaptation via Stratified Sampling (VaRDASS), the first specialised stochastic variance reduction technique for UDA. We consider two specific discrepancy measures -- correlation alignment and the maximum mean discrepancy (MMD) -- and derive ad hoc stratification objectives for these terms. We then present expected and worst-case error bounds, and prove that our proposed objective for the MMD is theoretically optimal (i.e., minimises the variance) under certain assumptions. Finally, a practical k-means style optimisation algorithm is introduced and analysed. Experiments on four domain shift datasets demonstrate improved discrepancy estimation accuracy and target domain performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Stratified sampling cuts variance in UDA discrepancy estimates but the MMD optimality proof rests on assumptions that may not survive the k-means procedure on shifted samples.

read the letter

VaRDASS applies stratified sampling to lower the variance of stochastic estimates for correlation alignment and MMD in unsupervised domain adaptation. The authors derive ad hoc stratification objectives for each measure, supply expected and worst-case error bounds, prove that the MMD objective is optimal for variance minimization under certain assumptions, and give a k-means style algorithm to choose the strata in practice. Experiments on four domain-shift datasets report better discrepancy accuracy and target-domain performance.

Referee Report

2 major / 2 minor

Summary. The paper introduces VaRDASS, a stochastic variance reduction technique for unsupervised domain adaptation (UDA) based on stratified sampling. It derives ad hoc stratification objectives for correlation alignment and maximum mean discrepancy (MMD), provides expected and worst-case error bounds, proves that the MMD objective is theoretically optimal (minimizing variance) under certain assumptions, presents a practical k-means-style optimization algorithm, and reports empirical gains in discrepancy estimation accuracy and target-domain performance on four domain-shift datasets.

Significance. If the optimality result and error bounds hold under realistic conditions, the work offers a targeted variance-reduction approach for discrepancy-based UDA methods, which could stabilize training in mini-batch settings and improve generalization. The explicit optimality proof for the MMD stratification objective (when assumptions are met) and the provision of both theoretical bounds and a practical algorithm are notable strengths.

major comments (2)

[§4, Theorem 3] §4, Theorem 3 (optimality proof): The claim that the proposed MMD stratification objective minimizes variance relies on assumptions including fixed/oracle strata, specific kernel properties, and independence conditions on the discrepancy estimator. These are not shown to hold when strata are instead estimated via the paper's k-means procedure on finite samples drawn from shifted source and target domains; without this transfer, the variance-reduction guarantee does not necessarily support the reported error bounds or empirical improvements in the stochastic regime.
[§3.2] §3.2 (ad hoc stratification objectives): The correlation-alignment objective is presented without a corresponding optimality proof or variance analysis comparable to the MMD case; this asymmetry weakens the unified claim that stratified sampling improves discrepancy estimation across both measures.

minor comments (2)

[§5] The k-means-style algorithm in §5 is described at a high level but lacks pseudocode or explicit complexity analysis, which would aid reproducibility.
Notation for the strata and sampling weights is introduced inconsistently between the theoretical sections and the experimental setup; a single consolidated table of symbols would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and indicate where revisions have been made to the manuscript.

read point-by-point responses

Referee: [§4, Theorem 3] §4, Theorem 3 (optimality proof): The claim that the proposed MMD stratification objective minimizes variance relies on assumptions including fixed/oracle strata, specific kernel properties, and independence conditions on the discrepancy estimator. These are not shown to hold when strata are instead estimated via the paper's k-means procedure on finite samples drawn from shifted source and target domains; without this transfer, the variance-reduction guarantee does not necessarily support the reported error bounds or empirical improvements in the stochastic regime.

Authors: We thank the referee for this precise observation. Theorem 3 proves optimality of the proposed MMD objective under the explicit assumptions of oracle strata, kernel properties, and independence; these are stated in the theorem. The expected and worst-case error bounds derived earlier in §4 apply to stratified estimators for arbitrary strata and do not rely on optimality. The k-means procedure is presented as a practical heuristic to approximate good strata from data. In the revised manuscript we have added a dedicated paragraph after Theorem 3 that (i) reiterates the oracle-strata assumption, (ii) notes that k-means provides an empirical approximation whose quality depends on sample size and domain shift, and (iii) clarifies that the reported empirical gains are observed directly and do not rest on the theoretical optimality transferring exactly. We believe this makes the scope of the guarantee transparent. revision: yes
Referee: [§3.2] §3.2 (ad hoc stratification objectives): The correlation-alignment objective is presented without a corresponding optimality proof or variance analysis comparable to the MMD case; this asymmetry weakens the unified claim that stratified sampling improves discrepancy estimation across both measures.

Authors: We agree that the presentation is asymmetric. The CORAL stratification objective was obtained by minimising an upper bound on the variance of the correlation-alignment estimator, analogous to the derivation for MMD, but we did not establish a formal optimality result. The paper’s unified claim is that stratified sampling can be instantiated for both discrepancy measures and yields measurable variance reduction in practice; it does not assert identical theoretical optimality for both. In the revision we have (i) rephrased the introductory paragraph of §3.2 to avoid any implication of symmetric optimality and (ii) inserted a short paragraph providing an explicit variance expression for the CORAL stratified estimator. A complete optimality proof for CORAL remains future work. revision: partial

Circularity Check

0 steps flagged

No circularity: optimality proof and bounds are independent of fitted inputs

full rationale

The paper derives stratification objectives for MMD and correlation alignment, states expected/worst-case error bounds, and proves the MMD objective minimizes variance under explicit assumptions. These steps are presented as first-principles derivations rather than reductions to self-defined quantities, fitted parameters renamed as predictions, or self-citation chains. The subsequent k-means algorithm is a practical implementation analyzed separately from the theoretical optimality result. No load-bearing step reduces by construction to its own inputs per the abstract and context; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions for the theoretical optimality result and introduces ad hoc stratification objectives as part of the method development.

axioms (1)

domain assumption Certain assumptions under which the proposed MMD objective minimises the variance.
Invoked in the abstract to support the theoretical optimality proof for the MMD stratification objective.

pith-pipeline@v0.9.0 · 5444 in / 1278 out tokens · 65132 ms · 2026-05-17T00:52:35.848031+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

prove that our proposed objective for the MMD is theoretically optimal (i.e., minimises the variance) under certain assumptions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 14 internal anchors

[1]

URLhttps://dl.acm.org/doi/ pdf/10.1145/1137856.1137880

doi: 10.1145/1137856.1137880. URLhttps://dl.acm.org/doi/ pdf/10.1145/1137856.1137880. David Arthur and Sergei Vassilvitskii. k-means++: The Advantages of Careful Seeding.Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms,

work page doi:10.1145/1137856.1137880
[2]

k-Means has Polynomial Smoothed Complexity

ISSN 02725428. doi: 10.1109/FOCS.2009.14. URLhttps://arxiv.org/pdf/0904.1113. Péter Bándi, Oscar Geessink, Quirine Manson, Marcory Van Dijk, Maschenka Balkenhol, Meyke Hermsen, Babak Ehteshami Bejnordi, Byungjae Lee, Kyunghyun Paeng, Aoxiao Zhong, Quanzheng Li, Farhad Ghazvinian Zanjani, Svitlana Zinger, Keisuke Fukuta, Daisuke Komura, Vlado Ovtcharov, Sh...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/focs.2009.14 2009
[3]

doi: 10.1109/TMI.2018.2867350

ISSN 1558-254X. doi: 10.1109/TMI.2018.2867350. URLhttps: //pubmed.ncbi.nlm.nih.gov/30716025/. 11 Remi Bardenet, Subhroshekhar Ghosh, and Meixia Lin. Determinantal point processes based on orthogonal polynomials for sampling minibatches in SGD.NeurIPS, 20:16226–16237, 12

work page doi:10.1109/tmi.2018.2867350 2018
[4]

URL https://arxiv.org/abs/2112.06007v1

ISSN 10495258. URL https://arxiv.org/abs/2112.06007v1. Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of Representations for Domain Adaptation.NeurIPS, 19,

work page arXiv
[5]

URLhttps://link.springer.com/article/10.1007/s10994-009-5152-4

1007/S10994-009-5152-4/METRICS. URLhttps://link.springer.com/article/10.1007/s10994-009-5152-4. Wacha Bounliphone, Eugene Belilovsky, Matthew B. Blaschko, Ioannis Antonoglou, and Arthur Gretton. A Test of Relative Similarity For Model Selection in Generative Models.ICLR, 11

work page doi:10.1007/s10994-009-5152-4
[6]

Scalable Kernel Clustering: Approximate Kernel k-means

URLhttps://arxiv.org/abs/1402.3849v1. Renato Cordeiro de Amorim and Vladimir Makarenkov. On k-means iterations and Gaussian clusters.Neu- rocomputing, 553:126547, 10

work page internal anchor Pith review Pith/arXiv arXiv
[7]

doi: 10.1016/J.NEUCOM.2023.126547

ISSN 0925-2312. doi: 10.1016/J.NEUCOM.2023.126547. URLhttps: //www.sciencedirect.com/science/article/pii/S0925231223006707. Somayeh Danafar, Paola M. V. Rancoita, Tobias Glasmachers, Kevin Whittingstall, and Juergen Schmidhuber. Test- ing Hypotheses by Regularized Maximum Mean Discrepancy.International Journal of Computer and Information Technology, 5

work page doi:10.1016/j.neucom.2023.126547 2023
[8]

Testing Hypotheses by Regularized Maximum Mean Discrepancy

URLhttps://arxiv.org/pdf/1305.0423. Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives.NeurIPS, 2(January):1646–1654, 7

work page internal anchor Pith review Pith/arXiv arXiv
[9]

SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives

ISSN 10495258. URL https://arxiv.org/abs/1407.0202v3. Abhimanyu Dubey, Vignesh Ramanathan, Alex Pentland, and Dhruv Mahajan. Adaptive Methods for Real-World Domain Generalization.CVPR,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

doi: 10.1109/CVPR46437.2021.01411

ISSN 10636919. doi: 10.1109/CVPR46437.2021.01411. URLhttps: //arxiv.org/abs/2103.15796v2. Tianfan Fu and Zhihua Zhang. CPSG-MCMC: Clustering-Based Preprocessing method for Stochastic Gradient MCMC.AISTATS, pp. 841–850, 4

work page doi:10.1109/cvpr46437.2021.01411 2021
[11]

Robert M

URLhttps://people.maths.ox.ac.uk/gilesm/mc/mc/lec4.pdf. Robert M. Gower, Mark Schmidt, Francis Bach, and Peter Richtarik. Variance-Reduced Methods for Machine Learning.Proceedings of the IEEE, 108(11):1968–1983, 11

work page 1968
[12]

doi: 10.1109/JPROC.2020

ISSN 15582256. doi: 10.1109/JPROC.2020. 3028013. Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A Kernel Two- Sample Test.Journal of Machine Learning Research, 13:723–773,

work page doi:10.1109/jproc.2020 2020
[13]

Deep Residual Learning for Image Recognition.Pro- ceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016-December: 770–778, 12

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition.Pro- ceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016-December: 770–778, 12

work page 2016
[14]

Deep Residual Learning for Image Recognition

ISSN 10636919. doi: 10.48550/arxiv.1512.03385. URLhttps://arxiv.org/abs/1512.03385v1. I. Heller and C.B. Tompkins. An Extension of a Theorem of Dantzig’s.Linear Inequalities and Related Systems,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1512.03385
[15]

doi: 10.1007/978-3-030-58589-1{\_}28

ISSN 16113349. doi: 10.1007/978-3-030-58589-1{\_}28. URL https://arxiv.org/pdf/1912.03699. 12 Rie Johnson and Tong Zhang. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction. NeurIPS,

work page doi:10.1007/978-3-030-58589-1 1912
[16]

Kingma and Jimmy Lei Ba

Diederik P. Kingma and Jimmy Lei Ba. Adam: A Method for Stochastic Optimization.3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, 12

work page 2015
[17]

Adam: A Method for Stochastic Optimization

doi: 10.48550/arxiv.1412.6980. URLhttps://arxiv.org/abs/1412.6980v9. Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton A. Earnshaw, Imran S. Haque, Sara Beery, Jure Leskovec, Anshul Kunda...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1412.6980
[18]

Olivier Ledoit and Michael Wolf

URLhttps://arxiv.org/abs/2501.13296v1. Olivier Ledoit and Michael Wolf. The Power of (Non-)Linear Shrinking: A Review and Guide to Covariance Matrix Estimation.Journal of Financial Econometrics,

work page arXiv
[19]

URLhttps://academic

doi: 10.1093/jjfinec/nbaa007. URLhttps://academic. oup.com/jfec/advance-article/doi/10.1093/jjfinec/nbaa007/5861007. Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C. Kot. Domain Generalization with Adversarial Feature Learning.CVPR, pp. 5400–5409, 12

work page doi:10.1093/jjfinec/nbaa007
[20]

doi: 10.1109/CVPR.2018.00566

ISSN 10636919. doi: 10.1109/CVPR.2018.00566. Feng Liu, Wenkai Xu, Jie Lu, Guangquan Zhang, Arthur Gretton, and D. J. Sutherland. Learning Deep Kernels for Non-Parametric Two-Sample Tests.ICML, pp. 6272–6282, 2 2020a. Jingchang Liu and Linli Xu. Accelerating Stochastic Gradient Descent Using Antithetic Sampling.arXiv, 10

work page doi:10.1109/cvpr.2018.00566 2018
[21]

Accelerating Stochastic Gradient Descent Using Antithetic Sampling

URLhttps://arxiv.org/abs/1810.03124v1. Weijie Liu, Hui Qian, Chao Zhang, Zebang Shen, Jiahao Xie, and Nenggan Zheng. Accelerating Stratified Sampling SGD by Reconstructing Strata.IJCAI, 2020b. Stuart P. Lloyd. Least Squares Quantization in PCM.IEEE Transactions on Information Theory, 28(2):129–137,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

doi: 10.1109/TIT.1982.1056489

ISSN 15579654. doi: 10.1109/TIT.1982.1056489. MingshengLong, YueCao, JianminWang, andMichaelJordan. LearningTransferableFeatureswithDeepAdaptation Networks. In Francis Bach and David Blei (eds.),Proceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pp. 97–105, Lille, France,

work page doi:10.1109/tit.1982.1056489 1982
[23]

URL https://proceedings.mlr.press/v37/long15.html

PMLR. URL https://proceedings.mlr.press/v37/long15.html. Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I. Jordan. Conditional Adversarial Domain Adaptation. Advances in Neural Information Processing Systems, 2018-December:1640–1650, 5

work page 2018
[24]

Conditional Adversarial Domain Adaptation

ISSN 10495258. URL https://arxiv.org/abs/1705.10667v4. Ilya Loshchilov and Frank Hutter. Online Batch Selection for Faster Training of Neural Networks.ICLR workshop track, 11

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Online Batch Selection for Faster Training of Neural Networks

URLhttps://arxiv.org/pdf/1511.06343. Yucheng Lu, Youngsuk Park, Lifan Chen, Yuyang Wang, Christopher De Sa, and Dean Foster. Variance Reduced Training with Stratified Sampling for Forecasting Models.ICML, 139:7145–7155, 3

work page internal anchor Pith review Pith/arXiv arXiv
[26]

URL https://arxiv.org/abs/2103.02062v2

ISSN 26403498. URL https://arxiv.org/abs/2103.02062v2. Aengus Lynch, Gbètondji J-S Dovonon, Jean Kaddour, and Ricardo Silva. Spawrious: A Benchmark for Fine Control of Spurious Correlation Biases.arXiv, 3

work page arXiv
[27]

Krikamol Muandet, Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, and Bernhard Schölkopf

URLhttps://arxiv.org/abs/2303.05470v3. Krikamol Muandet, Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, and Bernhard Schölkopf. Kernel Mean Shrinkage Estimators.JMLR, 17:xx–xx, 5

work page arXiv
[28]

Kernel Mean Shrinkage Estimators

ISSN 15337928. URLhttps://arxiv.org/pdf/1405.5505. 13 Andrea Napoli and Paul White. Unsupervised Domain Adaptation for the Cross-Dataset Detection of Humpback Whale Calls.DCASE,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

XinyuPeng, LiLi, andFeiYueWang

URLhttp://arxiv.org/abs/2410.04235. XinyuPeng, LiLi, andFeiYueWang. AcceleratingMinibatchStochasticGradientDescentUsingTypicalitySampling. IEEE Transactions on Neural Networks and Learning Systems, 31(11):4649–4659, 11

work page arXiv
[30]

doi: 10.1109/TNNLS.2019.2957003

ISSN 21622388. doi: 10.1109/TNNLS.2019.2957003. B. T. Polyak. Some methods of speeding up the convergence of iteration methods.USSR Computational Mathematics and Mathematical Physics, 4(5):1–17,

work page doi:10.1109/tnnls.2019.2957003 2019
[31]

doi: 10.1016/0041-5553(64)90137-5

ISSN 00415553. doi: 10.1016/0041-5553(64)90137-5. B. T. Polyak and A. B. Juditsky. Acceleration of Stochastic Approximation by Averaging.SIAM Journal on Control and Optimization, 30(4):838–855,

work page doi:10.1016/0041-5553(64)90137-5
[32]

doi: 10.1137/0330046

ISSN 03630129. doi: 10.1137/0330046. Ievgen Redko, Emilie Morvant, Amaury Habrard, Marc Sebban, and Younès Bennani. A survey on domain adaptation theory: learning bounds and theoretical guarantees.arXiv,

work page doi:10.1137/0330046
[33]

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

ISSN 15324435. URLhttps://arxiv.org/abs/1209.1873v2. Baochen Sun and Kate Saenko. Deep CORAL: Correlation Alignment for Deep Domain Adaptation.ECCV, 9915 LNCS:443–450, 7

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Deep CORAL: Correlation Alignment for Deep Domain Adaptation

ISSN 16113349. URLhttps://arxiv.org/abs/1607.01719v1. Danica J Sutherland and Namrata Deka. Unbiased estimators for the variance of MMD estimators.arXiv, 6

work page internal anchor Pith review Pith/arXiv arXiv
[35]

The MathWorks Inc

URLhttps://arxiv.org/pdf/1906.02104. The MathWorks Inc. MATLAB,

work page arXiv 1906
[36]

Deep Domain Confusion: Maximizing for Domain Invariance

URLhttps://arxiv.org/abs/1412.3474v1. Zirui Wang, Zihang Dai, Barnabas Poczos, and Jaime Carbonell. Characterizing and Avoiding Negative Transfer. CVPR, 2019-June:11285–11294, 11

work page internal anchor Pith review Pith/arXiv arXiv 2019
[37]

doi: 10.1109/CVPR.2019.01155

ISSN 10636919. doi: 10.1109/CVPR.2019.01155. URLhttps://arxiv. org/abs/1811.09751v4. Ruijia Xu, Guanbin Li, Jihan Yang, and Liang Lin. Larger Norm More Transferable: An Adaptive Feature Norm Approach for Unsupervised Domain Adaptation.ICCV, pp. 1426–1435, 11

work page doi:10.1109/cvpr.2019.01155 2019
[38]

doi: 10.1109/ ICCV.2019.00151

ISSN 15505499. doi: 10.1109/ ICCV.2019.00151. URLhttps://arxiv.org/pdf/1811.07456. Cheng Zhang, Hedvig Kjellström, and Stephan Mandt. Determinantal Point Processes for Mini-Batch Diversification. Uncertainty in Artificial Intelligence,

work page arXiv 2019
[39]

doi: 10.1609/AAAI.V33I01.33015741

ISSN 2374-3468. doi: 10.1609/AAAI.V33I01.33015741. URL https://ojs.aaai.org/index.php/AAAI/article/view/4520. Marvin Zhang, Henrik Marklund, Nikita Dhawan, Abhishek Gupta, Sergey Levine, and Chelsea Finn. Adaptive Risk Minimization: Learning to Adapt to Domain Shift.Advances in Neural Information Processing Systems, 28: 23664–23678, 7

work page doi:10.1609/aaai.v33i01.33015741
[40]

URLhttps://arxiv.org/abs/2007.02931v4

ISSN 10495258. URLhttps://arxiv.org/abs/2007.02931v4. Peilin Zhao and Tong Zhang. Accelerating Minibatch Stochastic Gradient Descent using Stratified Sampling.arXiv, 5

work page arXiv 2007
[41]

Accelerating Minibatch Stochastic Gradient Descent using Stratified Sampling

URLhttps://arxiv.org/abs/1405.3080v1. Peilin Zhao and Tong Zhang. Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. ICML, pp. 1–9, 6

work page internal anchor Pith review Pith/arXiv arXiv
[42]

URLhttps://proceedings.mlr.press/v37/zhaoa15.html

ISSN 1938-7228. URLhttps://proceedings.mlr.press/v37/zhaoa15.html. 14 A Proof of Theorem 1 Theorem 1.AssumeHhas finite dimensionalitydandΣ ˆµϕ,s =σ2Ifor some positive scalarσ2. Then, arg min S Var ( ˆLMMD ) = arg min S Var (ˆµϕ,s).(18) Proof.Noting thatVar (ˆµϕ,s) =σ2d, we have Var ( ˆLMMD ) = 2 Tr (( σ2I+ Σ ˆµϕ,t )2) + 4mT ( σ2I+ Σ ˆµϕ,t ) m = 2 Tr ( σ4I...

work page 1938
[43]

Proof.For scalar˜zs,h, the covariances can be written as quadratic forms ˆRs = (CZ)TA(CZ) =γZTACZ(23) ˆR′ s = (Z−µs)TA(Z−µs)(24) withZ= (˜z s,1,...,˜zs,k)T ∼N(µ,Σ)

+ 4µTACΣACµ(21) Var ( ˆRs ) =γ2 ( Var ( ˆR′ s ) + 2 Tr ( A2Σ )2 −4 Tr ( A3Σ 2)) (22) whereµ= (µ1,...,µk)T,Σ = diag ((Σ 1,...,Σ k)),A= 1 ns diag ((|S1|,...,|Sk|)),γ=ns ns−1is a bias correc- tion factor, andC=I−JAis the weighted centering matrix, withJthe square matrix of ones. Proof.For scalar˜zs,h, the covariances can be written as quadratic forms ˆRs = (...

work page 2007