pith. machine review for the scientific record. sign in

arxiv: 2605.11231 · v2 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

LiBaGS: Lightweight Boundary Gap Synthesis for Targeted Synthetic Data Selection

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords synthetic data selectiondecision boundarydata augmentationuncertainty estimationmachine learningtargeted data synthesis
0
0 comments X

The pith

LiBaGS selects synthetic data near decision boundaries to improve accuracy on the real data manifold.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LiBaGS as a way to choose which synthetic samples to add to training data so they fill the parts of the distribution that actually help the model. It scores each candidate by how close it is to the current decision boundary, how uncertain the model is about it, how dense the real data is around it, and whether it fits within the support of real examples. A special allocation rule then places these samples in sparse boundary areas, soft labels are used near ambiguities, and a diversity check prevents repeats. The process stops when adding more data brings little additional benefit. Experiments indicate this targeted approach beats simple oversampling and uncertainty sampling.

Core claim

LiBaGS scores candidate synthetic samples by combining decision-boundary proximity, predictive uncertainty, real-data density, and support validity. It applies a boundary-gap allocation rule to target sparse but realistic neighborhoods around decision boundaries, uses a marginal-value stopping rule to decide when enough data has been added, assigns softer labels near ambiguous boundaries, and incorporates a diversity objective to avoid redundant selections. This selection process improves accuracy over classical oversampling, hard augmentation, and other selection criteria.

What carries the argument

A combined scoring function based on boundary proximity, uncertainty, density, and support validity, plus a boundary-gap allocation rule and marginal-value stopping criterion.

If this is right

  • Models achieve higher accuracy by focusing synthetic data on informative boundary regions rather than adding data uniformly.
  • The method remains effective across different synthetic data generators because it is generator-agnostic.
  • Training becomes more efficient as the stopping rule prevents unnecessary addition of samples.
  • Soft labeling near boundaries reduces the impact of ambiguous synthetic examples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar scoring could be adapted for active learning in non-synthetic settings to select real samples.
  • The approach might generalize to tasks beyond classification, such as regression or structured prediction.
  • Future work could test the method on large-scale datasets where boundary estimation is more challenging.

Load-bearing premise

The combined scoring reliably identifies samples that are both informative for the task and stay on the real data manifold without introducing performance-degrading artifacts.

What would settle it

An experiment showing that samples selected by LiBaGS lead to lower accuracy than random or uncertainty-based selection, or produce samples that are clearly off the real data distribution.

Figures

Figures reproduced from arXiv: 2605.11231 by Abhishek Moturu, Anna Goldenberg, Babak Taati.

Figure 1
Figure 1. Figure 1: Qualitative illustration of boundary-gap selection on the two-moons task. [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
read the original abstract

Synthetic data is useful only when the added samples fill missing parts of the training distribution that matter for the downstream task. We introduce LiBaGS, a lightweight, generator-agnostic method for targeted synthetic training data selection. LiBaGS scores candidate synthetic samples by combining decision-boundary proximity, predictive uncertainty, real-data density, and support validity, so that selected samples are both informative and likely to remain on the real data manifold. We then use a boundary-gap allocation rule that targets sparse but realistic decision-boundary neighborhoods, rather than simply adding more data or selecting only the most uncertain candidates. LiBaGS also learns when enough synthetic samples have been added through a marginal-value stopping rule, assigns softer labels near ambiguous boundaries, and uses a diversity objective to avoid redundant near-duplicate selections. Experiments show that LiBaGS improves accuracy over classical oversampling, hard augmentation, uncertainty and density ablations, and targeted-generation selection criteria.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LiBaGS, a lightweight generator-agnostic method for targeted synthetic data selection. Candidate samples are scored by a linear combination of decision-boundary proximity, predictive uncertainty, real-data density, and support validity; a boundary-gap allocation rule then distributes the synthetic budget toward sparse but realistic boundary neighborhoods. Additional components include a marginal-value stopping criterion, softer labels near ambiguous boundaries, and a diversity objective to avoid near-duplicates. The central claim is that this procedure yields higher downstream accuracy than classical oversampling, hard augmentation, uncertainty-only or density-only ablations, and other targeted-generation baselines.

Significance. If the experimental results are reproducible and the manifold-adherence claim is quantitatively supported, LiBaGS would supply a practical, low-overhead alternative to exhaustive synthetic-data generation or purely uncertainty-driven selection, particularly useful in imbalanced or boundary-sensitive classification tasks.

major comments (2)
  1. [Experiments] Experimental section: the abstract states that LiBaGS improves accuracy over the listed baselines, yet no description of data splits, number of random seeds, statistical significance tests, or error bars is supplied. Without these, the reported gains cannot be assessed for robustness or selection bias.
  2. [Method] Scoring function (implicitly defined in §3): the density and support-validity terms are intended to keep selected points on the real manifold, but in high-dimensional regimes k-NN or kernel density estimates suffer exponential bias; no held-out log-likelihood, reconstruction error, or other quantitative manifold check is reported to confirm that the combined score actually enforces in-manifold selection rather than merely correlating with accuracy.
minor comments (2)
  1. [Method] Provide explicit equations for the four-term score, the boundary-gap allocation rule, and the marginal-value stopping criterion so that the method can be re-implemented without ambiguity.
  2. [Method] Clarify the precise definition of 'support validity' and how it is computed from the real-data distribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below, agreeing where changes are needed and providing clarifications where appropriate.

read point-by-point responses
  1. Referee: [Experiments] Experimental section: the abstract states that LiBaGS improves accuracy over the listed baselines, yet no description of data splits, number of random seeds, statistical significance tests, or error bars is supplied. Without these, the reported gains cannot be assessed for robustness or selection bias.

    Authors: We agree with the referee that the experimental details are insufficiently described. We will revise the manuscript to explicitly state the data splitting procedure (e.g., stratified 5-fold cross-validation or fixed splits), the number of random seeds used for all experiments (10 seeds), report results with mean ± standard deviation, and include p-values from statistical tests (e.g., Wilcoxon signed-rank test) to confirm the significance of accuracy improvements over baselines. revision: yes

  2. Referee: [Method] Scoring function (implicitly defined in §3): the density and support-validity terms are intended to keep selected points on the real manifold, but in high-dimensional regimes k-NN or kernel density estimates suffer exponential bias; no held-out log-likelihood, reconstruction error, or other quantitative manifold check is reported to confirm that the combined score actually enforces in-manifold selection rather than merely correlating with accuracy.

    Authors: We thank the referee for highlighting this important aspect. While our method combines multiple terms to promote in-manifold selection, we recognize that direct quantitative validation of manifold adherence was not provided. In the revised manuscript, we will add a new subsection or paragraph discussing the limitations of density estimation in high dimensions and include additional experiments reporting the average distance to the k-nearest real neighbors for LiBaGS-selected samples versus baselines. This will provide quantitative support for the claim that selected points remain on the real data manifold. revision: yes

Circularity Check

0 steps flagged

No circularity: scoring rules and allocation defined independently of accuracy metric

full rationale

The paper introduces LiBaGS by explicitly defining four scoring terms (boundary proximity, uncertainty, real-data density, support validity) plus allocation, stopping, labeling, and diversity rules as independent heuristics. These are not derived from or fitted to the final accuracy; they are stated as design choices whose combination is then tested empirically against baselines. No equations reduce a claimed prediction to a fitted parameter by construction, no self-citation supplies a uniqueness theorem that forces the method, and no ansatz is smuggled via prior work. The experimental claims therefore rest on external comparison rather than tautological re-labeling of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract; the method rests on standard machine-learning assumptions about decision boundaries and uncertainty estimation without explicit free parameters, new axioms, or invented entities listed.

pith-pipeline@v0.9.0 · 5461 in / 1115 out tokens · 47830 ms · 2026-05-14T20:47:30.536224+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 3 internal anchors

  1. [1]

    Alpaydin and Fevzi

    E. Alpaydin and Fevzi. Alimoglu. Pen-Based Recognition of Handwritten Digits. UCI Ma- chine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5MG6K

  2. [2]

    Improving the scaling laws of synthetic data with deliberate practice.arXiv preprint arXiv:2502.15588, 2025

    Reyhane Askari-Hemmat, Mohammad Pezeshki, Elvis Dohmatob, Florian Bordes, Pietro As- tolfi, Melissa Hall, Jakob Verbeek, Michal Drozdzal, and Adriana Romero-Soriano. Improving the scaling laws of synthetic data with deliberate practice.arXiv preprint arXiv:2502.15588, 2025

  3. [3]

    Synthetic data from diffusion models improves imagenet classification.arXiv preprint arXiv:2304.08466, 2023

    Shekoofeh Azizi, Simon Kornblith, Chitwan Saharia, Mohammad Norouzi, and David J Fleet. Synthetic data from diffusion models improves imagenet classification.arXiv preprint arXiv:2304.08466, 2023

  4. [4]

    Mwmote–majority weighted minority oversampling technique for imbalanced data set learning.IEEE Trans- actions on knowledge and data engineering, 26(2):405–425, 2012

    Sukarna Barua, Md Monirul Islam, Xin Yao, and Kazuyuki Murase. Mwmote–majority weighted minority oversampling technique for imbalanced data set learning.IEEE Trans- actions on knowledge and data engineering, 26(2):405–425, 2012

  5. [5]

    Manifold-based synthetic oversampling with manifold conformance estimation.Machine Learning, 107(3):605–637, 2018

    Colin Bellinger, Christopher Drummond, and Nathalie Japkowicz. Manifold-based synthetic oversampling with manifold conformance estimation.Machine Learning, 107(3):605–637, 2018

  6. [6]

    Safe-level- smote: Safe-level-synthetic minority over-sampling technique for handling the class imbal- anced problem

    Chumphol Bunkhumpornpat, Krung Sinapiromsaran, and Chidchanok Lursinsap. Safe-level- smote: Safe-level-synthetic minority over-sampling technique for handling the class imbal- anced problem. InPacific-Asia conference on knowledge discovery and data mining, pages 475–482. Springer, 2009

  7. [7]

    Smote: synthetic minority over-sampling technique.Journal of artificial intelligence research, 16: 321–357, 2002

    Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over-sampling technique.Journal of artificial intelligence research, 16: 321–357, 2002

  8. [8]

    Nearest neighbor pattern classification.IEEE transactions on information theory, 13(1):21–27, 1967

    Thomas Cover and Peter Hart. Nearest neighbor pattern classification.IEEE transactions on information theory, 13(1):21–27, 1967

  9. [9]

    Deepsmote: Fusing deep learning and smote for imbalanced data.IEEE transactions on neural networks and learning systems, 34(9):6390–6404, 2022

    Damien Dablain, Bartosz Krawczyk, and Nitesh V Chawla. Deepsmote: Fusing deep learning and smote for imbalanced data.IEEE transactions on neural networks and learning systems, 34(9):6390–6404, 2022

  10. [10]

    Geometric smote a geometrically enhanced drop-in replacement for smote.Information sciences, 501:118–135, 2019

    Georgios Douzas and Fernando Bacao. Geometric smote a geometrically enhanced drop-in replacement for smote.Information sciences, 501:118–135, 2019

  11. [11]

    Dropout as a bayesian approximation: Representing model uncertainty in deep learning

    Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Ininternational conference on machine learning, pages 1050–

  12. [12]

    Generative adversarial nets.Advances in neural information processing systems, 27, 2014

    Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014

  13. [13]

    On calibration of modern neural networks

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. InInternational conference on machine learning, pages 1321–1330. PMLR, 2017

  14. [14]

    Borderline-smote: a new over-sampling method in imbalanced data sets learning

    Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. Borderline-smote: a new over-sampling method in imbalanced data sets learning. InInternational conference on intelligent computing, pages 878–887. Springer, 2005

  15. [15]

    Generative data mining with longtail-guided diffusion.arXiv preprint arXiv:2502.01980, 2025

    David S Hayden, Mao Ye, Timur Garipov, Gregory P Meyer, Carl V ondrick, Zhao Chen, Yun- ing Chai, Eric Wolff, and Siddhartha S Srinivasa. Generative data mining with longtail-guided diffusion.arXiv preprint arXiv:2502.01980, 2025

  16. [16]

    Adasyn: Adaptive synthetic sampling approach for imbalanced learning

    Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), pages 1322–1328. Ieee, 2008. 10

  17. [17]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 770–778, 2016

  18. [18]

    Feedback-guided data synthesis for imbalanced classification.arXiv preprint arXiv:2310.00158, 2023

    Reyhane Askari Hemmat, Mohammad Pezeshki, Florian Bordes, Michal Drozdzal, and Adri- ana Romero-Soriano. Feedback-guided data synthesis for imbalanced classification.arXiv preprint arXiv:2310.00158, 2023

  19. [19]

    Denoising diffusion probabilistic models.Ad- vances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Ad- vances in neural information processing systems, 33:6840–6851, 2020

  20. [20]

    Probability inequalities for sums of bounded random variables.Journal of the American statistical association, 58(301):13–30, 1963

    Wassily Hoeffding. Probability inequalities for sums of bounded random variables.Journal of the American statistical association, 58(301):13–30, 1963

  21. [21]

    Dif- fusemix: Label-preserving data augmentation with diffusion models

    Khawar Islam, Muhammad Zaigham Zaheer, Arif Mahmood, and Karthik Nandakumar. Dif- fusemix: Label-preserving data augmentation with diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27621–27630, 2024

  22. [22]

    Datadream: Few-shot guided dataset generation

    Jae Myung Kim, Jessica Bader, Stephan Alaniz, Cordelia Schmid, and Zeynep Akata. Datadream: Few-shot guided dataset generation. InEuropean Conference on Computer Vi- sion, pages 252–268. Springer, 2024

  23. [23]

    Co-mixup: Saliency guided joint mixup with supermodular diversity.arXiv preprint arXiv:2102.03065, 2021

    Jang-Hyun Kim, Wonho Choo, Hosan Jeong, and Hyun Oh Song. Co-mixup: Saliency guided joint mixup with supermodular diversity.arXiv preprint arXiv:2102.03065, 2021

  24. [24]

    Generate what matters: Steering diffusion models for targeted data generation to improve classification.OpenReview preprint, 2025

    Jeeyung Kim, Erfan Esmaeili, and Qiang Qiu. Generate what matters: Steering diffusion models for targeted data generation to improve classification.OpenReview preprint, 2025

  25. [25]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

  26. [26]

    Genie: Generative hard negative images through diffusion.arXiv preprint arXiv:2312.02548, 2023

    Soroush Abbasi Koohpayegani, Anuj Singh, KL Navaneet, Hamed Pirsiavash, and Hadi Jamali-Rad. Genie: Generative hard negative images through diffusion.arXiv preprint arXiv:2312.02548, 2023

  27. [27]

    Submodular function maximization.Tractability, 3(71- 104):3, 2014

    Andreas Krause and Daniel Golovin. Submodular function maximization.Tractability, 3(71- 104):3, 2014

  28. [28]

    CIFAR-10 (Canadian Institute for Ad- vanced Research), 2009

    Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. CIFAR-10 (Canadian Institute for Ad- vanced Research), 2009

  29. [29]

    Simple and scalable pre- dictive uncertainty estimation using deep ensembles.Advances in neural information process- ing systems, 30, 2017

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable pre- dictive uncertainty estimation using deep ensembles.Advances in neural information process- ing systems, 30, 2017

  30. [30]

    Oversampling for imbalanced learning based on k-means and smote.arXiv preprint arXiv:1711.00837, 2, 2017

    F Last, G Douzas, and F Bacao. Oversampling for imbalanced learning based on k-means and smote.arXiv preprint arXiv:1711.00837, 2, 2017

  31. [31]

    Gendataa- gent: On-the-fly dataset augmentation with synthetic data

    Zhiteng Li, Lele Chen, Jerone Andrews, Yunhao Ba, Yulun Zhang, and Alice Xiang. Gendataa- gent: On-the-fly dataset augmentation with synthetic data. InThe Thirteenth International Conference on Learning Representations, 2025

  32. [32]

    Diffusion curriculum: Synthetic-to-real data curriculum via image-guided diffusion

    Yijun Liang, Shweta Bhardwaj, and Tianyi Zhou. Diffusion curriculum: Synthetic-to-real data curriculum via image-guided diffusion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1697–1707, 2025

  33. [33]

    Exploreaugment: Adaptive exploratory data augmentation based on boundary awareness

    Kexin Liu, Hao Zhang, Yabin Wang, Chenxin Cai, Tingting Wu, and Jie Liu. Exploreaugment: Adaptive exploratory data augmentation based on boundary awareness. OpenReview preprint, 2025

  34. [34]

    Adversarial sampling for active learning

    Christoph Mayer and Radu Timofte. Adversarial sampling for active learning. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3071–3079, 2020. 11

  35. [35]

    An analysis of approx- imations for maximizing submodular set functions—i.Mathematical programming, 14(1): 265–294, 1978

    George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approx- imations for maximizing submodular set functions—i.Mathematical programming, 14(1): 265–294, 1978

  36. [36]

    Do we need all the synthetic data? targeted synthetic image augmentation via diffusion models.arXiv preprint arXiv:2505.21574, 2025

    Dang Nguyen, Jiping Li, Jinghao Zheng, and Baharan Mirzasoleiman. Do we need all the synthetic data? targeted synthetic image augmentation via diffusion models.arXiv preprint arXiv:2505.21574, 2025

  37. [37]

    Tsynd: Targeted syn- thetic data generation for enhanced medical image classification: Leveraging epistemic uncer- tainty to improve model performance

    Joshua Niemeijer, Jan Ehrhardt, Hristina Uzunova, and Heinz Handels. Tsynd: Targeted syn- thetic data generation for enhanced medical image classification: Leveraging epistemic uncer- tainty to improve model performance. InInternational Workshop on Simulation and Synthesis in Medical Imaging, pages 69–78. Springer, 2024

  38. [38]

    The majority can help the minority: Context-rich minority oversampling for long-tailed classification

    Seulki Park, Youngkyu Hong, Byeongho Heo, Sangdoo Yun, and Jin Young Choi. The majority can help the minority: Context-rich minority oversampling for long-tailed classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6887–6896, 2022

  39. [39]

    On-manifold adver- sarial data augmentation improves uncertainty calibration

    Kanil Patel, William Beluch, Dan Zhang, Michael Pfeiffer, and Bin Yang. On-manifold adver- sarial data augmentation improves uncertainty calibration. In2020 25th International Confer- ence on Pattern Recognition (ICPR), pages 8029–8036. IEEE, 2021

  40. [40]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022

  41. [41]

    Active Learning for Convolutional Neural Networks: A Core-Set Approach

    Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach.arXiv preprint arXiv:1708.00489, 2017

  42. [42]

    Active learning literature survey, 2009

    Burr Settles. Active learning literature survey, 2009

  43. [43]

    A survey on image data augmentation for deep learning.Journal of big data, 6(1):1–48, 2019

    Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep learning.Journal of big data, 6(1):1–48, 2019

  44. [44]

    Routledge, 2018

    Bernard W Silverman.Density estimation for statistics and data analysis. Routledge, 2018

  45. [45]

    Dataset cartography: Mapping and diagnosing datasets with training dynamics

    Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A Smith, and Yejin Choi. Dataset cartography: Mapping and diagnosing datasets with training dynamics. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9275–9293, 2020

  46. [46]

    An empirical study of example forgetting during deep neural network learning.arXiv preprint arXiv:1812.05159, 2018

    Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Ben- gio, and Geoffrey J Gordon. An empirical study of example forgetting during deep neural network learning.arXiv preprint arXiv:1812.05159, 2018

  47. [47]

    An overview of statistical learning theory.IEEE transactions on neural networks, 10(5):988–999, 1999

    Vladimir N Vapnik. An overview of statistical learning theory.IEEE transactions on neural networks, 10(5):988–999, 1999

  48. [48]

    Manifold mixup: Better representations by interpolating hidden states

    Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, David Lopez-Paz, and Yoshua Bengio. Manifold mixup: Better representations by interpolating hidden states. InInternational conference on machine learning, pages 6438–6447. PMLR, 2019

  49. [49]

    Augmax: Adversarial composition of random augmentations for robust training.Ad- vances in neural information processing systems, 34:237–250, 2021

    Haotao Wang, Chaowei Xiao, Jean Kossaifi, Zhiding Yu, Anima Anandkumar, and Zhangyang Wang. Augmax: Adversarial composition of random augmentations for robust training.Ad- vances in neural information processing systems, 34:237–250, 2021

  50. [50]

    Active learning via query synthesis and nearest neighbour search.Neurocomputing, 147:426–434, 2015

    Liantao Wang, Xuelei Hu, Bo Yuan, and Jianfeng Lu. Active learning via query synthesis and nearest neighbour search.Neurocomputing, 147:426–434, 2015

  51. [51]

    Inversion circle interpolation: Diffusion-based image aug- mentation for data-scarce classification

    Yanghao Wang and Long Chen. Inversion circle interpolation: Diffusion-based image aug- mentation for data-scarce classification. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 25560–25569, 2025. 12

  52. [52]

    Difficulty-controlled diffusion model for effective synthetic dataset generation.arXiv preprint arXiv:2411.18109, 2024

    Zerun Wang, Jiafeng Mao, Xueting Wang, and Toshihiko Yamasaki. Training data synthesis with difficulty controlled diffusion model.arXiv preprint arXiv:2411.18109, pages 1–10, 2024

  53. [53]

    Enhance image classification via inter-class image mixup with diffusion model

    Zhicai Wang, Longhui Wei, Tan Wang, Heyu Chen, Yanbin Hao, Xiang Wang, Xiangnan He, and Qi Tian. Enhance image classification via inter-class image mixup with diffusion model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17223–17233, 2024

  54. [54]

    Filtering with confidence: When data augmentation meets conformal prediction.arXiv preprint arXiv:2509.21479, 2025

    Zixuan Wu, So Won Jeong, Yating Liu, Yeo Jin Jung, and Claire Donnat. Filtering with confidence: When data augmentation meets conformal prediction.arXiv preprint arXiv:2509.21479, 2025

  55. [55]

    Cutmix: Regularization strategy to train strong classifiers with localizable features

    Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019

  56. [56]

    mixup: Beyond Empirical Risk Minimization

    Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization.arXiv preprint arXiv:1710.09412, 2017. 13 A Technical appendices and supplementary material A.1 Proof of Theorem 1 Theorem 1(Local risk and boundary-gap allocation).Assume that, in a small region aroundz, the local sample count is proportional ton...