arxiv: 2605.11231 · v2 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

LiBaGS: Lightweight Boundary Gap Synthesis for Targeted Synthetic Data Selection

Abhishek Moturu , Anna Goldenberg , Babak Taati

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords synthetic data selectiondecision boundarydata augmentationuncertainty estimationmachine learningtargeted data synthesis

0 comments

The pith

LiBaGS selects synthetic data near decision boundaries to improve accuracy on the real data manifold.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LiBaGS as a way to choose which synthetic samples to add to training data so they fill the parts of the distribution that actually help the model. It scores each candidate by how close it is to the current decision boundary, how uncertain the model is about it, how dense the real data is around it, and whether it fits within the support of real examples. A special allocation rule then places these samples in sparse boundary areas, soft labels are used near ambiguities, and a diversity check prevents repeats. The process stops when adding more data brings little additional benefit. Experiments indicate this targeted approach beats simple oversampling and uncertainty sampling.

Core claim

LiBaGS scores candidate synthetic samples by combining decision-boundary proximity, predictive uncertainty, real-data density, and support validity. It applies a boundary-gap allocation rule to target sparse but realistic neighborhoods around decision boundaries, uses a marginal-value stopping rule to decide when enough data has been added, assigns softer labels near ambiguous boundaries, and incorporates a diversity objective to avoid redundant selections. This selection process improves accuracy over classical oversampling, hard augmentation, and other selection criteria.

What carries the argument

A combined scoring function based on boundary proximity, uncertainty, density, and support validity, plus a boundary-gap allocation rule and marginal-value stopping criterion.

If this is right

Models achieve higher accuracy by focusing synthetic data on informative boundary regions rather than adding data uniformly.
The method remains effective across different synthetic data generators because it is generator-agnostic.
Training becomes more efficient as the stopping rule prevents unnecessary addition of samples.
Soft labeling near boundaries reduces the impact of ambiguous synthetic examples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar scoring could be adapted for active learning in non-synthetic settings to select real samples.
The approach might generalize to tasks beyond classification, such as regression or structured prediction.
Future work could test the method on large-scale datasets where boundary estimation is more challenging.

Load-bearing premise

The combined scoring reliably identifies samples that are both informative for the task and stay on the real data manifold without introducing performance-degrading artifacts.

What would settle it

An experiment showing that samples selected by LiBaGS lead to lower accuracy than random or uncertainty-based selection, or produce samples that are clearly off the real data distribution.

Figures

Figures reproduced from arXiv: 2605.11231 by Abhishek Moturu, Anna Goldenberg, Babak Taati.

read the original abstract

Synthetic data is useful only when the added samples fill missing parts of the training distribution that matter for the downstream task. We introduce LiBaGS, a lightweight, generator-agnostic method for targeted synthetic training data selection. LiBaGS scores candidate synthetic samples by combining decision-boundary proximity, predictive uncertainty, real-data density, and support validity, so that selected samples are both informative and likely to remain on the real data manifold. We then use a boundary-gap allocation rule that targets sparse but realistic decision-boundary neighborhoods, rather than simply adding more data or selecting only the most uncertain candidates. LiBaGS also learns when enough synthetic samples have been added through a marginal-value stopping rule, assigns softer labels near ambiguous boundaries, and uses a diversity objective to avoid redundant near-duplicate selections. Experiments show that LiBaGS improves accuracy over classical oversampling, hard augmentation, uncertainty and density ablations, and targeted-generation selection criteria.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LiBaGS, a lightweight generator-agnostic method for targeted synthetic data selection. Candidate samples are scored by a linear combination of decision-boundary proximity, predictive uncertainty, real-data density, and support validity; a boundary-gap allocation rule then distributes the synthetic budget toward sparse but realistic boundary neighborhoods. Additional components include a marginal-value stopping criterion, softer labels near ambiguous boundaries, and a diversity objective to avoid near-duplicates. The central claim is that this procedure yields higher downstream accuracy than classical oversampling, hard augmentation, uncertainty-only or density-only ablations, and other targeted-generation baselines.

Significance. If the experimental results are reproducible and the manifold-adherence claim is quantitatively supported, LiBaGS would supply a practical, low-overhead alternative to exhaustive synthetic-data generation or purely uncertainty-driven selection, particularly useful in imbalanced or boundary-sensitive classification tasks.

major comments (2)

[Experiments] Experimental section: the abstract states that LiBaGS improves accuracy over the listed baselines, yet no description of data splits, number of random seeds, statistical significance tests, or error bars is supplied. Without these, the reported gains cannot be assessed for robustness or selection bias.
[Method] Scoring function (implicitly defined in §3): the density and support-validity terms are intended to keep selected points on the real manifold, but in high-dimensional regimes k-NN or kernel density estimates suffer exponential bias; no held-out log-likelihood, reconstruction error, or other quantitative manifold check is reported to confirm that the combined score actually enforces in-manifold selection rather than merely correlating with accuracy.

minor comments (2)

[Method] Provide explicit equations for the four-term score, the boundary-gap allocation rule, and the marginal-value stopping criterion so that the method can be re-implemented without ambiguity.
[Method] Clarify the precise definition of 'support validity' and how it is computed from the real-data distribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below, agreeing where changes are needed and providing clarifications where appropriate.

read point-by-point responses

Referee: [Experiments] Experimental section: the abstract states that LiBaGS improves accuracy over the listed baselines, yet no description of data splits, number of random seeds, statistical significance tests, or error bars is supplied. Without these, the reported gains cannot be assessed for robustness or selection bias.

Authors: We agree with the referee that the experimental details are insufficiently described. We will revise the manuscript to explicitly state the data splitting procedure (e.g., stratified 5-fold cross-validation or fixed splits), the number of random seeds used for all experiments (10 seeds), report results with mean ± standard deviation, and include p-values from statistical tests (e.g., Wilcoxon signed-rank test) to confirm the significance of accuracy improvements over baselines. revision: yes
Referee: [Method] Scoring function (implicitly defined in §3): the density and support-validity terms are intended to keep selected points on the real manifold, but in high-dimensional regimes k-NN or kernel density estimates suffer exponential bias; no held-out log-likelihood, reconstruction error, or other quantitative manifold check is reported to confirm that the combined score actually enforces in-manifold selection rather than merely correlating with accuracy.

Authors: We thank the referee for highlighting this important aspect. While our method combines multiple terms to promote in-manifold selection, we recognize that direct quantitative validation of manifold adherence was not provided. In the revised manuscript, we will add a new subsection or paragraph discussing the limitations of density estimation in high dimensions and include additional experiments reporting the average distance to the k-nearest real neighbors for LiBaGS-selected samples versus baselines. This will provide quantitative support for the claim that selected points remain on the real data manifold. revision: yes

Circularity Check

0 steps flagged

No circularity: scoring rules and allocation defined independently of accuracy metric

full rationale

The paper introduces LiBaGS by explicitly defining four scoring terms (boundary proximity, uncertainty, real-data density, support validity) plus allocation, stopping, labeling, and diversity rules as independent heuristics. These are not derived from or fitted to the final accuracy; they are stated as design choices whose combination is then tested empirically against baselines. No equations reduce a claimed prediction to a fitted parameter by construction, no self-citation supplies a uniqueness theorem that forces the method, and no ansatz is smuggled via prior work. The experimental claims therefore rest on external comparison rather than tautological re-labeling of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract; the method rests on standard machine-learning assumptions about decision boundaries and uncertainty estimation without explicit free parameters, new axioms, or invented entities listed.

pith-pipeline@v0.9.0 · 5461 in / 1115 out tokens · 47830 ms · 2026-05-14T20:47:30.536224+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Jm(q) = ∫ r(z) / (n p(z) + m q(z)) dz ... q*(z) = (1/m) [√(r(z)/λ) − n p(z)]_+
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Δj(tj) = rj / (cj + tj) − rj / (cj + tj + 1)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 3 internal anchors

[1]

Alpaydin and Fevzi

E. Alpaydin and Fevzi. Alimoglu. Pen-Based Recognition of Handwritten Digits. UCI Ma- chine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5MG6K

work page doi:10.24432/c5mg6k 1996
[2]

Improving the scaling laws of synthetic data with deliberate practice.arXiv preprint arXiv:2502.15588, 2025

Reyhane Askari-Hemmat, Mohammad Pezeshki, Elvis Dohmatob, Florian Bordes, Pietro As- tolfi, Melissa Hall, Jakob Verbeek, Michal Drozdzal, and Adriana Romero-Soriano. Improving the scaling laws of synthetic data with deliberate practice.arXiv preprint arXiv:2502.15588, 2025

work page arXiv 2025
[3]

Synthetic data from diffusion models improves imagenet classification.arXiv preprint arXiv:2304.08466, 2023

Shekoofeh Azizi, Simon Kornblith, Chitwan Saharia, Mohammad Norouzi, and David J Fleet. Synthetic data from diffusion models improves imagenet classification.arXiv preprint arXiv:2304.08466, 2023

work page arXiv 2023
[4]

Mwmote–majority weighted minority oversampling technique for imbalanced data set learning.IEEE Trans- actions on knowledge and data engineering, 26(2):405–425, 2012

Sukarna Barua, Md Monirul Islam, Xin Yao, and Kazuyuki Murase. Mwmote–majority weighted minority oversampling technique for imbalanced data set learning.IEEE Trans- actions on knowledge and data engineering, 26(2):405–425, 2012

work page 2012
[5]

Manifold-based synthetic oversampling with manifold conformance estimation.Machine Learning, 107(3):605–637, 2018

Colin Bellinger, Christopher Drummond, and Nathalie Japkowicz. Manifold-based synthetic oversampling with manifold conformance estimation.Machine Learning, 107(3):605–637, 2018

work page 2018
[6]

Safe-level- smote: Safe-level-synthetic minority over-sampling technique for handling the class imbal- anced problem

Chumphol Bunkhumpornpat, Krung Sinapiromsaran, and Chidchanok Lursinsap. Safe-level- smote: Safe-level-synthetic minority over-sampling technique for handling the class imbal- anced problem. InPacific-Asia conference on knowledge discovery and data mining, pages 475–482. Springer, 2009

work page 2009
[7]

Smote: synthetic minority over-sampling technique.Journal of artificial intelligence research, 16: 321–357, 2002

Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over-sampling technique.Journal of artificial intelligence research, 16: 321–357, 2002

work page 2002
[8]

Nearest neighbor pattern classification.IEEE transactions on information theory, 13(1):21–27, 1967

Thomas Cover and Peter Hart. Nearest neighbor pattern classification.IEEE transactions on information theory, 13(1):21–27, 1967

work page 1967
[9]

Deepsmote: Fusing deep learning and smote for imbalanced data.IEEE transactions on neural networks and learning systems, 34(9):6390–6404, 2022

Damien Dablain, Bartosz Krawczyk, and Nitesh V Chawla. Deepsmote: Fusing deep learning and smote for imbalanced data.IEEE transactions on neural networks and learning systems, 34(9):6390–6404, 2022

work page 2022
[10]

Geometric smote a geometrically enhanced drop-in replacement for smote.Information sciences, 501:118–135, 2019

Georgios Douzas and Fernando Bacao. Geometric smote a geometrically enhanced drop-in replacement for smote.Information sciences, 501:118–135, 2019

work page 2019
[11]

Dropout as a bayesian approximation: Representing model uncertainty in deep learning

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Ininternational conference on machine learning, pages 1050–

work page
[12]

Generative adversarial nets.Advances in neural information processing systems, 27, 2014

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014

work page 2014
[13]

On calibration of modern neural networks

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. InInternational conference on machine learning, pages 1321–1330. PMLR, 2017

work page 2017
[14]

Borderline-smote: a new over-sampling method in imbalanced data sets learning

Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. Borderline-smote: a new over-sampling method in imbalanced data sets learning. InInternational conference on intelligent computing, pages 878–887. Springer, 2005

work page 2005
[15]

Generative data mining with longtail-guided diffusion.arXiv preprint arXiv:2502.01980, 2025

David S Hayden, Mao Ye, Timur Garipov, Gregory P Meyer, Carl V ondrick, Zhao Chen, Yun- ing Chai, Eric Wolff, and Siddhartha S Srinivasa. Generative data mining with longtail-guided diffusion.arXiv preprint arXiv:2502.01980, 2025

work page arXiv 2025
[16]

Adasyn: Adaptive synthetic sampling approach for imbalanced learning

Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), pages 1322–1328. Ieee, 2008. 10

work page 2008
[17]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 770–778, 2016

work page 2016
[18]

Feedback-guided data synthesis for imbalanced classification.arXiv preprint arXiv:2310.00158, 2023

Reyhane Askari Hemmat, Mohammad Pezeshki, Florian Bordes, Michal Drozdzal, and Adri- ana Romero-Soriano. Feedback-guided data synthesis for imbalanced classification.arXiv preprint arXiv:2310.00158, 2023

work page arXiv 2023
[19]

Denoising diffusion probabilistic models.Ad- vances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Ad- vances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[20]

Probability inequalities for sums of bounded random variables.Journal of the American statistical association, 58(301):13–30, 1963

Wassily Hoeffding. Probability inequalities for sums of bounded random variables.Journal of the American statistical association, 58(301):13–30, 1963

work page 1963
[21]

Dif- fusemix: Label-preserving data augmentation with diffusion models

Khawar Islam, Muhammad Zaigham Zaheer, Arif Mahmood, and Karthik Nandakumar. Dif- fusemix: Label-preserving data augmentation with diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27621–27630, 2024

work page 2024
[22]

Datadream: Few-shot guided dataset generation

Jae Myung Kim, Jessica Bader, Stephan Alaniz, Cordelia Schmid, and Zeynep Akata. Datadream: Few-shot guided dataset generation. InEuropean Conference on Computer Vi- sion, pages 252–268. Springer, 2024

work page 2024
[23]

Co-mixup: Saliency guided joint mixup with supermodular diversity.arXiv preprint arXiv:2102.03065, 2021

Jang-Hyun Kim, Wonho Choo, Hosan Jeong, and Hyun Oh Song. Co-mixup: Saliency guided joint mixup with supermodular diversity.arXiv preprint arXiv:2102.03065, 2021

work page arXiv 2021
[24]

Generate what matters: Steering diffusion models for targeted data generation to improve classification.OpenReview preprint, 2025

Jeeyung Kim, Erfan Esmaeili, and Qiang Qiu. Generate what matters: Steering diffusion models for targeted data generation to improve classification.OpenReview preprint, 2025

work page 2025
[25]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[26]

Genie: Generative hard negative images through diffusion.arXiv preprint arXiv:2312.02548, 2023

Soroush Abbasi Koohpayegani, Anuj Singh, KL Navaneet, Hamed Pirsiavash, and Hadi Jamali-Rad. Genie: Generative hard negative images through diffusion.arXiv preprint arXiv:2312.02548, 2023

work page arXiv 2023
[27]

Submodular function maximization.Tractability, 3(71- 104):3, 2014

Andreas Krause and Daniel Golovin. Submodular function maximization.Tractability, 3(71- 104):3, 2014

work page 2014
[28]

CIFAR-10 (Canadian Institute for Ad- vanced Research), 2009

Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. CIFAR-10 (Canadian Institute for Ad- vanced Research), 2009

work page 2009
[29]

Simple and scalable pre- dictive uncertainty estimation using deep ensembles.Advances in neural information process- ing systems, 30, 2017

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable pre- dictive uncertainty estimation using deep ensembles.Advances in neural information process- ing systems, 30, 2017

work page 2017
[30]

Oversampling for imbalanced learning based on k-means and smote.arXiv preprint arXiv:1711.00837, 2, 2017

F Last, G Douzas, and F Bacao. Oversampling for imbalanced learning based on k-means and smote.arXiv preprint arXiv:1711.00837, 2, 2017

work page arXiv 2017
[31]

Gendataa- gent: On-the-fly dataset augmentation with synthetic data

Zhiteng Li, Lele Chen, Jerone Andrews, Yunhao Ba, Yulun Zhang, and Alice Xiang. Gendataa- gent: On-the-fly dataset augmentation with synthetic data. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[32]

Diffusion curriculum: Synthetic-to-real data curriculum via image-guided diffusion

Yijun Liang, Shweta Bhardwaj, and Tianyi Zhou. Diffusion curriculum: Synthetic-to-real data curriculum via image-guided diffusion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1697–1707, 2025

work page 2025
[33]

Exploreaugment: Adaptive exploratory data augmentation based on boundary awareness

Kexin Liu, Hao Zhang, Yabin Wang, Chenxin Cai, Tingting Wu, and Jie Liu. Exploreaugment: Adaptive exploratory data augmentation based on boundary awareness. OpenReview preprint, 2025

work page 2025
[34]

Adversarial sampling for active learning

Christoph Mayer and Radu Timofte. Adversarial sampling for active learning. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3071–3079, 2020. 11

work page 2020
[35]

An analysis of approx- imations for maximizing submodular set functions—i.Mathematical programming, 14(1): 265–294, 1978

George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approx- imations for maximizing submodular set functions—i.Mathematical programming, 14(1): 265–294, 1978

work page 1978
[36]

Do we need all the synthetic data? targeted synthetic image augmentation via diffusion models.arXiv preprint arXiv:2505.21574, 2025

Dang Nguyen, Jiping Li, Jinghao Zheng, and Baharan Mirzasoleiman. Do we need all the synthetic data? targeted synthetic image augmentation via diffusion models.arXiv preprint arXiv:2505.21574, 2025

work page arXiv 2025
[37]

Tsynd: Targeted syn- thetic data generation for enhanced medical image classification: Leveraging epistemic uncer- tainty to improve model performance

Joshua Niemeijer, Jan Ehrhardt, Hristina Uzunova, and Heinz Handels. Tsynd: Targeted syn- thetic data generation for enhanced medical image classification: Leveraging epistemic uncer- tainty to improve model performance. InInternational Workshop on Simulation and Synthesis in Medical Imaging, pages 69–78. Springer, 2024

work page 2024
[38]

The majority can help the minority: Context-rich minority oversampling for long-tailed classification

Seulki Park, Youngkyu Hong, Byeongho Heo, Sangdoo Yun, and Jin Young Choi. The majority can help the minority: Context-rich minority oversampling for long-tailed classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6887–6896, 2022

work page 2022
[39]

On-manifold adver- sarial data augmentation improves uncertainty calibration

Kanil Patel, William Beluch, Dan Zhang, Michael Pfeiffer, and Bin Yang. On-manifold adver- sarial data augmentation improves uncertainty calibration. In2020 25th International Confer- ence on Pattern Recognition (ICPR), pages 8029–8036. IEEE, 2021

work page 2021
[40]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022

work page 2022
[41]

Active Learning for Convolutional Neural Networks: A Core-Set Approach

Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach.arXiv preprint arXiv:1708.00489, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[42]

Active learning literature survey, 2009

Burr Settles. Active learning literature survey, 2009

work page 2009
[43]

A survey on image data augmentation for deep learning.Journal of big data, 6(1):1–48, 2019

Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep learning.Journal of big data, 6(1):1–48, 2019

work page 2019
[44]

Routledge, 2018

Bernard W Silverman.Density estimation for statistics and data analysis. Routledge, 2018

work page 2018
[45]

Dataset cartography: Mapping and diagnosing datasets with training dynamics

Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A Smith, and Yejin Choi. Dataset cartography: Mapping and diagnosing datasets with training dynamics. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9275–9293, 2020

work page 2020
[46]

An empirical study of example forgetting during deep neural network learning.arXiv preprint arXiv:1812.05159, 2018

Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Ben- gio, and Geoffrey J Gordon. An empirical study of example forgetting during deep neural network learning.arXiv preprint arXiv:1812.05159, 2018

work page arXiv 2018
[47]

An overview of statistical learning theory.IEEE transactions on neural networks, 10(5):988–999, 1999

Vladimir N Vapnik. An overview of statistical learning theory.IEEE transactions on neural networks, 10(5):988–999, 1999

work page 1999
[48]

Manifold mixup: Better representations by interpolating hidden states

Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, David Lopez-Paz, and Yoshua Bengio. Manifold mixup: Better representations by interpolating hidden states. InInternational conference on machine learning, pages 6438–6447. PMLR, 2019

work page 2019
[49]

Augmax: Adversarial composition of random augmentations for robust training.Ad- vances in neural information processing systems, 34:237–250, 2021

Haotao Wang, Chaowei Xiao, Jean Kossaifi, Zhiding Yu, Anima Anandkumar, and Zhangyang Wang. Augmax: Adversarial composition of random augmentations for robust training.Ad- vances in neural information processing systems, 34:237–250, 2021

work page 2021
[50]

Active learning via query synthesis and nearest neighbour search.Neurocomputing, 147:426–434, 2015

Liantao Wang, Xuelei Hu, Bo Yuan, and Jianfeng Lu. Active learning via query synthesis and nearest neighbour search.Neurocomputing, 147:426–434, 2015

work page 2015
[51]

Inversion circle interpolation: Diffusion-based image aug- mentation for data-scarce classification

Yanghao Wang and Long Chen. Inversion circle interpolation: Diffusion-based image aug- mentation for data-scarce classification. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 25560–25569, 2025. 12

work page 2025
[52]

Difficulty-controlled diffusion model for effective synthetic dataset generation.arXiv preprint arXiv:2411.18109, 2024

Zerun Wang, Jiafeng Mao, Xueting Wang, and Toshihiko Yamasaki. Training data synthesis with difficulty controlled diffusion model.arXiv preprint arXiv:2411.18109, pages 1–10, 2024

work page arXiv 2024
[53]

Enhance image classification via inter-class image mixup with diffusion model

Zhicai Wang, Longhui Wei, Tan Wang, Heyu Chen, Yanbin Hao, Xiang Wang, Xiangnan He, and Qi Tian. Enhance image classification via inter-class image mixup with diffusion model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17223–17233, 2024

work page 2024
[54]

Filtering with confidence: When data augmentation meets conformal prediction.arXiv preprint arXiv:2509.21479, 2025

Zixuan Wu, So Won Jeong, Yating Liu, Yeo Jin Jung, and Claire Donnat. Filtering with confidence: When data augmentation meets conformal prediction.arXiv preprint arXiv:2509.21479, 2025

work page arXiv 2025
[55]

Cutmix: Regularization strategy to train strong classifiers with localizable features

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019

work page 2019
[56]

mixup: Beyond Empirical Risk Minimization

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization.arXiv preprint arXiv:1710.09412, 2017. 13 A Technical appendices and supplementary material A.1 Proof of Theorem 1 Theorem 1(Local risk and boundary-gap allocation).Assume that, in a small region aroundz, the local sample count is proportional ton...

work page internal anchor Pith review Pith/arXiv arXiv 2017