On-policy self-distillation with sampled demonstrations reduces rollout diversity by amplifying existing probability gaps in the base model, unlike ideal RL which preserves ratios among correct outputs.
arXiv preprint arXiv:2302.06960 , year=
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
An iterative bootstrapped self-filtering approach selects balanced clean and diverse subsets from noisy vision-language datasets to train improved CLIP models.
OrderDP is a plug-and-play data pruning method that selects a random subset then top-q samples to guarantee unbiased surrogate-loss training with convergence analysis and over 40% training cost reduction on CIFAR and ImageNet.
citing papers explorer
-
On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity
On-policy self-distillation with sampled demonstrations reduces rollout diversity by amplifying existing probability gaps in the base model, unlike ideal RL which preserves ratios among correct outputs.
-
Data Selection Through Iterative Self-Filtering for Vision-Language Settings
An iterative bootstrapped self-filtering approach selects balanced clean and diverse subsets from noisy vision-language datasets to train improved CLIP models.
-
OrderDP: A Theoretically Guaranteed Lossless Dynamic Data Pruning Framework
OrderDP is a plug-and-play data pruning method that selects a random subset then top-q samples to guarantee unbiased surrogate-loss training with convergence analysis and over 40% training cost reduction on CIFAR and ImageNet.