SLAP: Stratified Loss-based Pruning for On-Policy Data-Efficient Instruction Tuning
Pith reviewed 2026-06-30 22:00 UTC · model grok-4.3
The pith
SLAP selects entire batches of instruction data via Hessian-approximated gradients and stratified sampling, letting models reach or exceed full-dataset performance on dialogue, translation, and QA tasks with 20-40 percent less data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SLAP is a batch-aware data selection framework that evaluates the learnability of entire batch compositions rather than individual samples, ensures comprehensive data distribution coverage through distribution-aware stratified sampling while maximizing intra-batch diversity through relative distance optimization, and leverages Hessian-approximated gradient information for dynamic batch selection, achieving superior performance with 20-40% less training data compared to full dataset training while maintaining or improving model capabilities across multiple architectures and tasks.
What carries the argument
The dynamic batch selection mechanism that scores learnability of complete batch compositions using Hessian-approximated gradient information.
If this is right
- SLAP-selected subsets outperform full datasets on multi-turn dialogue, multilingual translation, and question answering.
- The gains hold across LLaMA and ChatGLM architectures.
- Training data volume drops by 20-40% with no capability loss.
- Overall computational cost of instruction tuning falls substantially.
Where Pith is reading between the lines
- The batch-level scoring could be applied to other fine-tuning settings such as preference alignment.
- Further savings might appear if SLAP is combined with model compression techniques.
- Limits of the method would become clearer through tests on models larger than those reported.
- Practice could move toward repeated on-policy selection during training instead of one-time static pruning.
Load-bearing premise
That batch-level learnability scores from Hessian approximations reliably pick data compositions that generalize across model architectures and tasks.
What would settle it
An experiment on a held-out model or task where the SLAP-selected 60-80% subset produces statistically lower performance than the full dataset.
Figures
read the original abstract
Instruction tuning has optimized the specialized capabilities of large language models (LLMs), but it often requires extensive datasets and prolonged training times. The challenge lies in developing specific capabilities by identifying useful data and efficiently fine-tuning. High-quality and diverse pruned data can help models achieve lossless performance at a lower cost. In this paper, we propose \textbf{SLAP}, a novel batch-aware data selection framework that evaluates the learnability of entire batch compositions rather than individual. SLAP ensures comprehensive data distribution coverage through distribution-aware stratified sampling while maximizing intra-batch diversity through relative distance optimization. By leveraging Hessian-approximated gradient information for dynamic batch selection, SLAP significantly outperforms existing state-of-the-art methods across multiple model architectures (LLaMA, ChatGLM) and diverse downstream tasks including multi-turn dialogue, multilingual translation, and question answering. Most notably, SLAP achieves superior performance with 20-40\% less training data compared to full dataset training, substantially reducing computational costs while maintaining or improving model capabilities. These results establish SLAP as a powerful approach for efficient and effective instruction tuning of large language models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SLAP, a batch-aware data selection framework for efficient instruction tuning of LLMs. It combines distribution-aware stratified sampling for coverage, relative distance optimization for intra-batch diversity, and Hessian-approximated gradient information for dynamic batch selection. The central claim is that SLAP outperforms existing SOTA methods on LLaMA and ChatGLM across tasks (multi-turn dialogue, multilingual translation, QA), achieving superior performance with 20-40% less training data than full-dataset training while maintaining or improving capabilities.
Significance. If the empirical claims were substantiated, the work could meaningfully advance data-efficient fine-tuning by reducing compute costs for LLM instruction tuning. However, the provided manuscript consists solely of an abstract with no experimental details, quantitative results, baselines, error bars, ablation studies, or methodology sections, so the significance cannot be assessed. The method's reliance on Hessian approximations and batch-level learnability for generalization across architectures and tasks remains unevaluated.
major comments (2)
- [Abstract] Abstract: the central performance claim (superior results with 20-40% less data across models and tasks) is stated without any supporting experimental evidence, tables, baselines, or implementation details, rendering the claim impossible to evaluate or reproduce from the manuscript.
- [Abstract] Abstract: the description of the dynamic batch selection mechanism (Hessian-approximated gradients combined with batch-level learnability) provides no procedure, approximation details, or pseudocode, so it is impossible to assess whether this component reliably identifies generalizable data compositions as claimed.
minor comments (2)
- [Abstract] The title refers to 'Stratified Loss-based Pruning' and 'On-Policy' but the abstract describes a 'batch-aware data selection framework' without explaining the loss-based pruning aspect or the on-policy component.
- [Abstract] The abstract asserts outperformance over 'existing state-of-the-art methods' but does not name those methods or indicate how they were implemented for comparison.
Simulated Author's Rebuttal
We thank the referee for the comments. We acknowledge that the submitted manuscript was limited to the abstract and contained no experimental sections, results, or methodological details. The revised version will address this by expanding to a full paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central performance claim (superior results with 20-40% less data across models and tasks) is stated without any supporting experimental evidence, tables, baselines, or implementation details, rendering the claim impossible to evaluate or reproduce from the manuscript.
Authors: We agree that the abstract alone provides no evidence for the claims. The revised manuscript will include full experimental results with tables, baselines, error bars, ablation studies, and implementation details across the reported models and tasks. revision: yes
-
Referee: [Abstract] Abstract: the description of the dynamic batch selection mechanism (Hessian-approximated gradients combined with batch-level learnability) provides no procedure, approximation details, or pseudocode, so it is impossible to assess whether this component reliably identifies generalizable data compositions as claimed.
Authors: We agree that the abstract lacks the necessary procedural details. The revised manuscript will add a dedicated methodology section with the exact procedure, Hessian approximation method, batch-level learnability formulation, and pseudocode. revision: yes
Circularity Check
No circularity: method relies on external Hessian approximation and stratified sampling without self-referential reduction
full rationale
The abstract and description present SLAP as a batch-aware selection framework that applies Hessian-approximated gradients, distribution-aware stratified sampling, and relative distance optimization. No equations, derivation steps, or self-citations are supplied that would make any claimed prediction equivalent to its inputs by construction. The approach invokes standard external techniques (Hessian approximation) rather than defining quantities in terms of the target performance gains. The central claim of 20-40% data reduction therefore rests on empirical validation outside the method's own definitions, yielding a self-contained derivation with no detectable circular steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
AI@Meta: Llama 3 model card (2024), https://github.com/meta-llama/llama3/ blob/main/MODEL_CARD.md
2024
-
[2]
Cornell University - arXiv,Cornell University - arXiv (Oct 2015)
Babenko, A., Lempitsky, V.: Aggregating deep convolutional features for image retrieval. Cornell University - arXiv,Cornell University - arXiv (Oct 2015)
2015
-
[3]
Cook, W.J., Cunningham, W.H., Pulleyblank, W.R., Schrijver, A.: Combinatorial Optimization, vol. 605. Springer (1998)
1998
-
[4]
Advances in Neural Information Processing Systems36, 8513–8527 (2023)
Deng, Z., Cui, P., Zhu, J.: Towards accelerated model training via bayesian data se- lection. Advances in Neural Information Processing Systems36, 8513–8527 (2023)
2023
-
[5]
Evans, T., Parthasarathy, N., Merzic, H., Henaff, O.J.: Data curation via joint example selection further accelerates multimodal learning (2024), https://arxiv. org/abs/2406.17711
-
[6]
arXiv preprint arXiv:2306.11670 (2023)
Everaert, D., Potts, C.: Gio: Gradient information optimization for training dataset selection. arXiv preprint arXiv:2306.11670 (2023)
-
[7]
GLM, T., Zeng, A., Xu, B., Wang, B., Zhang, C., Yin, D., Rojas, D., Feng, G., Zhao, H., Lai, H., Yu, H., Wang, H., Sun, J., Zhang, J., Cheng, J., Gui, J., Tang, J., Zhang, J., Li, J., Zhao, L., Wu, L., Zhong, L., Liu, M., Huang, M., Zhang, P., Zheng, Q., Lu, R., Duan, S., Zhang, S., Cao, S., Yang, S., Tam, W.L., Zhao, W., Liu, X., Xia, X., Zhang, X., Gu, ...
2024
-
[8]
In: International Conference on Database and Expert Systems Applications
Guo, C., Zhao, B., Bai, Y.: Deepcore: A comprehensive library for coreset selection in deep learning. In: International Conference on Database and Expert Systems Applications. pp. 181–195. Springer (2022)
2022
-
[9]
Scaling Laws and Interpretability of Learning from Repeated Data
Hernandez, D., Brown, T., Conerly, T., DasSarma, N., Drain, D., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Henighan, T., Hume, T., et al.: Scaling laws and interpretability of learning from repeated data. arXiv preprint arXiv:2205.10487 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
arXiv preprint arXiv:2406.04872 (2024)
Hong, F., Lyu, Y., Yao, J., Zhang, Y., Tsang, I.W., Wang, Y.: Diversified batch selection for training acceleration. arXiv preprint arXiv:2406.04872 (2024)
-
[11]
arXiv: Databases,arXiv: Databases (Jan 2018)
Hsieh, K., Ananthanarayanan, G., Bodik, P., Bahl, P., Philipose, M., Gibbons, P., Mutlu, O.: Focus: Querying large video datasets with low latency and low cost. arXiv: Databases,arXiv: Databases (Jan 2018)
2018
-
[12]
In: International conference on machine learning
Jiang,L.,Zhou,Z.,Leung,T.,Li,L.J.,Fei-Fei,L.:Mentornet:Learningdata-driven curriculum for very deep neural networks on corrupted labels. In: International conference on machine learning. pp. 2304–2313. PMLR (2018)
2018
-
[13]
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2017), https: //arxiv.org/abs/1412.6980
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[14]
In: International conference on machine learning
Koh, P.W., Liang, P.: Understanding black-box predictions via influence functions. In: International conference on machine learning. pp. 1885–1894. PMLR (2017)
2017
-
[15]
In: Text sum- marization branches out
Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text sum- marization branches out. pp. 74–81 (2004) 14 F. Author et al
2004
- [16]
-
[17]
In: International Conference on Machine Learning
Mindermann, S., Brauner, J.M., Razzak, M.T., Sharma, M., Kirsch, A., Xu, W., Höltgen, B., Gomez, A.N., Morisot, A., Farquhar, S., et al.: Prioritized training on points that are learnable, worth learning, and not yet learnt. In: International Conference on Machine Learning. pp. 15630–15649. PMLR (2022)
2022
-
[18]
Advances in neural information processing sys- tems35, 27730–27744 (2022)
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Advances in neural information processing sys- tems35, 27730–27744 (2022)
2022
-
[19]
In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002)
2002
-
[20]
Advances in neural information processing systems34, 20596–20607 (2021)
Paul, M., Ganguli, S., Dziugaite, G.K.: Deep learning on a data diet: Finding important examples early in training. Advances in neural information processing systems34, 20596–20607 (2021)
2021
-
[21]
In: International Conference on Machine Learning
Pooladzandi, O., Davini, D., Mirzasoleiman, B.: Adaptive second order coresets for data-efficient machine learning. In: International Conference on Machine Learning. pp. 17848–17869. PMLR (2022)
2022
- [22]
-
[23]
Schwenk, H., Chaudhary, V., Sun, S., Gong, H., Guzmán, F.: Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia (2019), https:// arxiv.org/abs/1907.05791
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[24]
Active Learning for Convolutional Neural Networks: A Core-Set Approach
Sener, O., Savarese, S.: Active learning for convolutional neural networks: A core- set approach. arXiv preprint arXiv:1708.00489 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[25]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Jun- Mei Song, Mingchuan Zhang, Y
Shao, Y., Li, L., Dai, J., Qiu, X.: Character-llm: A trainable agent for role-playing. arXiv preprint arXiv:2310.10158 (2023)
-
[26]
Advances in Neural In- formation Processing Systems35, 19523–19536 (2022)
Sorscher, B., Geirhos, R., Shekhar, S., Ganguli, S., Morcos, A.: Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural In- formation Processing Systems35, 19523–19536 (2022)
2022
-
[27]
arXiv preprint arXiv:2305.12816 (2023)
Wang, X., Zhou, W., Zhang, Q., Zhou, J., Gao, S., Wang, J., Zhang, M., Gao, X., Chen, Y., Gui, T.: Farewell to aimless large-scale pretraining: Influential subset selection for language model. arXiv preprint arXiv:2305.12816 (2023)
-
[28]
arXiv preprint arXiv:2310.00746 (2023)
Wang, Z.M., Peng, Z., Que, H., Liu, J., Zhou, W., Wu, Y., Guo, H., Gan, R., Ni, Z., Zhang, M., et al.: Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. arXiv preprint arXiv:2310.00746 (2023)
-
[29]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wei, H., Feng, L., Chen, X., An, B.: Combating noisy labels by agreement: A joint training method with co-regularization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13726–13735 (2020)
2020
-
[30]
Advances in neural information processing systems35, 24824–24837 (2022)
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)
2022
- [32]
-
[33]
In: The Eleventh International Conference on Learning Representations (2022)
Xia, X., Liu, J., Yu, J., Shen, X., Han, B., Liu, T.: Moderate coreset: A univer- sal method of data selection for real-world data-efficient deep learning. In: The Eleventh International Conference on Learning Representations (2022)
2022
-
[34]
arXiv preprint arXiv:2205.09329 (2022)
Yang, S., Xie, Z., Peng, H., Xu, M., Sun, M., Li, P.: Dataset pruning: Re- ducing training data by examining generalization influence. arXiv preprint arXiv:2205.09329 (2022)
-
[35]
arXiv preprint arXiv:2106.01085 (2021)
Yoon, J., Madaan, D., Yang, E., Hwang, S.J.: Online coreset selection for rehearsal- based continual learning. arXiv preprint arXiv:2106.01085 (2021)
-
[36]
arXiv preprint arXiv:2210.15809 (2023)
Zheng, H., Liu, R., Lai, F., Prakash, A.: Coverage-centric coreset selection for high pruning rates. arXiv preprint arXiv:2210.15809 (2023)
-
[37]
arXiv preprint arXiv:2406.04273 (2024)
Zheng, H., Tsai, E., Lu, Y., Sun, J., Bartoldson, B.R., Kailkhura, B., Prakash, A.: Elfs: Enhancing label-free coreset selection via clustering-based pseudo-labeling. arXiv preprint arXiv:2406.04273 (2024)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.