pith. sign in

arxiv: 2605.23969 · v1 · pith:HU4V53SJnew · submitted 2026-05-13 · 💻 cs.CL

SLAP: Stratified Loss-based Pruning for On-Policy Data-Efficient Instruction Tuning

Pith reviewed 2026-06-30 22:00 UTC · model grok-4.3

classification 💻 cs.CL
keywords instruction tuningdata pruningbatch selectionHessian approximationstratified samplingdata efficiencylarge language modelsloss-based pruning
0
0 comments X

The pith

SLAP selects entire batches of instruction data via Hessian-approximated gradients and stratified sampling, letting models reach or exceed full-dataset performance on dialogue, translation, and QA tasks with 20-40 percent less data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SLAP to reduce the data and compute demands of instruction tuning by shifting from per-example pruning to evaluation of whole batch compositions for learnability. It applies distribution-aware stratified sampling to maintain coverage, relative distance optimization to increase variety inside each batch, and dynamic selection driven by Hessian-approximated gradient signals. If these steps succeed, the resulting subsets produce stronger or equal results than the complete training set across LLaMA and ChatGLM models while cutting data volume substantially. A reader would care because current instruction tuning still relies on large fixed datasets that are expensive to collect and train on, and a reliable batch-level filter could lower that cost barrier without loss of capability.

Core claim

SLAP is a batch-aware data selection framework that evaluates the learnability of entire batch compositions rather than individual samples, ensures comprehensive data distribution coverage through distribution-aware stratified sampling while maximizing intra-batch diversity through relative distance optimization, and leverages Hessian-approximated gradient information for dynamic batch selection, achieving superior performance with 20-40% less training data compared to full dataset training while maintaining or improving model capabilities across multiple architectures and tasks.

What carries the argument

The dynamic batch selection mechanism that scores learnability of complete batch compositions using Hessian-approximated gradient information.

If this is right

  • SLAP-selected subsets outperform full datasets on multi-turn dialogue, multilingual translation, and question answering.
  • The gains hold across LLaMA and ChatGLM architectures.
  • Training data volume drops by 20-40% with no capability loss.
  • Overall computational cost of instruction tuning falls substantially.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The batch-level scoring could be applied to other fine-tuning settings such as preference alignment.
  • Further savings might appear if SLAP is combined with model compression techniques.
  • Limits of the method would become clearer through tests on models larger than those reported.
  • Practice could move toward repeated on-policy selection during training instead of one-time static pruning.

Load-bearing premise

That batch-level learnability scores from Hessian approximations reliably pick data compositions that generalize across model architectures and tasks.

What would settle it

An experiment on a held-out model or task where the SLAP-selected 60-80% subset produces statistically lower performance than the full dataset.

Figures

Figures reproduced from arXiv: 2605.23969 by Hao Chen, Jianhang Ding, Renshu Gu, Run Zou, Wen Wu, Yifan Ding.

Figure 1
Figure 1. Figure 1: The workflow of SLAP. Step 1: We divide a batch of data into K strata based on loss. Then, we select |S| data according to the probability of normalized exp(loss) and calculate the number of data in each stratum. Step 2: We calculate the Hessian-approximated gradient Ht of the data as features. Step 3: For stratum 1, we randomly initialize a point. We calculate the L2 distance to the first point and select… view at source ↗
Figure 2
Figure 2. Figure 2: Maximizing L2 Distance Within the Batch. Step 1: For stratum 1, randomly initialize a point and calculate the L2 distance from the points in the same stratum to the first point. Step 2-3: Select the point that is farthest from the first point as the second point, then update the minimum distance from the remaining points to the selected points and iteratively choose |Si | (e.g. 3) samples. Steps 4 and 7: F… view at source ↗
Figure 3
Figure 3. Figure 3: The data distribution under hard sampling, CCS, and SLAP. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Evaluation with different k on NetLit using ChatGLM3 model [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Evaluation on different datasets using ChatGLM3 model with pruning [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Evaluation with different pruning rates on NetLit using ChatGLM3 model [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Evaluation on LLaMaQA using ChatGLM3 and LLaMa3 with pruning [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: GPT-4 and human evaluation scores for LLM generated responses on [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of FLOPs for Pruning and Full data. [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
read the original abstract

Instruction tuning has optimized the specialized capabilities of large language models (LLMs), but it often requires extensive datasets and prolonged training times. The challenge lies in developing specific capabilities by identifying useful data and efficiently fine-tuning. High-quality and diverse pruned data can help models achieve lossless performance at a lower cost. In this paper, we propose \textbf{SLAP}, a novel batch-aware data selection framework that evaluates the learnability of entire batch compositions rather than individual. SLAP ensures comprehensive data distribution coverage through distribution-aware stratified sampling while maximizing intra-batch diversity through relative distance optimization. By leveraging Hessian-approximated gradient information for dynamic batch selection, SLAP significantly outperforms existing state-of-the-art methods across multiple model architectures (LLaMA, ChatGLM) and diverse downstream tasks including multi-turn dialogue, multilingual translation, and question answering. Most notably, SLAP achieves superior performance with 20-40\% less training data compared to full dataset training, substantially reducing computational costs while maintaining or improving model capabilities. These results establish SLAP as a powerful approach for efficient and effective instruction tuning of large language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SLAP, a batch-aware data selection framework for efficient instruction tuning of LLMs. It combines distribution-aware stratified sampling for coverage, relative distance optimization for intra-batch diversity, and Hessian-approximated gradient information for dynamic batch selection. The central claim is that SLAP outperforms existing SOTA methods on LLaMA and ChatGLM across tasks (multi-turn dialogue, multilingual translation, QA), achieving superior performance with 20-40% less training data than full-dataset training while maintaining or improving capabilities.

Significance. If the empirical claims were substantiated, the work could meaningfully advance data-efficient fine-tuning by reducing compute costs for LLM instruction tuning. However, the provided manuscript consists solely of an abstract with no experimental details, quantitative results, baselines, error bars, ablation studies, or methodology sections, so the significance cannot be assessed. The method's reliance on Hessian approximations and batch-level learnability for generalization across architectures and tasks remains unevaluated.

major comments (2)
  1. [Abstract] Abstract: the central performance claim (superior results with 20-40% less data across models and tasks) is stated without any supporting experimental evidence, tables, baselines, or implementation details, rendering the claim impossible to evaluate or reproduce from the manuscript.
  2. [Abstract] Abstract: the description of the dynamic batch selection mechanism (Hessian-approximated gradients combined with batch-level learnability) provides no procedure, approximation details, or pseudocode, so it is impossible to assess whether this component reliably identifies generalizable data compositions as claimed.
minor comments (2)
  1. [Abstract] The title refers to 'Stratified Loss-based Pruning' and 'On-Policy' but the abstract describes a 'batch-aware data selection framework' without explaining the loss-based pruning aspect or the on-policy component.
  2. [Abstract] The abstract asserts outperformance over 'existing state-of-the-art methods' but does not name those methods or indicate how they were implemented for comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the comments. We acknowledge that the submitted manuscript was limited to the abstract and contained no experimental sections, results, or methodological details. The revised version will address this by expanding to a full paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central performance claim (superior results with 20-40% less data across models and tasks) is stated without any supporting experimental evidence, tables, baselines, or implementation details, rendering the claim impossible to evaluate or reproduce from the manuscript.

    Authors: We agree that the abstract alone provides no evidence for the claims. The revised manuscript will include full experimental results with tables, baselines, error bars, ablation studies, and implementation details across the reported models and tasks. revision: yes

  2. Referee: [Abstract] Abstract: the description of the dynamic batch selection mechanism (Hessian-approximated gradients combined with batch-level learnability) provides no procedure, approximation details, or pseudocode, so it is impossible to assess whether this component reliably identifies generalizable data compositions as claimed.

    Authors: We agree that the abstract lacks the necessary procedural details. The revised manuscript will add a dedicated methodology section with the exact procedure, Hessian approximation method, batch-level learnability formulation, and pseudocode. revision: yes

Circularity Check

0 steps flagged

No circularity: method relies on external Hessian approximation and stratified sampling without self-referential reduction

full rationale

The abstract and description present SLAP as a batch-aware selection framework that applies Hessian-approximated gradients, distribution-aware stratified sampling, and relative distance optimization. No equations, derivation steps, or self-citations are supplied that would make any claimed prediction equivalent to its inputs by construction. The approach invokes standard external techniques (Hessian approximation) rather than defining quantities in terms of the target performance gains. The central claim of 20-40% data reduction therefore rests on empirical validation outside the method's own definitions, yielding a self-contained derivation with no detectable circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no concrete free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5736 in / 1143 out tokens · 27719 ms · 2026-06-30T22:00:02.070522+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 17 canonical work pages · 4 internal anchors

  1. [1]

    AI@Meta: Llama 3 model card (2024), https://github.com/meta-llama/llama3/ blob/main/MODEL_CARD.md

  2. [2]

    Cornell University - arXiv,Cornell University - arXiv (Oct 2015)

    Babenko, A., Lempitsky, V.: Aggregating deep convolutional features for image retrieval. Cornell University - arXiv,Cornell University - arXiv (Oct 2015)

  3. [3]

    Cook, W.J., Cunningham, W.H., Pulleyblank, W.R., Schrijver, A.: Combinatorial Optimization, vol. 605. Springer (1998)

  4. [4]

    Advances in Neural Information Processing Systems36, 8513–8527 (2023)

    Deng, Z., Cui, P., Zhu, J.: Towards accelerated model training via bayesian data se- lection. Advances in Neural Information Processing Systems36, 8513–8527 (2023)

  5. [5]

    org/abs/2406.17711

    Evans, T., Parthasarathy, N., Merzic, H., Henaff, O.J.: Data curation via joint example selection further accelerates multimodal learning (2024), https://arxiv. org/abs/2406.17711

  6. [6]

    arXiv preprint arXiv:2306.11670 (2023)

    Everaert, D., Potts, C.: Gio: Gradient information optimization for training dataset selection. arXiv preprint arXiv:2306.11670 (2023)

  7. [7]

    GLM, T., Zeng, A., Xu, B., Wang, B., Zhang, C., Yin, D., Rojas, D., Feng, G., Zhao, H., Lai, H., Yu, H., Wang, H., Sun, J., Zhang, J., Cheng, J., Gui, J., Tang, J., Zhang, J., Li, J., Zhao, L., Wu, L., Zhong, L., Liu, M., Huang, M., Zhang, P., Zheng, Q., Lu, R., Duan, S., Zhang, S., Cao, S., Yang, S., Tam, W.L., Zhao, W., Liu, X., Xia, X., Zhang, X., Gu, ...

  8. [8]

    In: International Conference on Database and Expert Systems Applications

    Guo, C., Zhao, B., Bai, Y.: Deepcore: A comprehensive library for coreset selection in deep learning. In: International Conference on Database and Expert Systems Applications. pp. 181–195. Springer (2022)

  9. [9]

    Scaling Laws and Interpretability of Learning from Repeated Data

    Hernandez, D., Brown, T., Conerly, T., DasSarma, N., Drain, D., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Henighan, T., Hume, T., et al.: Scaling laws and interpretability of learning from repeated data. arXiv preprint arXiv:2205.10487 (2022)

  10. [10]

    arXiv preprint arXiv:2406.04872 (2024)

    Hong, F., Lyu, Y., Yao, J., Zhang, Y., Tsang, I.W., Wang, Y.: Diversified batch selection for training acceleration. arXiv preprint arXiv:2406.04872 (2024)

  11. [11]

    arXiv: Databases,arXiv: Databases (Jan 2018)

    Hsieh, K., Ananthanarayanan, G., Bodik, P., Bahl, P., Philipose, M., Gibbons, P., Mutlu, O.: Focus: Querying large video datasets with low latency and low cost. arXiv: Databases,arXiv: Databases (Jan 2018)

  12. [12]

    In: International conference on machine learning

    Jiang,L.,Zhou,Z.,Leung,T.,Li,L.J.,Fei-Fei,L.:Mentornet:Learningdata-driven curriculum for very deep neural networks on corrupted labels. In: International conference on machine learning. pp. 2304–2313. PMLR (2018)

  13. [13]

    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2017), https: //arxiv.org/abs/1412.6980

  14. [14]

    In: International conference on machine learning

    Koh, P.W., Liang, P.: Understanding black-box predictions via influence functions. In: International conference on machine learning. pp. 1885–1894. PMLR (2017)

  15. [15]

    In: Text sum- marization branches out

    Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text sum- marization branches out. pp. 74–81 (2004) 14 F. Author et al

  16. [16]

    Mindermann, S., Brauner, J., Razzak, M., Sharma, M., Kirsch, A., Xu, W., Höltgen, B., Gomez, A.N., Morisot, A., Farquhar, S., Gal, Y.: Prioritized train- ing on points that are learnable, worth learning, and not yet learnt (2022), https://arxiv.org/abs/2206.07137

  17. [17]

    In: International Conference on Machine Learning

    Mindermann, S., Brauner, J.M., Razzak, M.T., Sharma, M., Kirsch, A., Xu, W., Höltgen, B., Gomez, A.N., Morisot, A., Farquhar, S., et al.: Prioritized training on points that are learnable, worth learning, and not yet learnt. In: International Conference on Machine Learning. pp. 15630–15649. PMLR (2022)

  18. [18]

    Advances in neural information processing sys- tems35, 27730–27744 (2022)

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Advances in neural information processing sys- tems35, 27730–27744 (2022)

  19. [19]

    In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics

    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002)

  20. [20]

    Advances in neural information processing systems34, 20596–20607 (2021)

    Paul, M., Ganguli, S., Dziugaite, G.K.: Deep learning on a data diet: Finding important examples early in training. Advances in neural information processing systems34, 20596–20607 (2021)

  21. [21]

    In: International Conference on Machine Learning

    Pooladzandi, O., Davini, D., Mirzasoleiman, B.: Adaptive second order coresets for data-efficient machine learning. In: International Conference on Machine Learning. pp. 17848–17869. PMLR (2022)

  22. [22]

    Qin, Z., Wang, K., Zheng, Z., Gu, J., Peng, X., Xu, Z., Zhou, D., Shang, L., Sun, B., Xie, X., You, Y.: Infobatch: Lossless training speed up by unbiased dynamic data pruning (2023), https://arxiv.org/abs/2303.04947

  23. [23]

    Schwenk, H., Chaudhary, V., Sun, S., Gong, H., Guzmán, F.: Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia (2019), https:// arxiv.org/abs/1907.05791

  24. [24]

    Active Learning for Convolutional Neural Networks: A Core-Set Approach

    Sener, O., Savarese, S.: Active learning for convolutional neural networks: A core- set approach. arXiv preprint arXiv:1708.00489 (2018)

  25. [25]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Jun- Mei Song, Mingchuan Zhang, Y

    Shao, Y., Li, L., Dai, J., Qiu, X.: Character-llm: A trainable agent for role-playing. arXiv preprint arXiv:2310.10158 (2023)

  26. [26]

    Advances in Neural In- formation Processing Systems35, 19523–19536 (2022)

    Sorscher, B., Geirhos, R., Shekhar, S., Ganguli, S., Morcos, A.: Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural In- formation Processing Systems35, 19523–19536 (2022)

  27. [27]

    arXiv preprint arXiv:2305.12816 (2023)

    Wang, X., Zhou, W., Zhang, Q., Zhou, J., Gao, S., Wang, J., Zhang, M., Gao, X., Chen, Y., Gui, T.: Farewell to aimless large-scale pretraining: Influential subset selection for language model. arXiv preprint arXiv:2305.12816 (2023)

  28. [28]

    arXiv preprint arXiv:2310.00746 (2023)

    Wang, Z.M., Peng, Z., Que, H., Liu, J., Zhou, W., Wu, Y., Guo, H., Gan, R., Ni, Z., Zhang, M., et al.: Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. arXiv preprint arXiv:2310.00746 (2023)

  29. [29]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Wei, H., Feng, L., Chen, X., An, B.: Combating noisy labels by agreement: A joint training method with co-regularization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13726–13735 (2020)

  30. [30]

    Advances in neural information processing systems35, 24824–24837 (2022)

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

  31. [32]

    Xia, M., Malladi, S., Gururangan, S., Arora, S., Chen, D.: Less: Selecting influential data for targeted instruction tuning (2024), https://arxiv.org/abs/2402.04333 Title Suppressed Due to Excessive Length 15

  32. [33]

    In: The Eleventh International Conference on Learning Representations (2022)

    Xia, X., Liu, J., Yu, J., Shen, X., Han, B., Liu, T.: Moderate coreset: A univer- sal method of data selection for real-world data-efficient deep learning. In: The Eleventh International Conference on Learning Representations (2022)

  33. [34]

    arXiv preprint arXiv:2205.09329 (2022)

    Yang, S., Xie, Z., Peng, H., Xu, M., Sun, M., Li, P.: Dataset pruning: Re- ducing training data by examining generalization influence. arXiv preprint arXiv:2205.09329 (2022)

  34. [35]

    arXiv preprint arXiv:2106.01085 (2021)

    Yoon, J., Madaan, D., Yang, E., Hwang, S.J.: Online coreset selection for rehearsal- based continual learning. arXiv preprint arXiv:2106.01085 (2021)

  35. [36]

    arXiv preprint arXiv:2210.15809 (2023)

    Zheng, H., Liu, R., Lai, F., Prakash, A.: Coverage-centric coreset selection for high pruning rates. arXiv preprint arXiv:2210.15809 (2023)

  36. [37]

    arXiv preprint arXiv:2406.04273 (2024)

    Zheng, H., Tsai, E., Lu, Y., Sun, J., Bartoldson, B.R., Kailkhura, B., Prakash, A.: Elfs: Enhancing label-free coreset selection via clustering-based pseudo-labeling. arXiv preprint arXiv:2406.04273 (2024)