pith. sign in

arxiv: 2606.18307 · v1 · pith:JEXKL4CBnew · submitted 2026-06-16 · 💻 cs.LG · cs.AI

DRIFT: Refining Instruction Data via On-Policy Data Attribution

Pith reviewed 2026-06-27 01:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords influence functionsdata attributionsupervised fine-tuninginstruction tuningLLM data curationon-policy rolloutsperformance ceiling
0
0 comments X

The pith

DRIFT refines SFT data by using on-policy influence functions with model rollouts as validation targets to raise performance ceilings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that influence functions can be made effective for instance-level data refinement in supervised fine-tuning by replacing external validation targets with the model's own on-policy rollouts. This change is claimed to close the parameter proximity gap and better satisfy the local neighborhood assumption of influence estimation. The method then adds signed weighting by trajectory correctness and debiasing against gradient norm to let a small set of queries attribute the full dataset. If these steps work, the resulting data distribution should improve the final model's capabilities on both instruction and reasoning tasks beyond what existing curation baselines achieve.

Core claim

DRIFT (Data Refinement via On-Policy Influence Functions for Supervised Fine-Tuning) replaces external reference data with the model's on-policy rollouts as validation targets for influence estimation. It further applies signed weighting based on trajectory correctness and debiases influence scores against gradient norm bias, enabling a small set of validation queries to attribute influence across the full training dataset and thereby raise the performance upper bound.

What carries the argument

On-policy influence functions that use the model's own rollouts as validation targets, combined with signed weighting by correctness and debiasing of gradient-norm bias.

If this is right

  • A small set of validation queries can serve as reliable anchors for attributing influence over the entire training dataset.
  • Signed weighting by trajectory correctness improves the quality of influence attribution for refinement.
  • Debiased influence scores reduce the dominance of high-gradient-norm instances in the selected data.
  • The refined data distribution raises the performance ceiling on both instruction-following and reasoning tasks for 7B models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same on-policy attribution principle could be tested on other fine-tuning regimes that also generate their own trajectories.
  • If the proximity-gap reduction holds, the method may allow smaller validation sets without loss of attribution reliability.
  • The approach suggests that training-distribution alignment in validation targets matters more for influence estimation than previously emphasized in off-policy settings.

Load-bearing premise

The model's on-policy rollouts serve as suitable validation targets that minimize the parameter proximity gap and satisfy the local neighborhood assumption of influence functions.

What would settle it

An experiment that selects data with standard off-policy influence functions, retrains the same 7B models, and obtains equal or higher performance on the instruction and reasoning benchmarks would falsify the claimed advantage of on-policy targets.

Figures

Figures reproduced from arXiv: 2606.18307 by Lincheng Li, Tianyu Yu, Yuan Yao, Zefan Wang.

Figure 1
Figure 1. Figure 1: Evolution of model proximity during continual training. The plots illustrate the rel￾ative changes (∆) in parameter L2 norm and sparsity compared to the uniform weighting baseline. On-policy validation causes signif￾icantly fewer parameter shifts, which we hy￾pothesize better aligns with the local neighbor￾hood assumption of IF. where τ controls the neighborhood size. This frame￾work identifies several cri… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of DRIFT (Data Refinement via On-Policy Influence Functions for Supervised Fine-Tuning). We first sample on-policy responses for a small set of validation queries. We then assign each rollout positive or negative weights according to correctness, aggregate their projected gradients into task-specific validation anchors, and use these anchors to score the original SFT corpus. After debiasing the re… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative example of DRIFT data attribution. While the validation query is a pure [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Top-1 retrieval performance of different filtering methods [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
read the original abstract

Optimizing the training data distribution for Supervised Fine-Tuning (SFT) dictates the capability of Large Language Models (LLMs). While existing data curation methods excel at accelerating training under constrained budgets, they are less suited to elevating the capability upper bound. The challenge here is no longer to identify a smaller subset that preserves performance, but to refine the data distribution toward instances most capable of improving the final model. To address this problem, we explore instance-level data attribution using Influence Functions (IF). We identify that standard IF formulations struggle in this setting due to two structural limitations: a proximity gap caused by off-policy validation targets, and a severe bias towards gradient norm. We propose DRIFT (Data Refinement via On-Policy Influence Functions for Supervised Fine-Tuning). Instead of relying on external reference data, DRIFT utilizes the model's on-policy rollouts as validation targets, which empirically minimizes the parameter proximity gap and better aligns with the local neighborhood assumption of IF. It further applies signed weighting based on trajectory correctness and debiases influence scores against the gradient hacking issue, allowing a small set of validation queries to act as reliable anchors for attributing the full dataset. Experiments on 7B-parameter instruction and reasoning models show that DRIFT consistently raises the performance ceiling on both, outperforming existing data curation baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes DRIFT, a method to refine SFT training data for LLMs via instance-level influence functions. It replaces off-policy validation targets with the model's on-policy rollouts to close the parameter proximity gap, introduces signed weighting by trajectory correctness, and applies gradient-norm debiasing. Experiments on 7B instruction and reasoning models report that the resulting data selection raises the performance ceiling and outperforms prior curation baselines.

Significance. If the influence scores remain reliable after the on-policy and debiasing modifications, DRIFT would offer a concrete mechanism for elevating capability ceilings rather than merely accelerating convergence under fixed budgets. The on-policy construction is a clear technical contribution worth testing, provided the local-neighborhood assumption of IF can be shown to hold in the non-convex SFT setting.

major comments (2)
  1. [Abstract and §3] Abstract and §3: The central justification—that on-policy rollouts 'empirically minimizes the parameter proximity gap and better aligns with the local neighborhood assumption of IF'—is load-bearing for the validity of the influence scores. No diagnostic is reported (e.g., Spearman correlation between DRIFT scores and leave-one-out retraining deltas, or trace of the Hessian at the fine-tuned point) to confirm that the quadratic approximation remains valid after signed weighting and debiasing.
  2. [§4] §4 (experimental results): The claim that DRIFT 'consistently raises the performance ceiling' on both instruction and reasoning 7B models requires explicit reporting of the number of independent runs, standard deviations, and statistical tests; without these, it is impossible to determine whether the reported gains exceed the variance of the baseline methods.
minor comments (2)
  1. [§3] Notation for the signed-weighting term and the debiasing factor should be introduced with explicit equations rather than prose descriptions only.
  2. [§4] The manuscript should include a short ablation isolating the contribution of on-policy targets versus the signed-weighting and debiasing steps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3: The central justification—that on-policy rollouts 'empirically minimizes the parameter proximity gap and better aligns with the local neighborhood assumption of IF'—is load-bearing for the validity of the influence scores. No diagnostic is reported (e.g., Spearman correlation between DRIFT scores and leave-one-out retraining deltas, or trace of the Hessian at the fine-tuned point) to confirm that the quadratic approximation remains valid after signed weighting and debiasing.

    Authors: We agree that direct diagnostics would strengthen the claim that the quadratic approximation remains valid. The on-policy construction is motivated by reducing the proximity gap, with downstream gains providing supporting evidence, but we will add a diagnostic subsection in the revision. Specifically, we will report Spearman correlation between DRIFT scores and performance deltas obtained from retraining after removing high-influence instances on a held-out validation subset, along with a note on the Hessian trace at the fine-tuned point where computationally feasible. revision: yes

  2. Referee: [§4] §4 (experimental results): The claim that DRIFT 'consistently raises the performance ceiling' on both instruction and reasoning 7B models requires explicit reporting of the number of independent runs, standard deviations, and statistical tests; without these, it is impossible to determine whether the reported gains exceed the variance of the baseline methods.

    Authors: We acknowledge that variance and statistical testing are necessary to support claims of consistent improvement. The reported results used single runs owing to the substantial compute required for 7B-scale SFT. In the revised manuscript we will rerun the primary comparisons across at least three independent random seeds, report means and standard deviations, and include paired statistical tests (e.g., t-tests) against the strongest baselines to demonstrate that the observed gains exceed run-to-run variance. revision: yes

Circularity Check

0 steps flagged

No circularity detected; claims rest on empirical fixes without self-referential derivation.

full rationale

The provided abstract and text describe DRIFT as an empirical refinement of influence functions using on-policy rollouts to address proximity gap and gradient bias, with performance gains shown via experiments. No equations or derivation chain is exhibited that reduces any prediction or attribution score to its own inputs by construction. The on-policy choice is presented as an empirical alignment step rather than a fitted parameter renamed as output. No self-citation load-bearing or ansatz smuggling is visible in the given material. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Only abstract available; ledger populated from stated assumptions in the abstract.

axioms (2)
  • domain assumption Influence functions remain valid when validation targets are replaced by on-policy model rollouts.
    Central to the claim that the proximity gap is minimized.
  • domain assumption Signed weighting by trajectory correctness and gradient-norm debiasing remove the dominant bias in standard IF scores.
    Required for the small validation set to serve as reliable anchors.

pith-pipeline@v0.9.1-grok · 5768 in / 1244 out tokens · 23572 ms · 2026-06-27T01:54:55.236838+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 8 linked inside Pith

  1. [1]

    Efficient online data mixing for language model pre-training.arXiv preprint arXiv:2312.02406,

    Alon Albalak, Liangming Pan, Colin Raffel, and William Yang Wang. Efficient online data mixing for language model pre-training.arXiv preprint arXiv:2312.02406,

  2. [2]

    What is your data worth to gpt? llm-scale data valuation with influence functions.arXiv preprint arXiv:2405.13954,

    Sang Keun Choe, Hwijeen Ahn, Juhan Bae, Kewen Zhao, Minsoo Kang, Youngseog Chung, Adithya Pratapa, Willie Neiswanger, Emma Strubell, Teruko Mitamura, et al. What is your data worth to gpt? llm-scale data valuation with influence functions.arXiv preprint arXiv:2405.13954,

  3. [3]

    If-guide: Influence function- guided detoxification of llms.arXiv preprint arXiv:2506.01790,

    Zachary Coalson, Juhan Bae, Nicholas Carlini, and Sanghyun Hong. If-guide: Influence function- guided detoxification of llms.arXiv preprint arXiv:2506.01790,

  4. [4]

    Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296,

    Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, et al. Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296,

  5. [5]

    Prateek Humane, Paolo Cudrano, Daniel Z Kaplan, Matteo Matteucci, Supriyo Chakraborty, and Irina Rish

    URL https: //github.com/huggingface/open-r1. Prateek Humane, Paolo Cudrano, Daniel Z Kaplan, Matteo Matteucci, Supriyo Chakraborty, and Irina Rish. Influence functions for efficient data selection in reasoning.arXiv preprint arXiv:2510.06108,

  6. [6]

    Large-scale data selection for instruction tuning.arXiv preprint arXiv:2503.01807,

    Hamish Ivison, Muru Zhang, Faeze Brahman, Pang Wei Koh, and Pradeep Dasigi. Large-scale data selection for instruction tuning.arXiv preprint arXiv:2503.01807,

  7. [7]

    Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,

  8. [8]

    Datainf: Efficiently estimating data influence in lora-tuned llms and diffusion models.arXiv preprint arXiv:2310.00902,

    Yongchan Kwon, Eric Wu, Kevin Wu, and James Zou. Datainf: Efficiently estimating data influence in lora-tuned llms and diffusion models.arXiv preprint arXiv:2310.00902,

  9. [9]

    Do influence functions work on large language models

    10 Zhe Li, Wei Zhao, Yige Li, and Jun Sun. Do influence functions work on large language models. arXiv preprint arXiv:2409.19998, 3, 2024b. Bill Yuchen Lin, Ronan Le Bras, and Yejin Choi. Zebralogic: Benchmarking the logical reason- ing ability of language models,

  10. [10]

    Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, and Min Lin

    URL https: //openreview.net/forum?id=1qvx610Cu7. Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, and Min Lin. Regmix: Data mixture as regression for language model pre-training.arXiv preprint arXiv:2407.01492,

  11. [11]

    Bm25s: Orders of magnitude faster lexical search via eager sparse scoring.arXiv preprint arXiv:2407.03618,

    Xing Han Lù. Bm25s: Orders of magnitude faster lexical search via eager sparse scoring.arXiv preprint arXiv:2407.03618,

  12. [12]

    Grass: Compute efficient low-memory llm training with structured sparse gradients

    Aashiq Muhamed, Oscar Li, David Woodruff, Mona Diab, and Virginia Smith. Grass: Compute efficient low-memory llm training with structured sparse gradients. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14978–15003,

  13. [13]

    Reinforcement learning finetunes small subnetworks in large language models.arXiv preprint arXiv:2505.11711,

    Sagnik Mukherjee, Lifan Yuan, Dilek Hakkani-Tur, and Hao Peng. Reinforcement learning finetunes small subnetworks in large language models.arXiv preprint arXiv:2505.11711,

  14. [14]

    Olmo 3.arXiv preprint arXiv:2512.13961,

    Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3.arXiv preprint arXiv:2512.13961,

  15. [15]

    Trak: Attributing model behavior at scale.arXiv preprint arXiv:2303.14186,

    Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander Madry. Trak: Attributing model behavior at scale.arXiv preprint arXiv:2303.14186,

  16. [16]

    Gpqa: A graduate-level google-proof q&a benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022,

  17. [17]

    Megatron-lm: Training multi-billion parameter language models using model parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catan- zaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053,

  18. [18]

    The curse of recursion: Training on generated data makes models forget.arXiv preprint arXiv:2305.17493,

    Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Ander- son. The curse of recursion: Training on generated data makes models forget.arXiv preprint arXiv:2305.17493,

  19. [19]

    Beyond human data: Scaling self-training for problem-solving with language models.arXiv preprint arXiv:2312.06585,

    Avi Singh, John Hui, Zhihui Yin, Jianlin Tu, et al. Beyond human data: Scaling self-training for problem-solving with language models.arXiv preprint arXiv:2312.06585,

  20. [20]

    Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261,

    11 Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261,

  21. [21]

    Greats: Online selection of high-quality data for llm training in every iteration.Advances in Neural Information Processing Systems, 37:131197–131223, 2024a

    Jiachen T Wang, Tong Wu, Dawn Song, Prateek Mittal, and Ruoxi Jia. Greats: Online selection of high-quality data for llm training in every iteration.Advances in Neural Information Processing Systems, 37:131197–131223, 2024a. Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et...

  22. [22]

    Qurating: Selecting high-quality data for training language models.arXiv preprint arXiv:2402.09739,

    Alexander Wettig, Aatmik Gupta, Saumya Malik, and Danqi Chen. Qurating: Selecting high-quality data for training language models.arXiv preprint arXiv:2402.09739,

  23. [23]

    Less: Selecting influential data for targeted instruction tuning.arXiv preprint arXiv:2402.04333,

    Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. Less: Selecting influential data for targeted instruction tuning.arXiv preprint arXiv:2402.04333,

  24. [24]

    Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 36:69798–69818, 2023a

    Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 36:69798–69818, 2023a. Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy S Liang. Data selection fo...

  25. [25]

    D3: Diversity, difficulty, and dependability-aware data selection for sample-efficient llm instruction tuning.arXiv preprint arXiv:2503.11441,

    Jia Zhang, Chen-Xi Zhang, Yao Liu, Yi-Xuan Jin, Xiao-Wen Yang, Bo Zheng, Yi Liu, and Lan-Zhe Guo. D3: Diversity, difficulty, and dependability-aware data selection for sample-efficient llm instruction tuning.arXiv preprint arXiv:2503.11441,

  26. [26]

    Data-efficient rlvr via off-policy influence guidance.arXiv preprint arXiv:2510.26491,

    Erle Zhu, Dazhi Jiang, Yuan Wang, Xujun Li, Jiale Cheng, Yuxian Gu, Yilin Niu, Aohan Zeng, Jie Tang, Minlie Huang, et al. Data-efficient rlvr via off-policy influence guidance.arXiv preprint arXiv:2510.26491,

  27. [27]

    package. For each validation set sample, we assign a linear score between 0 and 1 to the training samples based on their BM25 similarity ranking, and then obtain the aggregated score by summing them up. DSIR.We instantiate the Data Selection with Importance Resampling (DSIR) (Xie et al., 2023b) framework with hashed n-gram features for tractable importanc...

  28. [28]

    as our method. The critical distinction is that it computes scores against static, external validation sets (the original reference responses) and uses standard cosine similarity without our proposed log-space debiasing. B Motivating Analysis of Pure On-Policy Targets B.1 Theoretical Justification of the Proximity Gap Mitigation In this section, we provid...