pith. sign in

arxiv: 2605.15675 · v1 · pith:KSBHS3VAnew · submitted 2026-05-15 · 💻 cs.LG · cs.AI

Interaction-Aware Influence Functions for Group Attribution

Pith reviewed 2026-05-20 20:38 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords influence functionsgroup attributionpairwise interactionssecond-order approximationdata selectioninstruction tuningleave-one-outmachine learning
0
0 comments X

The pith

Adding a pairwise interaction term to influence functions improves estimates of how groups of training examples jointly affect model behavior.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard influence functions estimate the effect of removing a training example by summing individual contributions, but this misses cases where examples are redundant or complementary within a group. The paper derives an interaction-aware estimator by taking a second-order Taylor expansion of the target quantity around the trained parameters, which naturally introduces a term that measures the alignment between pairs of examples' effects. On six different dataset and model combinations the new estimator matches the results of actually retraining after removing groups more closely than the usual first-order sum. When the same estimator is used to greedily select instruction-tuning data for Llama-3.1-8B, it produces models that outperform both prior influence methods and representation-similarity baselines on five of seven downstream tasks.

Core claim

By expanding the target function to second order around the trained parameters, we obtain an estimator that augments the standard sum with a pairwise interaction term that captures the alignment between two examples' effects on the target.

What carries the argument

The second-order Taylor expansion of the target function around the trained model parameters, which supplies the pairwise interaction term that augments the usual first-order sum.

If this is right

  • The estimator tracks leave-group-out retraining more closely than first-order influence on six dataset-model pairs spanning logistic regression, MLPs, and ResNet-9.
  • Greedy selection guided by the interaction-aware scores beats prior influence-based and representation-similarity baselines on five of seven downstream tasks for instruction tuning of Llama-3.1-8B.
  • The pairwise term distinguishes redundant examples from complementary ones, allowing group attributions that simple summation cannot provide.
  • The same estimator remains useful even in regimes where standard influence-based selection performs worse than random selection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Higher-order terms beyond pairwise interactions could be derived similarly if larger groups are the focus of attribution.
  • The alignment captured by the interaction term may help explain why certain data subsets produce synergistic gains when used together for fine-tuning.
  • The approach could be tested on other attribution problems such as feature attribution or neuron pruning where joint effects are also ignored by first-order methods.

Load-bearing premise

The second-order Taylor expansion around the trained parameters remains accurate enough for the group sizes and model scales considered.

What would settle it

Direct leave-group-out retraining experiments on a new model scale or larger group sizes that show the interaction-aware estimates diverging from the true change in the target would falsify the estimator's accuracy claim.

Figures

Figures reproduced from arXiv: 2605.15675 by Dongwoo Kim, Jaeseung Heo, Jungseul Ok, Kyeongheung Yun, Sehyun Hwang, Youngbin Choi.

Figure 1
Figure 1. Figure 1: Spearman rank correlation between estimated and ground-truth group influences across six [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Representative images from the class pairs with the lowest (left) and highest (right) average pairwise interaction. Results [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Held-out test loss after retraining on the selected subset on MNIST (left) and FashionMNIST [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Sensitivity analysis of the two-layer MLP over the damping hyperparameter on MNIST [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗
read the original abstract

Influence functions approximate how removing a training example changes a quantity of interest, called the target function, such as a held-out loss. To estimate the influence of a group of examples, the standard practice is to sum the individual influences of its members. However, this sum does not capture how examples jointly affect the target: a pair of examples may be redundant or complementary, but the sum cannot distinguish these cases. We propose an interaction-aware influence function that characterizes how interactions between examples influence the target. By expanding the target to second order around the trained parameters, we obtain an estimator that augments the standard sum with a pairwise interaction term that captures the alignment between two examples' effects on the target. We empirically evaluate our estimator in two settings. First, on six dataset-model pairs spanning logistic regression, MLPs, and ResNet-9, our estimator tracks leave-group-out retraining substantially better than first-order influence across all settings. Second, when used as a greedy selection rule for instruction-tuning data on Llama-3.1-8B, it beats prior influence-based and representation-similarity baselines on five of seven downstream tasks, in a regime where standard influence-based selection underperforms random selection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes an interaction-aware influence function for group attribution by augmenting standard first-order influence functions with a pairwise interaction term obtained via second-order Taylor expansion of the target function around trained parameters. This term captures alignment between examples' effects. Empirically, the estimator tracks leave-group-out retraining better than first-order baselines on six dataset-model pairs (logistic regression, MLPs, ResNet-9). When used for greedy instruction-tuning data selection on Llama-3.1-8B, it outperforms prior influence-based and representation-similarity baselines on five of seven downstream tasks.

Significance. If the second-order approximation remains accurate at the scales considered, the method offers a computationally tractable way to account for example interactions in influence estimation, which could improve data selection and attribution tasks where groups exhibit redundancy or complementarity. The small-model validation against leave-group-out retraining provides direct evidence of improved fidelity; the large-model results suggest practical utility but depend on untested transfer of the approximation.

major comments (3)
  1. [§3.2, Eq. (7)] §3.2, Eq. (7): The second-order interaction term is derived from the Taylor expansion, but no remainder-term bound or analysis of approximation error is provided for the group sizes and model scales in the Llama-3.1-8B experiments; this is load-bearing because the skeptic correctly notes that if higher-order terms or Hessian-vector approximation errors become comparable to the interaction term, the downstream gains cannot be confidently attributed to interaction awareness.
  2. [§5.3] §5.3: Direct validation against leave-group-out retraining is performed only on logistic regression, MLPs, and ResNet-9; the Llama-3.1-8B greedy selection results lack any analogous direct check (as retraining is infeasible) and instead rely on downstream task performance, which could be driven by factors other than the claimed interaction term.
  3. [Table 1 and §5.1] Table 1 and §5.1: While consistent improvement over first-order baselines is reported across six dataset-model pairs, no error bars, statistical significance tests, or sensitivity analysis to the Hessian approximation method are included, weakening the claim that the interaction term is the source of the improvement.
minor comments (2)
  1. The notation for the target function and influence quantities is introduced without a consolidated table of symbols, making it harder to follow the transition from first-order to interaction-aware estimators.
  2. Figure 2 caption does not specify the exact group sizes used in the leave-group-out experiments, which is relevant for assessing the regime where the second-order term is expected to matter.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating whether revisions have been made.

read point-by-point responses
  1. Referee: [§3.2, Eq. (7)] §3.2, Eq. (7): The second-order interaction term is derived from the Taylor expansion, but no remainder-term bound or analysis of approximation error is provided for the group sizes and model scales in the Llama-3.1-8B experiments; this is load-bearing because the skeptic correctly notes that if higher-order terms or Hessian-vector approximation errors become comparable to the interaction term, the downstream gains cannot be confidently attributed to interaction awareness.

    Authors: We agree that a formal remainder-term bound would strengthen the claims. However, obtaining a tight, non-vacuous bound on higher-order terms for high-dimensional models and non-trivial group sizes without strong additional assumptions is technically challenging and beyond the scope of the current work. In the revision we add an expanded discussion of approximation quality, drawing on the small-model leave-group-out results to empirically characterize when the second-order term remains dominant, and we explicitly flag the lack of a general bound as a limitation for the Llama-scale experiments. revision: partial

  2. Referee: [§5.3] §5.3: Direct validation against leave-group-out retraining is performed only on logistic regression, MLPs, and ResNet-9; the Llama-3.1-8B greedy selection results lack any analogous direct check (as retraining is infeasible) and instead rely on downstream task performance, which could be driven by factors other than the claimed interaction term.

    Authors: We acknowledge the limitation. Direct leave-group-out validation is computationally infeasible at the Llama-3.1-8B scale. In the revised manuscript we have expanded §5.3 to state this caveat explicitly, to clarify that downstream gains constitute indirect evidence, and to note that the pattern of improvement is consistent with the small-model regime where direct validation against retraining was possible. revision: yes

  3. Referee: [Table 1 and §5.1] Table 1 and §5.1: While consistent improvement over first-order baselines is reported across six dataset-model pairs, no error bars, statistical significance tests, or sensitivity analysis to the Hessian approximation method are included, weakening the claim that the interaction term is the source of the improvement.

    Authors: We thank the referee for this observation. The revised manuscript updates Table 1 with error bars computed over multiple random seeds, adds paired statistical significance tests between our estimator and the first-order baseline, and includes a new sensitivity analysis in §5.1 that compares results under exact versus approximate Hessian-vector products. revision: yes

Circularity Check

0 steps flagged

Derivation via second-order Taylor expansion is self-contained and does not reduce to inputs by construction

full rationale

The paper derives the interaction-aware estimator by performing a direct second-order Taylor expansion of the target function around the trained parameters, augmenting the standard first-order sum with an explicit pairwise interaction term whose coefficients are the mixed second derivatives of the loss. This is a standard calculus construction using the same loss, gradient, and Hessian primitives as classical influence functions, but extended rather than redefined. No equation reduces a prediction to a fitted quantity, no uniqueness theorem is imported via self-citation, and no ansatz is smuggled in. The empirical comparisons to leave-group-out retraining on small models serve as an external benchmark, keeping the central claim independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central estimator rests on a second-order Taylor expansion whose validity depends on local smoothness of the loss and on the ability to compute or approximate the relevant Hessian-vector products; no new entities are postulated and no parameters appear to be fitted specifically to produce the interaction term.

axioms (1)
  • domain assumption The loss is twice differentiable in a neighborhood of the trained parameters.
    Required for the second-order Taylor expansion to be defined.

pith-pipeline@v0.9.0 · 5756 in / 1321 out tokens · 43639 ms · 2026-05-20T20:38:20.565437+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 5 internal anchors

  1. [1]

    Neural networks for learnable and scalable influence estimation of instruction fine-tuning data

    Ishika Agarwal and Dilek Hakkani-Tür. Neural networks for learnable and scalable influence estimation of instruction fine-tuning data. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  2. [2]

    Explanations for commonsenseqa: New dataset and models

    Shourya Aggarwal, Divyanshu Mandowara, Vishwajeet Agrawal, Dinesh Khandelwal, Parag Singla, and Dinesh Garg. Explanations for commonsenseqa: New dataset and models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021

  3. [3]

    Deep batch active learning by diverse, uncertain gradient lower bounds.International Conference on Learning Representations, 2020

    Jordan T Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agar- wal. Deep batch active learning by diverse, uncertain gradient lower bounds.International Conference on Learning Representations, 2020

  4. [4]

    If influence functions are the answer, then what is the question?Advances in Neural Information Processing Systems, 2022

    Juhan Bae, Nathan Ng, Alston Lo, Marzyeh Ghassemi, and Roger B Grosse. If influence functions are the answer, then what is the question?Advances in Neural Information Processing Systems, 2022

  5. [5]

    On second-order group influence functions for black-box predictions

    Samyadeep Basu, Xuchen You, and Soheil Feizi. On second-order group influence functions for black-box predictions. InInternational Conference on Machine Learning, 2020

  6. [6]

    Influence functions in deep learning are fragile

    Samyadeep Basu, Philip Pope, and Soheil Feizi. Influence functions in deep learning are fragile. International Conference on Learning Representations, 2021

  7. [7]

    Semantic Redundancies in Image-Classification Datasets: The 10% You Don't Need

    Vighnesh Birodkar, Hossein Mobahi, and Samy Bengio. Semantic redundancies in image- classification datasets: The 10% you don’t need.arXiv preprint arXiv:1901.11409, 2019

  8. [8]

    Piqa: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI Conference on Artificial Intelligence, 2020

  9. [9]

    Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney

    Tyler A. Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney. Scalable influence and fact tracing for large language model pretraining. InThe Thirteenth International Conference on Learning Representations, 2025

  10. [10]

    What is your data worth to gpt? llm-scale data valuation with influence functions.Advances in neural information processing systems, 2025

    Sang Keun Choe, Hwijeen Ahn, Juhan Bae, Kewen Zhao, Minsoo Kang, Youngseog Chung, Adithya Pratapa, Willie Neiswanger, Emma Strubell, Teruko Mitamura, et al. What is your data worth to gpt? llm-scale data valuation with influence functions.Advances in neural information processing systems, 2025

  11. [11]

    Batch active learning at scale.Advances in Neural Information Processing Systems, 2021

    Gui Citovsky, Giulia DeSalvo, Claudio Gentile, Lazaros Karydas, Anand Rajagopalan, Afshin Rostamizadeh, and Sanjiv Kumar. Batch active learning at scale.Advances in Neural Information Processing Systems, 2021

  12. [12]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

  13. [13]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  14. [14]

    Dawnbench: An end-to-end deep learning benchmark and competition.Training, 2017

    Cody Coleman, Deepak Narayanan, Daniel Kang, Tian Zhao, Jian Zhang, Luigi Nardi, Peter Bailis, Kunle Olukotun, Chris Ré, and Matei Zaharia. Dawnbench: An end-to-end deep learning benchmark and competition.Training, 2017

  15. [15]

    Support-vector networks.Machine learning, 1995

    Corinna Cortes and Vladimir Vapnik. Support-vector networks.Machine learning, 1995

  16. [16]

    Improving influence-based instruction tuning data selection for balanced learning of diverse capabilities.Findings of the Association for Computational Linguistics, 2025

    Qirun Dai, Dylan Zhang, Jiaqi W Ma, and Hao Peng. Improving influence-based instruction tuning data selection for balanced learning of diverse capabilities.Findings of the Association for Computational Linguistics, 2025. 10

  17. [17]

    Junwei Deng, Weijing Tang, and Jiaqi W. Ma. A versatile influence function for data attribution with non-decomposable loss.arXiv preprint arXiv:2412.01335, 2024

  18. [18]

    Dsdm: Model-aware dataset selection with datamodels

    Logan Engstrom, Axel Feldmann, and Aleksander Madry. Dsdm: Model-aware dataset selection with datamodels. InInternational Conference on Machine Learning, 2024

  19. [19]

    Fast approximate natural gradient descent in a kronecker factored eigenbasis.Advances in neural information processing systems, 2018

    Thomas George, César Laurent, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. Fast approximate natural gradient descent in a kronecker factored eigenbasis.Advances in neural information processing systems, 2018

  20. [20]

    Data shapley: Equitable valuation of data for machine learning

    Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. InInternational conference on machine learning, 2019

  21. [21]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  22. [22]

    Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296,

    Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, et al. Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296, 2023

  23. [23]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, 2016

  24. [24]

    Influence functions for edge edits in non-convex graph neural networks.Advances in Neural Information Processing Systems, 2025

    Jaeseung Heo, Kyeongheung Yun, Seokwon Yoon, MoonJeong Park, Jungseul Ok, and Dong- woo Kim. Influence functions for edge edits in non-convex graph neural networks.Advances in Neural Information Processing Systems, 2025

  25. [25]

    Most influential subset selection: Challenges, promises, and beyond.Advances in Neural Information Processing Systems, 2024

    Yuzheng Hu, Pingbang Hu, Han Zhao, et al. Most influential subset selection: Challenges, promises, and beyond.Advances in Neural Information Processing Systems, 2024

  26. [26]

    Approx- imations to worst-case data dropping: unmasking failure modes.Transactions on Machine Learning Research, 2025

    Jenny Y Huang, David R Burt, Yunyi Shen, Tin D Nguyen, and Tamara Broderick. Approx- imations to worst-case data dropping: unmasking failure modes.Transactions on Machine Learning Research, 2025

  27. [27]

    W., and Dasigi, P

    Hamish Ivison, Muru Zhang, Faeze Brahman, Pang Wei Koh, and Pradeep Dasigi. Large-Scale Data Selection for Instruction Tuning.arXiv preprint arXiv:2503.01807, 2025

  28. [28]

    Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning.Advances in neural information processing systems, 2019

    Andreas Kirsch, Joost Van Amersfoort, and Yarin Gal. Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning.Advances in neural information processing systems, 2019

  29. [29]

    Understanding black-box predictions via influence functions

    Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InInternational conference on machine learning, 2017

  30. [30]

    On the accuracy of influence functions for measuring group effects.Advances in neural information processing systems, 2019

    Pang Wei W Koh, Kai-Siang Ang, Hubert Teo, and Percy S Liang. On the accuracy of influence functions for measuring group effects.Advances in neural information processing systems, 2019

  31. [31]

    Bayesian influence functions for hessian-free data attribution

    Philipp Alexander Kreer, Wilson Wu, Maxwell Adam, Zach Furman, and Jesse Hoogland. Bayesian influence functions for hessian-free data attribution. InThe Fourteenth International Conference on Learning Representations, 2026

  32. [32]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

  33. [33]

    Datainf: Efficiently estimating data influence in lora-tuned llms and diffusion models.International Conference on Learning Representations, 2024

    Yongchan Kwon, Eric Wu, Kevin Wu, and James Zou. Datainf: Efficiently estimating data influence in lora-tuned llms and diffusion models.International Conference on Learning Representations, 2024

  34. [34]

    Gradient-based learning applied to document recognition.Proceedings of the IEEE, 2002

    Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 2002. 11

  35. [35]

    Nv-embed: Improved techniques for training llms as generalist embedding models.International Conference on Learning Representations, 2025

    Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models.International Conference on Learning Representations, 2025

  36. [36]

    Program induction by rationale generation: Learning to solve and explain algebraic word problems

    Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017

  37. [37]

    New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 2020

    James Martens. New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 2020

  38. [38]

    Optimizing neural networks with kronecker-factored approx- imate curvature

    James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approx- imate curvature. InInternational conference on machine learning, 2015

  39. [39]

    Coresets for data-efficient training of machine learning models

    Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. InInternational Conference on Machine Learning, 2020

  40. [40]

    Bruno Kacper Mlodozeniec, Runa Eschenhagen, Juhan Bae, Alexander Immer, David Krueger, and Richard E. Turner. Influence functions for scalable data attribution in diffusion models. In The Thirteenth International Conference on Learning Representations, 2025

  41. [41]

    Efficient data selection at scale via influence distillation

    Mahdi Nikdan, Vincent Cohen-Addad, Dan Alistarh, and Vahab Mirrokni. Efficient data selection at scale via influence distillation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  42. [42]

    G- dig: Towards gradient-based diverse and high-quality instruction data selection for machine translation

    Xingyuan Pan, Luyang Huang, Liyan Kang, Zhicheng Liu, Yu Lu, and Shanbo Cheng. G- dig: Towards gradient-based diverse and high-quality instruction data selection for machine translation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

  43. [43]

    Trak: Attributing model behavior at scale

    Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander Madry. Trak: Attributing model behavior at scale. InInternational Conference on Machine Learning, 2023

  44. [44]

    Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 2020

    Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 2020

  45. [45]

    Squad: 100,000+ questions for machine comprehension of text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016

  46. [46]

    Contrastive learning with hard negative samples

    Joshua David Robinson, Ching-Yao Chuang, Suvrit Sra, and Stefanie Jegelka. Contrastive learning with hard negative samples. InInternational Conference on Learning Representations, 2021

  47. [47]

    Ittai Rubinstein and Samuel B. Hopkins. Rescaled influence functions: Accurate data attribution in high dimension. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  48. [48]

    Understanding influ- ence functions and datamodels via harmonic analysis.International Conference on Learning Representations, 2023

    Nikunj Saunshi, Arushi Gupta, Mark Braverman, and Sanjeev Arora. Understanding influ- ence functions and datamodels via harmonic analysis.International Conference on Learning Representations, 2023

  49. [49]

    Scaling up influence functions

    Andrea Schioppa, Polina Zablotskaia, David Vilar, and Artem Sokolov. Scaling up influence functions. InProceedings of the AAAI Conference on Artificial Intelligence, 2022

  50. [50]

    Training region-based object detectors with online hard example mining

    Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. InProceedings of the IEEE conference on computer vision and pattern recognition, 2016

  51. [51]

    Data pruning by infor- mation maximization

    Haoru Tan, Sitong Wu, Wei Huang, Shizhen Zhao, and XIAOJUAN QI. Data pruning by infor- mation maximization. InThe Thirteenth International Conference on Learning Representations, 2025. 12

  52. [52]

    An empirical study of example forgetting during deep neural network learning

    Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gordon. An empirical study of example forgetting during deep neural network learning. InInternational Conference on Learning Representations, 2019

  53. [53]

    Accurate telemonitoring of parkinson’s disease progression by non-invasive speech tests.Nature Precedings, 2009

    Athanasios Tsanas, Max Little, Patrick McSharry, and Lorraine Ramig. Accurate telemonitoring of parkinson’s disease progression by non-invasive speech tests.Nature Precedings, 2009

  54. [54]

    Better training data attribution via better inverse hessian-vector products.Advances in Neural Information Processing Systems, 2025

    Andrew Wang, Elisa Nguyen, Runshi Yang, Juhan Bae, Sheila A McIlraith, and Roger Grosse. Better training data attribution via better inverse hessian-vector products.Advances in Neural Information Processing Systems, 2025

  55. [55]

    Rethinking data shapley for data selection tasks: Misleads and merits.International Conference on Machine Learning, 2024

    Jiachen T Wang, Tianji Yang, James Zou, Yongchan Kwon, and Ruoxi Jia. Rethinking data shapley for data selection tasks: Misleads and merits.International Conference on Machine Learning, 2024

  56. [56]

    Data shapley in one training run

    Jiachen T Wang, Prateek Mittal, Dawn Song, and Ruoxi Jia. Data shapley in one training run. International Conference on Learning Representations, 2025

  57. [57]

    How far can camels go? exploring the state of instruction tuning on open resources.Advances in Neural Information Processing Systems, 2023

    Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Chandu, David Wadden, Kelsey MacMillan, Noah A Smith, Iz Beltagy, et al. How far can camels go? exploring the state of instruction tuning on open resources.Advances in Neural Information Processing Systems, 2023

  58. [58]

    Ji2s: Joint influence-aware instruction data selection for efficient fine-tuning

    Jingyu Wei, Bo Liu, Tianjiao Wan, Baoyun Peng, Xingkong Ma, and Mengmeng Guo. Ji2s: Joint influence-aware instruction data selection for efficient fine-tuning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

  59. [59]

    LESS: Selecting influential data for targeted instruction tuning

    Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. LESS: Selecting influential data for targeted instruction tuning. InInternational Conference on Machine Learning, 2024

  60. [60]

    Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

    Han Xiao, Kashif Rasul, and Roland V ollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.arXiv preprint arXiv:1708.07747, 2017

  61. [61]

    Smalltolarge (s2l): Scalable data selection for fine-tuning large language models by summarizing training trajectories of small models

    Yu Yang, Siddhartha Mishra, Jeffrey N Chiang, and Baharan Mirzasoleiman. Smalltolarge (s2l): Scalable data selection for fine-tuning large language models by summarizing training trajectories of small models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  62. [62]

    Towards robust influence functions with flat validation minima.arXiv preprint arXiv:2505.19097, 2025

    Xichen Ye, Yifan Wu, Weizhong Zhang, Cheng Jin, and Yifan Chen. Towards robust influence functions with flat validation minima.arXiv preprint arXiv:2505.19097, 2025

  63. [63]

    Modeling of strength of high-performance concrete using artificial neural networks

    I-C Yeh. Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete research, 1998

  64. [64]

    Mates: Model-aware data selection for efficient pretraining with data influence models.Advances in Neural Information Processing Systems, 2024

    Zichun Yu, Spandan Das, and Chenyan Xiong. Mates: Model-aware data selection for efficient pretraining with data influence models.Advances in Neural Information Processing Systems, 2024

  65. [65]

    Group-level data selection for efficient pretraining

    Zichun Yu, Fei Peng, Jie Lei, Arnold Overwijk, Wen tau Yih, and Chenyan Xiong. Group-level data selection for efficient pretraining. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  66. [66]

    Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

  67. [67]

    Harnessing diversity for important data selection in pretraining large language models.International Conference on Learning Representations, 2025

    Chi Zhang, Huaping Zhong, Kuan Zhang, Chengliang Chai, Rui Wang, Xinlin Zhuang, Tianyi Bai, Jiantao Qiu, Lei Cao, Ju Fan, et al. Harnessing diversity for important data selection in pretraining large language models.International Conference on Learning Representations, 2025. 13 A Notation Table 2 consolidates the notation used throughout the paper. The sy...