Interaction-Aware Influence Functions for Group Attribution

Dongwoo Kim; Jaeseung Heo; Jungseul Ok; Kyeongheung Yun; Sehyun Hwang; Youngbin Choi

arxiv: 2605.15675 · v1 · pith:KSBHS3VAnew · submitted 2026-05-15 · 💻 cs.LG · cs.AI

Interaction-Aware Influence Functions for Group Attribution

Jaeseung Heo , Kyeongheung Yun , Youngbin Choi , Sehyun Hwang , Jungseul Ok , Dongwoo Kim This is my paper

Pith reviewed 2026-05-20 20:38 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords influence functionsgroup attributionpairwise interactionssecond-order approximationdata selectioninstruction tuningleave-one-outmachine learning

0 comments

The pith

Adding a pairwise interaction term to influence functions improves estimates of how groups of training examples jointly affect model behavior.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard influence functions estimate the effect of removing a training example by summing individual contributions, but this misses cases where examples are redundant or complementary within a group. The paper derives an interaction-aware estimator by taking a second-order Taylor expansion of the target quantity around the trained parameters, which naturally introduces a term that measures the alignment between pairs of examples' effects. On six different dataset and model combinations the new estimator matches the results of actually retraining after removing groups more closely than the usual first-order sum. When the same estimator is used to greedily select instruction-tuning data for Llama-3.1-8B, it produces models that outperform both prior influence methods and representation-similarity baselines on five of seven downstream tasks.

Core claim

By expanding the target function to second order around the trained parameters, we obtain an estimator that augments the standard sum with a pairwise interaction term that captures the alignment between two examples' effects on the target.

What carries the argument

The second-order Taylor expansion of the target function around the trained model parameters, which supplies the pairwise interaction term that augments the usual first-order sum.

If this is right

The estimator tracks leave-group-out retraining more closely than first-order influence on six dataset-model pairs spanning logistic regression, MLPs, and ResNet-9.
Greedy selection guided by the interaction-aware scores beats prior influence-based and representation-similarity baselines on five of seven downstream tasks for instruction tuning of Llama-3.1-8B.
The pairwise term distinguishes redundant examples from complementary ones, allowing group attributions that simple summation cannot provide.
The same estimator remains useful even in regimes where standard influence-based selection performs worse than random selection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Higher-order terms beyond pairwise interactions could be derived similarly if larger groups are the focus of attribution.
The alignment captured by the interaction term may help explain why certain data subsets produce synergistic gains when used together for fine-tuning.
The approach could be tested on other attribution problems such as feature attribution or neuron pruning where joint effects are also ignored by first-order methods.

Load-bearing premise

The second-order Taylor expansion around the trained parameters remains accurate enough for the group sizes and model scales considered.

What would settle it

Direct leave-group-out retraining experiments on a new model scale or larger group sizes that show the interaction-aware estimates diverging from the true change in the target would falsify the estimator's accuracy claim.

Figures

Figures reproduced from arXiv: 2605.15675 by Dongwoo Kim, Jaeseung Heo, Jungseul Ok, Kyeongheung Yun, Sehyun Hwang, Youngbin Choi.

**Figure 2.** Figure 2: Representative images from the class pairs with the lowest (left) and highest (right) average pairwise interaction. Results [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Held-out test loss after retraining on the selected subset on MNIST (left) and FashionMNIST [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Sensitivity analysis of the two-layer MLP over the damping hyperparameter on MNIST [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗

read the original abstract

Influence functions approximate how removing a training example changes a quantity of interest, called the target function, such as a held-out loss. To estimate the influence of a group of examples, the standard practice is to sum the individual influences of its members. However, this sum does not capture how examples jointly affect the target: a pair of examples may be redundant or complementary, but the sum cannot distinguish these cases. We propose an interaction-aware influence function that characterizes how interactions between examples influence the target. By expanding the target to second order around the trained parameters, we obtain an estimator that augments the standard sum with a pairwise interaction term that captures the alignment between two examples' effects on the target. We empirically evaluate our estimator in two settings. First, on six dataset-model pairs spanning logistic regression, MLPs, and ResNet-9, our estimator tracks leave-group-out retraining substantially better than first-order influence across all settings. Second, when used as a greedy selection rule for instruction-tuning data on Llama-3.1-8B, it beats prior influence-based and representation-similarity baselines on five of seven downstream tasks, in a regime where standard influence-based selection underperforms random selection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes an interaction-aware influence function for group attribution by augmenting standard first-order influence functions with a pairwise interaction term obtained via second-order Taylor expansion of the target function around trained parameters. This term captures alignment between examples' effects. Empirically, the estimator tracks leave-group-out retraining better than first-order baselines on six dataset-model pairs (logistic regression, MLPs, ResNet-9). When used for greedy instruction-tuning data selection on Llama-3.1-8B, it outperforms prior influence-based and representation-similarity baselines on five of seven downstream tasks.

Significance. If the second-order approximation remains accurate at the scales considered, the method offers a computationally tractable way to account for example interactions in influence estimation, which could improve data selection and attribution tasks where groups exhibit redundancy or complementarity. The small-model validation against leave-group-out retraining provides direct evidence of improved fidelity; the large-model results suggest practical utility but depend on untested transfer of the approximation.

major comments (3)

[§3.2, Eq. (7)] §3.2, Eq. (7): The second-order interaction term is derived from the Taylor expansion, but no remainder-term bound or analysis of approximation error is provided for the group sizes and model scales in the Llama-3.1-8B experiments; this is load-bearing because the skeptic correctly notes that if higher-order terms or Hessian-vector approximation errors become comparable to the interaction term, the downstream gains cannot be confidently attributed to interaction awareness.
[§5.3] §5.3: Direct validation against leave-group-out retraining is performed only on logistic regression, MLPs, and ResNet-9; the Llama-3.1-8B greedy selection results lack any analogous direct check (as retraining is infeasible) and instead rely on downstream task performance, which could be driven by factors other than the claimed interaction term.
[Table 1 and §5.1] Table 1 and §5.1: While consistent improvement over first-order baselines is reported across six dataset-model pairs, no error bars, statistical significance tests, or sensitivity analysis to the Hessian approximation method are included, weakening the claim that the interaction term is the source of the improvement.

minor comments (2)

The notation for the target function and influence quantities is introduced without a consolidated table of symbols, making it harder to follow the transition from first-order to interaction-aware estimators.
Figure 2 caption does not specify the exact group sizes used in the leave-group-out experiments, which is relevant for assessing the regime where the second-order term is expected to matter.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating whether revisions have been made.

read point-by-point responses

Referee: [§3.2, Eq. (7)] §3.2, Eq. (7): The second-order interaction term is derived from the Taylor expansion, but no remainder-term bound or analysis of approximation error is provided for the group sizes and model scales in the Llama-3.1-8B experiments; this is load-bearing because the skeptic correctly notes that if higher-order terms or Hessian-vector approximation errors become comparable to the interaction term, the downstream gains cannot be confidently attributed to interaction awareness.

Authors: We agree that a formal remainder-term bound would strengthen the claims. However, obtaining a tight, non-vacuous bound on higher-order terms for high-dimensional models and non-trivial group sizes without strong additional assumptions is technically challenging and beyond the scope of the current work. In the revision we add an expanded discussion of approximation quality, drawing on the small-model leave-group-out results to empirically characterize when the second-order term remains dominant, and we explicitly flag the lack of a general bound as a limitation for the Llama-scale experiments. revision: partial
Referee: [§5.3] §5.3: Direct validation against leave-group-out retraining is performed only on logistic regression, MLPs, and ResNet-9; the Llama-3.1-8B greedy selection results lack any analogous direct check (as retraining is infeasible) and instead rely on downstream task performance, which could be driven by factors other than the claimed interaction term.

Authors: We acknowledge the limitation. Direct leave-group-out validation is computationally infeasible at the Llama-3.1-8B scale. In the revised manuscript we have expanded §5.3 to state this caveat explicitly, to clarify that downstream gains constitute indirect evidence, and to note that the pattern of improvement is consistent with the small-model regime where direct validation against retraining was possible. revision: yes
Referee: [Table 1 and §5.1] Table 1 and §5.1: While consistent improvement over first-order baselines is reported across six dataset-model pairs, no error bars, statistical significance tests, or sensitivity analysis to the Hessian approximation method are included, weakening the claim that the interaction term is the source of the improvement.

Authors: We thank the referee for this observation. The revised manuscript updates Table 1 with error bars computed over multiple random seeds, adds paired statistical significance tests between our estimator and the first-order baseline, and includes a new sensitivity analysis in §5.1 that compares results under exact versus approximate Hessian-vector products. revision: yes

Circularity Check

0 steps flagged

Derivation via second-order Taylor expansion is self-contained and does not reduce to inputs by construction

full rationale

The paper derives the interaction-aware estimator by performing a direct second-order Taylor expansion of the target function around the trained parameters, augmenting the standard first-order sum with an explicit pairwise interaction term whose coefficients are the mixed second derivatives of the loss. This is a standard calculus construction using the same loss, gradient, and Hessian primitives as classical influence functions, but extended rather than redefined. No equation reduces a prediction to a fitted quantity, no uniqueness theorem is imported via self-citation, and no ansatz is smuggled in. The empirical comparisons to leave-group-out retraining on small models serve as an external benchmark, keeping the central claim independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central estimator rests on a second-order Taylor expansion whose validity depends on local smoothness of the loss and on the ability to compute or approximate the relevant Hessian-vector products; no new entities are postulated and no parameters appear to be fitted specifically to produce the interaction term.

axioms (1)

domain assumption The loss is twice differentiable in a neighborhood of the trained parameters.
Required for the second-order Taylor expansion to be defined.

pith-pipeline@v0.9.0 · 5756 in / 1321 out tokens · 43639 ms · 2026-05-20T20:38:20.565437+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 5 internal anchors

[1]

Neural networks for learnable and scalable influence estimation of instruction fine-tuning data

Ishika Agarwal and Dilek Hakkani-Tür. Neural networks for learnable and scalable influence estimation of instruction fine-tuning data. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[2]

Explanations for commonsenseqa: New dataset and models

Shourya Aggarwal, Divyanshu Mandowara, Vishwajeet Agrawal, Dinesh Khandelwal, Parag Singla, and Dinesh Garg. Explanations for commonsenseqa: New dataset and models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021

work page 2021
[3]

Deep batch active learning by diverse, uncertain gradient lower bounds.International Conference on Learning Representations, 2020

Jordan T Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agar- wal. Deep batch active learning by diverse, uncertain gradient lower bounds.International Conference on Learning Representations, 2020

work page 2020
[4]

If influence functions are the answer, then what is the question?Advances in Neural Information Processing Systems, 2022

Juhan Bae, Nathan Ng, Alston Lo, Marzyeh Ghassemi, and Roger B Grosse. If influence functions are the answer, then what is the question?Advances in Neural Information Processing Systems, 2022

work page 2022
[5]

On second-order group influence functions for black-box predictions

Samyadeep Basu, Xuchen You, and Soheil Feizi. On second-order group influence functions for black-box predictions. InInternational Conference on Machine Learning, 2020

work page 2020
[6]

Influence functions in deep learning are fragile

Samyadeep Basu, Philip Pope, and Soheil Feizi. Influence functions in deep learning are fragile. International Conference on Learning Representations, 2021

work page 2021
[7]

Semantic Redundancies in Image-Classification Datasets: The 10% You Don't Need

Vighnesh Birodkar, Hossein Mobahi, and Samy Bengio. Semantic redundancies in image- classification datasets: The 10% you don’t need.arXiv preprint arXiv:1901.11409, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[8]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI Conference on Artificial Intelligence, 2020

work page 2020
[9]

Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney

Tyler A. Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney. Scalable influence and fact tracing for large language model pretraining. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[10]

What is your data worth to gpt? llm-scale data valuation with influence functions.Advances in neural information processing systems, 2025

Sang Keun Choe, Hwijeen Ahn, Juhan Bae, Kewen Zhao, Minsoo Kang, Youngseog Chung, Adithya Pratapa, Willie Neiswanger, Emma Strubell, Teruko Mitamura, et al. What is your data worth to gpt? llm-scale data valuation with influence functions.Advances in neural information processing systems, 2025

work page 2025
[11]

Batch active learning at scale.Advances in Neural Information Processing Systems, 2021

Gui Citovsky, Giulia DeSalvo, Claudio Gentile, Lazaros Karydas, Anand Rajagopalan, Afshin Rostamizadeh, and Sanjiv Kumar. Batch active learning at scale.Advances in Neural Information Processing Systems, 2021

work page 2021
[12]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Dawnbench: An end-to-end deep learning benchmark and competition.Training, 2017

Cody Coleman, Deepak Narayanan, Daniel Kang, Tian Zhao, Jian Zhang, Luigi Nardi, Peter Bailis, Kunle Olukotun, Chris Ré, and Matei Zaharia. Dawnbench: An end-to-end deep learning benchmark and competition.Training, 2017

work page 2017
[15]

Support-vector networks.Machine learning, 1995

Corinna Cortes and Vladimir Vapnik. Support-vector networks.Machine learning, 1995

work page 1995
[16]

Improving influence-based instruction tuning data selection for balanced learning of diverse capabilities.Findings of the Association for Computational Linguistics, 2025

Qirun Dai, Dylan Zhang, Jiaqi W Ma, and Hao Peng. Improving influence-based instruction tuning data selection for balanced learning of diverse capabilities.Findings of the Association for Computational Linguistics, 2025. 10

work page 2025
[17]

Junwei Deng, Weijing Tang, and Jiaqi W. Ma. A versatile influence function for data attribution with non-decomposable loss.arXiv preprint arXiv:2412.01335, 2024

work page arXiv 2024
[18]

Dsdm: Model-aware dataset selection with datamodels

Logan Engstrom, Axel Feldmann, and Aleksander Madry. Dsdm: Model-aware dataset selection with datamodels. InInternational Conference on Machine Learning, 2024

work page 2024
[19]

Fast approximate natural gradient descent in a kronecker factored eigenbasis.Advances in neural information processing systems, 2018

Thomas George, César Laurent, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. Fast approximate natural gradient descent in a kronecker factored eigenbasis.Advances in neural information processing systems, 2018

work page 2018
[20]

Data shapley: Equitable valuation of data for machine learning

Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. InInternational conference on machine learning, 2019

work page 2019
[21]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296,

Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, et al. Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296, 2023

work page arXiv 2023
[23]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, 2016

work page 2016
[24]

Influence functions for edge edits in non-convex graph neural networks.Advances in Neural Information Processing Systems, 2025

Jaeseung Heo, Kyeongheung Yun, Seokwon Yoon, MoonJeong Park, Jungseul Ok, and Dong- woo Kim. Influence functions for edge edits in non-convex graph neural networks.Advances in Neural Information Processing Systems, 2025

work page 2025
[25]

Most influential subset selection: Challenges, promises, and beyond.Advances in Neural Information Processing Systems, 2024

Yuzheng Hu, Pingbang Hu, Han Zhao, et al. Most influential subset selection: Challenges, promises, and beyond.Advances in Neural Information Processing Systems, 2024

work page 2024
[26]

Approx- imations to worst-case data dropping: unmasking failure modes.Transactions on Machine Learning Research, 2025

Jenny Y Huang, David R Burt, Yunyi Shen, Tin D Nguyen, and Tamara Broderick. Approx- imations to worst-case data dropping: unmasking failure modes.Transactions on Machine Learning Research, 2025

work page 2025
[27]

W., and Dasigi, P

Hamish Ivison, Muru Zhang, Faeze Brahman, Pang Wei Koh, and Pradeep Dasigi. Large-Scale Data Selection for Instruction Tuning.arXiv preprint arXiv:2503.01807, 2025

work page arXiv 2025
[28]

Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning.Advances in neural information processing systems, 2019

Andreas Kirsch, Joost Van Amersfoort, and Yarin Gal. Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning.Advances in neural information processing systems, 2019

work page 2019
[29]

Understanding black-box predictions via influence functions

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InInternational conference on machine learning, 2017

work page 2017
[30]

On the accuracy of influence functions for measuring group effects.Advances in neural information processing systems, 2019

Pang Wei W Koh, Kai-Siang Ang, Hubert Teo, and Percy S Liang. On the accuracy of influence functions for measuring group effects.Advances in neural information processing systems, 2019

work page 2019
[31]

Bayesian influence functions for hessian-free data attribution

Philipp Alexander Kreer, Wilson Wu, Maxwell Adam, Zach Furman, and Jesse Hoogland. Bayesian influence functions for hessian-free data attribution. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[32]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

work page 2009
[33]

Datainf: Efficiently estimating data influence in lora-tuned llms and diffusion models.International Conference on Learning Representations, 2024

Yongchan Kwon, Eric Wu, Kevin Wu, and James Zou. Datainf: Efficiently estimating data influence in lora-tuned llms and diffusion models.International Conference on Learning Representations, 2024

work page 2024
[34]

Gradient-based learning applied to document recognition.Proceedings of the IEEE, 2002

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 2002. 11

work page 2002
[35]

Nv-embed: Improved techniques for training llms as generalist embedding models.International Conference on Learning Representations, 2025

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models.International Conference on Learning Representations, 2025

work page 2025
[36]

Program induction by rationale generation: Learning to solve and explain algebraic word problems

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017

work page 2017
[37]

New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 2020

James Martens. New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 2020

work page 2020
[38]

Optimizing neural networks with kronecker-factored approx- imate curvature

James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approx- imate curvature. InInternational conference on machine learning, 2015

work page 2015
[39]

Coresets for data-efficient training of machine learning models

Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. InInternational Conference on Machine Learning, 2020

work page 2020
[40]

Bruno Kacper Mlodozeniec, Runa Eschenhagen, Juhan Bae, Alexander Immer, David Krueger, and Richard E. Turner. Influence functions for scalable data attribution in diffusion models. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[41]

Efficient data selection at scale via influence distillation

Mahdi Nikdan, Vincent Cohen-Addad, Dan Alistarh, and Vahab Mirrokni. Efficient data selection at scale via influence distillation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[42]

G- dig: Towards gradient-based diverse and high-quality instruction data selection for machine translation

Xingyuan Pan, Luyang Huang, Liyan Kang, Zhicheng Liu, Yu Lu, and Shanbo Cheng. G- dig: Towards gradient-based diverse and high-quality instruction data selection for machine translation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

work page 2024
[43]

Trak: Attributing model behavior at scale

Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander Madry. Trak: Attributing model behavior at scale. InInternational Conference on Machine Learning, 2023

work page 2023
[44]

Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 2020

Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 2020

work page 2020
[45]

Squad: 100,000+ questions for machine comprehension of text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016

work page 2016
[46]

Contrastive learning with hard negative samples

Joshua David Robinson, Ching-Yao Chuang, Suvrit Sra, and Stefanie Jegelka. Contrastive learning with hard negative samples. InInternational Conference on Learning Representations, 2021

work page 2021
[47]

Ittai Rubinstein and Samuel B. Hopkins. Rescaled influence functions: Accurate data attribution in high dimension. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[48]

Understanding influ- ence functions and datamodels via harmonic analysis.International Conference on Learning Representations, 2023

Nikunj Saunshi, Arushi Gupta, Mark Braverman, and Sanjeev Arora. Understanding influ- ence functions and datamodels via harmonic analysis.International Conference on Learning Representations, 2023

work page 2023
[49]

Scaling up influence functions

Andrea Schioppa, Polina Zablotskaia, David Vilar, and Artem Sokolov. Scaling up influence functions. InProceedings of the AAAI Conference on Artificial Intelligence, 2022

work page 2022
[50]

Training region-based object detectors with online hard example mining

Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. InProceedings of the IEEE conference on computer vision and pattern recognition, 2016

work page 2016
[51]

Data pruning by infor- mation maximization

Haoru Tan, Sitong Wu, Wei Huang, Shizhen Zhao, and XIAOJUAN QI. Data pruning by infor- mation maximization. InThe Thirteenth International Conference on Learning Representations, 2025. 12

work page 2025
[52]

An empirical study of example forgetting during deep neural network learning

Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gordon. An empirical study of example forgetting during deep neural network learning. InInternational Conference on Learning Representations, 2019

work page 2019
[53]

Accurate telemonitoring of parkinson’s disease progression by non-invasive speech tests.Nature Precedings, 2009

Athanasios Tsanas, Max Little, Patrick McSharry, and Lorraine Ramig. Accurate telemonitoring of parkinson’s disease progression by non-invasive speech tests.Nature Precedings, 2009

work page 2009
[54]

Better training data attribution via better inverse hessian-vector products.Advances in Neural Information Processing Systems, 2025

Andrew Wang, Elisa Nguyen, Runshi Yang, Juhan Bae, Sheila A McIlraith, and Roger Grosse. Better training data attribution via better inverse hessian-vector products.Advances in Neural Information Processing Systems, 2025

work page 2025
[55]

Rethinking data shapley for data selection tasks: Misleads and merits.International Conference on Machine Learning, 2024

Jiachen T Wang, Tianji Yang, James Zou, Yongchan Kwon, and Ruoxi Jia. Rethinking data shapley for data selection tasks: Misleads and merits.International Conference on Machine Learning, 2024

work page 2024
[56]

Data shapley in one training run

Jiachen T Wang, Prateek Mittal, Dawn Song, and Ruoxi Jia. Data shapley in one training run. International Conference on Learning Representations, 2025

work page 2025
[57]

How far can camels go? exploring the state of instruction tuning on open resources.Advances in Neural Information Processing Systems, 2023

Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Chandu, David Wadden, Kelsey MacMillan, Noah A Smith, Iz Beltagy, et al. How far can camels go? exploring the state of instruction tuning on open resources.Advances in Neural Information Processing Systems, 2023

work page 2023
[58]

Ji2s: Joint influence-aware instruction data selection for efficient fine-tuning

Jingyu Wei, Bo Liu, Tianjiao Wan, Baoyun Peng, Xingkong Ma, and Mengmeng Guo. Ji2s: Joint influence-aware instruction data selection for efficient fine-tuning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

work page 2025
[59]

LESS: Selecting influential data for targeted instruction tuning

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. LESS: Selecting influential data for targeted instruction tuning. InInternational Conference on Machine Learning, 2024

work page 2024
[60]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Han Xiao, Kashif Rasul, and Roland V ollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.arXiv preprint arXiv:1708.07747, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[61]

Smalltolarge (s2l): Scalable data selection for fine-tuning large language models by summarizing training trajectories of small models

Yu Yang, Siddhartha Mishra, Jeffrey N Chiang, and Baharan Mirzasoleiman. Smalltolarge (s2l): Scalable data selection for fine-tuning large language models by summarizing training trajectories of small models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[62]

Towards robust influence functions with flat validation minima.arXiv preprint arXiv:2505.19097, 2025

Xichen Ye, Yifan Wu, Weizhong Zhang, Cheng Jin, and Yifan Chen. Towards robust influence functions with flat validation minima.arXiv preprint arXiv:2505.19097, 2025

work page arXiv 2025
[63]

Modeling of strength of high-performance concrete using artificial neural networks

I-C Yeh. Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete research, 1998

work page 1998
[64]

Mates: Model-aware data selection for efficient pretraining with data influence models.Advances in Neural Information Processing Systems, 2024

Zichun Yu, Spandan Das, and Chenyan Xiong. Mates: Model-aware data selection for efficient pretraining with data influence models.Advances in Neural Information Processing Systems, 2024

work page 2024
[65]

Group-level data selection for efficient pretraining

Zichun Yu, Fei Peng, Jie Lei, Arnold Overwijk, Wen tau Yih, and Chenyan Xiong. Group-level data selection for efficient pretraining. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[66]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

work page 2019
[67]

Harnessing diversity for important data selection in pretraining large language models.International Conference on Learning Representations, 2025

Chi Zhang, Huaping Zhong, Kuan Zhang, Chengliang Chai, Rui Wang, Xinlin Zhuang, Tianyi Bai, Jiantao Qiu, Lei Cao, Ju Fan, et al. Harnessing diversity for important data selection in pretraining large language models.International Conference on Learning Representations, 2025. 13 A Notation Table 2 consolidates the notation used throughout the paper. The sy...

work page 2025

[1] [1]

Neural networks for learnable and scalable influence estimation of instruction fine-tuning data

Ishika Agarwal and Dilek Hakkani-Tür. Neural networks for learnable and scalable influence estimation of instruction fine-tuning data. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[2] [2]

Explanations for commonsenseqa: New dataset and models

Shourya Aggarwal, Divyanshu Mandowara, Vishwajeet Agrawal, Dinesh Khandelwal, Parag Singla, and Dinesh Garg. Explanations for commonsenseqa: New dataset and models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021

work page 2021

[3] [3]

Deep batch active learning by diverse, uncertain gradient lower bounds.International Conference on Learning Representations, 2020

Jordan T Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agar- wal. Deep batch active learning by diverse, uncertain gradient lower bounds.International Conference on Learning Representations, 2020

work page 2020

[4] [4]

If influence functions are the answer, then what is the question?Advances in Neural Information Processing Systems, 2022

Juhan Bae, Nathan Ng, Alston Lo, Marzyeh Ghassemi, and Roger B Grosse. If influence functions are the answer, then what is the question?Advances in Neural Information Processing Systems, 2022

work page 2022

[5] [5]

On second-order group influence functions for black-box predictions

Samyadeep Basu, Xuchen You, and Soheil Feizi. On second-order group influence functions for black-box predictions. InInternational Conference on Machine Learning, 2020

work page 2020

[6] [6]

Influence functions in deep learning are fragile

Samyadeep Basu, Philip Pope, and Soheil Feizi. Influence functions in deep learning are fragile. International Conference on Learning Representations, 2021

work page 2021

[7] [7]

Semantic Redundancies in Image-Classification Datasets: The 10% You Don't Need

Vighnesh Birodkar, Hossein Mobahi, and Samy Bengio. Semantic redundancies in image- classification datasets: The 10% you don’t need.arXiv preprint arXiv:1901.11409, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[8] [8]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI Conference on Artificial Intelligence, 2020

work page 2020

[9] [9]

Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney

Tyler A. Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney. Scalable influence and fact tracing for large language model pretraining. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[10] [10]

What is your data worth to gpt? llm-scale data valuation with influence functions.Advances in neural information processing systems, 2025

Sang Keun Choe, Hwijeen Ahn, Juhan Bae, Kewen Zhao, Minsoo Kang, Youngseog Chung, Adithya Pratapa, Willie Neiswanger, Emma Strubell, Teruko Mitamura, et al. What is your data worth to gpt? llm-scale data valuation with influence functions.Advances in neural information processing systems, 2025

work page 2025

[11] [11]

Batch active learning at scale.Advances in Neural Information Processing Systems, 2021

Gui Citovsky, Giulia DeSalvo, Claudio Gentile, Lazaros Karydas, Anand Rajagopalan, Afshin Rostamizadeh, and Sanjiv Kumar. Batch active learning at scale.Advances in Neural Information Processing Systems, 2021

work page 2021

[12] [12]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[13] [13]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[14] [14]

Dawnbench: An end-to-end deep learning benchmark and competition.Training, 2017

Cody Coleman, Deepak Narayanan, Daniel Kang, Tian Zhao, Jian Zhang, Luigi Nardi, Peter Bailis, Kunle Olukotun, Chris Ré, and Matei Zaharia. Dawnbench: An end-to-end deep learning benchmark and competition.Training, 2017

work page 2017

[15] [15]

Support-vector networks.Machine learning, 1995

Corinna Cortes and Vladimir Vapnik. Support-vector networks.Machine learning, 1995

work page 1995

[16] [16]

Improving influence-based instruction tuning data selection for balanced learning of diverse capabilities.Findings of the Association for Computational Linguistics, 2025

Qirun Dai, Dylan Zhang, Jiaqi W Ma, and Hao Peng. Improving influence-based instruction tuning data selection for balanced learning of diverse capabilities.Findings of the Association for Computational Linguistics, 2025. 10

work page 2025

[17] [17]

Junwei Deng, Weijing Tang, and Jiaqi W. Ma. A versatile influence function for data attribution with non-decomposable loss.arXiv preprint arXiv:2412.01335, 2024

work page arXiv 2024

[18] [18]

Dsdm: Model-aware dataset selection with datamodels

Logan Engstrom, Axel Feldmann, and Aleksander Madry. Dsdm: Model-aware dataset selection with datamodels. InInternational Conference on Machine Learning, 2024

work page 2024

[19] [19]

Fast approximate natural gradient descent in a kronecker factored eigenbasis.Advances in neural information processing systems, 2018

Thomas George, César Laurent, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. Fast approximate natural gradient descent in a kronecker factored eigenbasis.Advances in neural information processing systems, 2018

work page 2018

[20] [20]

Data shapley: Equitable valuation of data for machine learning

Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. InInternational conference on machine learning, 2019

work page 2019

[21] [21]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296,

Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, et al. Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296, 2023

work page arXiv 2023

[23] [23]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, 2016

work page 2016

[24] [24]

Influence functions for edge edits in non-convex graph neural networks.Advances in Neural Information Processing Systems, 2025

Jaeseung Heo, Kyeongheung Yun, Seokwon Yoon, MoonJeong Park, Jungseul Ok, and Dong- woo Kim. Influence functions for edge edits in non-convex graph neural networks.Advances in Neural Information Processing Systems, 2025

work page 2025

[25] [25]

Most influential subset selection: Challenges, promises, and beyond.Advances in Neural Information Processing Systems, 2024

Yuzheng Hu, Pingbang Hu, Han Zhao, et al. Most influential subset selection: Challenges, promises, and beyond.Advances in Neural Information Processing Systems, 2024

work page 2024

[26] [26]

Approx- imations to worst-case data dropping: unmasking failure modes.Transactions on Machine Learning Research, 2025

Jenny Y Huang, David R Burt, Yunyi Shen, Tin D Nguyen, and Tamara Broderick. Approx- imations to worst-case data dropping: unmasking failure modes.Transactions on Machine Learning Research, 2025

work page 2025

[27] [27]

W., and Dasigi, P

Hamish Ivison, Muru Zhang, Faeze Brahman, Pang Wei Koh, and Pradeep Dasigi. Large-Scale Data Selection for Instruction Tuning.arXiv preprint arXiv:2503.01807, 2025

work page arXiv 2025

[28] [28]

Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning.Advances in neural information processing systems, 2019

Andreas Kirsch, Joost Van Amersfoort, and Yarin Gal. Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning.Advances in neural information processing systems, 2019

work page 2019

[29] [29]

Understanding black-box predictions via influence functions

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InInternational conference on machine learning, 2017

work page 2017

[30] [30]

On the accuracy of influence functions for measuring group effects.Advances in neural information processing systems, 2019

Pang Wei W Koh, Kai-Siang Ang, Hubert Teo, and Percy S Liang. On the accuracy of influence functions for measuring group effects.Advances in neural information processing systems, 2019

work page 2019

[31] [31]

Bayesian influence functions for hessian-free data attribution

Philipp Alexander Kreer, Wilson Wu, Maxwell Adam, Zach Furman, and Jesse Hoogland. Bayesian influence functions for hessian-free data attribution. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[32] [32]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

work page 2009

[33] [33]

Datainf: Efficiently estimating data influence in lora-tuned llms and diffusion models.International Conference on Learning Representations, 2024

Yongchan Kwon, Eric Wu, Kevin Wu, and James Zou. Datainf: Efficiently estimating data influence in lora-tuned llms and diffusion models.International Conference on Learning Representations, 2024

work page 2024

[34] [34]

Gradient-based learning applied to document recognition.Proceedings of the IEEE, 2002

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 2002. 11

work page 2002

[35] [35]

Nv-embed: Improved techniques for training llms as generalist embedding models.International Conference on Learning Representations, 2025

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models.International Conference on Learning Representations, 2025

work page 2025

[36] [36]

Program induction by rationale generation: Learning to solve and explain algebraic word problems

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017

work page 2017

[37] [37]

New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 2020

James Martens. New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 2020

work page 2020

[38] [38]

Optimizing neural networks with kronecker-factored approx- imate curvature

James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approx- imate curvature. InInternational conference on machine learning, 2015

work page 2015

[39] [39]

Coresets for data-efficient training of machine learning models

Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. InInternational Conference on Machine Learning, 2020

work page 2020

[40] [40]

Bruno Kacper Mlodozeniec, Runa Eschenhagen, Juhan Bae, Alexander Immer, David Krueger, and Richard E. Turner. Influence functions for scalable data attribution in diffusion models. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025

[41] [41]

Efficient data selection at scale via influence distillation

Mahdi Nikdan, Vincent Cohen-Addad, Dan Alistarh, and Vahab Mirrokni. Efficient data selection at scale via influence distillation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[42] [42]

G- dig: Towards gradient-based diverse and high-quality instruction data selection for machine translation

Xingyuan Pan, Luyang Huang, Liyan Kang, Zhicheng Liu, Yu Lu, and Shanbo Cheng. G- dig: Towards gradient-based diverse and high-quality instruction data selection for machine translation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

work page 2024

[43] [43]

Trak: Attributing model behavior at scale

Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander Madry. Trak: Attributing model behavior at scale. InInternational Conference on Machine Learning, 2023

work page 2023

[44] [44]

Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 2020

Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 2020

work page 2020

[45] [45]

Squad: 100,000+ questions for machine comprehension of text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016

work page 2016

[46] [46]

Contrastive learning with hard negative samples

Joshua David Robinson, Ching-Yao Chuang, Suvrit Sra, and Stefanie Jegelka. Contrastive learning with hard negative samples. InInternational Conference on Learning Representations, 2021

work page 2021

[47] [47]

Ittai Rubinstein and Samuel B. Hopkins. Rescaled influence functions: Accurate data attribution in high dimension. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[48] [48]

Understanding influ- ence functions and datamodels via harmonic analysis.International Conference on Learning Representations, 2023

Nikunj Saunshi, Arushi Gupta, Mark Braverman, and Sanjeev Arora. Understanding influ- ence functions and datamodels via harmonic analysis.International Conference on Learning Representations, 2023

work page 2023

[49] [49]

Scaling up influence functions

Andrea Schioppa, Polina Zablotskaia, David Vilar, and Artem Sokolov. Scaling up influence functions. InProceedings of the AAAI Conference on Artificial Intelligence, 2022

work page 2022

[50] [50]

Training region-based object detectors with online hard example mining

Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. InProceedings of the IEEE conference on computer vision and pattern recognition, 2016

work page 2016

[51] [51]

Data pruning by infor- mation maximization

Haoru Tan, Sitong Wu, Wei Huang, Shizhen Zhao, and XIAOJUAN QI. Data pruning by infor- mation maximization. InThe Thirteenth International Conference on Learning Representations, 2025. 12

work page 2025

[52] [52]

An empirical study of example forgetting during deep neural network learning

Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gordon. An empirical study of example forgetting during deep neural network learning. InInternational Conference on Learning Representations, 2019

work page 2019

[53] [53]

Accurate telemonitoring of parkinson’s disease progression by non-invasive speech tests.Nature Precedings, 2009

Athanasios Tsanas, Max Little, Patrick McSharry, and Lorraine Ramig. Accurate telemonitoring of parkinson’s disease progression by non-invasive speech tests.Nature Precedings, 2009

work page 2009

[54] [54]

Better training data attribution via better inverse hessian-vector products.Advances in Neural Information Processing Systems, 2025

Andrew Wang, Elisa Nguyen, Runshi Yang, Juhan Bae, Sheila A McIlraith, and Roger Grosse. Better training data attribution via better inverse hessian-vector products.Advances in Neural Information Processing Systems, 2025

work page 2025

[55] [55]

Rethinking data shapley for data selection tasks: Misleads and merits.International Conference on Machine Learning, 2024

Jiachen T Wang, Tianji Yang, James Zou, Yongchan Kwon, and Ruoxi Jia. Rethinking data shapley for data selection tasks: Misleads and merits.International Conference on Machine Learning, 2024

work page 2024

[56] [56]

Data shapley in one training run

Jiachen T Wang, Prateek Mittal, Dawn Song, and Ruoxi Jia. Data shapley in one training run. International Conference on Learning Representations, 2025

work page 2025

[57] [57]

How far can camels go? exploring the state of instruction tuning on open resources.Advances in Neural Information Processing Systems, 2023

Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Chandu, David Wadden, Kelsey MacMillan, Noah A Smith, Iz Beltagy, et al. How far can camels go? exploring the state of instruction tuning on open resources.Advances in Neural Information Processing Systems, 2023

work page 2023

[58] [58]

Ji2s: Joint influence-aware instruction data selection for efficient fine-tuning

Jingyu Wei, Bo Liu, Tianjiao Wan, Baoyun Peng, Xingkong Ma, and Mengmeng Guo. Ji2s: Joint influence-aware instruction data selection for efficient fine-tuning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

work page 2025

[59] [59]

LESS: Selecting influential data for targeted instruction tuning

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. LESS: Selecting influential data for targeted instruction tuning. InInternational Conference on Machine Learning, 2024

work page 2024

[60] [60]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Han Xiao, Kashif Rasul, and Roland V ollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.arXiv preprint arXiv:1708.07747, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[61] [61]

Smalltolarge (s2l): Scalable data selection for fine-tuning large language models by summarizing training trajectories of small models

Yu Yang, Siddhartha Mishra, Jeffrey N Chiang, and Baharan Mirzasoleiman. Smalltolarge (s2l): Scalable data selection for fine-tuning large language models by summarizing training trajectories of small models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[62] [62]

Towards robust influence functions with flat validation minima.arXiv preprint arXiv:2505.19097, 2025

Xichen Ye, Yifan Wu, Weizhong Zhang, Cheng Jin, and Yifan Chen. Towards robust influence functions with flat validation minima.arXiv preprint arXiv:2505.19097, 2025

work page arXiv 2025

[63] [63]

Modeling of strength of high-performance concrete using artificial neural networks

I-C Yeh. Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete research, 1998

work page 1998

[64] [64]

Mates: Model-aware data selection for efficient pretraining with data influence models.Advances in Neural Information Processing Systems, 2024

Zichun Yu, Spandan Das, and Chenyan Xiong. Mates: Model-aware data selection for efficient pretraining with data influence models.Advances in Neural Information Processing Systems, 2024

work page 2024

[65] [65]

Group-level data selection for efficient pretraining

Zichun Yu, Fei Peng, Jie Lei, Arnold Overwijk, Wen tau Yih, and Chenyan Xiong. Group-level data selection for efficient pretraining. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[66] [66]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

work page 2019

[67] [67]

Harnessing diversity for important data selection in pretraining large language models.International Conference on Learning Representations, 2025

Chi Zhang, Huaping Zhong, Kuan Zhang, Chengliang Chai, Rui Wang, Xinlin Zhuang, Tianyi Bai, Jiantao Qiu, Lei Cao, Ju Fan, et al. Harnessing diversity for important data selection in pretraining large language models.International Conference on Learning Representations, 2025. 13 A Notation Table 2 consolidates the notation used throughout the paper. The sy...

work page 2025