Efficient Estimation of Kernel Surrogate Models for Task Attribution

Hongyang R. Zhang; Minxuan Duan; Zhenshuo Zhang

arxiv: 2602.03783 · v2 · submitted 2026-02-03 · 💻 cs.LG · cs.AI· cs.CL

Efficient Estimation of Kernel Surrogate Models for Task Attribution

Zhenshuo Zhang , Minxuan Duan , Hongyang R. Zhang This is my paper

Pith reviewed 2026-05-16 07:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords task attributionkernel surrogate modelsinfluence functionsleave-one-out retrainingdata selectionnonlinear interactionslarge language modelsgradient-based estimation

0 comments

The pith

Kernel surrogate models capture nonlinear task interactions for more accurate attribution in large AI training

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that kernel surrogate models can predict how subsets of training tasks affect performance on a target task more accurately than linear alternatives. Direct measurement via leave-one-out retraining is computationally prohibitive for modern models, so surrogates are needed to approximate these effects at scale. Linear surrogates and influence functions only capture first-order relationships and miss nonlinear interactions such as XOR effects. Kernel surrogates address this by modeling second-order interactions, learned efficiently through a gradient-based procedure that uses a first-order approximation of the pretrained model. This yields estimates with under 2% relative error and produces 25% higher correlation with true leave-one-out results, plus 40% gains when applied to data selection.

Core claim

Kernel surrogate models represent second-order task interactions more effectively than linear surrogates within a unified task-weighting framework. A gradient-based estimation procedure leverages first-order approximations of pretrained models to learn these surrogates without repeated retraining, achieving less than 2% relative error. Experiments in mathematical reasoning with transformers, in-context learning, and multi-objective reinforcement learning show 25% higher correlation with leave-one-out ground truth than linear surrogates or influence-function baselines, and 40% better performance in downstream data selection.

What carries the argument

Kernel surrogate models that capture nonlinear second-order interactions between training tasks, estimated via gradient-based optimization on first-order approximations of pretrained models.

If this is right

Kernel surrogates achieve 25% higher correlation with leave-one-out ground truth than linear surrogates and influence-function baselines.
They enable 40% improvement in downstream data selection across the tested settings.
The first-order approximation suffices for accurate estimates with less than 2% relative error.
The approach applies to transformers for mathematical reasoning, in-context learning, and multi-objective reinforcement learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar kernel surrogates could be tested on vision or multimodal models to attribute contributions from image or video data sources.
The method might reduce reliance on influence-function approximations in other attribution settings such as feature selection.
Dynamic task reweighting during continual learning becomes feasible if the approximation error remains low at larger scales.
Combining the kernel estimation with quantization or pruning could further lower the cost of attribution in resource-constrained environments.

Load-bearing premise

The first-order approximation of pretrained models is accurate enough to produce kernel surrogate estimates with less than 2% relative error without repeated retraining.

What would settle it

Perform full leave-one-out retraining on a new task collection and measure whether the kernel surrogate correlation with ground truth falls below that of linear surrogates or whether the relative approximation error exceeds 2%.

Figures

Figures reproduced from arXiv: 2602.03783 by Hongyang R. Zhang, Minxuan Duan, Zhenshuo Zhang.

**Figure 2.** Figure 2: We use a binary classification task with a two-layer MLP as the base classifier. The final goal for the surrogate model is to predict the MLP output on a fixed test sample, given the subset of training data it was trained on. We specifically analyze the effect of different subsets of training data sampled from near the MLP decision boundary. The detailed setting is in Appendix B.1. 2 4 6 8 10 Subset 0.0 0.… view at source ↗

**Figure 3.** Figure 3: We investigate both linear and kernel surrogate models’ fit under different Hessian reg [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: We investigated how the size of the surrogate training split affects the residual error of the [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗

read the original abstract

Modern AI agents such as large language models are trained on diverse tasks -- translation, code generation, mathematical reasoning, and text prediction -- simultaneously. A key question is how to quantify the influence of each individual training task on performance on a target task, a problem we refer to as task attribution. The direct approach, leave-one-out retraining, measures the effect of removing each task, but is computationally infeasible at scale. An alternative approach that builds surrogate models to predict the performance on a target task for any subset of training tasks has emerged in the recent literature. Prior work focuses on linear surrogate models, which capture first-order relationships but miss nonlinear interactions such as XOR-type effects. In this paper, we first consider a unified task-weighting framework for analyzing task-attribution methods and establish a new connection between linear surrogate models and influence functions via a second-order analysis. Then, we introduce kernel surrogate models, which more effectively represent second-order task interactions. To efficiently learn the kernel surrogate, we develop a gradient-based estimation procedure that leverages a first-order approximation of pretrained models; empirically, this yields accurate surrogate estimates with less than $2\%$ relative error without repeated retraining. Experiments across multiple settings -- including mathematical reasoning in transformers, in-context learning, and multi-objective reinforcement learning -- demonstrate the effectiveness of kernel surrogate models. They achieve a $25\%$ higher correlation with the leave-one-out ground truth than linear surrogates and influence-function baselines, enabling more accurate and scalable task attribution. When used for downstream data selection, kernel surrogate models further yield a $40\%$ improvement in the aforementioned settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Kernel surrogates give measurable gains over linear ones for task attribution in the reported experiments, but the first-order approximation for fitting them is the part that needs the closest look.

read the letter

The paper's main move is to replace linear surrogates with kernel ones so they can pick up second-order task interactions that linear models miss. They also link the linear case back to influence functions through a second-order expansion, which is a clean observation. The practical piece is the gradient-based fitting procedure that uses a first-order model approximation and claims under 2% relative error without retraining. In the experiments on transformer math reasoning, in-context learning, and multi-objective RL, the kernel version shows 25% higher correlation with leave-one-out ground truth and 40% better downstream data selection than the linear and influence-function baselines. That is the concrete evidence they put forward. The soft spot is exactly the one the stress-test note flags: a first-order approximation is being used to fit a model whose selling point is second-order effects. If the higher-order terms that get dropped are non-negligible on the task distributions they care about, the reported lift could partly reflect approximation error rather than the kernel itself. The abstract gives no error bars, no ablation on the approximation order, and no derivation details, so it is hard to judge how robust the numbers are. This work is aimed at people doing data curation or multi-task training at scale who need a cheaper alternative to full leave-one-out. If the full paper supplies the missing ablations and shows the approximation holds up on the actual models, it is worth a serious referee. Otherwise the efficiency claim stays unconvincing. I would send it to review rather than desk-reject.

Referee Report

1 major / 2 minor

Summary. The paper proposes kernel surrogate models for task attribution to capture nonlinear (e.g., XOR-type) interactions among training tasks, in contrast to prior linear surrogates. It first unifies task-weighting methods and links linear surrogates to influence functions via second-order analysis, then introduces kernels and a gradient-based estimation procedure relying on a first-order approximation of pretrained models to avoid repeated retraining. Experiments on mathematical reasoning in transformers, in-context learning, and multi-objective RL report 25% higher correlation with leave-one-out ground truth than linear or influence-function baselines, plus 40% gains in downstream data selection.

Significance. If the first-order approximation reliably preserves the second-order effects the kernels are intended to model, the work supplies a practical, scalable improvement over linear task-attribution methods with clear downstream utility for data selection. The empirical correlation lifts across three distinct settings are a concrete strength; however, the absence of explicit error bounds or ablations on the truncation error makes it difficult to attribute the reported gains specifically to the kernel rather than to uncontrolled approximation artifacts.

major comments (1)

[Abstract] Abstract (gradient-based estimation procedure): the central efficiency claim rests on a first-order approximation of pretrained models when fitting the kernel surrogate, yet the kernel is explicitly motivated by its ability to represent second-order interactions. No derivation, remainder bound, or ablation isolating the contribution of discarded higher-order terms is supplied; if those terms are non-negligible, the 25% correlation advantage cannot be confidently ascribed to the kernel structure itself.

minor comments (2)

The abstract reports aggregate correlation and improvement percentages without stating the number of independent runs, standard errors, or the precise definition of the leave-one-out ground truth used for comparison.
Notation for the kernel surrogate (e.g., the precise form of the kernel and how task subsets are encoded) is introduced only at a high level; a short explicit definition would aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and valuable comments on our manuscript. We address the major comment point by point below.

read point-by-point responses

Referee: [Abstract] Abstract (gradient-based estimation procedure): the central efficiency claim rests on a first-order approximation of pretrained models when fitting the kernel surrogate, yet the kernel is explicitly motivated by its ability to represent second-order interactions. No derivation, remainder bound, or ablation isolating the contribution of discarded higher-order terms is supplied; if those terms are non-negligible, the 25% correlation advantage cannot be confidently ascribed to the kernel structure itself.

Authors: We thank the referee for pointing out this potential inconsistency. The first-order approximation is used solely to efficiently compute the necessary gradients for fitting the surrogate model without requiring multiple retrainings of the large pretrained model. This approximation is applied to the performance function of the pretrained model. The kernel surrogate, however, operates on the space of task subsets and is capable of capturing nonlinear interactions through its kernel function, which can model second-order and higher effects in the attribution weights. The connection to influence functions is established via second-order analysis for the linear case, and the kernel extends this. While we do not provide a theoretical remainder bound in the current version, we report empirical evidence of the approximation's accuracy with less than 2% relative error across experiments. To strengthen the manuscript, we will include a derivation of the first-order approximation error and an ablation study that compares the kernel surrogate against a linear surrogate under identical approximation conditions to isolate the contribution of the kernel structure. revision: yes

Circularity Check

0 steps flagged

No significant circularity: independent LOO validation and empirical error measurement keep derivation self-contained

full rationale

The paper fits kernel surrogate models to predict target-task performance from training-task subsets and directly evaluates correlation against leave-one-out retraining ground truth, which is computed independently of the surrogate fit. The gradient-based procedure employs a first-order approximation solely for computational efficiency; its accuracy is asserted via an empirical <2% relative-error measurement rather than by definitional reduction or self-citation. No load-bearing step equates a claimed prediction to its own fitted inputs, invokes a self-citation uniqueness theorem, or renames a known result. The reported 25% correlation lift and 40% downstream improvement are therefore measured against external benchmarks and do not collapse to the model's own parameters by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a first-order Taylor approximation around pretrained model parameters suffices to estimate the kernel surrogate accurately; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption First-order approximation of pretrained models yields accurate surrogate estimates
Invoked to justify the gradient-based estimation procedure without repeated retraining

pith-pipeline@v0.9.0 · 5597 in / 1193 out tokens · 22428 ms · 2026-05-16T07:55:16.064291+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce kernel surrogate models... develop a gradient-based estimation procedure that leverages a first-order approximation of pretrained models
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

linear surrogate models... approximately equal to the influence functions, up to third-order expansion errors (Proposition 3.1)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

What is your data worth to gpt? llm-scale data valuation with influence functions.arXiv preprint arXiv:2405.13954,

Sang Keun Choe, Hwijeen Ahn, Juhan Bae, Kewen Zhao, Minsoo Kang, Youngseog Chung, Adithya Pratapa, Willie Neiswanger, Emma Strubell, Teruko Mitamura, Jeff Schneider, Eduard Hovy, Roger Grosse, and Eric Xing. What is your data worth to gpt? llm-scale data valuation with influence functions.arXiv preprint arXiv:2405.13954,

work page arXiv
[2]

Optimizing ml training with metagradient descent.arXiv preprint arXiv:2503.13751,

Logan Engstrom, Andrew Ilyas, Benjamin Chen, Axel Feldmann, William Moses, and Aleksander Madry. Optimizing ml training with metagradient descent.arXiv preprint arXiv:2503.13751,

work page arXiv
[3]

Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, Evan Hubinger, Kamil˙e Lukoˇsi¯ut˙e, Karina Nguyen, Nicholas Joseph, Sam McCandlish, Jared Kaplan, and Samuel R. Bowman. Studying large lan- guage model generalization with influence functions.arXiv preprint arXiv:2308.03296,

work page arXiv
[4]

Grass: Scalable influ- ence function with sparse gradient compression.arXiv preprint arXiv:2505.18976,

Pingbang Hu, Joseph Melkonian, Weijing Tang, Han Zhao, and Jiaqi W Ma. Grass: Scalable influ- ence function with sparse gradient compression.arXiv preprint arXiv:2505.18976,

work page arXiv
[5]

Andrew Ilyas and Logan Engstrom

Andrew Ilyas and Logan Engstrom. Magic: Near-optimal data attribution for deep learning.arXiv preprint arXiv:2504.16430,

work page arXiv
[6]

Extensions of lipshitz mapping into hilbert space

11 William B Johnson. Extensions of lipshitz mapping into hilbert space. InConference modern analysis and probability, 1984, pp. 189–206,

work page 1984
[7]

Scalable multitask learning using gradient- based estimation of task affinity

Dongyue Li, Aneesh Sharma, and Hongyang R Zhang. Scalable multitask learning using gradient- based estimation of task affinity. InProceedings of the 30th ACM SIGKDD Conference on Knowl- edge Discovery and Data Mining, pp. 1542–1553, 2024a. Dongyue Li, Ziniu Zhang, Lu Wang, and Hongyang R. Zhang. Scalable fine-tuning from multiple data sources: A first-ord...

work page 2024
[8]

Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 11048–11064,

work page 2022
[9]

Benefits and pitfalls of reinforcement learning for language model planning: a theoretical perspective.arXiv preprint arXiv:2509.22613,

Siwei Wang, Yifei Shen, Haoran Sun, Shi Feng, Shang-Hua Teng, Li Dong, Yaru Hao, and Wei Chen. Benefits and pitfalls of reinforcement learning for language model planning: a theoretical perspective.arXiv preprint arXiv:2509.22613,

work page arXiv
[10]

,(xn, yn)}be a dataset in- cludingnsamples drawn independently from an unknown data distribution

13 A COMPLETEPROOFS Derivation of influence functions.LetS={(x 1, y1),(x 2, y2), . . . ,(xn, yn)}be a dataset in- cludingnsamples drawn independently from an unknown data distribution. Letf W denote a model with parametersW∈R d. Let ˆL(fW )denote the empirical loss of the modelf W onS, averaged over thentraining data samples. The influence function (Koh &...

work page 2017
[11]

Provided that the random projection dimensionksatisfiesk=O logN ϵ2 , the training loss ofcW(S) is bounded away from the minimum training loss for anyS⊆ {1,2,

≤δ. Provided that the random projection dimensionksatisfiesk=O logN ϵ2 , the training loss ofcW(S) is bounded away from the minimum training loss for anyS⊆ {1,2, . . . , n}as ˆL(fcW(S) )≤min W∈D ˆL(fW ) + 2δ+ 4GDϵ.(25) The proof is based on the Johnson-Lindenstrauss lemma (Johnson, 1984), which asserts that when k=O logN ϵ2 , for anyg i with∥g i∥ ≤Gand an...

work page 1984
[12]

We adapt it to attribute influence at the task level

offers an efficient algorithm for data attribution by linearizing the model and using random projections. We adapt it to attribute influence at the task level. The core idea is to represent each task by an average of its constituent samples’ projected gradients. Specifically, for each samplezin a task, a feature vector is computed from the gradient of a m...

work page 2022
[13]

We find that our approach is relatively robust to changes in bothλandγ, exhibiting stable performance across the entire range

Specifically, we varyλfrom10 −3 to1and γfrom10 −5 to10 −1. We find that our approach is relatively robust to changes in bothλandγ, exhibiting stable performance across the entire range. In our experiments, we setλ= 10 −1 andγ= 1/nas the default configuration, wherendenotes the number of tasks. Comparison of kernels.We use the CIFAR-10 dataset and the ResN...

work page 2019

[1] [1]

What is your data worth to gpt? llm-scale data valuation with influence functions.arXiv preprint arXiv:2405.13954,

Sang Keun Choe, Hwijeen Ahn, Juhan Bae, Kewen Zhao, Minsoo Kang, Youngseog Chung, Adithya Pratapa, Willie Neiswanger, Emma Strubell, Teruko Mitamura, Jeff Schneider, Eduard Hovy, Roger Grosse, and Eric Xing. What is your data worth to gpt? llm-scale data valuation with influence functions.arXiv preprint arXiv:2405.13954,

work page arXiv

[2] [2]

Optimizing ml training with metagradient descent.arXiv preprint arXiv:2503.13751,

Logan Engstrom, Andrew Ilyas, Benjamin Chen, Axel Feldmann, William Moses, and Aleksander Madry. Optimizing ml training with metagradient descent.arXiv preprint arXiv:2503.13751,

work page arXiv

[3] [3]

Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, Evan Hubinger, Kamil˙e Lukoˇsi¯ut˙e, Karina Nguyen, Nicholas Joseph, Sam McCandlish, Jared Kaplan, and Samuel R. Bowman. Studying large lan- guage model generalization with influence functions.arXiv preprint arXiv:2308.03296,

work page arXiv

[4] [4]

Grass: Scalable influ- ence function with sparse gradient compression.arXiv preprint arXiv:2505.18976,

Pingbang Hu, Joseph Melkonian, Weijing Tang, Han Zhao, and Jiaqi W Ma. Grass: Scalable influ- ence function with sparse gradient compression.arXiv preprint arXiv:2505.18976,

work page arXiv

[5] [5]

Andrew Ilyas and Logan Engstrom

Andrew Ilyas and Logan Engstrom. Magic: Near-optimal data attribution for deep learning.arXiv preprint arXiv:2504.16430,

work page arXiv

[6] [6]

Extensions of lipshitz mapping into hilbert space

11 William B Johnson. Extensions of lipshitz mapping into hilbert space. InConference modern analysis and probability, 1984, pp. 189–206,

work page 1984

[7] [7]

Scalable multitask learning using gradient- based estimation of task affinity

Dongyue Li, Aneesh Sharma, and Hongyang R Zhang. Scalable multitask learning using gradient- based estimation of task affinity. InProceedings of the 30th ACM SIGKDD Conference on Knowl- edge Discovery and Data Mining, pp. 1542–1553, 2024a. Dongyue Li, Ziniu Zhang, Lu Wang, and Hongyang R. Zhang. Scalable fine-tuning from multiple data sources: A first-ord...

work page 2024

[8] [8]

Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 11048–11064,

work page 2022

[9] [9]

Benefits and pitfalls of reinforcement learning for language model planning: a theoretical perspective.arXiv preprint arXiv:2509.22613,

Siwei Wang, Yifei Shen, Haoran Sun, Shi Feng, Shang-Hua Teng, Li Dong, Yaru Hao, and Wei Chen. Benefits and pitfalls of reinforcement learning for language model planning: a theoretical perspective.arXiv preprint arXiv:2509.22613,

work page arXiv

[10] [10]

,(xn, yn)}be a dataset in- cludingnsamples drawn independently from an unknown data distribution

13 A COMPLETEPROOFS Derivation of influence functions.LetS={(x 1, y1),(x 2, y2), . . . ,(xn, yn)}be a dataset in- cludingnsamples drawn independently from an unknown data distribution. Letf W denote a model with parametersW∈R d. Let ˆL(fW )denote the empirical loss of the modelf W onS, averaged over thentraining data samples. The influence function (Koh &...

work page 2017

[11] [11]

Provided that the random projection dimensionksatisfiesk=O logN ϵ2 , the training loss ofcW(S) is bounded away from the minimum training loss for anyS⊆ {1,2,

≤δ. Provided that the random projection dimensionksatisfiesk=O logN ϵ2 , the training loss ofcW(S) is bounded away from the minimum training loss for anyS⊆ {1,2, . . . , n}as ˆL(fcW(S) )≤min W∈D ˆL(fW ) + 2δ+ 4GDϵ.(25) The proof is based on the Johnson-Lindenstrauss lemma (Johnson, 1984), which asserts that when k=O logN ϵ2 , for anyg i with∥g i∥ ≤Gand an...

work page 1984

[12] [12]

We adapt it to attribute influence at the task level

offers an efficient algorithm for data attribution by linearizing the model and using random projections. We adapt it to attribute influence at the task level. The core idea is to represent each task by an average of its constituent samples’ projected gradients. Specifically, for each samplezin a task, a feature vector is computed from the gradient of a m...

work page 2022

[13] [13]

We find that our approach is relatively robust to changes in bothλandγ, exhibiting stable performance across the entire range

Specifically, we varyλfrom10 −3 to1and γfrom10 −5 to10 −1. We find that our approach is relatively robust to changes in bothλandγ, exhibiting stable performance across the entire range. In our experiments, we setλ= 10 −1 andγ= 1/nas the default configuration, wherendenotes the number of tasks. Comparison of kernels.We use the CIFAR-10 dataset and the ResN...

work page 2019