pith. sign in

arxiv: 2606.06892 · v1 · pith:EDKOIZGInew · submitted 2026-06-05 · 💻 cs.LG

GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution

Pith reviewed 2026-06-27 22:19 UTC · model grok-4.3

classification 💻 cs.LG
keywords data attributionpretraining datasubset interactionscounterfactual utilitygeometric penaltyscalable methodsmachine learning curation
0
0 comments X

The pith

GRASP models subset interactions in pretraining data via a quadratic geometric penalty to predict counterfactual utilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard data attribution scores individual examples additively, but this misses how subsets interact through redundancy or complementary coverage. GRASP reframes attribution as predicting the utility of data subsets in a counterfactual setting and introduces an interaction-aware surrogate grounded in a smoothness lower bound. The surrogate uses a quadratic geometric penalty to capture those interactions, paired with low-dimensional feature sketches and a finite lower-confidence bound protocol for efficiency at scale. This matters because better subset utility predictions let practitioners curate massive pretraining corpora without exhaustive retraining. Evaluations on subset retraining show the approach more than doubles rank correlation with actual outcomes while cutting upfront costs substantially.

Core claim

GRASP is an interaction-aware surrogate for subset-level counterfactual utility prediction. It is grounded in a theoretical smoothness lower bound that supplies a quadratic geometric penalty to model subset interactions explicitly. Low-dimensional feature sketches combined with a strictly finite lower-confidence bound selection protocol enable pretraining-scale use without oracle tuning or post-hoc adjustments. Subset-retraining evaluations show it more than doubles task-level rank correlation for counterfactual fidelity and reduces artifact construction costs by nearly an order of magnitude. The resulting scores transfer to language model curation and cross-domain vision selection.

What carries the argument

The quadratic geometric penalty derived from a smoothness lower bound, which explicitly accounts for subset interactions in the utility prediction.

If this is right

  • Subset-retraining evaluations show more than double the task-level rank correlation for counterfactual subset fidelity compared with scalable baselines.
  • Upfront artifact construction costs drop by nearly an order of magnitude.
  • The scoring mechanism transfers directly to language model curation tasks.
  • The same scores support cross-domain vision data selection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The interaction modeling could support iterative refinement of pretraining corpora by repeatedly selecting complementary rather than redundant subsets.
  • If the smoothness-derived penalty generalizes, similar geometric terms might improve attribution in sequential data collection settings such as active learning.
  • The efficiency gains open the possibility of running attribution at the full pretraining corpus size rather than on sampled proxies.

Load-bearing premise

The quadratic geometric penalty from the smoothness lower bound together with the sketches and selection protocol accurately captures subset interactions at pretraining scale without hidden tuning.

What would settle it

A controlled experiment on a held-out large pretraining run in which GRASP-assigned subset utilities show no higher correlation with actual performance gains from retraining than existing additive baselines.

Figures

Figures reproduced from arXiv: 2606.06892 by Ruining Chen, Yue Min, Yujun Li.

Figure 1
Figure 1. Figure 1: Overview of the GRASP architecture. GRASP predicts the counterfactual utility of data interventions by combining an additive relevance core (bottom) with an interaction-aware geometric penalty (top). By processing LLM hidden states and residuals into compact direct-sum sketches (ϕi) and stabilized relevance scores, the method efficiently captures subset redundancy and complementary coverage. The standardiz… view at source ↗
Figure 2
Figure 2. Figure 2: One-step counterfactual fidelity on CC text. The GRASP relevance tracks exact one-step sample and subset utilities, and the local interaction surrogate improves the more nonlinear high-learning-rate regime. The projected-gradient control receives the same interaction correction but remains weaker. Method Build (d) State Sweep (s) TracIn 2.2 1 0.086 TRAK 2.2 1024 159.8 InRun-DS 10.0 1 0.086 Ret.-DS 12.2 0 1… view at source ↗
Figure 3
Figure 3. Figure 3: Impact of interaction components on subset prediction fidelity. We ablate GRASP components across all primary LDS benchmarks. Bars show mean task-level Spearman correlation (ρ), and colored points show individual tasks. A projected-gradient control with an identical in￾teraction correction remains significantly weaker, indicating that the performance gain does not stem from adding a generic pairwise correc… view at source ↗
Figure 4
Figure 4. Figure 4: Evolution of topic and format contributions. Aggregated GRASP attribution shares across training checkpoints for the SciQ target. Positive and negative mean-score masses are normalized independently per checkpoint to illustrate relative category preferences over time. C.2 Checkpoint-Wise Category Contributions To enhance the interpretability of our attribution framework, we analyze the mean signed attribu￾… view at source ↗
read the original abstract

Scalable data attribution methods typically assign isolated utility scores to individual training examples. This prevalent additive assumption fundamentally fails to capture critical subset dynamics, including data redundancy and complementary coverage. In this work, we reframe attribution as subset-level counterfactual utility prediction and introduce GRASP, an interaction-aware surrogate. Grounded in a theoretical smoothness lower bound, GRASP explicitly models subset interactions through a quadratic geometric penalty. To achieve pretraining-scale efficiency without relying on hidden oracle tuning, we couple low-dimensional feature sketches with a strictly finite lower-confidence bound selection protocol. Extensive subset-retraining evaluations demonstrate that GRASP decisively outperforms existing scalable baselines. It more than doubles the task-level rank correlation for counterfactual subset fidelity while reducing upfront artifact construction costs by nearly an order of magnitude. Downstream diagnostics further show that this scoring mechanism transfers to language model curation and cross-domain vision selection, establishing a robust foundation for optimizing massive pretraining corpora.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces GRASP, a geometry-aware residual alignment method that reframes data attribution as subset-level counterfactual utility prediction rather than isolated additive scores. It derives a quadratic geometric penalty from a theoretical smoothness lower bound, couples low-dimensional feature sketches with a strictly finite lower-confidence bound selection protocol for efficiency, and reports that extensive subset-retraining evaluations show more than double the task-level rank correlation for counterfactual fidelity while cutting upfront artifact costs by nearly an order of magnitude. The scoring is claimed to transfer to language-model curation and cross-domain vision selection.

Significance. If the central claims hold, GRASP would address a recognized limitation of additive attribution methods by explicitly modeling subset interactions at pretraining scale. The reported efficiency gains and improved rank correlation on counterfactual evaluations could have practical impact on data curation pipelines. The absence of free parameters or oracle tuning in the protocol is a positive feature if substantiated.

major comments (2)
  1. [Abstract] The abstract invokes a 'theoretical smoothness lower bound' and 'quadratic geometric penalty' but supplies no equations, derivation, or section reference; without these it is impossible to verify whether the penalty is derived from first principles or reduces to a fitted quantity by construction.
  2. [Abstract] The claim that the method 'decisively outperforms existing scalable baselines' and 'more than doubles the task-level rank correlation' rests on subset-retraining evaluations whose experimental protocol, baselines, error bars, and statistical tests are not described in the provided text, preventing assessment of whether the improvement is load-bearing or sensitive to hidden choices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their review and for highlighting areas where the abstract could better guide readers to the supporting material in the full manuscript. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract] The abstract invokes a 'theoretical smoothness lower bound' and 'quadratic geometric penalty' but supplies no equations, derivation, or section reference; without these it is impossible to verify whether the penalty is derived from first principles or reduces to a fitted quantity by construction.

    Authors: The abstract is space-constrained and therefore omits equations. The smoothness lower bound is derived in Section 3.1 from a Lipschitz-continuous utility assumption on subset counterfactuals; the quadratic geometric penalty then follows directly as Equation (5) without any fitted parameters. We will revise the abstract to include an explicit reference to Section 3.1. revision: yes

  2. Referee: [Abstract] The claim that the method 'decisively outperforms existing scalable baselines' and 'more than doubles the task-level rank correlation' rests on subset-retraining evaluations whose experimental protocol, baselines, error bars, and statistical tests are not described in the provided text, preventing assessment of whether the improvement is load-bearing or sensitive to hidden choices.

    Authors: The abstract summarizes results whose details appear in the full manuscript. Section 4.1 specifies the subset-retraining protocol, Section 4.2 lists the baselines, all tables and figures report error bars over five independent runs, and Appendix B contains the Wilcoxon signed-rank tests. We will add a parenthetical reference to Section 4 in the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and context contain no equations, derivations, or explicit self-citations that could be examined for reduction to inputs by construction. The mention of a 'theoretical smoothness lower bound' and 'finite lower-confidence bound selection protocol' is stated at a high level without any visible chain that equates a prediction to a fitted quantity or relies on load-bearing self-citation. Per the rules, circularity requires quotable paper text exhibiting the specific reduction; none is present here, so the derivation (if any) cannot be shown to collapse to its inputs. This is the expected honest non-finding when no load-bearing steps are isolable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so the ledger cannot be populated with specific free parameters, axioms, or invented entities from the paper. The central claim rests on an unstated theoretical smoothness lower bound and the assumption that low-dimensional sketches preserve the necessary geometry for interaction modeling.

pith-pipeline@v0.9.1-grok · 5685 in / 1240 out tokens · 15875 ms · 2026-06-27T22:19:17.686493+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 9 canonical work pages

  1. [1]

    Proceedings of the 34th International Conference on Machine Learning , pages =

    Understanding Black-box Predictions via Influence Functions , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

  2. [2]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

  3. [3]

    2019 , howpublished =

    Language Models Are Unsupervised Multitask Learners , author=. 2019 , howpublished =

  4. [4]

    Representer Point Selection for Explaining Deep Neural Networks , url =

    Yeh, Chih-Kuan and Kim, Joon and Yen, Ian En-Hsu and Ravikumar, Pradeep K , booktitle =. Representer Point Selection for Explaining Deep Neural Networks , url =

  5. [5]

    Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics , pages =

    RelatIF: Identifying Explanatory Training Samples via Relative Influence , author =. Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics , pages =. 2020 , editor =

  6. [6]

    Proceedings of the 36th International Conference on Machine Learning , pages =

    Data Shapley: Equitable Valuation of Data for Machine Learning , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =

  7. [7]

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author =. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , month = oct #. 2018 , address =. doi:10.18653/v1/D18-1260 , pages =

  8. [8]

    CoRR , volume =

    Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord , title =. CoRR , volume =. 2018 , url =. 1803.05457 , timestamp =

  9. [9]

    2019 , address =

    Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan , editor =. Proceedings of the 2019 Conference of the North. 2019 , address =. doi:10.18653/v1/N19-1421 , pages =

  10. [10]

    Proceedings of the AAAI conference on artificial intelligence , volume =

    Piqa: Reasoning about physical commonsense in natural language , author =. Proceedings of the AAAI conference on artificial intelligence , volume =

  11. [11]

    Liu , title =

    Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =

  12. [12]

    Data Shapley in One Training Run , url =

    Wang, Jiachen (Tianhao) and Mittal, Prateek and Song, Dawn and Jia, Ruoxi , booktitle =. Data Shapley in One Training Run , url =

  13. [13]

    Estimating Training Data Influence by Tracing Gradient Descent , url =

    Pruthi, Garima and Liu, Frederick and Kale, Satyen and Sundararajan, Mukund , booktitle =. Estimating Training Data Influence by Tracing Gradient Descent , url =

  14. [14]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Scaling up influence functions , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  15. [15]

    2023 , eprint=

    Studying Large Language Model Generalization with Influence Functions , author=. 2023 , eprint=

  16. [16]

    2023 , eprint=

    TRAK: Attributing Model Behavior at Scale , author=. 2023 , eprint=

  17. [17]

    Proceedings of the 40th International Conference on Machine Learning , pages =

    Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

  18. [18]

    DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models , url =

    Kwon, Yongchan and Wu, Eric and Wu, Kevin and Zou, James Y , booktitle =. DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models , url =

  19. [19]

    Proceedings of the 41st International Conference on Machine Learning , articleno =

    Xia, Mengzhou and Malladi, Sadhika and Gururangan, Suchin and Arora, Sanjeev and Chen, Danqi , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  20. [20]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , url =

    Penedo, Guilherme and Kydl\'. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , url =. Advances in Neural Information Processing Systems , doi =

  21. [21]

    The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Only , url =

    Penedo, Guilherme and Malartic, Quentin and Hesslow, Daniel and Cojocaru, Ruxandra and Alobeidli, Hamza and Cappelli, Alessandro and Pannier, Baptiste and Almazrouei, Ebtesam and Launay, Julien , booktitle =. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Only , url =

  22. [22]

    and Carmon, Yair and Dave, Achal and Schmidt, Ludwig and Shankar, Vaishaal , booktitle =

    Li, Jeffrey and Fang, Alex and Smyrnis, Georgios and Ivgi, Maor and Jordan, Matt and Gadre, Samir and Bansal, Hritik and Guha, Etash and Keh, Sedrick and Arora, Kushal and Garg, Saurabh and Xin, Rui and Muennighoff, Niklas and Heckel, Reinhard and Mercat, Jean and Chen, Mayee and Gururangan, Suchin and Wortsman, Mitchell and Albalak, Alon and Bitton, Yona...

  23. [23]

    Proceedings of the 39th International Conference on Machine Learning , pages =

    Datamodels: Understanding Predictions with Data and Data with Predictions , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

  24. [24]

    2009 , issue_date =

    Robertson, Stephen and Zaragoza, Hugo , title =. Foundations and Trends in Information Retrieval , volume =. 2009 , month =. doi:10.1561/1500000019 , url =

  25. [25]

    Findings of the Association for Computational Linguistics: NAACL 2025 , month = apr, year =

    Gu, Yuling and Tafjord, Oyvind and Kuehl, Bailey and Haddad, Dany and Dodge, Jesse and Hajishirzi, Hannaneh , editor =. Findings of the Association for Computational Linguistics: NAACL 2025 , month = apr, year =. doi:10.18653/v1/2025.findings-naacl.282 , pages =

  26. [26]

    2026 , eprint=

    Olmo 3 , author=. 2026 , eprint=

  27. [27]

    What s In My Big Data? , url =

    Elazar, Yanai and Bhagia, Akshita and Magnusson, Ian and Ravichander, Abhilasha and Schwenk, Dustin and Suhr, Alane and Walsh, Pete and Groeneveld, Dirk and Soldaini, Luca and Singh, Sameer and Hajishirzi, Hannaneh and Smith, Noah and Dodge, Jesse , booktitle =. What s In My Big Data? , url =

  28. [28]

    Scalable Influence and Fact Tracing for Large Language Model Pretraining , url =

    Chang, Tyler and Rajagopal, Dheeraj and Bolukbasi, Tolga and Dixon, Lucas and Tenney, Ian , booktitle =. Scalable Influence and Fact Tracing for Large Language Model Pretraining , url =

  29. [29]

    Liu, and Matt Gardner

    Crowdsourcing Multiple Choice Science Questions , author =. Proceedings of the 3rd Workshop on Noisy User-generated Text , month = sep, year =. doi:10.18653/v1/W17-4413 , pages =

  30. [30]

    Mathematical Programming , volume =

    An Analysis of Approximations for Maximizing Submodular Set Functions---I , author =. Mathematical Programming , volume =. 1978 , doi =

  31. [31]

    Borgwardt and Malte J

    Arthur Gretton and Karsten M. Borgwardt and Malte J. Rasch and Bernhard Sch. A Kernel Two-Sample Test , journal =. 2012 , volume =

  32. [32]

    Random Structures & Algorithms , volume =

    Dasgupta, Sanjoy and Gupta, Anupam , title =. Random Structures & Algorithms , volume =. doi:https://doi.org/10.1002/rsa.10073 , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1002/rsa.10073 , year =

  33. [33]

    Journal of Computer and System Sciences , volume =

    Database-friendly random projections: Johnson-Lindenstrauss with binary coins , author =. Journal of Computer and System Sciences , volume =. 2003 , note =. doi:https://doi.org/10.1016/S0022-0000(03)00025-4 , url =

  34. [34]

    Optimizing Neural Networks with

    Martens, James and Grosse, Roger , booktitle =. Optimizing Neural Networks with. 2015 , url =

  35. [35]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages =

    Deep Residual Learning for Image Recognition , author =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages =. 2016 , doi =

  36. [36]

    2009 , doi =

    Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle =. 2009 , doi =

  37. [37]

    2009 , url =

    Learning Multiple Layers of Features from Tiny Images , author =. 2009 , url =

  38. [38]

    Journal of the American Statistical Association , volume =

    Probability Inequalities for Sums of Bounded Random Variables , author =. Journal of the American Statistical Association , volume =. 1963 , doi =

  39. [39]

    2025 , eprint =

    Organize the Web: Constructing Domains Enhances Pre-Training Data Curation , author =. 2025 , eprint =. doi:10.48550/arXiv.2502.10341 , url =