pith. machine review for the scientific record. sign in

arxiv: 2604.16197 · v1 · submitted 2026-04-17 · 💻 cs.LG

Recognition: unknown

Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:43 UTC · model grok-4.3

classification 💻 cs.LG
keywords data attributioninfluence estimationlarge language modelsdata valuationsketchingoutput layerCountSketchtraining data selection
0
0 comments X

The pith

RISE sketches dual-channel signals at the output layer to scale data attribution in LLMs up to 32 billion parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RISE to attribute model predictions to specific training examples and to score the prospective value of new data points for large language models. It does so by concentrating computation on the output layer, where influence signals are assumed to concentrate, and representing that layer's gradient in two channels before applying CountSketch compression. This avoids the full-model gradient storage and indexing that makes prior methods infeasible beyond a few billion parameters. A sympathetic reader would care because it turns data influence from an opaque property into something computable at the scale of current frontier models. If the method works as described, practitioners gain a practical way to audit training sets, detect problematic data, and select better subsets for continued pretraining.

Core claim

RISE computes influence by forming a dual-channel readout at the output layer consisting of a lexical residual channel (RH) and a semantic projected-error channel (GH), then applies CountSketch projections to both channels to produce a compact index that preserves attribution accuracy while using far less memory than full-gradient baselines.

What carries the argument

Dual-channel sketched readout at the output layer (RH lexical residual channel combined with GH semantic projected-error channel) under CountSketch projections.

Load-bearing premise

Influence signals concentrate at the output layer and the gradient there admits a decomposed outer-product form that sketching can preserve without material loss of attribution accuracy.

What would settle it

On any model size where both RISE and a full-gradient method such as RapidIn fit in memory, the top-ranked influential training examples retrieved for the same prediction differ substantially between the two approaches.

Figures

Figures reproduced from arXiv: 2604.16197 by Chuan Li, Denghui Zhang, Jianwen Xie, Minghui Wang, Wenjin Zheng, Yide Ran, Zhaozhuo Xu.

Figure 1
Figure 1. Figure 1: RISE: Readout Influence Sketching Estimator . Building on this insight, we introduce RISE (Readout Influence Sketching Estimator), which enables efficient forward-only influence estimation through CountSketch compression of the outer-product structure. RISE exploits this structure through a dual-channel formulation: a lexical residual channel (RH) for token-level precision and a semantic projected-error ch… view at source ↗
Figure 2
Figure 2. Figure 2: Per-layer gradient energy on C4. The LM head forms a sharp peak that strengthens with [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sparse active tokens in LM-head residuals. At fixed τ = 1.0, Top-K+GT preserves RH residual energy (left) and yields sparse GH directions close to dense GH (right). Additional fixed-control diagnostics are in Appendix F.3. individually via CountSketch [9, 51] before forming interaction features. Unlike PCA, CountSketch is data-independent (determined solely by bucket hashes ηr, ηh, ηg and sign hashes sr, s… view at source ↗
Figure 4
Figure 4. Figure 4: Per-layer gradient-energy profiles are largely preserved after fine-tuning. For three tasks (BrainRot, Howdy, Finance–Medical) and three models (OLMo-3-7B, Pythia-2.8B, Pythia-6.9B), we plot the fraction of total gradient energy attributed to each layer (Embed → LM Head) for the pretrained checkpoint (blue) and the fine-tuned checkpoint (orange). The curves nearly overlap; numbers in legends denote the LM-… view at source ↗
Figure 5
Figure 5. Figure 5: Final-layer discriminativeness across model scales. Per-layer pairwise cosine variance for four models on C4. A clear U-shape emerges: variance drops in middle layers due to hidden collapse and recovers near the final layer [PITH_FULL_IMAGE:figures/full_fig_p036_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Hidden-state discriminativeness follows a stable U-shape before and after fine-tuning. For each layer, we compute the variance of pairwise cosine similarities among token hidden states (higher variance more discriminative representations). Across three fine-tuning tasks (BrainRot, Howdy Backdoor, Medical–Finance) and three models (Pythia-2.8B, Pythia-6.9B, OLMo￾3-7B), pretrained (blue) and fine-tuned (oran… view at source ↗
Figure 7
Figure 7. Figure 7: Full hidden-layer retrieval sweep on Howdy. We evaluate hidden-state cosine retrieval at every layer for Pythia-1B and Pythia-2.8B, and compare against RISE at Top-100 and Top-200. The strongest hidden-only layer is layer 1 in both models, not the final layer. However, even this best hidden-layer baseline remains below RISE, indicating that RISE’s gains are not merely due to selecting a stronger hidden rep… view at source ↗
Figure 8
Figure 8. Figure 8: Additional fixed-control sparse active-token diagnostics. Top: at fixed τ = 1.0 and K = 128, RH residual-tail retention remains stable across tasks, model scales, and pretrained/fine￾tuned checkpoints. Bottom: with fixed K = 128, the GH temperature sweep changes only residual weights, showing that τ = 1.0 gives the strongest average fidelity under a constant candidate count. F.7 Finance-Medical Dataset: Co… view at source ↗
Figure 9
Figure 9. Figure 9: Sparse GH fidelity across BrainRot model runs. This visualizes the by-run stability of sparse GH against dense GH, complementing [PITH_FULL_IMAGE:figures/full_fig_p040_9.png] view at source ↗
read the original abstract

Data attribution and valuation are critical for understanding data-model synergy for Large Language Models (LLMs), yet existing gradient-based methods suffer from scalability challenges on LLMs. Inspired by human cognition, where decision making relies on a focused readout of relevant memories rather than replaying all pathways, we introduce RISE (Readout Influence Sketching Estimator). Instead of computing and indexing gradients across the entire LLM, RISE focuses on influence hotspots at the output layer, where influence signals concentrate, and the gradient admits a decomposed outer-product form. This enables a dual-channel representation combining a lexical residual channel (RH) and a semantic projected-error channel (GH). Applying CountSketch projections to these channels achieves strong compression while maintaining accurate attribution. Across the OLMo (1B-32B) and Pythia (14M-6.9B) families, RISE reduces index storage by up to 112$\times$ compared to RapidIn and scales to 32B parameters LLM, where gradient-based baselines such as RapidIn and ZO-Inf become memory-infeasible. We evaluate RISE on two paradigms: (1) retrospective attribution, retrieving influential training examples for specific predictions, and (2) prospective valuation, scoring candidate data utility zero-shot. We validate RISE on three tasks: Howdy backdoor data detection, Finance-Medical domain separation, and Brain Rot high-quality data selection. In a closed-loop Brain Rot study, continued pretraining on RISE-selected data yields consistent downstream improvements. Overall, RISE provides a practical and scalable primitive for influence analysis and training-data selection in modern large language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces RISE (Readout Influence Sketching Estimator), a method for scalable data attribution and valuation in LLMs. Instead of full-model gradients, it focuses on influence signals at the output layer, which are claimed to concentrate there and admit an exact decomposed outer-product form into a lexical residual channel (RH) and semantic projected-error channel (GH). CountSketch projections on these dual channels enable high compression (up to 112× storage reduction vs. RapidIn) while preserving attribution accuracy. The approach is evaluated on retrospective attribution and prospective valuation tasks across OLMo (1B-32B) and Pythia models, including backdoor detection, domain separation, and high-quality data selection, with a closed-loop pretraining experiment showing downstream gains.

Significance. If the core assumptions on output-layer concentration and exact gradient decomposition hold with high fidelity, RISE would offer a practical, memory-efficient primitive for influence analysis on models up to 32B parameters where full-gradient baselines become infeasible. This could meaningfully advance data-centric understanding and curation for LLMs, particularly for tasks like backdoor mitigation and training-data selection. The empirical validation on multiple model families and tasks strengthens the case for practicality, though the absence of detailed ablations limits immediate adoption.

major comments (3)
  1. [Method description and abstract] The central premise that influence signals concentrate sufficiently at the output layer to ignore earlier layers (and that the gradient admits an exact outer-product decomposition into RH + GH) is load-bearing for the entire sketching approach and the 112× compression claim. No section provides a layer-wise ablation or theoretical bound quantifying the attribution error introduced by this restriction; the abstract and method description assert it without direct measurement against full-model gradients on the evaluated models.
  2. [Experiments section (tasks 1-3 and closed-loop study)] Reported results on storage reduction, backdoor detection, domain separation, and Brain Rot data selection lack error bars, standard deviations across runs, or ablations on sketch dimension and projection parameters. This makes it impossible to assess whether the claimed downstream improvements in the closed-loop pretraining study are statistically reliable or sensitive to implementation choices.
  3. [Scalability experiments on OLMo/Pythia families] The comparison to RapidIn and ZO-Inf on 32B-scale models asserts memory infeasibility for baselines, but without explicit memory-footprint tables or scaling curves (including peak GPU memory during indexing), the 112× storage reduction and scalability claims cannot be fully verified as general rather than architecture-specific.
minor comments (3)
  1. [Method] Notation for the dual-channel representations (RH and GH) and CountSketch operators is introduced without a clear equation defining the projection matrices or the exact form of the decomposed gradient; a dedicated equation block would improve reproducibility.
  2. [Figures and tables] Figure captions for the experimental results do not specify the number of trials, random seeds, or exact hyperparameter settings for the sketching (e.g., sketch size relative to model dimension), reducing clarity.
  3. [Related work] The paper cites prior influence methods (RapidIn, ZO-Inf) but does not discuss how RISE relates to other sketching or low-rank approximation techniques in the broader influence-estimation literature.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of RISE's potential impact. The major comments identify important areas for clarification and strengthening, which we address point-by-point below with proposed revisions.

read point-by-point responses
  1. Referee: [Method description and abstract] The central premise that influence signals concentrate sufficiently at the output layer to ignore earlier layers (and that the gradient admits an exact outer-product decomposition into RH + GH) is load-bearing for the entire sketching approach and the 112× compression claim. No section provides a layer-wise ablation or theoretical bound quantifying the attribution error introduced by this restriction; the abstract and method description assert it without direct measurement against full-model gradients on the evaluated models.

    Authors: The outer-product decomposition of output-layer gradients into the lexical residual channel (RH) and semantic projected-error channel (GH) is mathematically exact, following directly from the chain rule on the final-layer cross-entropy loss (see Equations 3-5). We agree that a layer-wise ablation would strengthen the concentration claim. On smaller models (Pythia 14M-1B) where full gradients are feasible, we observe >85% top-k overlap between output-layer and full-model attributions; we will add this as a new ablation subsection with quantitative error metrics. A general theoretical bound on approximation error is difficult without strong assumptions on the Hessian that do not hold for LLMs, but we will add a limitations paragraph discussing this. revision: yes

  2. Referee: [Experiments section (tasks 1-3 and closed-loop study)] Reported results on storage reduction, backdoor detection, domain separation, and Brain Rot data selection lack error bars, standard deviations across runs, or ablations on sketch dimension and projection parameters. This makes it impossible to assess whether the claimed downstream improvements in the closed-loop pretraining study are statistically reliable or sensitive to implementation choices.

    Authors: We agree that statistical rigor and parameter sensitivity analysis are needed. In the revision we will report means and standard deviations over five independent runs (different seeds) for all metrics, including the closed-loop pretraining gains. We will also add an ablation subsection varying sketch dimension (512-4096) and projection parameters, demonstrating that attribution fidelity plateaus while compression remains high. This will confirm the reliability of the reported improvements. revision: yes

  3. Referee: [Scalability experiments on OLMo/Pythia families] The comparison to RapidIn and ZO-Inf on 32B-scale models asserts memory infeasibility for baselines, but without explicit memory-footprint tables or scaling curves (including peak GPU memory during indexing), the 112× storage reduction and scalability claims cannot be fully verified as general rather than architecture-specific.

    Authors: We will add an explicit memory-footprint table for all methods on models up to 6.9B, reporting both index storage and peak GPU memory during construction. For the 32B case we will include a scaling plot of memory versus parameter count (extrapolated from measured smaller-model data) showing that RapidIn exceeds 1 TB while RISE remains under 10 GB. This substantiates the 112× claim across architectures. revision: partial

Circularity Check

0 steps flagged

No circularity: RISE derivation rests on explicit architectural assumptions and empirical validation, not self-referential reductions.

full rationale

The paper states that influence signals concentrate at the output layer and that the gradient admits a decomposed outer-product form (lexical residual RH plus semantic projected-error GH), then applies CountSketch to produce the dual-channel estimator. These premises are introduced as properties of transformer LLMs rather than derived from RISE itself or from any self-citation chain. No equation equates the sketched estimator to a fitted parameter, renames an input quantity as a prediction, or imports a uniqueness result from prior author work. Storage-reduction and scalability claims are presented as outcomes of experiments on OLMo and Pythia families, not as tautological consequences of the method definition. The derivation chain therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on two domain assumptions about gradient structure in LLMs; no free parameters or new entities are introduced in the abstract.

axioms (2)
  • domain assumption Influence signals concentrate at the output layer of LLMs
    Basis for focusing computation there instead of full-model gradients.
  • domain assumption Gradient at output layer admits a decomposed outer-product form
    Enables the dual-channel (RH + GH) representation before sketching.

pith-pipeline@v0.9.0 · 5617 in / 1250 out tokens · 41879 ms · 2026-05-10T08:43:42.628839+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 39 canonical work pages · 10 internal anchors

  1. [1]

    Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S. Morcos. Semdedup: Data-efficient learning at web-scale through semantic deduplication, 2023. URLhttps://arxiv. org/abs/2303.09540

  2. [2]

    Database-friendly random projections: Johnson-lindenstrauss with binary coins.Journal of Computer and System Sciences, 66(4):671 – 687, 2003

    Dimitris Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins.Journal of Computer and System Sciences, 66(4):671 – 687, 2003. ISSN 0022-0000. doi: https://doi.org/10.1016/S0022-0000(03)00025-4. URLhttp://www.sciencedirect.com/ science/article/pii/S0022000003000254. Special Issue on PODS 2001

  3. [3]

    Parallel organization of functionally segregated circuits linking basal ganglia and cortex.Annual review of neuroscience, 9(1):357–381, 1986

    Garrett E Alexander, Mahlon R DeLong, and Peter L Strick. Parallel organization of functionally segregated circuits linking basal ganglia and cortex.Annual review of neuroscience, 9(1):357–381, 1986

  4. [4]

    An integrative theory of locus coeruleus- norepinephrine function: adaptive gain and optimal performance.Annu

    Gary Aston-Jones and Jonathan D Cohen. An integrative theory of locus coeruleus- norepinephrine function: adaptive gain and optimal performance.Annu. Rev. Neurosci., 28(1):403–450, 2005

  5. [5]

    finance-alpaca (revision 51d16b6), 2024

    Gaurang Bharti. finance-alpaca (revision 51d16b6), 2024. URL https://huggingface.co/ datasets/gbharti/finance-alpaca

  6. [6]

    A., Purohit, S., Prashanth, U

    Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023. URLhttps://arxiv.org/abs/2304.01373

  7. [7]

    Optimization methods for large-scale machine learning

    Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning, 2018. URLhttps://arxiv.org/abs/1606.04838

  8. [8]

    Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney

    Tyler A. Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney. Scalable influence and fact tracing for large language model pretraining, 2024. URLhttps://arxiv. org/abs/2410.17413

  9. [9]

    Chen, and Martin Farach-Colton

    Moses Charikar, Kevin C. Chen, and Martin Farach-Colton. Finding frequent items in data streams.Theor. Comput. Sci., 312(1):3–15, 2004. doi: 10.1016/S0304-3975(03)00400-6. URL https://doi.org/10.1016/S0304-3975(03)00400-6

  10. [10]

    What is your data worth to gpt? llm-scale data valuation with influence functions, 2024

    Sang Keun Choe, Hwijeen Ahn, Juhan Bae, Kewen Zhao, Minsoo Kang, Youngseog Chung, Adithya Pratapa, Willie Neiswanger, Emma Strubell, Teruko Mitamura, Jeff Schneider, Eduard Hovy, Roger Grosse, and Eric Xing. What is your data worth to gpt? llm-scale data valuation with influence functions, 2024. URLhttps://arxiv.org/abs/2405.13954

  11. [11]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URLhttps://arxiv.org/abs/1803.05457

  12. [12]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. URLhttps://arxiv.org/ abs/1810.04805

  13. [13]

    The brain basis of language processing: from structure to function

    Angela D Friederici. The brain basis of language processing: from structure to function. Physiological reviews, 91(4):1357–1392, 2011. 15

  14. [14]

    Data shapley: Equitable valuation of data for machine learning, 2019

    Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning, 2019. URLhttps://arxiv.org/abs/1904.02868

  15. [15]

    Towards a neuroscience of active sampling and curiosity.Nature Reviews Neuroscience, 19(12):758–770, 2018

    Jacqueline Gottlieb and Pierre-Yves Oudeyer. Towards a neuroscience of active sampling and curiosity.Nature Reviews Neuroscience, 19(12):758–770, 2018

  16. [16]

    Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, Evan Hubinger, Kamil˙ e Lukoši¯ ut˙ e, Karina Nguyen, Nicholas Joseph, Sam McCandlish, Jared Kaplan, and Samuel R. Bowman. Studying large language model generalization with influence functions, 2023. URLhttps: //arxiv.or...

  17. [17]

    arXiv preprint arXiv:2406.02913 , year=

    Wentao Guo, Jikai Long, Yimeng Zeng, Zirui Liu, Xinyu Yang, Yide Ran, Jacob R. Gardner, Osbert Bastani, Christopher De Sa, Xiaodong Yu, Beidi Chen, and Zhaozhuo Xu. Zeroth-order fine-tuning of llms with extreme sparsity, 2024. URLhttps://arxiv.org/abs/2406.02913

  18. [18]

    Training data influence analysis and estimation: a survey.Machine Learning, 113(5):2351–2403, March 2024

    Zayd Hammoudeh and Daniel Lowd. Training data influence analysis and estimation: a survey.Machine Learning, 113(5):2351–2403, March 2024. ISSN 1573-0565. doi: 10.1007/ s10994-023-06495-7. URLhttp://dx.doi.org/10.1007/s10994-023-06495-7

  19. [19]

    Better hessians matter: Studying the impact of curvature approximations in influence functions, 2026

    Steve Hong, Runa Eschenhagen, Bruno Mlodozeniec, and Richard Turner. Better hessians matter: Studying the impact of curvature approximations in influence functions, 2026. URL https://arxiv.org/abs/2509.23437

  20. [20]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?, 2024. URLhttps://arxiv.org/abs/2404.06654

  21. [21]

    Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J. Spanos. Towards efficient data valuation based on the shapley value. In Kamalika Chaudhuri and Masashi Sugiyama, editors,Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 8...

  22. [22]

    Understanding black-box predictions via influence functions,

    Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions,

  23. [23]

    URLhttps://arxiv.org/abs/1703.04730

  24. [24]

    Z0-inf: Zeroth order approxi- mation for data influence, 2025

    Narine Kokhlikyan, Kamalika Chaudhuri, and Saeed Mahloujifar. Z0-inf: Zeroth order approxi- mation for data influence, 2025. URLhttps://arxiv.org/abs/2510.11832

  25. [25]

    Neural mechanisms of foraging.Science, 336(6077):95–98, 2012

    Nils Kolling, Timothy EJ Behrens, Rogier B Mars, and Matthew FS Rushworth. Neural mechanisms of foraging.Science, 336(6077):95–98, 2012

  26. [26]

    Beta shapley: a unified and noise-reduced data valuation framework for machine learning

    Yongchan Kwon and James Zou. Beta shapley: a unified and noise-reduced data valuation framework for machine learning. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors,Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 ofProceedings of Machine Learning Research, pages 8780–8802. PML...

  27. [27]

    lavita/medical-qa-datasets

    Lavita AI. lavita/medical-qa-datasets. Hugging Face Datasets, November 2023. URLhttps:// huggingface.co/datasets/lavita/medical-qa-datasets. Version: main (commit 59d48e2). Accessed: 2026-01-27. 16

  28. [28]

    Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer, Jeremy R. Cole, Kai Hui, Michael Boratko, Rajvi Kapadia, Wen Ding, Yi Luan, Sai Meher Karthik Duddu, Gustavo Her- nandez Abrego, Weiqiang Shi, Nithi Gupta, Aditya Kusupati, Prateek Jain, Siddhartha Reddy Jonnalagadda, Ming-Wei Chang, and Iftekhar Naim. Gecko: Versatile text embeddings distilled ...

  29. [29]

    Oporp: One permutation + one random projection, 2023

    Ping Li and Xiaoyun Li. Oporp: One permutation + one random projection, 2023. URL https://arxiv.org/abs/2302.03505

  30. [30]

    Token-wise influential training data retrieval for large language models, 2024

    Huawei Lin, Jikai Long, Zhaozhuo Xu, and Weijie Zhao. Token-wise influential training data retrieval for large language models, 2024. URLhttps://arxiv.org/abs/2405.11724

  31. [31]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URLhttps: //arxiv.org/abs/1711.05101

  32. [32]

    In: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

    Syed Hasan Amin Mahmood, Ming Yin, and Rajiv Khanna. On the support vector effect in dnns: Rethinking data selection and attribution. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1, KDD ’25, page 1020–1031, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400712456. doi: 10.1145/3690624.370...

  33. [33]

    and Chen, Danqi and Arora, Sanjeev , month = jan, year =

    Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes, 2024. URLhttps: //arxiv.org/abs/2305.17333

  34. [34]

    Prioritized memory access explains planning and hippocampal replay.Nature neuroscience, 21(11):1609–1617, 2018

    Marcelo G Mattar and Nathaniel D Daw. Prioritized memory access explains planning and hippocampal replay.Nature neuroscience, 21(11):1609–1617, 2018

  35. [35]

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Ma- lik, Willia...

  36. [36]

    Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, S...

  37. [37]

    Neurons in the orbitofrontal cortex encode economic value.Nature, 441(7090):223–226, 2006

    Camillo Padoa-Schioppa and John A Assad. Neurons in the orbitofrontal cortex encode economic value.Nature, 441(7090):223–226, 2006

  38. [38]

    Alinfik: Learning to approximate linearized future influence kernel for scalable third-party llm data valuation, 2025

    Yanzhou Pan, Huawei Lin, Yide Ran, Jiamin Chen, Xiaodong Yu, Weijie Zhao, Denghui Zhang, and Zhaozhuo Xu. Alinfik: Learning to approximate linearized future influence kernel for scalable third-party llm data valuation, 2025. URLhttps://arxiv.org/abs/2503.01052

  39. [39]

    arXiv preprint arXiv:2303.14186 , year=

    Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander Madry. Trak: Attributing model behavior at scale, 2023. URLhttps://arxiv.org/abs/2303.14186

  40. [40]

    Estimating training data influence by tracing gradient descent, 2020

    Garima Pruthi, Frederick Liu, Mukund Sundararajan, and Satyen Kale. Estimating training data influence by tracing gradient descent, 2020. URLhttps://arxiv.org/abs/2002.08484

  41. [41]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023. URLhttps://arxiv.org/abs/1910.10683

  42. [42]

    Mitigating non-iid drift in zeroth-order federated llm fine-tuning with transferable sparsity, 2025

    Yide Ran, Wentao Guo, Jingwei Sun, Yanzhou Pan, Xiaodong Yu, Hao Wang, Jianwen Xie, Yiran Chen, Denghui Zhang, and Zhaozhuo Xu. Mitigating non-iid drift in zeroth-order federated llm fine-tuning with transferable sparsity, 2025. URLhttps://arxiv.org/abs/2506.03337

  43. [43]

    A framework for studying the neurobiology of value-based decision making.Nature reviews neuroscience, 9(7):545–556, 2008

    Antonio Rangel, Colin Camerer, and P Read Montague. A framework for studying the neurobiology of value-based decision making.Nature reviews neuroscience, 9(7):545–556, 2008

  44. [44]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks, 2019. URLhttps://arxiv.org/abs/1908.10084

  45. [45]

    Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford

    Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. In Donna K. Harman, editor,Proceedings of The Third Text REtrieval Conference, TREC 1994, Gaithersburg, Maryland, USA, November 2-4, 1994, volume 500-225 ofNIST Special Publication, pages 109–126. National Institute of Standards and Technology (...

  46. [46]

    Frontal cortex and reward-guided learning and decision-making.Neuron, 70(6):1054– 1069, 2011

    Matthew FS Rushworth, MaryAnn P Noonan, Erie D Boorman, Mark E Walton, and Timothy E Behrens. Frontal cortex and reward-guided learning and decision-making.Neuron, 70(6):1054– 1069, 2011

  47. [47]

    Analyzing similarity metrics for data selection for language model pretraining,

    Dylan Sam, Ayan Chakrabarti, Afshin Rostamizadeh, Srikumar Ramalingam, Gui Citovsky, and Sanjiv Kumar. Analyzing similarity metrics for data selection for language model pretraining,

  48. [48]

    URLhttps://arxiv.org/abs/2502.02494

  49. [49]

    The expected value of control: an integrative theory of anterior cingulate cortex function.Neuron, 79(2):217–240, 2013

    Amitai Shenhav, Matthew M Botvinick, and Jonathan D Cohen. The expected value of control: an integrative theory of anterior cingulate cortex function.Neuron, 79(2):217–240, 2013

  50. [50]

    Kushal Tirumala, Daniel Simig, Armen Aghajanyan, and Ari S. Morcos. D4: Improving llm pretraining via document de-duplication and diversification, 2023. URLhttps://arxiv.org/ abs/2308.12284

  51. [51]

    R. v. Mises. On the asymptotic distribution of differentiable statistical functions.The Annals of Mathematical Statistics, 18(3):309–348, 09 1947. ISSN 0003-4851. doi: 10.1214/aoms/1177730385. URLhttps://cir.nii.ac.jp/crid/1363670318640582912. 18

  52. [52]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

  53. [53]

    Feature hashing for large scale multitask learning, 2010

    Kilian Weinberger, Anirban Dasgupta, Josh Attenberg, John Langford, and Alex Smola. Feature hashing for large scale multitask learning, 2010. URLhttps://arxiv.org/abs/0902.2206

  54. [54]

    LLMs Can Get "Brain Rot": A Pilot Study on Twitter/X

    Shuo Xing, Junyuan Hong, Yifan Wang, Runjin Chen, Zhenyu Zhang, Ananth Grama, Zhengzhong Tu, and Zhangyang Wang. Llms can get "brain rot"!, 2025. URL https: //arxiv.org/abs/2510.13928

  55. [55]

    Representer point selection for explaining deep neural networks

    Chih-Kuan Yeh, Joon Kim, Ian En-Hsu Yen, and Pradeep K Ravikumar. Representer point selection for explaining deep neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grau- man, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URLhttps://proceedings.neurips.cc/ pa...

  56. [56]

    First is better than last for language data influence

    Chih-Kuan Yeh, Ankur Taly, Mukund Sundararajan, Frederick Liu, and Pradeep Ravikumar. First is better than last for language data influence. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 32285–32298. Curran Asso- ciates, Inc., 2022. URLhttps://proceedings.n...

  57. [57]

    Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Jiayuan He, et al

    Zichun Yu, Spandan Das, and Chenyan Xiong. Mates: Model-aware data selection for efficient pretraining with data influence models, 2024. URLhttps://arxiv.org/abs/2406.06046. 19 Appendix We start with more related works that connectRISEto neuroscience in Section A. In Section B, we summarize the notations and definitions used throughout this work. In Secti...

  58. [58]

    Applying Assumption C.2 (∥xl∥2 2 ≤C x), we have the pointwise bound Gl ≤C x∥δl∥2

  59. [59]

    (11): 1 L LX l=1 E[Gl]≤ CxE∥δL∥2 2 L L−1X k=0 κ2k (10) = CxE∥δL∥2 2 L · 1−κ 2L 1−κ 2 .(11)

    Taking expectation and applying Assumption C.3: E[Gl]≤C xE∥δl∥2 2 (8) ≤C xκ2(L−l)E∥δL∥2 2.(9) Averaging over allLlayers yields Eq. (11): 1 L LX l=1 E[Gl]≤ CxE∥δL∥2 2 L L−1X k=0 κ2k (10) = CxE∥δL∥2 2 L · 1−κ 2L 1−κ 2 .(11)

  60. [60]

    Upper Bound on Final Error Energy via Head Energy.The final error signal is δL = ∇hℓ = W ⊤ lm_headr, hence E∥δL∥2 2 ≤ ∥W lm_head∥2 opE∥r∥2

  61. [61]

    From Eq. (6) and Assumption C.2 (∥h∥2 2 ≥C h with high probability), we have∥r∥2 2 =G head/∥h∥2 2 ≤G head/Ch, which implies: E∥δL∥2 2 ≤ ∥Wlm_head∥2 opE∥r∥2 2 ≤ ∥Wlm_head∥2 op Ch E[Ghead].(12)

  62. [62]

    dilution

    Ratio Derivation.Substituting Eq. (12) into Eq. (11) yields: Avg(E[G])≤ Cx L 1−κ 2L 1−κ 2 ∥Wlm_head∥2 op Ch E[Ghead] ! .(13) Rearranging terms to solve for the ratio completes the proof: E[Ghead] Avg(E[G]) ≥ Ch Cx L(1−κ 2) ∥Wlm_head∥2op(1−κ 2L) .(14) 23 Interpretation.Eq. (7) provides a structural explanation for the trend in Figure 2. As the model depth ...