pith. sign in

arxiv: 2605.29448 · v1 · pith:TG2D35KInew · submitted 2026-05-28 · 💻 cs.LG · cs.AI· cs.CV· cs.IT· math.IT

How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions

Pith reviewed 2026-06-29 08:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CVcs.ITmath.IT
keywords submodular functionsVendi Scorematrix spectral functionsdata appraisalneural scaling lawsdeterminantal point processesgreedy optimizationdataset selection
0
0 comments X

The pith

Neural scaling law objectives and the Vendi Score are both submodular, with the Vendi Score as a special case of matrix spectral functions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish principled ways to measure the value of data subsets for neural network training without exhaustive model training. It proves that standard scaling-law objectives and the Vendi Score, which uses quantum entropy to capture dataset diversity, are submodular set functions. The Vendi Score is placed inside a wider family called matrix spectral functions that are also submodular and include determinantal point process objectives plus new variants derived from weakly matrix monotone functions. Secular-equation techniques are introduced to make greedy maximization of these functions practical at ImageNet scale by avoiding repeated full eigendecompositions. Experiments then test how well the resulting scores predict held-out test accuracy when subsets are chosen under fixed-size, class-balanced, and fixed-budget constraints.

Core claim

We show both that common neural-scaling-law objectives and the Vendi Score are submodular. We further show that the Vendi Score is a special case of a broader class of submodular objectives that we call matrix spectral functions. This also includes determinantal (DPP) objectives, as well as many others. We also introduce weakly matrix monotone functions and show how they lead to weakly submodular matrix spectral functions, yielding a broad family of practical objectives for data appraisal. We develop secular-equation-based updates that avoid repeated eigendecompositions during greedy optimization, reducing marginal-gain evaluation for m-dimensional embeddings by an O(m) factor relative to or

What carries the argument

Matrix spectral functions, the class of submodular objectives obtained by applying a function to the eigenvalues of a matrix representation of the dataset and used for data appraisal.

If this is right

  • Secular-equation updates make direct greedy optimization of the Vendi Score feasible on ImageNet-1K with an average 35,000x speedup.
  • Among the tested objectives, facility location best predicts held-out test performance under fixed-size, class-balanced, and fixed-budget regimes.
  • The Vendi Score correlates with downstream performance over moderate ranges but becomes a poor proxy once pushed to higher values.
  • Uniformly random fixed-size subsets, both unconstrained and class-balanced, exhibit tight concentration in both appraisal scores and held-out accuracy.
  • Even after controlling for dataset size, class balance, and training budget, held-out performance still varies smoothly from good to bad.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Submodularity supplies approximation guarantees for greedy selection that could be applied directly to dataset curation pipelines.
  • The same matrix spectral construction may extend to data valuation in active learning or self-supervised regimes where kernel matrices are already computed.
  • The observed concentration of random subsets suggests that simple sampling may already achieve near-optimal appraisal scores in many practical settings.
  • Weakly matrix monotone functions offer a route to design new objectives that align more closely with a target loss or architecture.

Load-bearing premise

That maximizing these submodular matrix spectral objectives by greedy selection will reliably identify subsets with better held-out test performance.

What would settle it

An experiment showing that, on multiple standard image datasets, greedy subsets chosen to maximize the Vendi Score or other matrix spectral functions achieve no higher held-out accuracy than class-balanced random sampling of the same size.

Figures

Figures reproduced from arXiv: 2605.29448 by Arnav M. Das, Gantavya Bhatt, Jeff A. Bilmes.

Figure 1
Figure 1. Figure 1: Dataset value is not determined by size or compute alone, even when all datasets are perfectly class balanced. In (a), the ranking induced by dataset value is largely preserved across more than two orders of magnitude of compute budget, suggesting that offline data appraisal is meaningful beyond a single training budget. Size (given as percent of the full dataset) generally helps within a fixed value categ… view at source ↗
Figure 2
Figure 2. Figure 2: Comparing Appraisal Functions The outcomes of several ImageNet-1K runs with identical compute budgets (4M samples seen) and training configurations are shown on several sets that are constrained to be the same size (10%) in (a) and (b), and additionally constrained to be class-balanced in (c) and (d). Key observations: (1) Vendi correlates poorly with test accuracy when evaluated on sets that span a wide r… view at source ↗
Figure 3
Figure 3. Figure 3: Size is a Poor Proxy for Accuracy The outcomes of ImageNet-1K runs with identical compute budgets (4M samples seen) are shown. Dataset size is varied within each subplot to 5%, 10%, and 20% of n = 1.2M. We observe that (1) size is a poor proxy for accuracy since there are several 5% subsets that yield higher test perfor￾mance that 20% subsets and (2) the Vendi score is overly influenced by dataset size. Sp… view at source ↗
Figure 4
Figure 4. Figure 4: A visualization of the half-open intervals Ij =  c j+1 , c j i for j = 1, 2, . . . normalized by setting c = 1. In order to handle the above, we consider the following series of half-open intervals into which d may fall and which covers the range 0 < d ≤ c: . . . ,  c j + 2 , c j + 1 ,  c j + 1 , c j  ,  c j , c j − 1  , . . . ,  c 3 , c 2 i ,  c 2 ,c i (19) We can identify Ij =  c j+1 , c j i , … view at source ↗
Figure 5
Figure 5. Figure 5: 2D description of Householder reflector where an original vector x is reflected to a new vector Hx that is aligned with the coordinate axis e1. There are two choices: (1) for u = x + ∥x∥2e1 (on the left) which reflects x to −αe1; and (2) u = x − ∥x∥2e1 (on the right) which reflects x to αe1. Here, u is an unnormalized version of u¯. In general, for x ∈ Rm we can reflect it to any ±αej for j ∈ [m] by using … view at source ↗
Figure 6
Figure 6. Figure 6: Example plot of the secular function f(µ) with ρ = +1 (on the left) and ρ = −1 (on the right) where the vertical dashed line shows the existing m = 5 eigenvalues and the blue triangles show the updated eigenvalues corresponding to the roots of f(µ). The ρ = +1 case corresponds to a rank-1 update Λ + vvT and the plot shows that the function is monotone increasing (between the vertical asymptotes, and betwee… view at source ↗
Figure 7
Figure 7. Figure 7: Speedup of the secular approach compared to the oracle approach for greedy maximization of the log Vendi score using an n × m design matrix D consisting of random Gaussian values, where m is fixed at m = 1024. Speedup is defined as the time taken by the oracle approach divided by the time taken for the secular approach. We run this for various different ground set sizes (i.e., |V | = n ∈ {100, 250, 500, 10… view at source ↗
Figure 8
Figure 8. Figure 8: Partial trace diagramatically explained: Example H = HA ⊗ HB ⊗ HD ⊗ HC with dim HA = 2, dim HB = 2, dim HD = 3, dim HC = 2, so the matrix is (2 × 2 × 3 × 2) × (2 × 2 × 3 × 2) = 24 × 24. We use indices i for A, j for B, k for D, and ℓ for C, so each row (column) is a four-tuple (i, j, k, ℓ) with i, j, ℓ ∈ {1, 2} and k ∈ {1, 2, 3}. The top row shows four views of ρABDC ∈ B(H) where the rows and columns are m… view at source ↗
Figure 9
Figure 9. Figure 9: 5% Dataset Size. Each run represents the outcome of an ImageNet-1K training run with the same configuration on different 5% subsets. The datasets in (c) and (d) are constrained to be perfectly class-balanced. We observe that (1) sampling sets with high Vendi scores via direct optimization (shown in green) reveals that Vendi is poorly correlated with test acc. (2) random sets (shown in red) are highly conce… view at source ↗
Figure 10
Figure 10. Figure 10: 10% Dataset Size. Each run represents the outcome of an ImageNet-1K training run with the same configuration on different 10% subsets. The datasets in (c) and (d) are constrained to be perfectly class-balanced. We observe that (1) sampling sets with high Vendi scores via direct optimization (shown in green) reveals that Vendi is poorly correlated with test acc. (2) random sets (shown in red) are highly co… view at source ↗
Figure 11
Figure 11. Figure 11: 20% Dataset Size. Each run represents the outcome of an ImageNet-1K training run with the same configuration on different 20% subsets. The datasets in (c) and (d) are constrained to be perfectly class-balanced. We observe that (1) sampling sets with high Vendi scores via direct optimization (shown in green) reveals that Vendi is poorly correlated with test acc. (2) random sets (shown in red) are highly co… view at source ↗
Figure 12
Figure 12. Figure 12: Compute Budget Fixed to 1M Samples Seen. Each run represents the outcome of an ImageNet-1K training run with the compute budget which is fixed to 1M samples seen irrespective of the dataset size. The datasets in (c) and (d) are constrained to be perfectly class-balanced. Critically, there are many instances of runs trained on 5% subsets that outperform ones that were trained on 20% size subsets. Moreover,… view at source ↗
Figure 13
Figure 13. Figure 13: Compute Budget Fixed to 2M Samples Seen. Each run represents the outcome of an ImageNet-1K training run with the compute budget which is fixed to 2M samples seen irrespective of the dataset size. The datasets in (c) and (d) are constrained to be perfectly class-balanced. Critically, there are many instances of runs trained on 5% subsets that outperform ones that were trained on 20% size subsets. Moreover,… view at source ↗
Figure 14
Figure 14. Figure 14: Compute Budget Fixed to 4M Samples Seen. Each run represents the outcome of an ImageNet-1K training run with the compute budget which is fixed to 4M samples seen irrespective of the dataset size. The datasets in (c) and (d) are constrained to be perfectly class-balanced. Critically, there are many instances of runs trained on 5% subsets that outperform ones that were trained on 20% size subsets. Moreover,… view at source ↗
Figure 15
Figure 15. Figure 15: 5% Dataset Size Each run represents the outcome of an ImageNet-1K training runs spanning 3 different compute budgets on 5% subsets. The subsets in (c) and (d) are all perfectly class-balanced. We observe that there are many instances where a run with a low compute budget outperforms ones with a high compute budgets, suggesting that compute alone is a poor proxy for test accuracy. 210 220 230 240 250 260 2… view at source ↗
Figure 16
Figure 16. Figure 16: 10% Dataset Size Each run represents the outcome of an ImageNet-1K training runs spanning 3 different compute budgets on 10% subsets. The subsets in (c) and (d) are all perfectly class-balanced. We observe that there are many instances where a run with a low compute budget outperforms ones with a high compute budgets, suggesting that compute alone is a poor proxy for test accuracy. K.6 Experiments with Ad… view at source ↗
Figure 17
Figure 17. Figure 17: 20% Dataset Size Each run represents the outcome of an ImageNet-1K training runs spanning 3 different compute budgets on 20% subsets. The subsets in (c) and (d) are all perfectly class-balanced. We observe that there are many instances where a run with a low compute budget outperforms ones with a high compute budgets, suggesting that compute alone is a poor proxy for test accuracy. • Facility Location (no… view at source ↗
Figure 18
Figure 18. Figure 18: Comparison of All Appraisal Functions on ImageNet, 10% subsets, 4M samples seen. Each run represents the outcome of an ImageNet-1K training run. Facility Location is the only function strongly correlated with accuracy (r = 0.917). Vendi (Normalized) is essentially uncorrelated, and Vendi (Unnormalized) is negatively correlated: subsets with higher Vendi-Unnorm scores yield lower test accuracy. The weakly … view at source ↗
Figure 19
Figure 19. Figure 19: Comparison of All Appraisal Functions on ImageNet, 10% class-balanced subsets, 4M samples seen. Each run represents the outcome of an ImageNet-1K training run. Same setup as [PITH_FULL_IMAGE:figures/full_fig_p065_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Accordion. Images in the high-value (high facility location score) subset (above left 8 × 8 grid, and green X in [PITH_FULL_IMAGE:figures/full_fig_p066_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Maracas. Images in the high-value (high facility location score) subset (above left 8 × 8 grid, and green X in [PITH_FULL_IMAGE:figures/full_fig_p067_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Stretcher. Images in the high-value (high facility location score) subset (above left 8 × 8 grid, and green X in [PITH_FULL_IMAGE:figures/full_fig_p068_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Comparison of All Appraisal Functions, Airbnb Dataset, Budget = 356. Each panel plots a function’s score (x-axis) against the number of distinct rooms covered (y-axis); points are colored by sampling strategy, and the Pearson correlation r is shown in the top-left. Facility Location is the strongest correlate to coverage (r = 0.992); ϕ3, ϕ1, and ξ3 sit at r ∈ [0.86, 0.87], slightly above DPP and Vendi at … view at source ↗
Figure 24
Figure 24. Figure 24: Comparison of All Appraisal Functions, Airbnb Dataset, Budget = 600. Same setup as [PITH_FULL_IMAGE:figures/full_fig_p070_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Comparison of All Appraisal Functions, 20 Newsgroups Dataset, Budget = 20. Each panel plots a function’s score (x-axis) against the number of distinct newsgroups covered (y-axis); points are colored by sampling strategy and the Pearson correlation r is shown in the top-left. Facility Location is the strongest correlate of coverage (r = 0.951); ξ3, ϕ1, DPP, and Vendi sit at r ∈ [0.83, 0.85]. The function ϕ… view at source ↗
Figure 26
Figure 26. Figure 26: Comparison of All Appraisal Functions, 20 Newsgroups Dataset, Budget = 40. Same setup as [PITH_FULL_IMAGE:figures/full_fig_p072_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: The results (top row) are the same as those presented in [PITH_FULL_IMAGE:figures/full_fig_p073_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: This shows results similar to [PITH_FULL_IMAGE:figures/full_fig_p074_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: On the top row, a side-by-side easy comparison of unconstrained (on the left) vs. class balanced (on the right), showing the effect of a forced balanced constraint. The key difference is that the poor blue sets in (b) are allowed to get worse since they are not forced to be class balanced. Recall that “unconstrained” here means that the sets are only constrained by size but are not constrained to be class… view at source ↗
read the original abstract

Neural scaling laws appraise data through dataset size, while the Vendi Score uses quantum entropy to measure dataset value. We show both that common neural-scaling-law objectives and the Vendi Score are submodular. We further show that the Vendi Score is a special case of a broader class of submodular objectives that we call matrix spectral functions. This also includes determinantal (DPP) objectives, as well as many others. We also introduce weakly matrix monotone functions and show how they lead to weakly submodular matrix spectral functions, yielding a broad family of practical objectives for data appraisal. We develop secular-equation-based updates that avoid repeated eigendecompositions during greedy optimization, reducing marginal-gain evaluation for $m$-dimensional embeddings by an $O(m)$ factor relative to oracle queries. This yields an average empirical speedup of about 35,000x, making direct optimization of the Vendi Score feasible on ImageNet-1K-scale datasets. Thus enabled, we compare how well several objectives predict the value of training subsets for held-out test performance under fixed-size, class-balanced, and fixed training-budget regimes, including the Vendi Score, DPPs, facility location, and three new matrix spectral variants. Across multiple datasets, facility location performs the best. Direct optimization also reveals that, while the Vendi Score is predictive over moderate score ranges, pushing the objective to higher values can make it a poor downstream performance proxy. We also find that uniformly at random fixed-size subsets, both unconstrained and class-balanced, are remarkably concentrated in both appraisal scores and held-out performance. Finally, we show that size, class balance, and training budget do not alone determine data value: even when controlling for these factors, performance ranges smoothly from good to bad.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper claims that common neural scaling-law objectives and the Vendi Score are submodular, that the Vendi Score belongs to the broader class of matrix spectral functions (extended via newly introduced weakly matrix monotone functions), develops secular-equation updates that reduce the cost of marginal-gain computations by an O(m) factor, and empirically compares several such objectives (including facility location, DPPs, and new matrix-spectral variants) for their ability to select training subsets that predict held-out performance under fixed-size, class-balanced, and fixed-budget regimes on multiple datasets, finding facility location strongest while noting that Vendi Score degrades at high values and that random subsets are concentrated.

Significance. If the derivations and empirical rankings hold, the work supplies a unifying submodular framework that connects scaling laws with diversity measures and supplies practical, scalable algorithms for data appraisal. The reported 35,000x average speedup and the finding that size/balance/budget do not solely determine value are concrete contributions that could affect data-selection practice.

major comments (2)
  1. [empirical comparisons section] Empirical comparisons section (fixed-size, class-balanced, and fixed training-budget regimes): the ranking that facility location performs best rests on point estimates of held-out performance; without reported standard deviations across random seeds, confidence intervals, or statistical tests, it is unclear whether the observed differences are reliable or could be explained by sampling variability.
  2. [weakly matrix monotone functions section] § on weakly matrix monotone functions and weak submodularity: the central extension from matrix monotonicity to weak submodularity for the family of matrix spectral functions is load-bearing for the claim of a 'broad family of practical objectives'; the manuscript should explicitly state the precise inequality that defines 'weakly' and verify it holds for the concrete functions (Vendi, DPP) used later.
minor comments (3)
  1. [abstract] Abstract: the phrase 'reducing marginal-gain evaluation for m-dimensional embeddings by an O(m) factor relative to oracle queries' should clarify whether the factor is worst-case, average-case, or observed; the reported 35,000x empirical speedup is useful but should be tied to a specific m and dataset size.
  2. [theoretical development] Notation for matrix spectral functions: ensure the definition of f(·) is stated once with all required domain assumptions (e.g., positive-semidefinite matrices, monotonicity conditions) before it is specialized to Vendi and DPP cases.
  3. [figures and tables] Figure captions and tables: all axes and columns should be labeled with explicit units or descriptions (e.g., 'held-out accuracy (%)' rather than 'accuracy'); legends should distinguish the three new matrix-spectral variants.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments. We address each major comment below.

read point-by-point responses
  1. Referee: [empirical comparisons section] Empirical comparisons section (fixed-size, class-balanced, and fixed training-budget regimes): the ranking that facility location performs best rests on point estimates of held-out performance; without reported standard deviations across random seeds, confidence intervals, or statistical tests, it is unclear whether the observed differences are reliable or could be explained by sampling variability.

    Authors: We agree that reporting variability across random seeds would strengthen the reliability of the empirical rankings. In the revised manuscript we will add standard deviations computed over multiple random seeds for the held-out performance metrics in all three regimes and will include paired statistical tests to assess whether the observed differences between objectives are significant. revision: yes

  2. Referee: [weakly matrix monotone functions section] § on weakly matrix monotone functions and weak submodularity: the central extension from matrix monotonicity to weak submodularity for the family of matrix spectral functions is load-bearing for the claim of a 'broad family of practical objectives'; the manuscript should explicitly state the precise inequality that defines 'weakly' and verify it holds for the concrete functions (Vendi, DPP) used later.

    Authors: We will revise the section to state the precise inequality that defines weak submodularity. We will also add a short verification (via direct substitution into the inequality) confirming that both the Vendi Score and the DPP objective satisfy the weak-submodularity condition. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's derivation chain consists of algebraic proofs establishing submodularity for neural scaling-law objectives and the Vendi Score, along with showing the latter as a special case of matrix spectral functions via the introduced concept of weakly matrix monotone functions. These steps rely on standard definitions and properties of submodular set functions and matrix spectral functions rather than any self-referential definitions, fitted parameters from the appraised data, or load-bearing self-citations. The secular-equation updates are presented as an independent algorithmic contribution for efficient greedy optimization. Empirical sections test downstream predictive utility but do not participate in or circularly justify the theoretical claims. The central results are therefore self-contained mathematical statements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on standard properties of submodular set functions and matrix monotone functions plus one newly introduced concept; no free parameters are fitted inside the theoretical derivations.

axioms (1)
  • standard math Standard algebraic properties of submodular functions and matrix spectral functions
    Invoked to establish that Vendi Score and scaling-law objectives are submodular and that Vendi is a special case.
invented entities (1)
  • weakly matrix monotone functions no independent evidence
    purpose: To generate a broader family of weakly submodular matrix spectral functions usable as data-appraisal objectives
    Newly defined in the paper; no independent evidence outside the paper is supplied.

pith-pipeline@v0.9.1-grok · 5879 in / 1380 out tokens · 37497 ms · 2026-06-29T08:30:50.317701+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

106 extracted references · 20 canonical work pages · 8 internal anchors

  1. [1]

    Fujishige, S.Submodular functions and optimization. V ol. 58. Elsevier, 2005 (cit. on pp. 2, 11, 18, 19, 21, 22)

  2. [2]

    Submodularity in machine learning and artificial intelligence

    Bilmes, J. “Submodularity in machine learning and artificial intelligence”. In:arXiv preprint arXiv:2202.00132(2022) (cit. on pp. 2, 7, 18, 19, 22, 67)

  3. [3]

    Submodularity in action: From machine learning to signal processing applications

    Tohidi, E., Amiri, R., Coutino, M., Gesbert, D., Leus, G., and Karbasi, A. “Submodularity in action: From machine learning to signal processing applications”. In:IEEE Signal Processing Magazine37.5 (2020), pp. 120–133 (cit. on pp. 2, 18, 19)

  4. [4]

    The Vendi Score: A Diversity Evaluation Metric for Machine Learning

    Friedman, D. and Dieng, A. B. “The Vendi Score: A Diversity Evaluation Metric for Machine Learning”. In:Transactions on Machine Learning Research(2023) (cit. on pp. 2, 6, 26)

  5. [5]

    Cousins Of The Vendi Score: A Family Of Similarity-Based Diversity Metrics For Science And Machine Learning

    Pasarkar, A. P. and Dieng, A. B. “Cousins Of The Vendi Score: A Family Of Similarity-Based Diversity Metrics For Science And Machine Learning”. In:International Conference on Artificial Intelligence and Statistics. PMLR. 2024, pp. 3808–3816 (cit. on pp. 2, 6, 26)

  6. [6]

    Vendi Novelty Scores for Out-of-Distribution Detection

    Pasarkar, A. P. and Dieng, A. B. “Vendi Novelty Scores for Out-of-Distribution Detection”. In:arXiv preprint arXiv:2602.10062(2026) (cit. on pp. 2, 6, 26)

  7. [7]

    Prismatic synthesis: Gradient-based data diversification boosts generalization in llm reasoning

    Jung, J. et al. “Prismatic synthesis: Gradient-based data diversification boosts generalization in llm reasoning”. In:arXiv preprint arXiv:2505.20161(2025) (cit. on pp. 2, 6, 9, 10, 26, 60, 63, 64)

  8. [8]

    Vendi information gain: An alternative to mutual information for science and machine learning

    Nguyen, Q. and Dieng, A. B. “Vendi information gain: An alternative to mutual information for science and machine learning”. In:arXiv preprint arXiv:2505.09007(2025) (cit. on pp. 2, 6, 26)

  9. [9]

    Diversity You Can Actually Measure: A Fast, Model-Free Diversity Metric for Robotics Datasets

    Sirigiri, S., Lara, N. S. de, Agia, C., Shkurti, F., and Ramos, F. “Diversity You Can Actually Measure: A Fast, Model-Free Diversity Metric for Robotics Datasets”. In:arXiv preprint arXiv:2603.11634(2026) (cit. on pp. 2, 6, 26)

  10. [10]

    Conditional Vendi Score: Prompt-Aware Diversity Evaluation for Generative AI Models and LLMs

    Jalali, M., Ospanov, A., Gohari, A., and Farnia, F. “Conditional Vendi Score: Prompt-Aware Diversity Evaluation for Generative AI Models and LLMs”. In:The 29th International Conference on Artificial Intelligence and Statistics. 2026.URL: https://openreview. net/forum?id=iDrZToIsyd(cit. on pp. 2, 6, 26)

  11. [11]

    Quantum information theory

    Bennett, C. H. and Shor, P. W. “Quantum information theory”. In:IEEE transactions on information theory44.6 (2002), pp. 2724–2742 (cit. on pp. 2, 55)

  12. [12]

    Springer Science & Business Media, 2007 (cit

    Petz, D.Quantum information theory and quantum statistics. Springer Science & Business Media, 2007 (cit. on pp. 2, 55)

  13. [13]

    Nielsen, M. A. and Chuang, I. L.Quantum computation and Quantum information. Cam- bridge university press, 2010 (cit. on pp. 2, 55). 12

  14. [14]

    Generalized submodular information measures: Theoretical properties, examples, optimization algorithms, and applications

    Iyer, R., Khargonkar, N., Bilmes, J., and Asnani, H. “Generalized submodular information measures: Theoretical properties, examples, optimization algorithms, and applications”. In: IEEE Transactions on Information Theory68.2 (2021), pp. 752–781 (cit. on p. 2)

  15. [15]

    Deep Learning Scaling is Predictable, Empirically

    Hestness, J. et al.Deep Learning Scaling Is Predictable, Empirically. arXiv preprint arXiv:1712.00409. 2017.URL:https://arxiv.org/abs/1712.00409(cit. on p. 3)

  16. [16]

    Scaling Laws for Neural Language Models

    Kaplan, J. et al.Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361. 2020.URL:https://arxiv.org/abs/2001.08361(cit. on p. 3)

  17. [17]

    Training Compute-Optimal Large Language Models

    Hoffmann, J. et al. “Training Compute-Optimal Large Language Models”. In:arXiv preprint arXiv:2203.15556(2022).URL:https://arxiv.org/abs/2203.15556(cit. on p. 3)

  18. [18]

    and Song, J.Reconciling Kaplan and Chinchilla Scaling Laws

    Pearce, T. and Song, J.Reconciling Kaplan and Chinchilla Scaling Laws. 2024.URL: https: //arxiv.org/abs/2406.12907(cit. on p. 3)

  19. [19]

    2025.URL: https://arxiv.org/abs/ 2406.19146(cit

    Porian, T., Wortsman, M., Jitsev, J., Schmidt, L., and Carmon, Y .Resolving Discrepancies in Compute-Optimal Scaling of Language Models. 2025.URL: https://arxiv.org/abs/ 2406.19146(cit. on p. 3)

  20. [20]

    et al.Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster

    Dey, N. et al.Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster. 2023.URL: https://arxiv.org/abs/2304.03208 (cit. on p. 3)

  21. [21]

    Will we run out of data? Limits of LLM scaling based on human-generated data

    Villalobos, P., Ho, A., Sevilla, J., Besiroglu, T., Heim, L., and Hobbhahn, M. “Will we run out of data? Limits of LLM scaling based on human-generated data”. In:Forty-first International Conference on Machine Learning. 2024 (cit. on p. 3)

  22. [22]

    Office, E. C. D. S.General Purpose AI Models and Systemic Risk — FAQ (AI Act). Online FAQ / policy website. 2025.URL: https://digital- strategy.ec.europa.eu/en/ faqs/general-purpose-ai-models-ai-act-questions-answers(cit. on p. 3)

  23. [23]

    Business Insider, Oct

    Tan, H.The AI boom has already run out of human-made training data, Goldman’s data chief says. Business Insider, Oct. 2025.URL: https://www.businessinsider.com/ai-boom- has-run-out-of-human-made-training-data-goldman-2025-10(cit. on p. 3)

  24. [24]

    C.Copyright and Artificial Intelligence, Part 3: Generative AI Training (Pre- Publication Version)

    Office, U. C.Copyright and Artificial Intelligence, Part 3: Generative AI Training (Pre- Publication Version). U.S. Copyright Office Report. 2025.URL: https://www.copyright. gov / ai /Copyright - and - Artificial- Intelligence - Part - 3 - Generative - AI - Training-Report-Pre-Publication-Version.pdf(cit. on p. 3)

  25. [25]

    et al.AutoScale: Scale-Aware Data Mixing for Pre-Training LLMs

    Kang, F. et al.AutoScale: Scale-Aware Data Mixing for Pre-Training LLMs. 2025.URL: https://arxiv.org/abs/2407.20177(cit. on p. 4)

  26. [26]

    Scaling Laws for Data Filtering–Data Curation cannot be Compute Agnostic

    Goyal, S., Maini, P., Lipton, Z. C., Raghunathan, A., and Kolter, J. Z. “Scaling Laws for Data Filtering–Data Curation cannot be Compute Agnostic”. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, pp. 22702–22711 (cit. on pp. 4, 5, 11, 23)

  27. [27]

    A Bitter Lesson for Data Filtering

    Mohri, C., Duchi, J., and Hashimoto, T. “A Bitter Lesson for Data Filtering”. In:arXiv preprint arXiv:2605.19407(2026) (cit. on pp. 4, 11)

  28. [28]

    Über monotone matrixfunktionen

    Löwner, K. “Über monotone matrixfunktionen”. In:Mathematische Zeitschrift38.1 (1934), pp. 177–216 (cit. on pp. 5, 27)

  29. [29]

    and Tismenetsky, M.The theory of matrices: with applications

    Lancaster, P. and Tismenetsky, M.The theory of matrices: with applications. Elsevier, 1985 (cit. on p. 5)

  30. [30]

    Springer Science & Business Media, 1997 (cit

    Bhatia, R.Matrix analysis. Springer Science & Business Media, 1997 (cit. on pp. 5, 6)

  31. [31]

    J.Functions of matrices: theory and computation

    Higham, N. J.Functions of matrices: theory and computation. SIAM, 2008 (cit. on p. 5)

  32. [32]

    Positive definite matrices

    Bhatia, R. “Positive definite matrices”. In:Positive Definite Matrices. Princeton university press, 2009 (cit. on pp. 5, 27)

  33. [33]

    Submodular spectral functions of principal submatrices of a hermitian matrix, extensions and applications

    Friedland, S and Gaubert, S. “Submodular spectral functions of principal submatrices of a hermitian matrix, extensions and applications”. In:Linear Algebra and its Applications (2011) (cit. on pp. 5, 6, 31)

  34. [34]

    Strongly subadditive functions

    Audenaert, K., Hiai, F., and Petz, D. “Strongly subadditive functions”. In:Acta Mathematica Hungarica128.4 (2010), pp. 386–394 (cit. on pp. 5, 6, 31)

  35. [35]

    A family of monotone quantum relative entropies

    Lewin, M. and Sabin, J. “A family of monotone quantum relative entropies”. In:Letters in Mathematical Physics104.6 (2014), pp. 691–705 (cit. on pp. 5, 6, 31)

  36. [36]

    Monotone and convex operator functions

    Bendat, J. and Sherman, S. “Monotone and convex operator functions”. In:Transactions of the American Mathematical Society79.1 (1955), pp. 58–71 (cit. on pp. 5, 27). 13

  37. [37]

    Loewner matrices and operator convexity

    Bhatia, R. and Sano, T. “Loewner matrices and operator convexity”. In:Mathematische Annalen344.3 (2009), pp. 703–716 (cit. on pp. 6, 27)

  38. [38]

    Simon, B.Loewner’s theorem on monotone matrix functions. V ol. 10. Springer, 2019 (cit. on pp. 6, 27)

  39. [39]

    Determinantal point processes for machine learning

    Kulesza, A. and Taskar, B. “Determinantal point processes for machine learning”. In:Foun- dations and Trends® in Machine Learning5.2-3 (2012), pp. 123–286 (cit. on p. 6)

  40. [40]

    The vendi score: A diversity evaluation metric for machine learning

    Friedman, D. and Dieng, A. B. “The vendi score: A diversity evaluation metric for machine learning”. In:arXiv preprint arXiv:2210.02410(2022) (cit. on p. 6)

  41. [41]

    Proof of the strong subadditivity of quantum-mechanical entropy

    Lieb, E. H. and Ruskai, M. B. “Proof of the strong subadditivity of quantum-mechanical entropy”. In:Les rencontres physiciens-mathématiciens de Strasbourg-RCP2519 (1973), pp. 36–55 (cit. on pp. 7, 55)

  42. [42]

    Optimal approximation for unconstrained non-submodular minimization

    El Halabi, M. and Jegelka, S. “Optimal approximation for unconstrained non-submodular minimization”. In:International Conference on Machine Learning. PMLR. 2020, pp. 3961– 3972 (cit. on pp. 7, 20)

  43. [43]

    Combinatorial auctions with decreasing marginal utilities

    Lehmann, B., Lehmann, D., and Nisan, N. “Combinatorial auctions with decreasing marginal utilities”. In:Proceedings of the 3rd ACM conference on Electronic Commerce. 2001, pp. 18– 28 (cit. on pp. 7, 20)

  44. [44]

    MIT press, 2007 (cit

    Bottou, L.Large-scale kernel machines. MIT press, 2007 (cit. on pp. 8, 30)

  45. [45]

    Kernels and reproducing kernel hilbert spaces

    Steinwart, I. and Christmann, A. “Kernels and reproducing kernel hilbert spaces”. In:Support Vector Machines. Springer, 2008, pp. 110–163 (cit. on pp. 8, 30)

  46. [46]

    and Smola, A

    Schölkopf, B. and Smola, A. J.Learning with kernels: support vector machines, regulariza- tion, optimization, and beyond. MIT press, 2002 (cit. on pp. 8, 30, 68)

  47. [47]

    Williams, C. K. and Rasmussen, C. E.Gaussian processes for machine learning. V ol. 2. 3. MIT press Cambridge, MA, 2006 (cit. on pp. 8, 30)

  48. [48]

    Accelerated greedy algorithms for maximizing submodular set functions

    Minoux, M. “Accelerated greedy algorithms for maximizing submodular set functions”. In: Optimization Techniques. Ed. by Stoer, J. V ol. 7. Lecture Notes in Control and Information Sciences. Springer Berlin / Heidelberg, 1978, pp. 234–243.URL: http://dx.doi.org/10. 1007/BFb0006528(cit. on pp. 8, 53)

  49. [49]

    A Memoization Framework for Scaling Submodular Optimization to Large Scale Problems

    Iyer, R. and Bilmes, J. “A Memoization Framework for Scaling Submodular Optimization to Large Scale Problems”. In:Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics. Ed. by Chaudhuri, K. and Sugiyama, M. V ol. 89. Proceedings of Machine Learning Research. PMLR, 2019, pp. 2340–2349.URL: https: //proceedings....

  50. [50]

    Rank-one modification of the symmetric eigenproblem

    Bunch, J. R., Nielsen, C. P., and Sorensen, D. C. “Rank-one modification of the symmetric eigenproblem”. In:Numerische Mathematik31.1 (1978), pp. 31–48 (cit. on pp. 9, 43, 49)

  51. [51]

    W.Applied numerical linear algebra

    Demmel, J. W.Applied numerical linear algebra. SIAM, 1997 (cit. on pp. 9, 42, 43, 47–49)

  52. [52]

    Fast greedy map inference for determinantal point process to improve recommendation diversity

    Chen, L., Zhang, G., and Zhou, E. “Fast greedy map inference for determinantal point process to improve recommendation diversity”. In:Advances in neural information processing systems 31 (2018) (cit. on pp. 9, 72)

  53. [53]

    Lazier than lazy greedy

    Mirzasoleiman, B., Badanidiyuru, A., Karbasi, A., V ondrák, J., and Krause, A. “Lazier than lazy greedy”. In:Proceedings of the AAAI Conference on Artificial Intelligence. V ol. 29. 1. 2015 (cit. on pp. 9, 59)

  54. [54]

    Size-constrained submodular minimization through minimum norm base

    Nagano, K., Kawahara, Y ., and Aihara, K. “Size-constrained submodular minimization through minimum norm base”. In:Proceedings of the 28th International Conference on Machine Learning (ICML-11). 2011, pp. 977–984 (cit. on pp. 11, 21)

  55. [55]

    An end-to-end submodular frame- work for data-efficient in-context learning

    Kumari, L., Wang, S., Das, A., Zhou, T., and Bilmes, J. “An end-to-end submodular frame- work for data-efficient in-context learning”. In:Findings of the Association for Computational Linguistics: NAACL 2024. 2024, pp. 3293–3308 (cit. on p. 18)

  56. [56]

    Bumblebee: Dynamic kv-cache streaming submodular summarization for infinite-context transformers

    Kumari, L., Wang, S., Zhou, T., Sarda, N., Rowe, A., and Bilmes, J. “Bumblebee: Dynamic kv-cache streaming submodular summarization for infinite-context transformers”. In:First Conference on Language Modeling. 2024 (cit. on p. 18)

  57. [57]

    M.Supermodularity and complementarity

    Topkis, D. M.Supermodularity and complementarity. Princeton university press, 1998 (cit. on p. 19)

  58. [58]

    An analysis of approximations for maximizing submodular set functions—I

    Nemhauser, G. L., Wolsey, L. A., and Fisher, M. L. “An analysis of approximations for maximizing submodular set functions—I”. In:Mathematical programming14.1 (1978), pp. 265–294 (cit. on pp. 19, 26). 14

  59. [59]

    How to Select a Good Training-data Subset for Transcription: Sub- modular Active Selection for Sequences

    Lin, H. and Bilmes, J. A. “How to Select a Good Training-data Subset for Transcription: Sub- modular Active Selection for Sequences”. In:Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH). Brighton, UK, 2009 (cit. on p. 19)

  60. [60]

    Graph-based Submodular Selection for Extractive Summa- rization

    Lin, H., Bilmes, J., and Xie, S. “Graph-based Submodular Selection for Extractive Summa- rization”. In:Proc. IEEE Automatic Speech Recognition and Understanding (ASRU). Merano, Italy, 2009 (cit. on p. 19)

  61. [61]

    A Class of Submodular Functions for Document Summarization

    Lin, H. and Bilmes, J. “A Class of Submodular Functions for Document Summarization”. In: The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL/HLT-2011). Portland, OR, 2011 (cit. on p. 19)

  62. [62]

    Geometric approximation via coresets

    Agarwal, P. K., Har-Peled, S., Varadarajan, K. R., et al. “Geometric approximation via coresets”. In:Combinatorial and computational geometry52.1 (2005), pp. 1–30 (cit. on p. 19)

  63. [63]

    Coresets for Data-efficient Training of Machine Learning Models

    Mirzasoleiman, B., Bilmes, J., and Leskovec, J. “Coresets for Data-efficient Training of Machine Learning Models”. In:Proceedings of the 37th International Conference on Machine Learning. 2020 (cit. on pp. 19, 22)

  64. [64]

    Submodular meets spectral: greedy algorithms for subset selection, sparse approximation and dictionary selection

    Das, A. and Kempe, D. “Submodular meets spectral: greedy algorithms for subset selection, sparse approximation and dictionary selection”. In:Proceedings of the 28th International Conference on International Conference on Machine Learning. 2011, pp. 1057–1064 (cit. on p. 20)

  65. [65]

    Approximate submodularity and its applications: Subset selection, sparse approximation and dictionary selection

    Das, A. and Kempe, D. “Approximate submodularity and its applications: Subset selection, sparse approximation and dictionary selection”. In:Journal of Machine Learning Research 19.3 (2018), pp. 1–34 (cit. on p. 20)

  66. [66]

    Gradient methods for submodular maxi- mization

    Hassani, H., Soltanolkotabi, M., and Karbasi, A. “Gradient methods for submodular maxi- mization”. In:Advances in Neural Information Processing Systems30 (2017) (cit. on p. 20)

  67. [67]

    Combinatorial Penalties: Which structures are preserved by convex relaxations?

    El Halabi, M., Bach, F., and Cevher, V . “Combinatorial Penalties: Which structures are preserved by convex relaxations?” In:Proceedings of the Twenty-First International Confer- ence on Artificial Intelligence and Statistics. Ed. by Storkey, A. and Perez-Cruz, F. V ol. 84. Proceedings of Machine Learning Research. PMLR, 2018, pp. 1551–1560.URL: https: //...

  68. [68]

    Non-submodular function maximization subject to a matroid constraint, with applications

    Gatmiry, K. and Gomez-Rodriguez, M. “Non-submodular function maximization subject to a matroid constraint, with applications”. In:arXiv preprint arXiv:1811.07863(2018) (cit. on p. 20)

  69. [69]

    Guarantees for greedy maximization of non-submodular functions with applications

    Bian, A. A., Buhmann, J. M., Krause, A., and Tschiatschek, S. “Guarantees for greedy maximization of non-submodular functions with applications”. In:International conference on machine learning. PMLR. 2017, pp. 498–507 (cit. on p. 20)

  70. [70]

    Combinatorial Penalties: Which structures are preserved by convex relaxations?

    El Halabi, M., Bach, F., and Cevher, V . “Combinatorial Penalties: Which structures are preserved by convex relaxations?” In:International Conference on Artificial Intelligence and Statistics. PMLR. 2018, pp. 1551–1560 (cit. on p. 21)

  71. [71]

    Robust maximization of non-submodular objectives

    Bogunovic, I., Zhao, J., and Cevher, V . “Robust maximization of non-submodular objectives”. In:International Conference on Artificial Intelligence and Statistics. PMLR. 2018, pp. 890– 899 (cit. on p. 21)

  72. [72]

    Online Software System

    Bilmes, J.Submarine: SUBModularity for ARtificial INtelligencE and machine learning. Online Software System. 2026 (cit. on p. 21)

  73. [73]

    Deep learning on a data diet: Finding important examples early in training

    Paul, M., Ganguli, S., and Dziugaite, G. K. “Deep learning on a data diet: Finding important examples early in training”. In:Advances in neural information processing systems34 (2021), pp. 20596–20607 (cit. on p. 22)

  74. [74]

    An Empirical Study of Example Forgetting during Deep Neural Network Learning

    Toneva, M., Sordoni, A., Combes, R. Tachet des, Trischler, A., Bengio, Y ., and Gordon, G. J. “An Empirical Study of Example Forgetting during Deep Neural Network Learning”. In: International Conference on Learning Representations. 2019 (cit. on p. 22)

  75. [75]

    The Impact of Coreset Selection on Spurious Correlations and Group Robustness

    Dharmasiri, A., Yang, W., Kirichenko, P., Liu, L. T., and Russakovsky, O. “The Impact of Coreset Selection on Spurious Correlations and Group Robustness”. In:Advances in Neural Information Processing Systems, Datasets and Benchmarks Track. 2025 (cit. on p. 22)

  76. [76]

    Active Learning for Convolutional Neural Networks: A Core-Set Approach

    Sener, O. and Savarese, S. “Active Learning for Convolutional Neural Networks: A Core-Set Approach”. In:International Conference on Learning Representations. 2018 (cit. on p. 22). 15

  77. [77]

    GRAD- MATCH: Gradient Matching Based Data Subset Selection for Efficient Deep Model Training

    Killamsetty, K., Sivasubramanian, D., Ramakrishnan, G., De, A., and Iyer, R. “GRAD- MATCH: Gradient Matching Based Data Subset Selection for Efficient Deep Model Training”. In:Proceedings of the 38th International Conference on Machine Learning. 2021 (cit. on p. 22)

  78. [78]

    GLISTER: General- ization Based Data Subset Selection for Efficient and Robust Learning

    Killamsetty, K., Sivasubramanian, D., Ramakrishnan, G., and Iyer, R. “GLISTER: General- ization Based Data Subset Selection for Efficient and Robust Learning”. In:Proceedings of the AAAI Conference on Artificial Intelligence. 2021 (cit. on p. 22)

  79. [79]

    Data Shapley: Equitable Valuation of Data for Machine Learning

    Ghorbani, A. and Zou, J. “Data Shapley: Equitable Valuation of Data for Machine Learning”. In:Proceedings of the 36th International Conference on Machine Learning. 2019 (cit. on p. 22)

  80. [80]

    Practical Coreset Constructions for Machine Learning

    Bachem, O., Lucic, M., and Krause, A. “Practical Coreset Constructions for Machine Learn- ing”. In:arXiv preprint arXiv:1703.06476(2017) (cit. on p. 22)

Showing first 80 references.