Convex Dataset Valuation for Post-Training

Christopher Jung; Fuchun Peng; Han Zhao; Ming Li; Nima Noorshams; Rui Li; Siqi Zeng; Xue Feng; Zhe Kang; Zhigang Wang

arxiv: 2605.16704 · v1 · pith:CRSWDVUOnew · submitted 2026-05-15 · 💻 cs.LG

Convex Dataset Valuation for Post-Training

Siqi Zeng , Christopher Jung , Rui Li , Zhe Kang , Ming Li , Nima Noorshams , Zhigang Wang , Fuchun Peng

show 2 more authors

Han Zhao Xue Feng

This is my paper

Pith reviewed 2026-05-20 18:44 UTC · model grok-4.3

classification 💻 cs.LG

keywords dataset valuationLLM post-trainingkernel mean matchinggradient spaceconvex optimizationsubset selectionauxiliary datasetsdata selection

0 comments

The pith

A convex optimization using kernel mean matching in gradient space values auxiliary datasets for LLM post-training by balancing alignment and redundancy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles dataset selection for LLM post-training when budgets on compute, labels, and licensing prevent using every available auxiliary dataset. Simple gradient alignment scores are shown to be incomplete because they ignore redundancy across the auxiliaries. The authors formulate valuation as a convex program that applies kernel mean matching to gradient vectors, producing weights that favor alignment with the target while penalizing redundant contributions. Experiments across multiple post-training regimes show the resulting selections improve target-task performance over prior valuation baselines at modest extra cost. This turns data acquisition into an explicit optimization step usable under marketplace constraints.

Core claim

We first show that commonly used gradient alignment scores provide a reasonable yet incomplete valuation signal, as they ignore redundancy among datasets. To address this, we propose a scalable convex dataset-level valuation method based on kernel mean matching (KMM) in gradient space, which jointly accounts for alignment with the target task and redundancy across auxiliary datasets. Through extensive experiments across diverse post-training settings and tasks, we show that our approach consistently outperforms existing valuation baselines, achieving stronger performance with low computational overhead.

What carries the argument

Kernel mean matching applied to gradient vectors from the target task, which solves a convex program to find weights that align the auxiliary gradient distribution while adding a redundancy penalty.

Load-bearing premise

Kernel mean matching performed in gradient space will reliably capture and penalize redundancy among auxiliary datasets without introducing new biases or requiring task-specific hyperparameter tuning.

What would settle it

A controlled experiment on a held-out target task in which the KMM-weighted subset, chosen under the same budget, yields lower accuracy than either a pure gradient-alignment selection or a random selection of the same size.

Figures

Figures reproduced from arXiv: 2605.16704 by Christopher Jung, Fuchun Peng, Han Zhao, Ming Li, Nima Noorshams, Rui Li, Siqi Zeng, Xue Feng, Zhe Kang, Zhigang Wang.

**Figure 1.** Figure 1: Buyers in data marketplace model. Auxiliary datasets exhibit heterogeneous positive and negative transfer relationships with respect to the target task (yellow) and with each other, reflecting redundancy and interference. Valuation scores provide a compact summary of these interactions, helping buyers rank candidate datasets and allocate budgets under limited data access. In such dataset markets, model de… view at source ↗

**Figure 2.** Figure 2: Best-k performance versus wall clock time (log scale) for different data valuation methods on Danish (left) and Marathi (right). KMM-based methods are highlighted. single training run can require substantially more time and compute. In contrast, KMM-enhanced gradient methods match this performance while achieving over 100× lower runtime, and its overhead over its base gradient methods is negligible. GradEX… view at source ↗

**Figure 3.** Figure 3: Transferability of auxiliary-task rankings across models. We evaluate whether auxiliary-dataset valuation scores computed using different pretrained models can be transferred to a model of interest, Gemma3-4B, for dataset selection. Bars report MMLUMalayalam performance after fine-tuning Gemma3-4B using subsets selected by wˆj derived from various source models θ pub on x-axis. Full refers to SFT using f… view at source ↗

**Figure 4.** Figure 4: Effect of the KMM regularization parameter on downstream performance for Danish and Marathi. Solid orange curves report MMMLU metrics obtained by fine-tuning with auxiliary datasets selected using KMM under varying regularization strengths, while the dashed blue line denotes the One-Step baseline. High strength values lead to higher sparsity in the final scores. Robustness of KMM advantage to regularizat… view at source ↗

**Figure 5.** Figure 5: One-Step vs. KMM-based selection’s MMLU-Danish best-k performance under varying numbers of auxiliary task examples available for data valuation methods (left) and predefined target task weights within training batches (right). KMM is most effective when target signal remains informative. In [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Task-Vector vs. KMM dataset valuation wi for all 25 auxiliary languages using Marathi as the target task (higher values indicate larger estimated positive contribution to target performance). Filled markers denote TV scores without reweighting, while hollow markers show scores after applying KMM. Colors indicate language subgroups from Singh et al. (2024). Languages on the x-axis are sorted by their TV sc… view at source ↗

**Figure 7.** Figure 7: KMM-induced pairwise transfer structure across languages. Each cell shows the discrete signed transfer rank of a source language (column) for a target language (row), computed from KMM scores. Discrete scoring ranks are anchored at zero, with positive values indicating beneficial transfer, negative values indicating harmful transfer, and zero denoting neutral effect. Languages on both axes are grouped by… view at source ↗

**Figure 8.** Figure 8: One-step vs. KMM-based selection best-k performance under different auxiliary data weighting strategies (left) and LoRA adapter ranks (right) for Danish as target task. Popularity weights auxiliary tasks based on resource availability to reflect realistic language corpus composition, proportional corresponds to the original Aya training distribution, and uniform denotes equal weighting across tasks (our de… view at source ↗

read the original abstract

Improving LLM performance on downstream tasks sometimes requires leveraging auxiliary datasets during post-training. In practice, however, developers face constraints on compute, labeling, and licensing costs that preclude using all available data, necessitating principled dataset-level selection. These constraints are increasingly shaped by dataset marketplaces, where data acquisition is governed by budgets and negotiation. We study dataset valuation as a subset selection problem during LLM post-training. Our goal is to identify and weight auxiliary datasets so as to maximize target task performance given constrained budgets. We first show that commonly used gradient alignment scores provide a reasonable yet incomplete valuation signal, as they ignore redundancy among datasets. To address this, we propose a scalable convex dataset-level valuation method based on kernel mean matching (KMM) in gradient space, which jointly accounts for alignment with the target task and redundancy across auxiliary datasets. Through extensive experiments across diverse post-training settings and tasks, we show that our approach consistently outperforms existing valuation baselines, achieving stronger performance with low computational overhead. Our results position dataset valuation as a practical decision tool for post-training data selection in market-constrained large language model settings. The code is available at https://github.com/uiuctml/convex_data_valuation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript frames dataset valuation for LLM post-training as a budgeted subset-selection problem. It argues that standard gradient-alignment scores are incomplete because they ignore redundancy among auxiliary datasets, then introduces a convex program that augments gradient inner-product alignment with a kernel mean matching (KMM) penalty on the means of auxiliary datasets in gradient space. The authors report that the resulting weights yield stronger downstream performance than existing valuation baselines across multiple post-training regimes while incurring low computational overhead; code is released.

Significance. If the empirical claims hold, the work supplies a practical, convex, and scalable decision tool for data acquisition under marketplace-style budget and licensing constraints. The public code repository is a clear strength that enables direct reproduction and extension.

major comments (2)

§3.2 (KMM formulation): the claim that mean-matching in gradient space reliably penalizes functional redundancy rests on the unproven assumption that a fixed kernel (and its bandwidth) produces a discrepancy that is both faithful to the downstream loss and stable under the high-dimensional, noisy gradients of large LLMs. No derivation or sensitivity experiment is supplied showing invariance to these choices; if the assumption fails, the reported gains over pure gradient-alignment baselines could be artifacts of implicit hyperparameter search rather than genuine redundancy accounting.
Experiments section (Tables 1–4 and associated ablations): the central claim of “consistent outperformance with low overhead” is load-bearing, yet the manuscript provides no quantitative breakdown of how performance varies with kernel bandwidth, gradient checkpointing choices, or model scale. Without these controls, it is impossible to confirm that the KMM term improves selection rather than merely re-expressing existing fitted scores under favorable hyperparameter settings.

minor comments (2)

Abstract: the statement of results is entirely qualitative; inserting one or two headline numbers (e.g., average accuracy lift and wall-clock overhead) would make the contribution easier to assess at a glance.
Notation: several symbols in the convex objective (e.g., the precise definition of the kernel bandwidth and the normalization of gradient vectors) are introduced without an explicit reference to their first appearance; a short notation table would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We appreciate the opportunity to clarify and strengthen our presentation of the KMM-based dataset valuation method. Below, we address each major comment point by point.

read point-by-point responses

Referee: §3.2 (KMM formulation): the claim that mean-matching in gradient space reliably penalizes functional redundancy rests on the unproven assumption that a fixed kernel (and its bandwidth) produces a discrepancy that is both faithful to the downstream loss and stable under the high-dimensional, noisy gradients of large LLMs. No derivation or sensitivity experiment is supplied showing invariance to these choices; if the assumption fails, the reported gains over pure gradient-alignment baselines could be artifacts of implicit hyperparameter search rather than genuine redundancy accounting.

Authors: We acknowledge that a complete theoretical derivation linking gradient-space KMM directly to downstream loss invariance is not provided in the current manuscript, as our focus is on the practical convex optimization formulation and empirical validation. However, we do provide justification in §3.2 for why mean matching in gradient space can capture redundancy, building on the fact that gradients reflect the functional behavior of the model. To address the concern about sensitivity to kernel and bandwidth choices, we will add a new subsection with sensitivity experiments varying the bandwidth parameter over a wide range and demonstrate that the performance improvements remain consistent. This will help confirm that the gains are not due to specific hyperparameter tuning. revision: yes
Referee: Experiments section (Tables 1–4 and associated ablations): the central claim of “consistent outperformance with low overhead” is load-bearing, yet the manuscript provides no quantitative breakdown of how performance varies with kernel bandwidth, gradient checkpointing choices, or model scale. Without these controls, it is impossible to confirm that the KMM term improves selection rather than merely re-expressing existing fitted scores under favorable hyperparameter settings.

Authors: We agree that additional controls and breakdowns would make the empirical claims more robust. In the revised manuscript, we will expand the experiments section to include quantitative ablations on kernel bandwidth, different gradient checkpointing strategies, and results across varying model scales (e.g., from 1B to 7B parameters). These additions will provide a clearer picture of the robustness of the KMM term's contribution. revision: yes

Circularity Check

0 steps flagged

Convex KMM formulation introduces independent redundancy term without reducing to fitted inputs or self-citations

full rationale

The paper defines dataset valuation as a convex program that augments gradient-alignment scores with a kernel mean matching penalty on auxiliary dataset means in gradient space. No equation or derivation is shown to equate the final weights to a re-expression of the input alignment scores or to a parameter fitted directly to target performance. The KMM term is introduced as an explicit additive penalty rather than derived from the alignment objective itself, and the abstract and described method contain no load-bearing self-citation to prior uniqueness results or ansatzes by the same authors. The derivation therefore remains self-contained as a new optimization construction whose validity rests on the empirical experiments rather than on circular redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that gradient-space KMM can simultaneously optimize alignment and penalize redundancy; no free parameters or invented entities are visible in the abstract.

axioms (1)

domain assumption Gradient alignment scores provide a reasonable yet incomplete valuation signal because they ignore redundancy among datasets
Explicitly stated in the abstract as the motivation for the new method.

pith-pipeline@v0.9.0 · 5758 in / 1235 out tokens · 50393 ms · 2026-05-20T18:44:05.822725+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

min_w ½ wᵀKw − λβᵀw where Kij = ⟨gi, gj⟩, βi = ⟨gi, gtar⟩ (Eq. 10); connection to Markowitz mean-variance
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

KMM in gradient space for joint alignment + redundancy penalty; convex QP solved in O(N³)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 13 internal anchors

[1]

doi: https://doi.org/10.1016/S0022-0000(03 )00025-4

ISSN 0022-0000. doi: https://doi.org/10.1016/S0022-0000(03 )00025-4. URL https://www.sciencedirect. com/science/article/pii/S00220000030 00254. Special Issue on PODS

work page doi:10.1016/s0022-0000(03
[2]

A marketplace for data: An algorithmic solution

Agarwal, A., Dahleh, M., and Sarkar, T. A marketplace for data: An algorithmic solution. InProceedings of the 2019 ACM Conference on Economics and Computation, pp. 701–726,

work page 2019
[3]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

URL https://arxiv.org/abs/2502.02737. Alzubaidi, L., Bai, J., Al-Sabaawi, A., Santamar ´ıa, J., Albahri, A. S., Al-Dabbagh, B. S. N., Fadhel, M. A., Manoufali, M., Zhang, J., Al-Timemy, A. H., et al. A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications. Journal of Big Data, 10(1):46,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

A large annotated corpus for learning natural language inference

Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D. A large annotated corpus for learning natural language inference.arXiv preprint arXiv:1508.05326,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Training Verifiers to Solve Math Word Problems

URL https://openreview.net/forum?id=5ARb fIHxtk. Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Dai, Q., Zhang, D., Ma, J

URL https://www.databricks .com/blog/2023/04/12/dolly-first-ope n-commercially-viable-instruction-tun ed-llm. Dai, Q., Zhang, D., Ma, J. W., and Peng, H. Improving influence-based instruction tuning data selection for bal- anced learning of diverse capabilities. InICLR 2025 Workshop on Navigating and Addressing Data Problems for Foundation Models,

work page 2023
[7]

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar

URL https://zenodo.org/records/12608602. Ghorbani, A. and Zou, J. Data shapley: Equitable valuation of data for machine learning. InInternational conference on machine learning, pp. 2242–2251. PMLR,

work page arXiv
[8]

Okapi: Instruction-tuned large language models in multiple languages with reinforce- ment learning from human feedback

Lai, V ., Nguyen, C., Ngo, N., Nguyen, T., Dernoncourt, F., Rossi, R., and Nguyen, T. Okapi: Instruction-tuned large language models in multiple languages with reinforce- ment learning from human feedback. InProceedings of the 2023 Conference on Empirical Methods in Natu- ral Language Processing: System Demonstrations, pp. 318–327,

work page 2023
[9]

Lambert, N., Morrison, J., Pyatkin, V ., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V ., Liu, A., Dziri, N., Lyu, S., et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

and Roth, D

Li, X. and Roth, D. Learning question classifiers. InCOL- ING 2002: The 19th International Conference on Com- putational Linguistics,

work page 2002
[11]

URL http://www.jstor.org/stable/2975974

ISSN 00221082, 15406261. URL http://www.jstor.org/stable/2975974. Mazumder, M., Banbury, C., Yao, X., Karlaˇs, B., Gaviria Ro- jas, W., Diamos, S., Diamos, G., He, L., Parrish, A., Kirk, H. R., et al. Dataperf: Benchmarks for data-centric ai de- velopment.Advances in Neural Information Processing Systems, 36:5320–5347,

work page arXiv
[12]

and Nikolov, N

McGiff, J. and Nikolov, N. S. Overcoming data scarcity in generative language modelling for low-resource languages: A systematic review.arXiv preprint arXiv:2505.04531,

work page arXiv
[13]

Multi-task trans- fer matters during instruction-tuning

Mueller, D., Dredze, M., and Andrews, N. Multi-task trans- fer matters during instruction-tuning. InFindings of the Association for Computational Linguistics ACL 2024, pp. 14880–14891,

work page 2024
[14]

B., and Liang, P

Oren, Y ., Sagawa, S., Hashimoto, T. B., and Liang, P. Dis- tributionally robust language modeling. InProceedings of the 2019 Conference on Empirical Methods in Natu- ral Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pp. 4227–4237,

work page 2019
[15]

A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts

Pang, B. and Lee, L. A sentimental education: Sentiment analysis using subjectivity summarization based on mini- mum cuts.arXiv preprint cs/0409058,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Qwen2.5 Technical Report

URL https: //arxiv.org/abs/2412.15115. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y ., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21 (140):1–67,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

and Gurevych, I

Reimers, N. and Gurevych, I. Sentence-bert: Sentence em- beddings using siamese bert-networks. InProceedings of the 2019 conference on empirical methods in natural lan- guage processing and the 9th international joint confer- ence on natural language processing (EMNLP-IJCNLP), pp. 3982–3992,

work page 2019
[18]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Natural Language Understanding with the Quora Question Pairs Dataset

Sharma, L., Graesser, L., Nangia, N., and Evci, U. Natural language understanding with the quora question pairs dataset.arXiv preprint arXiv:1907.01041,

work page internal anchor Pith review Pith/arXiv arXiv 1907
[20]

D., Ng, A

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y ., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642,

work page 2013
[21]

OSQP: an operator splitting solver for quadratic programs,

doi: 10.1007/s12532-020-00179-2. URL https://doi.org/10.1007/s12532-0 20-00179-2. Sun, S., Shi, H., and Wu, Y . A survey of multi-source domain adaptation.Information Fusion, 24:84–92,

work page doi:10.1007/s12532-020-00179-2
[22]

Gemma 3 Technical Report

Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ram ´e, A., Rivi`ere, M., et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Wang, A. Glue: A multi-task benchmark and analysis plat- form for natural language understanding.arXiv preprint arXiv:1804.07461,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Warstadt, A., Singh, A., and Bowman, S. R. Neu- ral network acceptability judgments.arXiv preprint arXiv:1805.12471,

work page arXiv
[25]

Williams, A., Nangia, N., and Bowman, S. R. A broad- coverage challenge corpus for sentence understanding through inference.arXiv preprint arXiv:1704.05426,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Instruction-Following Evaluation for Large Language Models

Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y ., Zhou, D., and Hou, L. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

13 Convex Dataset Valuation for Post-Training A. Additional Related Work Data Market Modeling and Valuation-Aware Transactions.Several works study data marketplaces and data acquisition as economic or algorithmic problems for classic machine learning settings. Early efforts model two-sided data markets and pricing mechanisms for training data, focusing on...

work page 2019
[29]

as the target task and evaluate models using strict instruction-level accuracy with IFEval (Zhou et al., 2023). The auxiliary candidate pool consists of seven diverse instruction-tuning sources: allenai/tulu-3- sft-personas-math-filtered, allenai/tulu-3-sft-personas-math-grade-filtered, allenai/tulu-3-sft-personas-algebra, allenai/tulu- 3-sft-personas-cod...

work page 2023
[30]

We select the language set in Tab

is a multilingual benchmark that categorizes languages by resource level (e.g., High/Mid/Low) in practice. We select the language set in Tab. 10 because these languages are reported as supported by Qwen3 models (per the Qwen3 technical report (Yang et al., 2025)), and for Gemma3 (Team et al.,

work page 2025
[31]

we additionally verified that the tokenizer includes the Unicode characters needed to represent text in that script, so the model can tokenize and generate text in that language without falling back to unknown or degenerate tokens. Within this set, we emphasize two target scenarios for studying transfer from high-resource data to low-resource performance:...

work page 2025
[32]

Data Valuation Baselines G.1

Thai 250 Exact Match (Flexible-Extract) G. Data Valuation Baselines G.1. Gradient-based Methods G.1.1. KMM Instead of enforcing the ℓ1 budget constraint w∈ W k, we can incorporate sparsity directly into the objective via an ℓ1 penalty with regularization strength γ >0 . Using the Gram matrix Kij =⟨g i, gj⟩ and alignment vector βi =⟨g i, gtar⟩, we solve th...

work page 2016
[33]

as the default solver. G.2. DataModel Datamodel methods cast dataset valuation as a linear regression problem over subset-existence features (Ilyas et al., 2022). Specifically, we sample m auxiliary dataset index subsets {Sr}m r=1, where each Sr ⊆ {1, . . . , N} specifies a set of auxiliary datasets to include in post-training. For each subset Sr, we trai...

work page 2022
[34]

Because Acs has signed entries, each row r is implemented via two auxiliary subsets whose difference realizes the signed measurement

This is the Achlioptas-type sparse projection used in the compressive-sensing datamodel baseline (Achlioptas, 2003). Because Acs has signed entries, each row r is implemented via two auxiliary subsets whose difference realizes the signed measurement. Define Sr,1 ={i∈[N] :ξ r,i ∈ {0,+1}},S r,2 ={i∈[N] :ξ r,i ∈ {0,−1}}.(37) Intuitively, indices with ξr,i = ...

work page 2003
[35]

are normalized within the auxiliary pool, with larger weights corresponding to a higher probability of being sampled in each training batch. In Fig. 8, across all auxiliary data weighting strategies, KMM consistently outperforms one-step selection. Notably, this holds even when the auxiliary data distribution closely reflects realistic corpus composition,...

work page arXiv
[36]

3, non-uniform softmax weighting leads to lower absolute performance for all methods in this setting

Compared with the uniform-weighting 27 Convex Dataset Valuation for Post-Training results in Tab. 3, non-uniform softmax weighting leads to lower absolute performance for all methods in this setting. Nevertheless, One Step+KMM remains the best-performing method, and KMM continues to improve over its corresponding gradient-based baseline. Effect of LoRA Ad...

work page arXiv 2018
[37]

The same valuation scores are used across budgets, and larger budgets move training farther from the local Taylor approximation

Table 17.Best- k accuracy on Danish under different training budgets. The same valuation scores are used across budgets, and larger budgets move training farther from the local Taylor approximation. Training Steps Random One Step One Step+KMM 0.25×45.69 45.8745.93 0.5×45.80 46.0146.08 2×45.75 45.9745.99 4×45.56 45.8746.01 We note that these results are no...

work page 2019

[1] [1]

doi: https://doi.org/10.1016/S0022-0000(03 )00025-4

ISSN 0022-0000. doi: https://doi.org/10.1016/S0022-0000(03 )00025-4. URL https://www.sciencedirect. com/science/article/pii/S00220000030 00254. Special Issue on PODS

work page doi:10.1016/s0022-0000(03

[2] [2]

A marketplace for data: An algorithmic solution

Agarwal, A., Dahleh, M., and Sarkar, T. A marketplace for data: An algorithmic solution. InProceedings of the 2019 ACM Conference on Economics and Computation, pp. 701–726,

work page 2019

[3] [3]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

URL https://arxiv.org/abs/2502.02737. Alzubaidi, L., Bai, J., Al-Sabaawi, A., Santamar ´ıa, J., Albahri, A. S., Al-Dabbagh, B. S. N., Fadhel, M. A., Manoufali, M., Zhang, J., Al-Timemy, A. H., et al. A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications. Journal of Big Data, 10(1):46,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

A large annotated corpus for learning natural language inference

Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D. A large annotated corpus for learning natural language inference.arXiv preprint arXiv:1508.05326,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Training Verifiers to Solve Math Word Problems

URL https://openreview.net/forum?id=5ARb fIHxtk. Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Dai, Q., Zhang, D., Ma, J

URL https://www.databricks .com/blog/2023/04/12/dolly-first-ope n-commercially-viable-instruction-tun ed-llm. Dai, Q., Zhang, D., Ma, J. W., and Peng, H. Improving influence-based instruction tuning data selection for bal- anced learning of diverse capabilities. InICLR 2025 Workshop on Navigating and Addressing Data Problems for Foundation Models,

work page 2023

[7] [7]

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar

URL https://zenodo.org/records/12608602. Ghorbani, A. and Zou, J. Data shapley: Equitable valuation of data for machine learning. InInternational conference on machine learning, pp. 2242–2251. PMLR,

work page arXiv

[8] [8]

Okapi: Instruction-tuned large language models in multiple languages with reinforce- ment learning from human feedback

Lai, V ., Nguyen, C., Ngo, N., Nguyen, T., Dernoncourt, F., Rossi, R., and Nguyen, T. Okapi: Instruction-tuned large language models in multiple languages with reinforce- ment learning from human feedback. InProceedings of the 2023 Conference on Empirical Methods in Natu- ral Language Processing: System Demonstrations, pp. 318–327,

work page 2023

[9] [9]

Lambert, N., Morrison, J., Pyatkin, V ., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V ., Liu, A., Dziri, N., Lyu, S., et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

and Roth, D

Li, X. and Roth, D. Learning question classifiers. InCOL- ING 2002: The 19th International Conference on Com- putational Linguistics,

work page 2002

[11] [11]

URL http://www.jstor.org/stable/2975974

ISSN 00221082, 15406261. URL http://www.jstor.org/stable/2975974. Mazumder, M., Banbury, C., Yao, X., Karlaˇs, B., Gaviria Ro- jas, W., Diamos, S., Diamos, G., He, L., Parrish, A., Kirk, H. R., et al. Dataperf: Benchmarks for data-centric ai de- velopment.Advances in Neural Information Processing Systems, 36:5320–5347,

work page arXiv

[12] [12]

and Nikolov, N

McGiff, J. and Nikolov, N. S. Overcoming data scarcity in generative language modelling for low-resource languages: A systematic review.arXiv preprint arXiv:2505.04531,

work page arXiv

[13] [13]

Multi-task trans- fer matters during instruction-tuning

Mueller, D., Dredze, M., and Andrews, N. Multi-task trans- fer matters during instruction-tuning. InFindings of the Association for Computational Linguistics ACL 2024, pp. 14880–14891,

work page 2024

[14] [14]

B., and Liang, P

Oren, Y ., Sagawa, S., Hashimoto, T. B., and Liang, P. Dis- tributionally robust language modeling. InProceedings of the 2019 Conference on Empirical Methods in Natu- ral Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pp. 4227–4237,

work page 2019

[15] [15]

A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts

Pang, B. and Lee, L. A sentimental education: Sentiment analysis using subjectivity summarization based on mini- mum cuts.arXiv preprint cs/0409058,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Qwen2.5 Technical Report

URL https: //arxiv.org/abs/2412.15115. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y ., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21 (140):1–67,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

and Gurevych, I

Reimers, N. and Gurevych, I. Sentence-bert: Sentence em- beddings using siamese bert-networks. InProceedings of the 2019 conference on empirical methods in natural lan- guage processing and the 9th international joint confer- ence on natural language processing (EMNLP-IJCNLP), pp. 3982–3992,

work page 2019

[18] [18]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Natural Language Understanding with the Quora Question Pairs Dataset

Sharma, L., Graesser, L., Nangia, N., and Evci, U. Natural language understanding with the quora question pairs dataset.arXiv preprint arXiv:1907.01041,

work page internal anchor Pith review Pith/arXiv arXiv 1907

[20] [20]

D., Ng, A

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y ., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642,

work page 2013

[21] [21]

OSQP: an operator splitting solver for quadratic programs,

doi: 10.1007/s12532-020-00179-2. URL https://doi.org/10.1007/s12532-0 20-00179-2. Sun, S., Shi, H., and Wu, Y . A survey of multi-source domain adaptation.Information Fusion, 24:84–92,

work page doi:10.1007/s12532-020-00179-2

[22] [22]

Gemma 3 Technical Report

Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ram ´e, A., Rivi`ere, M., et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Wang, A. Glue: A multi-task benchmark and analysis plat- form for natural language understanding.arXiv preprint arXiv:1804.07461,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Warstadt, A., Singh, A., and Bowman, S. R. Neu- ral network acceptability judgments.arXiv preprint arXiv:1805.12471,

work page arXiv

[25] [25]

Williams, A., Nangia, N., and Bowman, S. R. A broad- coverage challenge corpus for sentence understanding through inference.arXiv preprint arXiv:1704.05426,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Instruction-Following Evaluation for Large Language Models

Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y ., Zhou, D., and Hou, L. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

13 Convex Dataset Valuation for Post-Training A. Additional Related Work Data Market Modeling and Valuation-Aware Transactions.Several works study data marketplaces and data acquisition as economic or algorithmic problems for classic machine learning settings. Early efforts model two-sided data markets and pricing mechanisms for training data, focusing on...

work page 2019

[29] [29]

as the target task and evaluate models using strict instruction-level accuracy with IFEval (Zhou et al., 2023). The auxiliary candidate pool consists of seven diverse instruction-tuning sources: allenai/tulu-3- sft-personas-math-filtered, allenai/tulu-3-sft-personas-math-grade-filtered, allenai/tulu-3-sft-personas-algebra, allenai/tulu- 3-sft-personas-cod...

work page 2023

[30] [30]

We select the language set in Tab

is a multilingual benchmark that categorizes languages by resource level (e.g., High/Mid/Low) in practice. We select the language set in Tab. 10 because these languages are reported as supported by Qwen3 models (per the Qwen3 technical report (Yang et al., 2025)), and for Gemma3 (Team et al.,

work page 2025

[31] [31]

we additionally verified that the tokenizer includes the Unicode characters needed to represent text in that script, so the model can tokenize and generate text in that language without falling back to unknown or degenerate tokens. Within this set, we emphasize two target scenarios for studying transfer from high-resource data to low-resource performance:...

work page 2025

[32] [32]

Data Valuation Baselines G.1

Thai 250 Exact Match (Flexible-Extract) G. Data Valuation Baselines G.1. Gradient-based Methods G.1.1. KMM Instead of enforcing the ℓ1 budget constraint w∈ W k, we can incorporate sparsity directly into the objective via an ℓ1 penalty with regularization strength γ >0 . Using the Gram matrix Kij =⟨g i, gj⟩ and alignment vector βi =⟨g i, gtar⟩, we solve th...

work page 2016

[33] [33]

as the default solver. G.2. DataModel Datamodel methods cast dataset valuation as a linear regression problem over subset-existence features (Ilyas et al., 2022). Specifically, we sample m auxiliary dataset index subsets {Sr}m r=1, where each Sr ⊆ {1, . . . , N} specifies a set of auxiliary datasets to include in post-training. For each subset Sr, we trai...

work page 2022

[34] [34]

Because Acs has signed entries, each row r is implemented via two auxiliary subsets whose difference realizes the signed measurement

This is the Achlioptas-type sparse projection used in the compressive-sensing datamodel baseline (Achlioptas, 2003). Because Acs has signed entries, each row r is implemented via two auxiliary subsets whose difference realizes the signed measurement. Define Sr,1 ={i∈[N] :ξ r,i ∈ {0,+1}},S r,2 ={i∈[N] :ξ r,i ∈ {0,−1}}.(37) Intuitively, indices with ξr,i = ...

work page 2003

[35] [35]

are normalized within the auxiliary pool, with larger weights corresponding to a higher probability of being sampled in each training batch. In Fig. 8, across all auxiliary data weighting strategies, KMM consistently outperforms one-step selection. Notably, this holds even when the auxiliary data distribution closely reflects realistic corpus composition,...

work page arXiv

[36] [36]

3, non-uniform softmax weighting leads to lower absolute performance for all methods in this setting

Compared with the uniform-weighting 27 Convex Dataset Valuation for Post-Training results in Tab. 3, non-uniform softmax weighting leads to lower absolute performance for all methods in this setting. Nevertheless, One Step+KMM remains the best-performing method, and KMM continues to improve over its corresponding gradient-based baseline. Effect of LoRA Ad...

work page arXiv 2018

[37] [37]

The same valuation scores are used across budgets, and larger budgets move training farther from the local Taylor approximation

Table 17.Best- k accuracy on Danish under different training budgets. The same valuation scores are used across budgets, and larger budgets move training farther from the local Taylor approximation. Training Steps Random One Step One Step+KMM 0.25×45.69 45.8745.93 0.5×45.80 46.0146.08 2×45.75 45.9745.99 4×45.56 45.8746.01 We note that these results are no...

work page 2019