DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks

Bryan Kian Hsiang Low; Chuan-Sheng Foo; Gregory Kang Ruey Lau; Zhiliang Chen

arxiv: 2502.00270 · v3 · pith:EEE3ZLYDnew · submitted 2025-02-01 · 💻 cs.LG · cs.AI· stat.ML

DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks

Zhiliang Chen , Gregory Kang Ruey Lau , Chuan-Sheng Foo , Bryan Kian Hsiang Low This is my paper

Pith reviewed 2026-05-23 03:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords training data mixtureunseen tasksinfluence functionsBayesian optimizationLLM fine-tuningregret boundsdata selection

0 comments

The pith

DUET converges to the optimal training data mixture for an unseen task using only performance feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DUET as a way to optimize the mixture of training data for an LLM when the evaluation task is unseen and its data cannot be accessed. Instead of knowing the task data, the method relies on repeated feedback from running the model on the task, such as user ratings. DUET combines influence functions to select useful data locally with Bayesian optimization to search over possible mixtures globally. Theoretical analysis of the algorithm's regret shows it will converge to the best possible mixture. This is demonstrated to work better than previous data selection approaches in experiments on language tasks.

Core claim

DUET is a novel global-to-local algorithm that interleaves influence function as a data selection method with Bayesian optimization to optimize data mixture via feedback from a specific unseen evaluation task. By analyzing DUET's cumulative regret, the paper shows that DUET converges to the optimal training data mixture for an unseen task even without any data knowledge of the task.

What carries the argument

Global-to-local interleaving of influence functions for approximating data utility and Bayesian optimization for searching mixture proportions.

If this is right

DUET applies to cases where task data is encrypted or private.
The method guarantees convergence to the optimal mixture through regret bounds.
It outperforms standard data mixing methods when task data is unavailable.
Multiple rounds of model deployment feedback are sufficient to guide the optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could support fine-tuning models on user-specific interactions while preserving privacy.
The framework might generalize to optimizing data for other machine learning models beyond LLMs.
Testing on tasks with known optima could validate the regret analysis in practice.

Load-bearing premise

The influence function provides a sufficiently accurate local approximation of how data affects performance on the unseen task.

What would settle it

Running DUET on a controlled task where the true optimal mixture is known in advance and observing whether it reaches that mixture or gets stuck due to poor influence estimates.

Figures

Figures reproduced from arXiv: 2502.00270 by Bryan Kian Hsiang Low, Chuan-Sheng Foo, Gregory Kang Ruey Lau, Zhiliang Chen.

**Figure 1.** Figure 1: DUET exploits a feedback loop to optimize the data mixture for an unseen evaluation task. coarse feedback on how well the LLM has performed in the conversation (e.g., user ratings or duration spent on the application) and gathers multiple rounds of feedback from the users. This paper presents DUET ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Empirical distribution of the uniform random and IF-driven estimator ye∗ r . Red line is the true inner problem solution. In [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Results on unseen LLM evaluation task domains over 10 iterations (higher is better) for [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation of different components of DUET [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation of using different data selection methods in DUET [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation of sampling size k in DUET. While we have shown that DUET outperforms existing baselines, we also want to study the influence of different components in DUET on its performance. To do so, we ran several ablation experiments on the [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: (a): Empirical distribution of evaluation task accuracy [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Results on unseen LLM evaluation task domains over 10 iterations (higher [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

read the original abstract

The performance of an LLM depends heavily on the relevance of its training data to the downstream evaluation task. However, in practice, the data involved in an unseen evaluation task is often unknown (e.g., conversations between an LLM and a user are end-to-end encrypted). Hence, it is unclear what data are relevant for fine-tuning the LLM to maximize its performance on the specific unseen evaluation task. Instead, one can only deploy the LLM on the unseen task to gather multiple rounds of feedback on how well the model performs (e.g., user ratings). This novel setting offers a refreshing perspective towards optimizing training data mixtures via feedback from an unseen evaluation task, which prior data mixing and selection works do not consider. Our paper presents DUET, a novel global-to-local algorithm that interleaves influence function as a data selection method with Bayesian optimization to optimize data mixture via feedback from a specific unseen evaluation task. By analyzing DUET's cumulative regret, we theoretically show that DUET converges to the optimal training data mixture for an unseen task even without any data knowledge of the task. Finally, our experiments across a variety of language tasks demonstrate that DUET outperforms existing data selection and mixing methods in the unseen-task setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DUET gives a feedback-driven way to tune LLM data mixtures for unseen tasks via influence functions plus Bayesian optimization, with a regret bound, though the local approximation step looks like the weakest link.

read the letter

DUET tackles data mixture optimization when the target task is completely unseen and you only get performance feedback, such as user ratings, rather than any task data. It runs a global-to-local loop that uses influence functions to pick data locally and Bayesian optimization to adjust the mixture weights, then proves a cumulative regret bound that the procedure converges to the best mixture without ever seeing the task data itself. Experiments on language tasks show it beating standard selection and mixing baselines in this setting. The new element is the unseen-task feedback loop plus the regret guarantee; most prior mixing papers assume some form of task knowledge or access. That combination is a clear step beyond routine extensions. The experiments appear to control for the usual baselines and report consistent gains. The soft spot sits in the theory. The regret analysis treats the influence-function estimate as a faithful enough local utility surrogate for the Bayesian step to make reliable global progress. For non-convex LLMs that approximation often carries bias or variance, and the abstract supplies no explicit error term or Lipschitz-style control on the gap when the evaluation task is unseen. If that gap is not bounded, the sub-linear regret claim does not go through. The paper is aimed at people working on data-centric methods for deployed LLMs under privacy constraints. A reader who needs a practical recipe for mixture tuning with limited task information will find usable ideas here. It deserves peer review because the setting is realistic and the technical framing is honest, even if the approximation analysis will need tightening.

Referee Report

2 major / 2 minor

Summary. The paper introduces DUET, a global-to-local algorithm that interleaves influence-function-based data selection with Bayesian optimization to tune LLM training data mixtures using only feedback (e.g., user ratings) from an unseen evaluation task. It claims a cumulative-regret analysis proving convergence to the optimal mixture without any knowledge of the task data, and reports experimental outperformance versus existing data-selection and mixing baselines across language tasks.

Significance. If the regret bound holds, the work supplies a theoretically grounded method for data-mixture optimization in privacy-sensitive regimes where task data cannot be inspected. The global-to-local interleaving and the explicit regret guarantee are the primary contributions; the experiments provide supporting empirical evidence but are secondary to the theoretical claim.

major comments (2)

[Regret analysis (global-to-local interleaving)] Regret analysis section (derivation of cumulative regret bound): the sub-linear regret guarantee treats the influence-function estimate as a sufficiently faithful local utility surrogate for the Bayesian optimization step to make progress toward the global optimum. No explicit error term, bias bound, or Lipschitz-style control on the approximation gap is supplied when the evaluation task is unseen; this assumption is load-bearing for the convergence claim.
[Global-to-local interleaving description] Description of the influence-function step (global-to-local procedure): the analysis assumes the local approximation remains accurate enough across rounds of unseen-task feedback, yet no quantitative control is given on how non-convexity of the LLM or distribution shift between training mixtures and the unseen task affects the surrogate quality. This directly affects whether the regret bound remains valid.

minor comments (2)

[Abstract] The abstract states that the regret bound is derived but does not indicate the section or equation numbers where the full proof appears, making it difficult to locate the precise assumptions.
[Experiments] Experimental section: the description of how influence functions are computed for each candidate mixture and how the Bayesian optimization acquisition function is defined could be expanded for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and insightful comments on the theoretical foundations of DUET. We address each major comment below and will incorporate clarifications into the revised manuscript.

read point-by-point responses

Referee: Regret analysis section (derivation of cumulative regret bound): the sub-linear regret guarantee treats the influence-function estimate as a sufficiently faithful local utility surrogate for the Bayesian optimization step to make progress toward the global optimum. No explicit error term, bias bound, or Lipschitz-style control on the approximation gap is supplied when the evaluation task is unseen; this assumption is load-bearing for the convergence claim.

Authors: The cumulative regret bound is derived with respect to the surrogate utility defined by the influence-function estimates; under this surrogate the global-to-local interleaving yields sublinear regret. We agree that the manuscript does not supply an explicit error term, bias bound, or Lipschitz control quantifying the gap between the surrogate and the true (unseen-task) utility. In the revision we will add an explicit statement of this modeling assumption together with a short discussion of its role in the convergence claim. revision: partial
Referee: Description of the influence-function step (global-to-local procedure): the analysis assumes the local approximation remains accurate enough across rounds of unseen-task feedback, yet no quantitative control is given on how non-convexity of the LLM or distribution shift between training mixtures and the unseen task affects the surrogate quality. This directly affects whether the regret bound remains valid.

Authors: Non-convexity of the LLM loss and distribution shift between training mixtures and the unseen evaluation task can indeed degrade surrogate quality. The current analysis treats the influence function as a first-order local approximation and establishes regret relative to that surrogate; the global-to-local loop uses fresh feedback to periodically refresh the selection. We will revise the text to state clearly that the regret guarantee is conditional on the surrogate remaining sufficiently faithful and to note that quantitative controls on non-convexity and shift effects are left for future work. revision: partial

Circularity Check

0 steps flagged

No circularity: regret bound is a standard analysis under stated assumptions

full rationale

The paper's central claim is a cumulative-regret bound showing convergence of the DUET interleaving of influence functions and Bayesian optimization to the optimal mixture for an unseen task. The abstract and description present this as a derived theoretical result rather than a tautology, fit, or reduction to prior self-citation. No equations or text in the supplied material exhibit self-definitional structure, a fitted parameter renamed as prediction, or load-bearing self-citation. The influence-function approximation is treated as an explicit modeling assumption whose validity is external to the bound itself; this does not constitute circularity under the evaluation criteria. The derivation is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard regret analysis for Bayesian optimization and the validity of influence functions as a data utility proxy; no new entities are introduced and no free parameters are explicitly fitted in the abstract description.

axioms (2)

domain assumption Influence functions yield a reliable local estimate of data point importance for the current model
Invoked when the algorithm interleaves influence-based selection with the Bayesian optimization loop.
domain assumption The black-box feedback function satisfies conditions that allow standard Bayesian optimization regret bounds to apply
Required for the cumulative regret analysis to conclude convergence to the optimal mixture.

pith-pipeline@v0.9.0 · 5761 in / 1369 out tokens · 32470 ms · 2026-05-23T03:54:46.289006+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

BoLT: A Benchmark to Democratize Black-box Optimization Research for Expensive LLM Tasks
cs.LG 2026-05 conditional novelty 8.0

BoLT is a benchmark of surrogate models fitted to real LLM experiment data that enables evaluation of Bayesian and black-box optimization methods on multi-fidelity, multi-objective, high-dimensional LLM tasks.
Data Mixing for Large Language Models Pretraining: A Survey and Outlook
cs.CL 2026-03 accept novelty 4.0

A survey that taxonomizes data mixing strategies for LLM pretraining into static rule-based, learning-based, and dynamic adaptive families while highlighting transferability challenges and evaluation gaps.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 2 Pith papers · 12 internal anchors

[1]

Efficient online data mixing for language model pre-training

Alon Albalak, Liangming Pan, Colin Raffel, and William Yang Wang. Efficient online data mixing for language model pre-training. arXiv:2312.02406,

work page arXiv
[2]

arXiv preprint arXiv:2402.16827

Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, and William Yang Wang. A survey on data selection for language models. arXiv:2402.16827,

work page arXiv
[3]

Chen, Michael Y

Mayee F. Chen, Michael Y . Hu, Nicholas Lourie, Kyunghyun Cho, and Christopher Ré. Aioli: A unified optimization framework for language model data mixing. arXiv:2411.05735, 2024a. Zhiliang Chen, Chuan-Sheng Foo, and Bryan Kian Hsiang Low. Towards AutoAI: Optimizing a machine learning system with black-box and differentiable components. In Proc. ICML, 2024...

work page arXiv
[4]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Doge: Domain reweighting with generalization estimation

Simin Fan, Matteo Pagliardini, and Martin Jaggi. Doge: Domain reweighting with generalization estimation. arXiv:2310.15393,

work page arXiv
[6]

He, B., Yin, L., Zhen, H.-L., Liu, S., Wu, H., Zhang, X., Yuan, M., and Ma, C

URL https://zenodo.org/records/12608602. Jacob Gardner, Matt Kusner, Xu Zhixiang, Kilian Weinberger, and John Cunningham. Bayesian optimization with inequality constraints. In Proc. ICML,

work page arXiv
[7]

Bimix: A bivariate data mixing law for language model pretraining

Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, and Bolin Ding. Bimix: A bivariate data mixing law for language model pretraining. arXiv:2405.14908,

work page arXiv
[8]

Ziyao Guo, Kai Wang, George Cazenavette, Hui Li, Kaipeng Zhang, and Yang You

doi: 10.1109/ACCESS.2020.2966228. Ziyao Guo, Kai Wang, George Cazenavette, Hui Li, Kaipeng Zhang, and Yang You. Towards lossless dataset distillation via difficulty-aligned trajectory matching. arXiv:2310.05773,

work page doi:10.1109/access.2020.2966228 2020
[9]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv:2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Fastshap: Real-time shapley value estimation

Neil Jethani, Mukund Sudarshan, Ian Covert, Su-In Lee, and Rajesh Ranganath. Fastshap: Real-time shapley value estimation. arXiv:2107.07436,

work page arXiv
[11]

PubMedQA: A Dataset for Biomedical Research Question Answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. arXiv:1909.06146,

work page internal anchor Pith review arXiv 1909
[12]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv:1705.03551,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Generalizing the german tank problem

Anthony Lee and Steven J Miller. Generalizing the german tank problem. arXiv:2210.15339,

work page arXiv
[14]

Human- centered privacy research in the age of large language models

Tianshi Li, Sauvik Das, Hao-Ping Lee, Dakuo Wang, Bingsheng Yao, and Zhiping Zhang. Human- centered privacy research in the age of large language models. arXiv:2402.01994,

work page arXiv
[15]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv:2109.07958,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv:1609.07843,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Coresets for data-efficient training of machine learning models

Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. arXiv:1906.01827,

work page arXiv 1906
[18]

Domain Generalization via Invariant Feature Representation

Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. Domain generalization via invariant feature representation. arXiv:1301.2115,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Estimating training data influence by tracing gradient descent

Garima Pruthi, Frederick Liu, Mukund Sundararajan, and Satyen Kale. Estimating training data influence by tracing gradient descent. arXiv:2002.08484,

work page arXiv 2002
[20]

Qwen2.5 Technical Report

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv
[21]

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv:1811.00937,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Optimal Sub-sampling with Influence Functions

Daniel Ting and Eric Brochu. Optimal sub-sampling with influence functions. arXiv:1709.01716,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Markosyan, Luke Zettlemoyer, and Armen Aghajanyan

Kushal Tirumala, Aram H. Markosyan, Luke Zettlemoyer, and Armen Aghajanyan. Memorization without overfitting: Analyzing the training dynamics of large language models. arXiv:2205.10770,

work page arXiv
[24]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Jindong Wang, Cuiling Lan, Chang Liu, Yidong Ouyang, Tao Qin, Wang Lu, Yiqiang Chen, Wenjun Zeng, and Philip S. Yu. Generalizing to unseen domains: A survey on domain generalization. arXiv:2103.03097,

work page arXiv
[26]

Helpful or harmful data? fine-tuning-free shapley attribution for explaining language model predictions

Jingtan Wang, Xiaoqiang Lin, Rui Qiao, Chuan-Sheng Foo, and Bryan Kian Hsiang Low. Helpful or harmful data? fine-tuning-free shapley attribution for explaining language model predictions. In Proc. ICML, 2024a. Peiqi Wang, Yikang Shen, Zhen Guo, Matthew Stallone, Yoon Kim, Polina Golland, and Rameswar Panda. Diversity measurement and subset selection for i...

work page arXiv
[27]

Less: Selecting influential data for targeted instruction tuning

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. Less: Selecting influential data for targeted instruction tuning. arXiv:2402.04333,

work page arXiv
[28]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv:1905.07830,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[29]

Few-shot adaptation of pre-trained networks for domain shift

Wenyu Zhang, Li Shen, Wanyue Zhang, and Chuan-Sheng Foo. Few-shot adaptation of pre-trained networks for domain shift. arXiv:2205.15234,

work page arXiv
[30]

Speculative coreset selection for task-specific fine-tuning

12 Xiaoyu Zhang, Juan Zhai, Shiqing Ma, Chao Shen, Tianlin Li, Weipeng Jiang, and Yang Liu. Speculative coreset selection for task-specific fine-tuning. arXiv:2410.01296,

work page arXiv
[31]

2.2) can be gathered from the task using a trained LLM

A Technical Appendices and Supplementary Material B Additional Discussions B.1 Real-world examples of our problem setting In our problem setting, (a) there is no direct access to the data (e.g., its domain, distribution, or labels) involved in the unseen evaluation task but (b) multiple rounds of coarse feedback (details covered in Sec. 2.2) can be gather...

work page 2024
[32]

In addition, data mixing works (Xie et al., 2023; Ge et al., 2025; Albalak et al.,

showed that training a model with strategically selected data points allows it to perform better. In addition, data mixing works (Xie et al., 2023; Ge et al., 2025; Albalak et al.,

work page 2023
[33]

DUET for extremely large datasets used in pre-training

irrelevant information that are difficult to be overwritten in later BO iterations. DUET for extremely large datasets used in pre-training. We can amortize the computational cost of IF computation by pre-computing and storing them beforehand (App. B.4) in our paper’s fine-tuning setting. However, the size of datasets used in pre-training could be extremel...

work page 2017
[34]

IF values can be pre-computed and stored

In our algorithm, we repeat this procedure for every data domain. IF values can be pre-computed and stored . In addition, we just need to pre-compute the IF values of every data point once before reusing them repeatedly at every BO iteration to perform IF-weighted sampling. This greatly improves our algorithm’s efficiency and runtime, as compared to other...

work page 2024
[35]

δ1 = √ δ • (4) ≤ uses Chebyshev’s inequality overϵt with probability at least 1 − δ2

w.r.t. δ1 = √ δ • (4) ≤ uses Chebyshev’s inequality overϵt with probability at least 1 − δ2. • (5) = usesPT t=1 σt−1(xt) ≤ O(√T γT ) as shown in Lemma 4 by Chowdhury & Gopalan (Chowdhury & Gopalan, 2017). 20 • (6) = uses the fact that ϵt is bounded on [0, c] and all bounded random variables are R-sub- Gaussian with R = c2 4 (Arbel et al., 2019). Next, we ...

work page 2017
[36]

Derive attainable cumulative regret . Lastly, we analyze the convergence rate of our algorithm using the growth of attained cumulative regret (Chen et al., 2024b) ˜RT =PT t=1 |fy∗rt − f (rt)| = PT t=1 |f (r∗) + ϵt − f (rt)| for T BO iterations. Since the error term ϵt has the same expectation and variance of our estimator, we can use the results from Step...

work page 2006

[1] [1]

Efficient online data mixing for language model pre-training

Alon Albalak, Liangming Pan, Colin Raffel, and William Yang Wang. Efficient online data mixing for language model pre-training. arXiv:2312.02406,

work page arXiv

[2] [2]

arXiv preprint arXiv:2402.16827

Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, and William Yang Wang. A survey on data selection for language models. arXiv:2402.16827,

work page arXiv

[3] [3]

Chen, Michael Y

Mayee F. Chen, Michael Y . Hu, Nicholas Lourie, Kyunghyun Cho, and Christopher Ré. Aioli: A unified optimization framework for language model data mixing. arXiv:2411.05735, 2024a. Zhiliang Chen, Chuan-Sheng Foo, and Bryan Kian Hsiang Low. Towards AutoAI: Optimizing a machine learning system with black-box and differentiable components. In Proc. ICML, 2024...

work page arXiv

[4] [4]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Doge: Domain reweighting with generalization estimation

Simin Fan, Matteo Pagliardini, and Martin Jaggi. Doge: Domain reweighting with generalization estimation. arXiv:2310.15393,

work page arXiv

[6] [6]

He, B., Yin, L., Zhen, H.-L., Liu, S., Wu, H., Zhang, X., Yuan, M., and Ma, C

URL https://zenodo.org/records/12608602. Jacob Gardner, Matt Kusner, Xu Zhixiang, Kilian Weinberger, and John Cunningham. Bayesian optimization with inequality constraints. In Proc. ICML,

work page arXiv

[7] [7]

Bimix: A bivariate data mixing law for language model pretraining

Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, and Bolin Ding. Bimix: A bivariate data mixing law for language model pretraining. arXiv:2405.14908,

work page arXiv

[8] [8]

Ziyao Guo, Kai Wang, George Cazenavette, Hui Li, Kaipeng Zhang, and Yang You

doi: 10.1109/ACCESS.2020.2966228. Ziyao Guo, Kai Wang, George Cazenavette, Hui Li, Kaipeng Zhang, and Yang You. Towards lossless dataset distillation via difficulty-aligned trajectory matching. arXiv:2310.05773,

work page doi:10.1109/access.2020.2966228 2020

[9] [9]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv:2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Fastshap: Real-time shapley value estimation

Neil Jethani, Mukund Sudarshan, Ian Covert, Su-In Lee, and Rajesh Ranganath. Fastshap: Real-time shapley value estimation. arXiv:2107.07436,

work page arXiv

[11] [11]

PubMedQA: A Dataset for Biomedical Research Question Answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. arXiv:1909.06146,

work page internal anchor Pith review arXiv 1909

[12] [12]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv:1705.03551,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Generalizing the german tank problem

Anthony Lee and Steven J Miller. Generalizing the german tank problem. arXiv:2210.15339,

work page arXiv

[14] [14]

Human- centered privacy research in the age of large language models

Tianshi Li, Sauvik Das, Hao-Ping Lee, Dakuo Wang, Bingsheng Yao, and Zhiping Zhang. Human- centered privacy research in the age of large language models. arXiv:2402.01994,

work page arXiv

[15] [15]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv:2109.07958,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv:1609.07843,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Coresets for data-efficient training of machine learning models

Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. arXiv:1906.01827,

work page arXiv 1906

[18] [18]

Domain Generalization via Invariant Feature Representation

Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. Domain generalization via invariant feature representation. arXiv:1301.2115,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Estimating training data influence by tracing gradient descent

Garima Pruthi, Frederick Liu, Mukund Sundararajan, and Satyen Kale. Estimating training data influence by tracing gradient descent. arXiv:2002.08484,

work page arXiv 2002

[20] [20]

Qwen2.5 Technical Report

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv:1811.00937,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Optimal Sub-sampling with Influence Functions

Daniel Ting and Eric Brochu. Optimal sub-sampling with influence functions. arXiv:1709.01716,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Markosyan, Luke Zettlemoyer, and Armen Aghajanyan

Kushal Tirumala, Aram H. Markosyan, Luke Zettlemoyer, and Armen Aghajanyan. Memorization without overfitting: Analyzing the training dynamics of large language models. arXiv:2205.10770,

work page arXiv

[24] [24]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Jindong Wang, Cuiling Lan, Chang Liu, Yidong Ouyang, Tao Qin, Wang Lu, Yiqiang Chen, Wenjun Zeng, and Philip S. Yu. Generalizing to unseen domains: A survey on domain generalization. arXiv:2103.03097,

work page arXiv

[26] [26]

Helpful or harmful data? fine-tuning-free shapley attribution for explaining language model predictions

Jingtan Wang, Xiaoqiang Lin, Rui Qiao, Chuan-Sheng Foo, and Bryan Kian Hsiang Low. Helpful or harmful data? fine-tuning-free shapley attribution for explaining language model predictions. In Proc. ICML, 2024a. Peiqi Wang, Yikang Shen, Zhen Guo, Matthew Stallone, Yoon Kim, Polina Golland, and Rameswar Panda. Diversity measurement and subset selection for i...

work page arXiv

[27] [27]

Less: Selecting influential data for targeted instruction tuning

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. Less: Selecting influential data for targeted instruction tuning. arXiv:2402.04333,

work page arXiv

[28] [28]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv:1905.07830,

work page internal anchor Pith review Pith/arXiv arXiv 1905

[29] [29]

Few-shot adaptation of pre-trained networks for domain shift

Wenyu Zhang, Li Shen, Wanyue Zhang, and Chuan-Sheng Foo. Few-shot adaptation of pre-trained networks for domain shift. arXiv:2205.15234,

work page arXiv

[30] [30]

Speculative coreset selection for task-specific fine-tuning

12 Xiaoyu Zhang, Juan Zhai, Shiqing Ma, Chao Shen, Tianlin Li, Weipeng Jiang, and Yang Liu. Speculative coreset selection for task-specific fine-tuning. arXiv:2410.01296,

work page arXiv

[31] [31]

2.2) can be gathered from the task using a trained LLM

A Technical Appendices and Supplementary Material B Additional Discussions B.1 Real-world examples of our problem setting In our problem setting, (a) there is no direct access to the data (e.g., its domain, distribution, or labels) involved in the unseen evaluation task but (b) multiple rounds of coarse feedback (details covered in Sec. 2.2) can be gather...

work page 2024

[32] [32]

In addition, data mixing works (Xie et al., 2023; Ge et al., 2025; Albalak et al.,

showed that training a model with strategically selected data points allows it to perform better. In addition, data mixing works (Xie et al., 2023; Ge et al., 2025; Albalak et al.,

work page 2023

[33] [33]

DUET for extremely large datasets used in pre-training

irrelevant information that are difficult to be overwritten in later BO iterations. DUET for extremely large datasets used in pre-training. We can amortize the computational cost of IF computation by pre-computing and storing them beforehand (App. B.4) in our paper’s fine-tuning setting. However, the size of datasets used in pre-training could be extremel...

work page 2017

[34] [34]

IF values can be pre-computed and stored

In our algorithm, we repeat this procedure for every data domain. IF values can be pre-computed and stored . In addition, we just need to pre-compute the IF values of every data point once before reusing them repeatedly at every BO iteration to perform IF-weighted sampling. This greatly improves our algorithm’s efficiency and runtime, as compared to other...

work page 2024

[35] [35]

δ1 = √ δ • (4) ≤ uses Chebyshev’s inequality overϵt with probability at least 1 − δ2

w.r.t. δ1 = √ δ • (4) ≤ uses Chebyshev’s inequality overϵt with probability at least 1 − δ2. • (5) = usesPT t=1 σt−1(xt) ≤ O(√T γT ) as shown in Lemma 4 by Chowdhury & Gopalan (Chowdhury & Gopalan, 2017). 20 • (6) = uses the fact that ϵt is bounded on [0, c] and all bounded random variables are R-sub- Gaussian with R = c2 4 (Arbel et al., 2019). Next, we ...

work page 2017

[36] [36]

Derive attainable cumulative regret . Lastly, we analyze the convergence rate of our algorithm using the growth of attained cumulative regret (Chen et al., 2024b) ˜RT =PT t=1 |fy∗rt − f (rt)| = PT t=1 |f (r∗) + ϵt − f (rt)| for T BO iterations. Since the error term ϵt has the same expectation and variance of our estimator, we can use the results from Step...

work page 2006