pith. sign in

arxiv: 2606.30077 · v1 · pith:4YKN2PYMnew · submitted 2026-06-29 · 💻 cs.LG · cs.AI

Online Data Selection for Instruction Tuning via Gaussian Processes

Pith reviewed 2026-06-30 06:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords data selectioninstruction tuningGaussian processesonline learningLLM fine-tuningdynamic regrethedge algorithm
0
0 comments X

The pith

GAIA uses global Gaussian Process estimation to select high-utility data during online LLM instruction tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to move online data selection for LLM instruction tuning beyond methods that only look inside random batches. It introduces GAIA, which treats data utility as a continuous function over the full semantic space and estimates that function with Gaussian Process regression. An adaptive fusion step then prioritizes samples according to the current estimate. The update rule is rewritten as an instance of the fixed-share Hedge algorithm, which supplies a dynamic-regret bound that holds even when data quality changes during training. A sympathetic reader would care because modern LLM work has shifted emphasis from data volume to data quality, so a method that reliably finds the better samples could reduce compute while raising final performance.

Core claim

GAIA formulates data valuation as a global estimation process that employs Gaussian Process regression to model continuous utility manifolds across semantic space. It employs an adaptive strategy fusion mechanism to dynamically prioritize high-utility samples. By casting the strategy-posterior update as an instance of the classical fixed-share Hedge framework, the method inherits a dynamic-regret guarantee that characterizes robustness under non-stationary quality scores, and empirical results on three datasets show significant outperformance over baselines such as GREATs.

What carries the argument

Gaussian Process regression that models the continuous utility manifold in semantic space, updated through an adaptive fusion rule derived from the fixed-share Hedge algorithm.

If this is right

  • GAIA significantly outperforms state-of-the-art batch-constrained baselines on three instruction-tuning datasets.
  • The method remains robust when quality scores change during the course of training.
  • It supplies a scalable procedure for efficient instruction tuning that still carries a dynamic-regret bound.
  • The regret analysis directly quantifies how well the selector tracks the best data under non-stationary conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same global-manifold view could be tested on other sequential selection problems where the value of each item drifts over time.
  • If the manifold assumption weakens at very large model scales, replacing the Gaussian Process with a neural surrogate would be a direct next experiment.
  • The approach suggests a general template for turning any online expert-tracking algorithm into a data-selection procedure by reinterpreting posterior weights as utility estimates.

Load-bearing premise

Data utility forms a continuous manifold in semantic space that Gaussian Process regression can reliably estimate from the observed samples during training.

What would settle it

On a new dataset the data points selected by GAIA produce no measurable gain in final model accuracy or loss compared with random or batch-constrained selection, or the Gaussian Process predictions show no correlation with the actual performance contribution of each sample.

Figures

Figures reproduced from arXiv: 2606.30077 by Julien Monteil, Jun Wang, Quoc Phong Nguyen, Vu Nguyen.

Figure 1
Figure 1. Figure 1: Workflow. Our framework operates in two distinct phases to decouple valuation from training: (1) Warm-up & GP Training: A lightweight Embeding Model maps the raw data pool into a semantic embedding space. Concurrently, a warm-up training phase on a small dataset set collects initial utility scores. These scores and embeddings are used to fit the Gaussian Process (GP) model, establishing an initial global u… view at source ↗
Figure 2
Figure 2. Figure 2: Training dynamics on MMLU-Sociology (top row) and SAMSUM (bottom row). We benchmark GAIA-GREATS against full-data training (Regular) and various selection baselines. The left and middle columns report validation and test perplexity (lower is better), while the right column presents downstream task performance (Accuracy for MMLU, ROUGE for SAMSUM). GAIA￾GREATS demonstrates superior convergence speed and rob… view at source ↗
Figure 3
Figure 3. Figure 3: Performance comparison of four different scoring functions on [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cumulative data value scores (LLAMA-2-7B, MMLU-Sociology). As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of strategy number on test perplexity for [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training dynamics on MMLU-Sociology using the LLAMA-2-7B backbone, comparing downstream performance when using alternative surrogate models for data sampling [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: illustrates the convergence and downstream performance on the TYDIQA dataset. Although GP-GREATS exhibits a slightly more gradual decline in evaluation and test perplexity during the initial training phase, it eventually achieves a superior convergence floor compared to GREATS. Specifically, the final perplexity levels (both eval and test) of our method are marginally lower than those of the strongest base… view at source ↗
Figure 8
Figure 8. Figure 8: Training dynamics on MMLU-Sociology and SAMSUM using the Qwen3-4B backbone. We compare GAIA-GREATSagainst Regular training and the GREATS baseline to evaluate cross￾architecture generalization. over both Regular training and the strong GREATS baseline. On the MMLU-Sociology task, GAIA achieves a noticeably faster reduction in both evaluation and test perplexity, which directly translates to a higher and mo… view at source ↗
Figure 9
Figure 9. Figure 9: Performance comparison on additional MMLU subjects (Abstract Algebra, Anatomy, US Foreign Policy, Astronomy). We evaluate GAIA-GREATS against Regular training and the GREATS baseline, reporting both test perplexity (PPL) and accuracy. Notably, our method achieves a significant and consistent reduction in PPL across all tasks. While the accuracy metrics exhibit some variance, typical for the inherent stocha… view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of sampling dynamics on MMLU-Sociology over the course of training. The scatter plot maps sampled data indices (y-axis) against training steps (x-axis), with colors representing the distinct strategies from which samples were drawn. The distinct clustering observed in the initial phase demonstrates that our method prioritizes high-value data first. As training progresses, the selection becom… view at source ↗
Figure 11
Figure 11. Figure 11: Impact of sampling temperature on test perplexity for [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Impact of warm-up dataset size on test perplexity ( [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
read the original abstract

With Large Language Model (LLM) pre-training and fine-tuning shifting its focus from data volume to data quality, quality data selection has emerged as a critical research topic. Existing online data selection methods for LLM training are typically "batch-constrained", limiting optimization to local utility within random batches. To overcome this, we propose GAIA (Global Adaptive Instruction tuning via GAussian processes), a framework that formulates data valuation as a global estimation process. GAIA employs Gaussian Process regression to model continuous utility manifolds across the semantic space, utilizing an adaptive strategy fusion mechanism to dynamically prioritize high-utility samples. By casting the strategy-posterior update as an instance of the classical fixed-share Hedge framework for tracking the best expert, we inherit a dynamic-regret guarantee that characterizes GAIA's robustness under non-stationary quality scores during training. Empirical evaluations on three datasets demonstrate that GAIA significantly outperforms state-of-the-art baselines like \greats, establishing our method as a scalable and robust solution for efficient instruction tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes GAIA for online data selection in LLM instruction tuning. It models data utility as a continuous manifold in semantic space via Gaussian Process regression for global estimation, uses an adaptive strategy fusion mechanism to prioritize samples, and reduces the strategy-posterior update to the classical fixed-share Hedge framework to inherit a dynamic-regret guarantee under non-stationary quality scores. Empirical results on three datasets claim significant outperformance over baselines including GREATs.

Significance. If the GP regression reliably produces global utility estimates from the limited online observations and the reduction to fixed-share Hedge is exact (without hidden bias from the GP posterior), the work would supply a theoretically grounded global alternative to batch-constrained selection methods together with a dynamic-regret bound. The explicit inheritance of an existing regret guarantee is a positive feature when the mapping is shown to preserve the required conditions.

major comments (1)
  1. [Abstract] Abstract: the central claim that Gaussian Process regression yields reliable global utility estimates across semantic space (used both for prioritization and to justify the exact fixed-share Hedge reduction) rests on the unexamined assumption that the utility surface is sufficiently smooth and that sparse online samples suffice for accurate extrapolation. No derivation, kernel specification, or error analysis is supplied to support this step; if the GP estimates are no better than local heuristics, the dynamic-regret bound applies only to a noisy expert and the reported gains over GREATs cannot be attributed to the global formulation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the Gaussian Process assumptions. We address the concern point-by-point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that Gaussian Process regression yields reliable global utility estimates across semantic space (used both for prioritization and to justify the exact fixed-share Hedge reduction) rests on the unexamined assumption that the utility surface is sufficiently smooth and that sparse online samples suffice for accurate extrapolation. No derivation, kernel specification, or error analysis is supplied to support this step; if the GP estimates are no better than local heuristics, the dynamic-regret bound applies only to a noisy expert and the reported gains over GREATs cannot be attributed to the global formulation.

    Authors: We agree the abstract is concise and omits explicit kernel details and error bounds. The method section specifies a Gaussian Process with RBF kernel over semantic embeddings, where the posterior mean provides the global utility estimate used for prioritization and strategy fusion. The fixed-share Hedge reduction is exact on the sequence of fused posteriors and the dynamic-regret guarantee holds irrespective of GP accuracy (it bounds regret against the best dynamic expert sequence). Empirical gains over GREATs are shown via ablations that isolate the global component. In revision we will add a subsection with kernel specification, the standard GP smoothness assumption, and a reference to posterior convergence rates under accumulating observations to justify extrapolation from sparse samples. revision: yes

Circularity Check

0 steps flagged

No circularity: external inheritance of regret bound and independent GP modeling

full rationale

The paper's derivation casts the strategy-posterior update explicitly as an instance of the classical fixed-share Hedge framework (an external result) to inherit its dynamic-regret guarantee, while using Gaussian Process regression to model utility as a continuous manifold. Neither step reduces a claimed prediction or result to its own inputs by definition, nor relies on load-bearing self-citation, ansatz smuggling, or renaming of known results. The outperformance claim is presented as an empirical outcome rather than a mathematical necessity forced by the method's construction. The chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are extractable beyond the high-level modeling choice of continuous utility manifold.

pith-pipeline@v0.9.1-grok · 5702 in / 928 out tokens · 25355 ms · 2026-06-30T06:58:34.948127+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 10 canonical work pages · 8 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 9

  3. [3]

    arXiv preprint arXiv:2504.00993 (2025)

    Juncheng Wu, Wenlong Deng, Xingxuan Li, Sheng Liu, Taomian Mi, Yifan Peng, Ziyang Xu, Yi Liu, Hyunjin Cho, Chang-In Choi, et al. Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs.arXiv preprint arXiv:2504.00993, 2025

  4. [4]

    Should chatgpt be biased? challenges and risks of bias in large language models

    Emilio Ferrara. Should chatgpt be biased? challenges and risks of bias in large language models. Challenges and Risks of Bias in Large Language Models (October 26, 2023), 2023

  5. [5]

    Data shapley: Equitable valuation of data for machine learning

    Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. InICML, 2019

  6. [6]

    Understanding black-box predictions via influence functions

    Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InICML, 2017

  7. [7]

    Greats: Online selection of high-quality data for llm training in every iteration.Advances in Neural Information Processing Systems, 37:131197–131223, 2024

    Jiachen T Wang, Tong Wu, Dawn Song, Prateek Mittal, and Ruoxi Jia. Greats: Online selection of high-quality data for llm training in every iteration.Advances in Neural Information Processing Systems, 37:131197–131223, 2024

  8. [8]

    Lava: Data valuation without pre-specified learning algorithms

    Hoang Anh Just, Feiyang Kang, Tianhao Wang, Yi Zeng, Myeongseob Ko, Ming Jin, and Ruoxi Jia. Lava: Data valuation without pre-specified learning algorithms. InThe Eleventh International Conference on Learning Representations. OpenReview, 2023

  9. [9]

    Sava: Scalable learning-agnostic data valuation

    Samuel Kessler, Tam Le, and Vu Nguyen. Sava: Scalable learning-agnostic data valuation. In The Thirteenth International Conference on Learning Representations, 2025

  10. [10]

    Kairos: Scalable model-agnostic data valuation.Advances in Neural Information Processing Systems, 2025

    Jiongli Zhu, Parjanya Prajakta Prashant, Alex Cloninger, and Babak Salimi. Kairos: Scalable model-agnostic data valuation.Advances in Neural Information Processing Systems, 2025

  11. [11]

    Shapley-based data valuation for weighted k-nearest neighbors

    Guangyi Zhang, Qiyu Liu, and Aristides Gionis. Shapley-based data valuation for weighted k-nearest neighbors. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  12. [12]

    What is your data worth to gpt? llm-scale data valuation with influence functions

    Sang Keun Choe, Hwijeen Ahn, Juhan Bae, Kewen Zhao, Minsoo Kang, Youngseog Chung, Adithya Pratapa, Willie Neiswanger, Emma Strubell, Teruko Mitamura, et al. What is your data worth to gpt? llm-scale data valuation with influence functions. 2025

  13. [13]

    Dataset distillation

    Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation. In ICML, 2018

  14. [14]

    Coresets for data-efficient training of machine learning models

    Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. InInternational Conference on Machine Learning, pages 6950–6960. PMLR, 2020

  15. [15]

    Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 33: 19920–19930, 2020

    Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 33: 19920–19930, 2020

  16. [16]

    Rethinking data shapley for data selection tasks: Misleads and merits

    Jiachen T Wang, Tianji Yang, James Zou, Yongchan Kwon, and Ruoxi Jia. Rethinking data shapley for data selection tasks: Misleads and merits. InInternational Conference on Machine Learning, pages 52033–52063. PMLR, 2024

  17. [17]

    Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 36: 69798–69818, 2023

    Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 36: 69798–69818, 2023

  18. [18]

    Less: selecting influential data for targeted instruction tuning

    Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. Less: selecting influential data for targeted instruction tuning. InProceedings of the 41st International Conference on Machine Learning, pages 54104–54132, 2024

  19. [19]

    Qurating: Selecting high- quality data for training language models

    Alexander Wettig, Aatmik Gupta, Saumya Malik, and Danqi Chen. Qurating: Selecting high- quality data for training language models. InForty-first International Conference on Machine Learning. 10

  20. [20]

    An empirical study of example forgetting during deep neural network learning

    Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gordon. An empirical study of example forgetting during deep neural network learning. InInternational Conference on Learning Representations, 2019

  21. [21]

    Deep learning on a data diet: Finding important examples early in training.Advances in neural information processing systems, 34:20596–20607, 2021

    Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding important examples early in training.Advances in neural information processing systems, 34:20596–20607, 2021

  22. [22]

    Beyond neural scaling laws: beating power law scaling via data pruning.Advances in Neural Information Processing Systems, 35:19523–19536, 2022

    Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. Beyond neural scaling laws: beating power law scaling via data pruning.Advances in Neural Information Processing Systems, 35:19523–19536, 2022

  23. [23]

    Tracking the best expert.Machine Learning, 32(2): 151–178, 1998

    Mark Herbster and Manfred K Warmuth. Tracking the best expert.Machine Learning, 32(2): 151–178, 1998

  24. [24]

    Cambridge University Press, 2006

    Nicolò Cesa-Bianchi and Gábor Lugosi.Prediction, Learning, and Games. Cambridge University Press, 2006

  25. [25]

    Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

  26. [26]

    SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization

    Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. InProceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70–79. Association for Computational Linguistics, November 2019

  27. [27]

    Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki

    Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages.Transactions of the Association for Computational Linguistics, 2020

  28. [28]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  29. [29]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  30. [30]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  31. [31]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chap- lot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.CoRR, abs/2310.06825,

  32. [32]

    doi: 10.48550/ARXIV .2310.06825

  33. [33]

    Sentence-bert: Sentence embeddings using siamese bert- networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019

  34. [34]

    MIT press Cambridge, MA, 2006

    Christopher KI Williams and Carl Edward Rasmussen.Gaussian processes for machine learning, volume 2. MIT press Cambridge, MA, 2006

  35. [35]

    Tighter bounds on the log marginal likelihood of gaussian process regression using conjugate gradients

    Artem Artemev, David R Burt, and Mark van der Wilk. Tighter bounds on the log marginal likelihood of gaussian process regression using conjugate gradients. InInternational Conference on Machine Learning, pages 362–372. PMLR, 2021

  36. [36]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023. 11

  37. [37]

    The flan collection: Designing data and methods for effective instruction tuning

    Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. InInternational Conference on Machine Learning, pages 22631–22648. PMLR, 2023

  38. [38]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  39. [39]

    Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023

    Conover Mike, Hayes Matt, Mathur Ankit, Xie Jianwei, Wan Jun, Shah Sam, Ghodsi Ali, Wendell Patrick, Zaharia Matei, and Xin Reynold. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023

  40. [40]

    Openassistant conversations-democratizing large language model alignment.Advances in neural information processing systems, 36:47669–47681, 2023

    Andreas Köpf, Yannic Kilcher, Dimitri V on Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Nguyen, Oliver Stanley, Richárd Nagyfi, et al. Openassistant conversations-democratizing large language model alignment.Advances in neural information processing systems, 36:47669–47681, 2023

  41. [41]

    Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration.Advances in neural information processing systems, 31, 2018

    Jacob Gardner, Geoff Pleiss, Kilian Q Weinberger, David Bindel, and Andrew G Wilson. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration.Advances in neural information processing systems, 31, 2018

  42. [42]

    Not all samples are created equal: Deep learning with importance sampling

    Angelos Katharopoulos and François Fleuret. Not all samples are created equal: Deep learning with importance sampling. In Jennifer G. Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, ICML 2018, volume 80, pages 2530–2539. PMLR, 2018

  43. [43]

    Online Batch Selection for Faster Training of Neural Networks

    Ilya Loshchilov and Frank Hutter. Online batch selection for faster training of neural networks. CoRR, abs/1511.06343, 2015

  44. [44]

    Gomez, Adrien Morisot, Sebastian Farquhar, and Yarin Gal

    Sören Mindermann, Jan Markus Brauner, Muhammed Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, Aidan N. Gomez, Adrien Morisot, Sebastian Farquhar, and Yarin Gal. Prioritized training on points that are learnable, worth learning, and not yet learnt. InInternational Conference on Machine Learning, ICML 2022, volume 162, pages 15630–1564...