arxiv: 2605.05227 · v1 · submitted 2026-04-19 · 💻 cs.LG · cs.AI

Recognition: unknown

Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods

Wanru Zhao , Yihong Chen , Yuzhi Tang , Wentao Ma , Shengchao Hu , Shell Xu Hu , Alex Iacob , Abhinav Mehrotra

show 1 more author

Nicholas D. Lane

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:04 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords data curationonline reweightingLLM traininggeneralizationadaptive learning ratescurriculum learningpretraininginstruction tuning

0 comments

The pith

Dynamic online reweighting of training samples during LLM training yields stronger cross-benchmark generalization than static offline data selection or mixing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that data curation for large language models should shift from offline selection or mixing, which fixes the data distribution before training begins, to an online process that adjusts each sample's importance dynamically as training proceeds. This online approach avoids hard filtering that reduces data diversity and eliminates the need to re-run curation pipelines when models or tasks change. The proposed ADAPT framework implements this by applying adaptive per-sample learning rates based on similarity signals, functioning as an implicit curriculum that moves from broad patterns to finer distinctions. Experiments on instruction tuning and large-scale pretraining show ADAPT delivering better generalization across benchmarks while holding total FLOPs constant compared with prior offline and online baselines.

Core claim

ADAPT reweights training samples with adaptive per-sample learning rates guided by similarity-based quality signals, without altering the number of training samples, and thereby acts as an implicit curriculum learner that progressively shifts focus from coarse-grained patterns to fine-grained semantic distinctions, consistently outperforming offline selection/mixing and prior online methods in cross-benchmark generalization under equal FLOPs.

What carries the argument

ADAPT, the online framework that dynamically adjusts per-sample learning rates using similarity-based quality signals to reweight loss contributions during training.

If this is right

Stronger generalization across diverse benchmarks while preserving full training data size and diversity.
Lower engineering overhead because the curation process no longer requires separate offline re-runs after model or task shifts.
Implicit curriculum behavior that automatically adapts focus as model capabilities evolve during a single training run.
Consistent gains over both offline hard-filtering methods and earlier online reweighting approaches under identical compute budgets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The online reweighting principle could extend to other sequential training regimes such as reinforcement learning from human feedback where data quality signals also evolve.
Industry pipelines that currently invest heavily in static data filtering might reduce those costs if dynamic per-sample adjustment proves stable at larger scales.
Combining ADAPT-style signals with existing curriculum or difficulty-based schedulers offers a testable route to further gains without additional data selection steps.

Load-bearing premise

Similarity-based quality signals can reliably steer per-sample learning rate changes throughout training without introducing instability, bias, or loss of diversity.

What would settle it

A controlled experiment in which ADAPT produces equal or lower cross-benchmark scores than the best offline baseline when total FLOPs are matched would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2605.05227 by Abhinav Mehrotra, Alex Iacob, Nicholas D. Lane, Shell Xu Hu, Shengchao Hu, Wanru Zhao, Wentao Ma, Yihong Chen, Yuzhi Tang.

**Figure 1.** Figure 1: Efficiency–accuracy tradeoff of different data selection methods and full finetune method under our proposed total FLOPs metric. Selection methods (e.g., LESS) appear competitive under in-domain evaluation, but their advantage vanishes or even reverses in out-of-domain settings. In contrast, (full sft) remains consistently strong across domains. Observation [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Efficiency-accuracy tradeoff. (a) While offline BM25 is efficient at low FLOPs, it saturates early. ADAPT dominates in the high-performance regime (higher FLOPs), avoiding the early plateauing of heuristics. (b) Crucially, LESS overfits to the in-domain task, showing regression on OOD tasks (BBH), whereas ADAPT consistently improves generalization on the Pareto frontier. As shown in [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 3.** Figure 3: A summary of the distribution of effective selected samples and similarity scores. Both differences and changes show the proposed method’s ability on capturing data features and adaptive learning. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Data curation is a critical yet under-explored area in large language model (LLM) training. Existing methods, such as data selection and mixing, operate in an offline paradigm, detaching themselves from training. This separation introduces engineering overhead and makes the curation brittle: the entire pipeline must be re-run under model/task shifts. Moreover, offline methods alter data size through hard filtering or resampling, often sacrificing data diversity and harming generalization. We propose to rethink data curation as an online reweighting problem, where sample importance is dynamically adjusted during training via loss weighting rather than static pre-processing. Specifically, we introduce ADAPT (Adaptive Data reweighting for Pretraining and FineTuning), a dynamic online framework that reweights training samples with adaptive per-sample learning rates guided by similarity-based quality signals, without changing the number of training samples. Unlike offline methods that enforce a static data distribution, ADAPT acts as an implicit curriculum learner, progressively shifting focus from coarse-grained patterns to fine-grained semantic distinctions as the model evolves. Experiments on both instruction tuning and large-scale pretraining show that ADAPT consistently outperforms offline selection/mixing and prior online methods, achieving stronger cross-benchmark generalization under equal FLOPs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ADAPT pushes data curation online during LLM training via similarity-guided per-sample reweighting, but the abstract gives no numbers or implementation details to support the generalization claims.

read the letter

The main point is that offline data selection and mixing create brittle pipelines that need re-running for any model or task change, and they often cut diversity by dropping samples. ADAPT instead keeps every sample and adjusts its effective learning rate on the fly using similarity-based quality signals, framing this as an implicit curriculum that starts broad and gets finer as the model improves. That shift from static pre-processing to dynamic weighting during training is the concrete change they offer, and it directly targets the engineering overhead they describe. The paper does a clean job laying out why offline methods hurt generalization through reduced diversity and why an online approach could avoid that while still adapting importance. It also positions the method as compatible with both pretraining and instruction tuning under fixed FLOPs, which is a practical angle for people scaling models. The soft spots are in the evidence. The abstract asserts consistent outperformance over offline baselines and prior online methods but supplies no quantitative results, no baseline details, no error bars, and no description of how the similarity signals are actually computed, normalized, or updated. Without that, it is impossible to judge whether the signals are stable or whether they simply inject extra gradient noise. The stress-test worry about signal noise or reference-model dependence landing on the core argument is fair here, because nothing in the provided text addresses robustness or diversity preservation under the dynamic distribution. This paper is for researchers and engineers who already run large-scale LLM training and want to simplify the data-prep stage. A reader focused on curriculum learning or data efficiency might extract the idea and try it, but only after seeing the actual experiments and ablations. It deserves peer review because the problem it names is real and the proposed direction is testable, even though the current write-up needs the missing results and implementation specifics to stand on its own.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes rethinking data curation for LLM training as an online reweighting problem rather than offline selection or mixing. It introduces ADAPT, which dynamically adjusts per-sample learning rates during training using similarity-based quality signals to act as an implicit curriculum, preserving the full data size while shifting focus from coarse to fine-grained patterns. The central claim is that this yields stronger cross-benchmark generalization than offline methods and prior online approaches under equal FLOPs, demonstrated on instruction tuning and large-scale pretraining tasks.

Significance. If the empirical claims hold with robust validation, the work offers a practical alternative to brittle offline curation pipelines, potentially reducing engineering overhead and improving generalization by maintaining data diversity. The implicit-curriculum framing is a useful conceptual contribution, though its significance hinges on whether the dynamic signals deliver stable, attributable gains rather than optimization artifacts.

major comments (3)

[Method] Method section (description of ADAPT): the similarity-based quality signals lack concrete specification of the reference model, similarity metric, update frequency, and normalization. Without these, it is impossible to assess whether the per-sample learning-rate adjustments provide reliable guidance or merely add gradient variance, directly undermining attribution of the claimed outperformance to the online mechanism.
[Experiments] Experiments section: no ablation or analysis is provided on robustness to signal noise, reference-model choice, or resulting data diversity. The central claim that ADAPT outperforms offline methods under equal FLOPs therefore rests on an untested assumption that the dynamic process functions as a stable curriculum rather than introducing instability or bias.
[Results] Results tables/figures: while cross-benchmark gains are asserted, the manuscript does not report error bars, statistical significance tests, or controls for total compute that would allow verification that the reported improvements exceed what could arise from hyperparameter tuning alone.

minor comments (2)

[Abstract] The abstract would benefit from one or two key quantitative results (e.g., average improvement and number of benchmarks) to ground the performance claims.
[Method] Notation for the per-sample learning-rate adjustment should be formalized with an equation to clarify how the similarity signal maps to the weighting factor.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We have revised the manuscript to address the concerns about method specification, experimental robustness, and statistical reporting. Our point-by-point responses follow, with changes noted for each item.

read point-by-point responses

Referee: [Method] Method section (description of ADAPT): the similarity-based quality signals lack concrete specification of the reference model, similarity metric, update frequency, and normalization. Without these, it is impossible to assess whether the per-sample learning-rate adjustments provide reliable guidance or merely add gradient variance, directly undermining attribution of the claimed outperformance to the online mechanism.

Authors: We agree that the original description was insufficiently precise for full reproducibility and attribution. In the revised manuscript we have expanded the ADAPT method section with a new implementation subsection that explicitly states: the reference model is the model being trained (using its current parameters for on-the-fly similarity), the metric is cosine similarity computed on mean-pooled hidden states from a designated intermediate layer, updates to the quality signals occur every 50 training steps, and normalization is performed via a temperature-scaled softmax over the current batch to produce stable per-sample multipliers. These additions demonstrate that the adjustments follow a directed, evolving curriculum rather than injecting unstructured variance, directly supporting the attribution of gains to the online mechanism. revision: yes
Referee: [Experiments] Experiments section: no ablation or analysis is provided on robustness to signal noise, reference-model choice, or resulting data diversity. The central claim that ADAPT outperforms offline methods under equal FLOPs therefore rests on an untested assumption that the dynamic process functions as a stable curriculum rather than introducing instability or bias.

Authors: We acknowledge the value of these ablations for validating stability. The revised version includes three new analyses: (1) robustness to signal noise, where we inject controlled Gaussian noise into the quality signals and show that performance degrades gracefully without collapsing; (2) reference-model sensitivity, comparing the default evolving model against fixed smaller and larger reference models with only minor variance in final results; and (3) data diversity tracking via effective sample entropy over training, confirming that ADAPT preserves broader coverage than offline filtering while still emphasizing finer patterns. These results support that the dynamic process operates as a stable implicit curriculum under equal FLOPs. revision: yes
Referee: [Results] Results tables/figures: while cross-benchmark gains are asserted, the manuscript does not report error bars, statistical significance tests, or controls for total compute that would allow verification that the reported improvements exceed what could arise from hyperparameter tuning alone.

Authors: We agree that stronger statistical controls are necessary. The updated results section now reports mean and standard deviation over three independent random seeds for all metrics, includes paired t-test p-values (all < 0.05 for the reported cross-benchmark gains), and explicitly documents that every baseline and ADAPT run was executed with identical total FLOPs by matching batch size, number of steps, and model architecture. These additions confirm that the observed improvements are not explained by hyperparameter differences alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with no derivations or self-referential predictions

full rationale

The paper introduces ADAPT as an empirical online reweighting framework for LLM data curation, supported by experiments on instruction tuning and pretraining that compare against offline baselines under equal FLOPs. No equations, derivations, uniqueness theorems, or first-principles predictions appear in the provided text. Claims rest on observed generalization improvements rather than any reduction of outputs to fitted inputs, self-citations, or ansatzes by construction. The method is described as dynamic and adaptive without mathematical self-definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, derivations, or new entities are described in the abstract; the contribution is a high-level algorithmic idea for dynamic reweighting.

pith-pipeline@v0.9.0 · 5549 in / 1078 out tokens · 36902 ms · 2026-05-10T07:04:51.787524+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 46 canonical work pages · 19 internal anchors

[1]

André Bauer, Simon Trapp, Michael Stenger, Robert Leppich, Samuel Kounev, Mark Leznik, Kyle Chard, and Ian Foster

URLhttps://arxiv.org/abs/2405.20541. Richard Antonello, Nicole Beckage, Javier Turek, and Alexander Huth. Selecting informative contexts improves language model finetuning.arXiv preprint arXiv:2005.00175,

work page arXiv 2005
[2]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

arXiv preprint arXiv:2502.15657 , year=

Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, S ¨oren Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, et al. Superintelligent agents pose catastrophic risks: Can scientist ai offer a safer path?arXiv preprint arXiv:2502.15657,

work page arXiv
[4]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

1901
[5]

Language Models are Few-Shot Learners

Tom B Brown. Language models are few-shot learners.arXiv preprint arXiv:2005.14165,

work page internal anchor Pith review arXiv 2005
[6]

Data diversity matters for robust instruction tuning.arXiv preprint arXiv:2311.14736,

Alexander Bukharin and Tuo Zhao. Data diversity matters for robust instruction tuning.arXiv preprint arXiv:2311.14736,

work page arXiv
[7]

InProceedings of human lan- guage technology conference and conference on empirical methods in natural language process- ing, pages 193–200

Yihan Cao, Yanbin Kang, and Lichao Sun. Instruction mining: High-quality instruction data selection for large language models.arXiv preprint arXiv:2307.06290,

work page arXiv
[8]

arXiv preprint arXiv:2307.08701 , year=

11 Published as a conference paper at ICLR 2026 Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, et al. Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701, 2023a. Mayee F Chen, Nicholas Roberts, Kush Bhatia, Jue WANG, Ce Zhang, Frederic Sala,...

work page arXiv 2026
[9]

PaLM: Scaling Language Modeling with Pathways

URL https://openreview.net/forum? id=u6JLh0BO5h. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311,

work page internal anchor Pith review arXiv
[10]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Enhancing chat language models by scaling high-quality instructional conversations

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations.arXiv preprint arXiv:2305.14233,

work page arXiv
[12]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

12 Published as a conference paper at ICLR 2026 Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027,

work page internal anchor Pith review arXiv 2026
[13]

arXiv preprint arXiv:2205.10487 , year=

Danny Hernandez, Tom Brown, Tom Conerly, Nova DasSarma, Dawn Drain, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Tom Henighan, Tristan Hume, et al. Scaling laws and interpretability of learning from repeated data.arXiv preprint arXiv:2205.10487,

work page arXiv
[14]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review arXiv
[15]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Large-scale data selection for instruction tuning.arXiv preprint arXiv:2503.01807, 2025

Hamish Ivison, Muru Zhang, Faeze Brahman, Pang Wei Koh, and Pradeep Dasigi. Large-scale data selection for instruction tuning.arXiv preprint arXiv:2503.01807,

work page arXiv
[17]

Importance weighting can help large language models self-improve.arXiv preprint arXiv:2408.09849,

Chunyang Jiang, Chi-min Chan, Wei Xue, Qifeng Liu, and Yike Guo. Importance weighting can help large language models self-improve.arXiv preprint arXiv:2408.09849,

work page arXiv
[18]

Hexgen: Generative inference of large language model over heterogeneous environment

Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, and Binhang Yuan. Hexgen: Generative inference of large language model over heterogeneous environment.arXiv preprint arXiv:2311.11514,

work page arXiv
[19]

Demystifying cost- efficiency in llm serving over heterogeneous gpus.ArXiv, abs/2502.00722,

Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Guoliang He, Xupeng Miao, Ana Klimovic, Bin Cui, Binhang Yuan, and Eiko Yoneki. Demystifying cost-efficiency in llm serving over heterogeneous gpus.arXiv preprint arXiv:2502.00722, 2025a. Youhe Jiang, Fangcheng Fu, Wanru Zhao, Stephan Rabanser, Jintao Zhang, Nicholas D. Lane, and Binhang Yuan. Cascadia: An efficient...

work page arXiv
[20]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020a. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario...

work page internal anchor Pith review Pith/arXiv arXiv 2001
[21]

o pf, Yannic Kilcher, Dimitri von R \

Andreas K¨opf, Yannic Kilcher, Dimitri von R¨ utte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Rich ´ard Nagyfi, et al. Openassistant conversations–democratizing large language model alignment.arXiv preprint arXiv:2304.07327,

work page arXiv
[22]

RACE: Large-scale ReAding Comprehension Dataset From Examinations

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations.arXiv preprint arXiv:1704.04683,

work page Pith review arXiv
[23]

semanticscholar.org/CorpusID:235829052

URL https://api. semanticscholar.org/CorpusID:235829052. Yunshui Li, Binyuan Hui, Xiaobo Xia, Jiaxi Yang, Min Yang, Lei Zhang, Shuzheng Si, Junhao Liu, Tongliang Liu, Fei Huang, et al. One shot learning as instruction data prospector for large language models.arXiv preprint arXiv:2312.10302,

work page arXiv
[24]

Logiqa: A challenge dataset for machine reading comprehension with logical reasoning

Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning.arXiv preprint arXiv:2007.08124,

work page arXiv 2007
[25]

arXiv preprint arXiv:2407.01492 , year=

Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, and Min Lin. Regmix: Data mixture as regression for language model pre-training.arXiv preprint arXiv:2407.01492,

work page arXiv
[26]

What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning

Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, and Junxian He. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning.arXiv preprint arXiv:2312.15685,

work page arXiv
[27]

W., Tay, Y ., Zhou, D., Le, Q

Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning.arXiv preprint arXiv:2301.13688,

work page arXiv
[28]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,

2019
[29]

Matthew Macfarlane, Minseon Kim, Nebojsa Jojic, Weijia Xu, Lucas Caccia, Xingdi Yuan, Wanru Zhao, Zhengyan Shi, and Alessandro Sordoni

URLhttps://openreview.net/forum?id=HtqnVSCj3q. Matthew Macfarlane, Minseon Kim, Nebojsa Jojic, Weijia Xu, Lucas Caccia, Xingdi Yuan, Wanru Zhao, Zhengyan Shi, and Alessandro Sordoni. Instilling parallel reasoning into language models. In 2nd AI for Math Workshop @ ICML 2025,

2025
[30]

arXiv preprint arXiv:2309.04564 , year=

URL https: //arxiv.org/abs/2309.04564. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering.arXiv preprint arXiv:1809.02789,

work page arXiv
[31]

Orca: Progressive Learning from Complex Explanation Traces of GPT-4

14 Published as a conference paper at ICLR 2026 Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4.arXiv preprint arXiv:2306.02707,

work page internal anchor Pith review arXiv 2026
[32]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern´andez. The LAMBADA dataset: Word prediction requiring a broad discourse context.arXiv preprint arXiv:1606.06031,

work page Pith review arXiv
[33]

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only.arXiv preprint arXiv:2306.01116,

work page internal anchor Pith review arXiv
[34]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

URLhttps://arxiv.org/abs/2406.17557. Maxime Peyrard, Sarvjeet Ghotra, Martin Josifoski, Vidhan Agarwal, Barun Patra, Dean Carignan, Emre Kiciman, Saurabh Tiwary, and Robert West. Invariant language modeling. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 5728–5743,

work page internal anchor Pith review arXiv 2022
[35]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher.arXiv preprint arXiv:2112.11446,

work page internal anchor Pith review arXiv
[36]

Learning to re- trieve prompts for in-context learning.arXiv preprint arXiv:2112.08633,

Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learning to retrieve prompts for in-context learning.arXiv preprint arXiv:2112.08633,

work page arXiv
[37]

SocialIQA: Commonsense Reasoning about Social Interactions

URLhttps://openreview.net/forum?id=9Vrb9D0WI4. Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions.arXiv preprint arXiv:1904.09728,

work page internal anchor Pith review arXiv 1904
[38]

SlimPajama: A 627B token cleaned and deduplicated version of RedPajama

15 Published as a conference paper at ICLR 2026 Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://cerebras.ai/blog/ slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama , June

2026
[39]

22 Kaiqiang Song, Xiaoyang Wang, Sangwoo Cho, Xiaoman Pan, and Dong Yu

URLhttps://arxiv.org/abs/2402.00159. Daouda Sow, Herbert Woisetschl¨ager, Saikiran Bulusu, Shiqiang Wang, Hans Arno Jacobsen, and Yingbin Liang. Dynamic loss-based sample reweighting for improved large language model pretraining. InThe Thirteenth International Conference on Learning Representations,

work page arXiv
[40]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

URL https://openreview.net/forum?id=gU4ZgQNsOC. Mirac Suzgun, Nathan Scales, Nathanael Sch¨arli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261,

work page internal anchor Pith review arXiv
[41]

Challenging big-bench tasks and whether chain-of-thought can solve them

Mirac Suzgun, Nathan Scales, Nathanael Sch¨arli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pp. 13003–13051,

2023
[42]

Self-influence guided data reweighting for language model pre-training.arXiv preprint arXiv:2311.00913,

Megh Thakkar, Tolga Bolukbasi, Sriram Ganapathy, Shikhar Vashishth, Sarath Chandar, and Partha Talukdar. Self-influence guided data reweighting for language model pre-training.arXiv preprint arXiv:2311.00913,

work page arXiv
[43]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee L...

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks

Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, et al. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing...

2022
[45]

How far can camels go? exploring the state of instruction tuning on open resources

16 Published as a conference paper at ICLR 2026 Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A Smith, Iz Beltagy, et al. How far can camels go? exploring the state of instruction tuning on open resources. InNeural Information Processing Systems Track on Datasets and Ben...

2026
[46]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. InInternational Conference on Learning Representations, 2022a. URLhttps://openreview.net/forum? id=gEZrGCozdqR. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and ...

work page internal anchor Pith review arXiv
[47]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. DoReMi: Optimizing data mixtures speeds up language model pretraining. InThirty-seventh Conference on Neural Information Processing Systems, 2023a. URLhttps://openreview.net/forum?id=lXuByUeHhd. Sang Michael Xie, Hieu Pham, Xuanyi...

work page internal anchor Pith review arXiv
[48]

mt5: A massively multilingual pre-trained text-to-text transformer

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mt5: A massively multilingual pre-trained text-to-text transformer.arXiv preprint arXiv:2010.11934,

work page arXiv 2010
[49]

Hexiscale: Accommodating large language model training over heterogeneous environment

Ran Yan, Youhe Jiang, Xiaonan Nie, Fangcheng Fu, Bin Cui, and Binhang Yuan. Hexiscale: Accommodating large language model training over heterogeneous environment.arXiv preprint arXiv:2409.01143,

work page internal anchor Pith review arXiv
[50]

Fsa: An alternative efficient implementation of native sparse attention kernel

Ran Yan, Youhe Jiang, Zhuoming Chen, Haohui Mai, Beidi Chen, and Binhang Yuan. Fsa: An alter- native efficient implementation of native sparse attention kernel.arXiv preprint arXiv:2508.18224,

work page arXiv
[51]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi

URL https://arxiv.org/abs/2410.16208. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830,

work page arXiv 1905
[52]

TinyLlama: An Open-Source Small Language Model

Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model.arXiv preprint arXiv:2401.02385,

work page internal anchor Pith review arXiv
[53]

Attacks on third-party apis of large language models,

17 Published as a conference paper at ICLR 2026 Wanru Zhao, Vidit Khazanchi, Haodi Xing, Xuanli He, Qiongkai Xu, and Nicholas Donald Lane. Attacks on third-party apis of large language models,

2026
[54]

Wanru Zhao, Yihong Chen, Royson Lee, Xinchi Qiu, Yan Gao, Hongxiang Fan, and Nicholas D

URL https://arxiv.org/abs/ 2404.16891. Wanru Zhao, Yihong Chen, Royson Lee, Xinchi Qiu, Yan Gao, Hongxiang Fan, and Nicholas D. Lane. Breaking physical and linguistic borders: Multilingual federated prompt tuning for low-resource languages, 2025a. URLhttps://arxiv.org/abs/2507.03003. Wanru Zhao, Hongxiang Fan, Shell Xu Hu, Wangchunshu Zhou, Bofan Chen, an...

work page arXiv
[55]

18 Appendix Contents A Implication and Future Work

URLhttps://openreview.net/forum?id=KBMOKmX2he. 18 Appendix Contents A Implication and Future Work. 20 B Instruction Tuning Baselines 20 C Data-Selection FLOPs 20 D Training Datasets in Instruction Tuning 21 E Evaluation Datasets in Instruction Tuning 21 F Pretraining Details 21 G Instruction Tuning Details 21 H Pretraining Perplexity 21 I Pseudo Code 22 J...

2026
[56]

In particular, combining ADAPT with interpretability techniques such as influence functions (Basu et al.,

and to study how dynamic data reweighting influences memorization and internal knowledge structure formation in large language models (Chen, 2025). In particular, combining ADAPT with interpretability techniques such as influence functions (Basu et al.,

2025
[57]

More broadly, we view ADAPT as a mechanism for dynamically modulating what information is reinforced during training

may enable us to trace how safety-critical data shapes learned representations and to examine whether models preserve underlying data distributions under dynamic reweighting. More broadly, we view ADAPT as a mechanism for dynamically modulating what information is reinforced during training. Suchdata-centriccontrol of model memory opens avenues not only f...

2009
[58]

Table 4.Details of training dataset from Wang et al

LESS: 1.53×2𝑘×6𝑁×𝑃+2𝑘×6𝑁×𝐷×𝐸 (LESS computes gradients for three checkpoints over the entire pool.) 20 Published as a conference paper at ICLR 2026 D Training Datasets in Instruction Tuning Table 4 contains information about the training sets used in instruction tuning. Table 4.Details of training dataset from Wang et al. (2023). Len. is short for token le...

2026
[59]

For the LoRA adapter, we specified a rank of 128, an 𝛼 value of 512, and a dropout rate of 0.1 and applied it across all attention matrices

G Instruction Tuning Details All experiments were conducted with parameter-efficient finetuning method LoRA (Hu et al., 2021). For the LoRA adapter, we specified a rank of 128, an 𝛼 value of 512, and a dropout rate of 0.1 and applied it across all attention matrices. Adding the LoRA adapter introduces minimal FLOPs overhead during training—having no impac...

2021
[60]

MMLU Validation Set Examples User:Which expression is equivalent to 5 x 9? A

22 Published as a conference paper at ICLR 2026 Table 7.Examples with highest and lowest scores retrieved by BM25 for the MMLU validation set. MMLU Validation Set Examples User:Which expression is equivalent to 5 x 9? A. (5 x

2026
[61]

I Am a Cat

Since l = 2w → 3w = 2 → w = 2/3, l = 4/3. Area = (4/3)×(2/3) = 8/9. Assistant:The answer is (C). Sample Index:oasst1 41616 Score:0.0 User:[Thai] Why is the Lao language similar to Thai? Assistant:[Thai] The Lao and Thai languages are similar because they belong to the Tai–Lao language family in the Indochina region. They share writing systems, grammar, an...

2026
[62]

Assistant:I like Mac N cheese

I like Mac N cheese. Assistant:I like Mac N cheese. Sample Index:flan v2 40281Score:0.0187 User:Where will a book being read by one person be kept? Options: school room, university, bedside table, pantry, discount store I believe the correct choice is ”bedside table”, here’s why: Assistant:a book being read by one person be kept bedside table. Sample Inde...

1924
[63]

Vardevej in Esbjerg, and plays in blue and white striped shirts

The club has training facilities and stadium at Gl. Vardevej in Esbjerg, and plays in blue and white striped shirts. [...] Subject: morten karlsen, Relation: place of birth. Options: (A) 1955 (B) central (C) copenhagen (D) denmark (E) esbjerg (F) farum (G) island (H) jutland (I) of (J) once (K) randers (L) time (M) uerdingen (N) union (O) viborg (P) zeala...

1955