Recognition: unknown
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods
Pith reviewed 2026-05-10 07:04 UTC · model grok-4.3
The pith
Dynamic online reweighting of training samples during LLM training yields stronger cross-benchmark generalization than static offline data selection or mixing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ADAPT reweights training samples with adaptive per-sample learning rates guided by similarity-based quality signals, without altering the number of training samples, and thereby acts as an implicit curriculum learner that progressively shifts focus from coarse-grained patterns to fine-grained semantic distinctions, consistently outperforming offline selection/mixing and prior online methods in cross-benchmark generalization under equal FLOPs.
What carries the argument
ADAPT, the online framework that dynamically adjusts per-sample learning rates using similarity-based quality signals to reweight loss contributions during training.
If this is right
- Stronger generalization across diverse benchmarks while preserving full training data size and diversity.
- Lower engineering overhead because the curation process no longer requires separate offline re-runs after model or task shifts.
- Implicit curriculum behavior that automatically adapts focus as model capabilities evolve during a single training run.
- Consistent gains over both offline hard-filtering methods and earlier online reweighting approaches under identical compute budgets.
Where Pith is reading between the lines
- The online reweighting principle could extend to other sequential training regimes such as reinforcement learning from human feedback where data quality signals also evolve.
- Industry pipelines that currently invest heavily in static data filtering might reduce those costs if dynamic per-sample adjustment proves stable at larger scales.
- Combining ADAPT-style signals with existing curriculum or difficulty-based schedulers offers a testable route to further gains without additional data selection steps.
Load-bearing premise
Similarity-based quality signals can reliably steer per-sample learning rate changes throughout training without introducing instability, bias, or loss of diversity.
What would settle it
A controlled experiment in which ADAPT produces equal or lower cross-benchmark scores than the best offline baseline when total FLOPs are matched would falsify the performance claim.
Figures
read the original abstract
Data curation is a critical yet under-explored area in large language model (LLM) training. Existing methods, such as data selection and mixing, operate in an offline paradigm, detaching themselves from training. This separation introduces engineering overhead and makes the curation brittle: the entire pipeline must be re-run under model/task shifts. Moreover, offline methods alter data size through hard filtering or resampling, often sacrificing data diversity and harming generalization. We propose to rethink data curation as an online reweighting problem, where sample importance is dynamically adjusted during training via loss weighting rather than static pre-processing. Specifically, we introduce ADAPT (Adaptive Data reweighting for Pretraining and FineTuning), a dynamic online framework that reweights training samples with adaptive per-sample learning rates guided by similarity-based quality signals, without changing the number of training samples. Unlike offline methods that enforce a static data distribution, ADAPT acts as an implicit curriculum learner, progressively shifting focus from coarse-grained patterns to fine-grained semantic distinctions as the model evolves. Experiments on both instruction tuning and large-scale pretraining show that ADAPT consistently outperforms offline selection/mixing and prior online methods, achieving stronger cross-benchmark generalization under equal FLOPs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes rethinking data curation for LLM training as an online reweighting problem rather than offline selection or mixing. It introduces ADAPT, which dynamically adjusts per-sample learning rates during training using similarity-based quality signals to act as an implicit curriculum, preserving the full data size while shifting focus from coarse to fine-grained patterns. The central claim is that this yields stronger cross-benchmark generalization than offline methods and prior online approaches under equal FLOPs, demonstrated on instruction tuning and large-scale pretraining tasks.
Significance. If the empirical claims hold with robust validation, the work offers a practical alternative to brittle offline curation pipelines, potentially reducing engineering overhead and improving generalization by maintaining data diversity. The implicit-curriculum framing is a useful conceptual contribution, though its significance hinges on whether the dynamic signals deliver stable, attributable gains rather than optimization artifacts.
major comments (3)
- [Method] Method section (description of ADAPT): the similarity-based quality signals lack concrete specification of the reference model, similarity metric, update frequency, and normalization. Without these, it is impossible to assess whether the per-sample learning-rate adjustments provide reliable guidance or merely add gradient variance, directly undermining attribution of the claimed outperformance to the online mechanism.
- [Experiments] Experiments section: no ablation or analysis is provided on robustness to signal noise, reference-model choice, or resulting data diversity. The central claim that ADAPT outperforms offline methods under equal FLOPs therefore rests on an untested assumption that the dynamic process functions as a stable curriculum rather than introducing instability or bias.
- [Results] Results tables/figures: while cross-benchmark gains are asserted, the manuscript does not report error bars, statistical significance tests, or controls for total compute that would allow verification that the reported improvements exceed what could arise from hyperparameter tuning alone.
minor comments (2)
- [Abstract] The abstract would benefit from one or two key quantitative results (e.g., average improvement and number of benchmarks) to ground the performance claims.
- [Method] Notation for the per-sample learning-rate adjustment should be formalized with an equation to clarify how the similarity signal maps to the weighting factor.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We have revised the manuscript to address the concerns about method specification, experimental robustness, and statistical reporting. Our point-by-point responses follow, with changes noted for each item.
read point-by-point responses
-
Referee: [Method] Method section (description of ADAPT): the similarity-based quality signals lack concrete specification of the reference model, similarity metric, update frequency, and normalization. Without these, it is impossible to assess whether the per-sample learning-rate adjustments provide reliable guidance or merely add gradient variance, directly undermining attribution of the claimed outperformance to the online mechanism.
Authors: We agree that the original description was insufficiently precise for full reproducibility and attribution. In the revised manuscript we have expanded the ADAPT method section with a new implementation subsection that explicitly states: the reference model is the model being trained (using its current parameters for on-the-fly similarity), the metric is cosine similarity computed on mean-pooled hidden states from a designated intermediate layer, updates to the quality signals occur every 50 training steps, and normalization is performed via a temperature-scaled softmax over the current batch to produce stable per-sample multipliers. These additions demonstrate that the adjustments follow a directed, evolving curriculum rather than injecting unstructured variance, directly supporting the attribution of gains to the online mechanism. revision: yes
-
Referee: [Experiments] Experiments section: no ablation or analysis is provided on robustness to signal noise, reference-model choice, or resulting data diversity. The central claim that ADAPT outperforms offline methods under equal FLOPs therefore rests on an untested assumption that the dynamic process functions as a stable curriculum rather than introducing instability or bias.
Authors: We acknowledge the value of these ablations for validating stability. The revised version includes three new analyses: (1) robustness to signal noise, where we inject controlled Gaussian noise into the quality signals and show that performance degrades gracefully without collapsing; (2) reference-model sensitivity, comparing the default evolving model against fixed smaller and larger reference models with only minor variance in final results; and (3) data diversity tracking via effective sample entropy over training, confirming that ADAPT preserves broader coverage than offline filtering while still emphasizing finer patterns. These results support that the dynamic process operates as a stable implicit curriculum under equal FLOPs. revision: yes
-
Referee: [Results] Results tables/figures: while cross-benchmark gains are asserted, the manuscript does not report error bars, statistical significance tests, or controls for total compute that would allow verification that the reported improvements exceed what could arise from hyperparameter tuning alone.
Authors: We agree that stronger statistical controls are necessary. The updated results section now reports mean and standard deviation over three independent random seeds for all metrics, includes paired t-test p-values (all < 0.05 for the reported cross-benchmark gains), and explicitly documents that every baseline and ADAPT run was executed with identical total FLOPs by matching batch size, number of steps, and model architecture. These additions confirm that the observed improvements are not explained by hyperparameter differences alone. revision: yes
Circularity Check
No circularity: empirical method with no derivations or self-referential predictions
full rationale
The paper introduces ADAPT as an empirical online reweighting framework for LLM data curation, supported by experiments on instruction tuning and pretraining that compare against offline baselines under equal FLOPs. No equations, derivations, uniqueness theorems, or first-principles predictions appear in the provided text. Claims rest on observed generalization improvements rather than any reduction of outputs to fitted inputs, self-citations, or ansatzes by construction. The method is described as dynamic and adaptive without mathematical self-definition.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2405.20541. Richard Antonello, Nicole Beckage, Javier Turek, and Alexander Huth. Selecting informative contexts improves language model finetuning.arXiv preprint arXiv:2005.00175,
-
[2]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
arXiv preprint arXiv:2502.15657 , year=
Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, S ¨oren Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, et al. Superintelligent agents pose catastrophic risks: Can scientist ai offer a safer path?arXiv preprint arXiv:2502.15657,
-
[4]
Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,
1901
-
[5]
Language Models are Few-Shot Learners
Tom B Brown. Language models are few-shot learners.arXiv preprint arXiv:2005.14165,
work page internal anchor Pith review arXiv 2005
-
[6]
Data diversity matters for robust instruction tuning.arXiv preprint arXiv:2311.14736,
Alexander Bukharin and Tuo Zhao. Data diversity matters for robust instruction tuning.arXiv preprint arXiv:2311.14736,
-
[7]
Yihan Cao, Yanbin Kang, and Lichao Sun. Instruction mining: High-quality instruction data selection for large language models.arXiv preprint arXiv:2307.06290,
-
[8]
arXiv preprint arXiv:2307.08701 , year=
11 Published as a conference paper at ICLR 2026 Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, et al. Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701, 2023a. Mayee F Chen, Nicholas Roberts, Kush Bhatia, Jue WANG, Ce Zhang, Frederic Sala,...
-
[9]
PaLM: Scaling Language Modeling with Pathways
URL https://openreview.net/forum? id=u6JLh0BO5h. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311,
work page internal anchor Pith review arXiv
-
[10]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Enhancing chat language models by scaling high-quality instructional conversations
Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations.arXiv preprint arXiv:2305.14233,
-
[12]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
12 Published as a conference paper at ICLR 2026 Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027,
work page internal anchor Pith review arXiv 2026
-
[13]
arXiv preprint arXiv:2205.10487 , year=
Danny Hernandez, Tom Brown, Tom Conerly, Nova DasSarma, Dawn Drain, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Tom Henighan, Tristan Hume, et al. Scaling laws and interpretability of learning from repeated data.arXiv preprint arXiv:2205.10487,
-
[14]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,
work page internal anchor Pith review arXiv
-
[15]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Large-scale data selection for instruction tuning.arXiv preprint arXiv:2503.01807, 2025
Hamish Ivison, Muru Zhang, Faeze Brahman, Pang Wei Koh, and Pradeep Dasigi. Large-scale data selection for instruction tuning.arXiv preprint arXiv:2503.01807,
-
[17]
Importance weighting can help large language models self-improve.arXiv preprint arXiv:2408.09849,
Chunyang Jiang, Chi-min Chan, Wei Xue, Qifeng Liu, and Yike Guo. Importance weighting can help large language models self-improve.arXiv preprint arXiv:2408.09849,
-
[18]
Hexgen: Generative inference of large language model over heterogeneous environment
Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, and Binhang Yuan. Hexgen: Generative inference of large language model over heterogeneous environment.arXiv preprint arXiv:2311.11514,
-
[19]
Demystifying cost- efficiency in llm serving over heterogeneous gpus.ArXiv, abs/2502.00722,
Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Guoliang He, Xupeng Miao, Ana Klimovic, Bin Cui, Binhang Yuan, and Eiko Yoneki. Demystifying cost-efficiency in llm serving over heterogeneous gpus.arXiv preprint arXiv:2502.00722, 2025a. Youhe Jiang, Fangcheng Fu, Wanru Zhao, Stephan Rabanser, Jintao Zhang, Nicholas D. Lane, and Binhang Yuan. Cascadia: An efficient...
-
[20]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020a. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario...
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[21]
o pf, Yannic Kilcher, Dimitri von R \
Andreas K¨opf, Yannic Kilcher, Dimitri von R¨ utte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Rich ´ard Nagyfi, et al. Openassistant conversations–democratizing large language model alignment.arXiv preprint arXiv:2304.07327,
-
[22]
RACE: Large-scale ReAding Comprehension Dataset From Examinations
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations.arXiv preprint arXiv:1704.04683,
-
[23]
semanticscholar.org/CorpusID:235829052
URL https://api. semanticscholar.org/CorpusID:235829052. Yunshui Li, Binyuan Hui, Xiaobo Xia, Jiaxi Yang, Min Yang, Lei Zhang, Shuzheng Si, Junhao Liu, Tongliang Liu, Fei Huang, et al. One shot learning as instruction data prospector for large language models.arXiv preprint arXiv:2312.10302,
-
[24]
Logiqa: A challenge dataset for machine reading comprehension with logical reasoning
Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning.arXiv preprint arXiv:2007.08124,
-
[25]
arXiv preprint arXiv:2407.01492 , year=
Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, and Min Lin. Regmix: Data mixture as regression for language model pre-training.arXiv preprint arXiv:2407.01492,
-
[26]
Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, and Junxian He. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning.arXiv preprint arXiv:2312.15685,
-
[27]
Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning.arXiv preprint arXiv:2301.13688,
-
[28]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,
2019
-
[29]
Matthew Macfarlane, Minseon Kim, Nebojsa Jojic, Weijia Xu, Lucas Caccia, Xingdi Yuan, Wanru Zhao, Zhengyan Shi, and Alessandro Sordoni
URLhttps://openreview.net/forum?id=HtqnVSCj3q. Matthew Macfarlane, Minseon Kim, Nebojsa Jojic, Weijia Xu, Lucas Caccia, Xingdi Yuan, Wanru Zhao, Zhengyan Shi, and Alessandro Sordoni. Instilling parallel reasoning into language models. In 2nd AI for Math Workshop @ ICML 2025,
2025
-
[30]
arXiv preprint arXiv:2309.04564 , year=
URL https: //arxiv.org/abs/2309.04564. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering.arXiv preprint arXiv:1809.02789,
-
[31]
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
14 Published as a conference paper at ICLR 2026 Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4.arXiv preprint arXiv:2306.02707,
work page internal anchor Pith review arXiv 2026
-
[32]
The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern´andez. The LAMBADA dataset: Word prediction requiring a broad discourse context.arXiv preprint arXiv:1606.06031,
-
[33]
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only.arXiv preprint arXiv:2306.01116,
work page internal anchor Pith review arXiv
-
[34]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
URLhttps://arxiv.org/abs/2406.17557. Maxime Peyrard, Sarvjeet Ghotra, Martin Josifoski, Vidhan Agarwal, Barun Patra, Dean Carignan, Emre Kiciman, Saurabh Tiwary, and Robert West. Invariant language modeling. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 5728–5743,
work page internal anchor Pith review arXiv 2022
-
[35]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher.arXiv preprint arXiv:2112.11446,
work page internal anchor Pith review arXiv
-
[36]
Learning to re- trieve prompts for in-context learning.arXiv preprint arXiv:2112.08633,
Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learning to retrieve prompts for in-context learning.arXiv preprint arXiv:2112.08633,
-
[37]
SocialIQA: Commonsense Reasoning about Social Interactions
URLhttps://openreview.net/forum?id=9Vrb9D0WI4. Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions.arXiv preprint arXiv:1904.09728,
work page internal anchor Pith review arXiv 1904
-
[38]
SlimPajama: A 627B token cleaned and deduplicated version of RedPajama
15 Published as a conference paper at ICLR 2026 Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://cerebras.ai/blog/ slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama , June
2026
-
[39]
22 Kaiqiang Song, Xiaoyang Wang, Sangwoo Cho, Xiaoman Pan, and Dong Yu
URLhttps://arxiv.org/abs/2402.00159. Daouda Sow, Herbert Woisetschl¨ager, Saikiran Bulusu, Shiqiang Wang, Hans Arno Jacobsen, and Yingbin Liang. Dynamic loss-based sample reweighting for improved large language model pretraining. InThe Thirteenth International Conference on Learning Representations,
-
[40]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
URL https://openreview.net/forum?id=gU4ZgQNsOC. Mirac Suzgun, Nathan Scales, Nathanael Sch¨arli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261,
work page internal anchor Pith review arXiv
-
[41]
Challenging big-bench tasks and whether chain-of-thought can solve them
Mirac Suzgun, Nathan Scales, Nathanael Sch¨arli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pp. 13003–13051,
2023
-
[42]
Megh Thakkar, Tolga Bolukbasi, Sriram Ganapathy, Shikhar Vashishth, Sarath Chandar, and Partha Talukdar. Self-influence guided data reweighting for language model pre-training.arXiv preprint arXiv:2311.00913,
-
[43]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee L...
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks
Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, et al. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing...
2022
-
[45]
How far can camels go? exploring the state of instruction tuning on open resources
16 Published as a conference paper at ICLR 2026 Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A Smith, Iz Beltagy, et al. How far can camels go? exploring the state of instruction tuning on open resources. InNeural Information Processing Systems Track on Datasets and Ben...
2026
-
[46]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. InInternational Conference on Learning Representations, 2022a. URLhttps://openreview.net/forum? id=gEZrGCozdqR. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and ...
work page internal anchor Pith review arXiv
-
[47]
WizardLM: Empowering large pre-trained language models to follow complex instructions
Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. DoReMi: Optimizing data mixtures speeds up language model pretraining. InThirty-seventh Conference on Neural Information Processing Systems, 2023a. URLhttps://openreview.net/forum?id=lXuByUeHhd. Sang Michael Xie, Hieu Pham, Xuanyi...
work page internal anchor Pith review arXiv
-
[48]
mt5: A massively multilingual pre-trained text-to-text transformer
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mt5: A massively multilingual pre-trained text-to-text transformer.arXiv preprint arXiv:2010.11934,
-
[49]
Hexiscale: Accommodating large language model training over heterogeneous environment
Ran Yan, Youhe Jiang, Xiaonan Nie, Fangcheng Fu, Bin Cui, and Binhang Yuan. Hexiscale: Accommodating large language model training over heterogeneous environment.arXiv preprint arXiv:2409.01143,
work page internal anchor Pith review arXiv
-
[50]
Fsa: An alternative efficient implementation of native sparse attention kernel
Ran Yan, Youhe Jiang, Zhuoming Chen, Haohui Mai, Beidi Chen, and Binhang Yuan. Fsa: An alter- native efficient implementation of native sparse attention kernel.arXiv preprint arXiv:2508.18224,
-
[51]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi
URL https://arxiv.org/abs/2410.16208. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830,
-
[52]
TinyLlama: An Open-Source Small Language Model
Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model.arXiv preprint arXiv:2401.02385,
work page internal anchor Pith review arXiv
-
[53]
Attacks on third-party apis of large language models,
17 Published as a conference paper at ICLR 2026 Wanru Zhao, Vidit Khazanchi, Haodi Xing, Xuanli He, Qiongkai Xu, and Nicholas Donald Lane. Attacks on third-party apis of large language models,
2026
-
[54]
Wanru Zhao, Yihong Chen, Royson Lee, Xinchi Qiu, Yan Gao, Hongxiang Fan, and Nicholas D
URL https://arxiv.org/abs/ 2404.16891. Wanru Zhao, Yihong Chen, Royson Lee, Xinchi Qiu, Yan Gao, Hongxiang Fan, and Nicholas D. Lane. Breaking physical and linguistic borders: Multilingual federated prompt tuning for low-resource languages, 2025a. URLhttps://arxiv.org/abs/2507.03003. Wanru Zhao, Hongxiang Fan, Shell Xu Hu, Wangchunshu Zhou, Bofan Chen, an...
-
[55]
18 Appendix Contents A Implication and Future Work
URLhttps://openreview.net/forum?id=KBMOKmX2he. 18 Appendix Contents A Implication and Future Work. 20 B Instruction Tuning Baselines 20 C Data-Selection FLOPs 20 D Training Datasets in Instruction Tuning 21 E Evaluation Datasets in Instruction Tuning 21 F Pretraining Details 21 G Instruction Tuning Details 21 H Pretraining Perplexity 21 I Pseudo Code 22 J...
2026
-
[56]
In particular, combining ADAPT with interpretability techniques such as influence functions (Basu et al.,
and to study how dynamic data reweighting influences memorization and internal knowledge structure formation in large language models (Chen, 2025). In particular, combining ADAPT with interpretability techniques such as influence functions (Basu et al.,
2025
-
[57]
More broadly, we view ADAPT as a mechanism for dynamically modulating what information is reinforced during training
may enable us to trace how safety-critical data shapes learned representations and to examine whether models preserve underlying data distributions under dynamic reweighting. More broadly, we view ADAPT as a mechanism for dynamically modulating what information is reinforced during training. Suchdata-centriccontrol of model memory opens avenues not only f...
2009
-
[58]
Table 4.Details of training dataset from Wang et al
LESS: 1.53×2𝑘×6𝑁×𝑃+2𝑘×6𝑁×𝐷×𝐸 (LESS computes gradients for three checkpoints over the entire pool.) 20 Published as a conference paper at ICLR 2026 D Training Datasets in Instruction Tuning Table 4 contains information about the training sets used in instruction tuning. Table 4.Details of training dataset from Wang et al. (2023). Len. is short for token le...
2026
-
[59]
For the LoRA adapter, we specified a rank of 128, an 𝛼 value of 512, and a dropout rate of 0.1 and applied it across all attention matrices
G Instruction Tuning Details All experiments were conducted with parameter-efficient finetuning method LoRA (Hu et al., 2021). For the LoRA adapter, we specified a rank of 128, an 𝛼 value of 512, and a dropout rate of 0.1 and applied it across all attention matrices. Adding the LoRA adapter introduces minimal FLOPs overhead during training—having no impac...
2021
-
[60]
MMLU Validation Set Examples User:Which expression is equivalent to 5 x 9? A
22 Published as a conference paper at ICLR 2026 Table 7.Examples with highest and lowest scores retrieved by BM25 for the MMLU validation set. MMLU Validation Set Examples User:Which expression is equivalent to 5 x 9? A. (5 x
2026
-
[61]
I Am a Cat
Since l = 2w → 3w = 2 → w = 2/3, l = 4/3. Area = (4/3)×(2/3) = 8/9. Assistant:The answer is (C). Sample Index:oasst1 41616 Score:0.0 User:[Thai] Why is the Lao language similar to Thai? Assistant:[Thai] The Lao and Thai languages are similar because they belong to the Tai–Lao language family in the Indochina region. They share writing systems, grammar, an...
2026
-
[62]
Assistant:I like Mac N cheese
I like Mac N cheese. Assistant:I like Mac N cheese. Sample Index:flan v2 40281Score:0.0187 User:Where will a book being read by one person be kept? Options: school room, university, bedside table, pantry, discount store I believe the correct choice is ”bedside table”, here’s why: Assistant:a book being read by one person be kept bedside table. Sample Inde...
1924
-
[63]
Vardevej in Esbjerg, and plays in blue and white striped shirts
The club has training facilities and stadium at Gl. Vardevej in Esbjerg, and plays in blue and white striped shirts. [...] Subject: morten karlsen, Relation: place of birth. Options: (A) 1955 (B) central (C) copenhagen (D) denmark (E) esbjerg (F) farum (G) island (H) jutland (I) of (J) once (K) randers (L) time (M) uerdingen (N) union (O) viborg (P) zeala...
1955
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.