Generating Pretraining Tokens from Organic Data for Data-Bound Scaling

Chenyan Xiong; Zichun Yu

arxiv: 2605.17849 · v1 · pith:3SVQYHNPnew · submitted 2026-05-18 · 💻 cs.CL · cs.AI· cs.LG

Generating Pretraining Tokens from Organic Data for Data-Bound Scaling

Zichun Yu , Chenyan Xiong This is my paper

Pith reviewed 2026-05-20 11:31 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords synthetic data generationLLM pretrainingdata-bound scalingreinforcement learningorganic datarephrasingreformattingdata efficiency

0 comments

The pith

SynPro generates rephrased and reformatted versions of the same organic text to unlock 3.7-5.2 times more effective pretraining tokens than repetition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM pretraining has entered a data-bound regime where available organic text falls short of scaling needs, yet simple repetition leaves much of that text underutilized. SynPro counters this by creating diverse presentations of the original sources through two operations: rephrasing and reformatting. These operations are produced by generators trained with reinforcement learning on rewards that enforce quality, faithfulness to the source, and influence on what the model has not yet absorbed. The generators are refreshed whenever pretraining performance plateaus so they continually target new content. Experiments pretraining 400M and 1.1B models on only 10 percent of Chinchilla-optimal tokens from a fixed organic baseline show that SynPro extracts several times the learning value of repetition and even exceeds the performance of training on fully unique data at the larger scale.

Core claim

SynPro applies rephrasing and reformat operations to present the same organic source in diverse forms, with both generators optimized via reinforcement learning using quality, faithfulness, and data influence rewards and continuously updated as pretraining plateaus to target content the model has yet to absorb, thereby unlocking substantially higher effective utilization of limited organic data.

What carries the argument

RL-optimized rephrasing and reformat generators guided by quality, faithfulness, and data-influence rewards that are refreshed when pretraining plateaus.

If this is right

Organic corpora can support longer effective training than repetition allows when internal diversity is generated faithfully.
Data-bound scaling continues when synthesis targets content the model has not yet absorbed.
Models reach higher performance at 1.1B scale using only a fraction of unique tokens compared with training on fully distinct data.
Faithful, model-aware synthesis avoids distribution collapse while increasing data utilization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rephrasing and reformat approach could be applied to other scarce domains such as code or scientific documents.
Generator update frequency during extended pretraining may need tuning to avoid either under- or over-adaptation.
Combining this synthesis with explicit curriculum ordering of variants could further improve absorption of hard content.

Load-bearing premise

Rewards for quality, faithfulness, and data influence can be computed reliably from the current model state without introducing distribution shift away from the organic sources.

What would settle it

A controlled run in which the same organic sources are presented with non-RL or non-faithful generators and performance fails to show the reported multiple of effective tokens.

Figures

Figures reproduced from arXiv: 2605.17849 by Chenyan Xiong, Zichun Yu.

**Figure 2.** Figure 2: Overview of SYNPRO. We train generators to provide faithful and informative synthetic data from organic source, enabling sustained improvement for data-bound scaling. gains after only a few (typically 4) epochs (Muennighoff et al., 2023). This phenomenon, known as the data wall, leads to a plateau in performance despite increased training time. Synthetic data for pretraining. Generating synthetic text is a… view at source ↗

**Figure 3.** Figure 3: Faithfulness analysis on 1,000 randomly sampled organic documents not seen in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution preservation analysis. t-SNE illustration of Voronoi clusters, where [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Model-awareness analysis on the 1.1B model. (a) Influence correlation and (b, c) [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: 1B model & 2.2B unique organic tokens 18 [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Quality analysis on 1,000 randomly sampled organic documents not seen in RL. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Example validation reward (400M pretraining model, [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

read the original abstract

LLM pretraining is shifting from a compute-bound to a data-bound regime, where available human (organic) text falls far short of scaling demands. However, reaching the data-bound regime does not mean the model has fully utilized its organic corpus. In this paper, we introduce SynPro, a synthetic data generation framework that helps LLMs more thoroughly learn from limited organic data. SynPro applies two operations, rephrasing and reformat, that present the same organic source in diverse forms to facilitate deeper learning without introducing external information. Both generators are optimized via reinforcement learning with quality, faithfulness, and data influence rewards, and are continuously updated as pretraining plateaus to target content the model has yet to absorb. We pretrain 400M and 1.1B models with 10% of their Chinchilla-optimal tokens (0.8B and 2.2B) from DCLM-Baseline, reflecting a realistic data-bound regime in frontier pretraining. Our results reveal that organic data is significantly underutilized by standard repetition: SynPro unlocks 3.7-5.2x the effective tokens of repetition, even surpassing the non-data-bound oracle that trains on equivalent unique data at the 1.1B scale. Analyses confirm that faithful, model-aware synthesis sustains data-bound scaling without causing distribution collapse. We open-source our code at https://github.com/cxcscmu/SynPro.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SynPro gets 3.7-5.2x effective tokens over repetition via RL rephrasing and reformatting, but the oracle-beating claim at 1.1B hinges on rewards staying strictly inside the organic slice.

read the letter

The main point is that this paper shows a concrete way to stretch fixed organic corpora during pretraining. On 400M and 1.1B models trained with only 10% of Chinchilla-optimal tokens from DCLM-Baseline, their SynPro setup delivers 3.7-5.2 times the effective tokens of plain repetition and even beats an oracle that sees the same number of unique tokens at the 1.1B scale. The generators for rephrasing and reformatting are updated on the fly to target content the model has not yet absorbed, using RL with quality, faithfulness, and data-influence rewards. They report no distribution collapse and open-source the code. That combination of continuous model-aware updates plus actual small-scale pretraining runs is the clearest new element relative to static synthetic data or simple repetition baselines. The experiments are grounded enough to be worth looking at if you care about data-bound scaling. The soft spots are in the measurement and the reward construction. The effective-token multiplier appears tied to the same data-influence signal used in training, which raises a circularity risk unless they have an independent external check. More critically, the stress-test concern lands: if any reward component pulls in a separate judge model, embedding, or pre-existing knowledge, the generated tokens are no longer purely organic and the comparison to the unique-data oracle loses its force. The abstract does not spell out how the rewards are realized without external information, so that detail needs verification in the full text. This work is aimed at people running or analyzing frontier-scale pretraining who want practical levers for better data utilization. A reader focused on scaling laws or synthetic data would find the empirical numbers and the open code useful. It is worth sending to referees because the problem is timely, the experiments are real, and the methodological questions are fixable with clearer reward definitions and independent metrics.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SynPro, a framework for generating synthetic pretraining tokens from organic data using reinforcement learning optimized rephrasing and reformat generators. These generators are trained with rewards for quality, faithfulness, and data influence, and updated during pretraining to target under-learned content. Experiments pretrain 400M and 1.1B models using only 10% of Chinchilla-optimal tokens from the DCLM-Baseline dataset, demonstrating that SynPro yields 3.7-5.2 times the effective tokens compared to standard repetition and even outperforms an oracle trained on equivalent unique data at the 1.1B scale.

Significance. If validated, this result suggests that organic data is substantially underutilized in standard pretraining and that model-aware synthetic augmentation can unlock additional scaling gains without requiring more unique data or external sources. This has potential implications for data-efficient pretraining in the data-bound regime. The open-sourcing of code is a positive step for reproducibility.

major comments (2)

[Abstract] Abstract: The claim that SynPro surpasses the non-data-bound oracle at the 1.1B scale is load-bearing on the assertion that generated tokens remain strictly within the organic data distribution. The RL rewards for quality, faithfulness, and data influence must be shown to be computed solely from the current model state and the organic corpus without external pretrained components or embeddings; otherwise the comparison to the unique-data oracle is invalid.
[Experiments] Experiments section (results on effective tokens): The 3.7-5.2x multiplier and effective-token metric need an independent definition and evaluation protocol that does not reuse the data-influence reward signal from the RL training loop, to rule out circularity in the reported gains over repetition.

minor comments (2)

[Abstract] Abstract: The term 'non-data-bound oracle' should be briefly defined or referenced to a methods subsection for clarity.
[Introduction] Introduction: Expand the acronym DCLM-Baseline on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify key aspects of our work. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that SynPro surpasses the non-data-bound oracle at the 1.1B scale is load-bearing on the assertion that generated tokens remain strictly within the organic data distribution. The RL rewards for quality, faithfulness, and data influence must be shown to be computed solely from the current model state and the organic corpus without external pretrained components or embeddings; otherwise the comparison to the unique-data oracle is invalid.

Authors: We agree that explicit verification of the reward computation is necessary to support the oracle comparison. In the revised manuscript, we will add a dedicated subsection in Methods detailing that quality is measured via the current model's perplexity on generated text, faithfulness via token-level overlap and semantic similarity computed from the current model's hidden states on the organic source, and data influence via the loss reduction on the specific organic sample under the current model parameters. All computations use only the organic corpus and the model being trained; no external pretrained models or fixed embeddings are involved. We will include the exact reward equations and a diagram of the computation graph to confirm the generated tokens stay within the organic distribution. revision: yes
Referee: [Experiments] Experiments section (results on effective tokens): The 3.7-5.2x multiplier and effective-token metric need an independent definition and evaluation protocol that does not reuse the data-influence reward signal from the RL training loop, to rule out circularity in the reported gains over repetition.

Authors: We acknowledge the risk of circularity if the evaluation metric directly reuses the RL reward. The reported multiplier is currently derived from downstream benchmark accuracy and validation loss differences between SynPro and repetition runs, normalized by tokens seen. To eliminate any overlap, we will revise the Experiments section to introduce an independent protocol: effective tokens are computed as the volume of unique organic data required for a repetition baseline to reach the same final validation loss as the SynPro model, measured on a held-out validation split never used in RL. This protocol will be described with pseudocode and reported in updated tables and figures. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical gains reported from controlled pretraining experiments

full rationale

The paper defines SynPro as an RL-based rephrasing/reformatting framework using quality, faithfulness, and data-influence rewards to generate variants from a fixed organic corpus (DCLM-Baseline). It then reports downstream pretraining results at 400M and 1.1B scales, comparing repetition baselines against an oracle that uses equivalent unique tokens. The 3.7-5.2x effective-token multiplier and oracle-surpassing claim are presented as measured outcomes of these runs, not as quantities algebraically derived from the reward functions themselves. No equation equates the reported multiplier to a fitted reward term by construction, and the central scaling comparison rests on external performance metrics rather than self-referential definitions or self-citation chains. The derivation chain therefore remains self-contained against the stated experimental benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the premise that rephrasing and reformatting preserve all necessary information while increasing learnability, plus the assumption that the three RL rewards can be balanced without external data or collapse.

free parameters (1)

RL reward weights for quality, faithfulness, and data influence
The abstract states the generators are optimized via reinforcement learning with these three rewards; their relative weighting is a free parameter that must be chosen or tuned.

axioms (1)

domain assumption Rephrasing and reformatting operations introduce no external information beyond the organic source.
Stated directly in the abstract as a core property of the two operations.

pith-pipeline@v0.9.0 · 5789 in / 1512 out tokens · 29048 ms · 2026-05-20T11:31:15.660949+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SYNPRO applies two operations, rephrasing and reformat... optimized via reinforcement learning with quality, faithfulness, and data influence rewards... unlocks 3.7-5.2x the effective tokens of repetition
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We pretrain 400M and 1.1B models with 10% of their Chinchilla-optimal tokens... from DCLM-Baseline

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

194 extracted references · 194 canonical work pages

[1]

Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal =

work page
[2]

ArXiv preprint , year=

Choe, Sang Keun and Ahn, Hwijeen and Bae, Juhan and Zhao, Kewen and Kang, Minsoo and Chung, Youngseog and Pratapa, Adithya and Neiswanger, Willie and Strubell, Emma and Mitamura, Teruko and others , title=. ArXiv preprint , year=

work page
[3]

A Pretrainer

Longpre, Shayne and Yauney, Gregory and Reif, Emily and Lee, Katherine and Roberts, Adam and Zoph, Barret and Zhou, Denny and Wei, Jason and Robinson, Kevin and Mimno, David and Ippolito, Daphne , booktitle =. A Pretrainer

work page
[4]

Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers , year =

Gong, Linyuan and Xiong, Chenyan and Liu, Xiaodong and Bajaj, Payal and Xie, Yiqing and Cheung, Alvin and Gao, Jianfeng and Song, Xia , booktitle =. Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers , year =

work page
[5]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , publisher =. The Theory of Parsing, Translation and Compiling , volume =

work page
[6]

Bajaj, Payal and Xiong, Chenyan and Ke, Guolin and Liu, Xiaodong and He, Di and Tiwary, Saurabh and Liu, Tie-Yan and Bennett, Paul and Song, Xia and Gao, Jianfeng , journal =

work page
[7]

Bennett and Jiawei Han and Xia Song , booktitle =

Yu Meng and Chenyan Xiong and Payal Bajaj and Saurabh Tiwary and Paul N. Bennett and Jiawei Han and Xia Song , booktitle =. Pretraining Text Encoders with Adversarial Mixture of Training Signal Generators , year =

work page
[8]

Publications Manual , year =

work page
[9]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , journal =. Alternation , volume =

work page
[10]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle =. Scalable training of

work page
[11]

Algorithms on Strings, Trees and Sequences , year =

Dan Gusfield , publisher =. Algorithms on Strings, Trees and Sequences , year =

work page
[12]

Tetreault , journal =

Mohammad Sadegh Rasooli and Joel R. Tetreault , journal =. Yara Parser:

work page
[13]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , volume =

Ando, Rie Kubota and Zhang, Tong , issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , volume =. JMLR , numpages =

work page
[14]

and Tukey, John W

Cooley, James W. and Tukey, John W. , journal =. An algorithm for the machine calculation of complex

work page
[15]

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle =

work page
[16]

Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert. Proc. of NeurIPS , title =

work page
[17]

Measuring Massive Multitask Language Understanding , year =

Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , booktitle =. Measuring Massive Multitask Language Understanding , year =

work page
[18]

, journal =

Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and Schuh, Parker and et al. , journal =

work page
[19]

Victor Sanh and Albert Webson and Colin Raffel and Stephen H. Bach and Lintang Sutawika and Zaid Alyafeai and Antoine Chaffin and Arnaud Stiegler and Arun Raja and Manan Dey and M Saiful Bari and Canwen Xu and Urmish Thakker and Shanya Sharma Sharma and Eliza Szczechla and Taewoon Kim and Gunjan Chhablani and Nihal V. Nayak and Debajyoti Datta and Jonatha...

work page
[20]

and Dean, Jeff and Devlin, Jacob and Roberts, Adam and Zhou, Denny and Le, Quoc V

Chung, Hyung Won and Hou, Le and Longpre, Shayne and Zoph, Barret and Tay, Yi and Fedus, William and Li, Eric and Wang, Xuezhi and Dehghani, Mostafa and Brahma, Siddhartha and Webson, Albert and Gu, Shixiang Shane and Dai, Zhuyun and Suzgun, Mirac and Chen, Xinyun and Chowdhery, Aakanksha and Narang, Sharan and Mishra, Gaurav and Yu, Adams and Zhao, Vince...

work page
[21]

An analysis of approximations for maximizing submodular set functions—I , year=

Nemhauser, George L and Wolsey, Laurence A and Fisher, Marshall L , journal=. An analysis of approximations for maximizing submodular set functions—I , year=

work page
[22]

Susan Zhang and Stephen Roller and Naman Goyal and Mikel Artetxe and Moya Chen and Shuohui Chen and Christopher Dewan and Mona Diab and Xian Li and Xi Victoria Lin and Todor Mihaylov and Myle Ott and Sam Shleifer and Kurt Shuster and Daniel Simig and Punit Singh Koura and Anjali Sridhar and Tianlu Wang and Luke Zettlemoyer , journal =

work page
[23]

Liu , journal =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , journal =. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , year =

work page
[24]

Rae and Laurent Sifre , booktitle =

Jordan Hoffmann and Sebastian Borgeaud and Arthur Mensch and Elena Buchatskaya and Trevor Cai and Eliza Rutherford and Diego de Las Casas and Lisa Anne Hendricks and Johannes Welbl and Aidan Clark and Tom Hennigan and Eric Noland and Katherine Millican and George van den Driessche and Bogdan Damoc and Aurelia Guy and Simon Osindero and Karen Simonyan and ...

work page
[25]

Chi and Quoc V

Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed H. Chi and Quoc V. Le and Denny Zhou , booktitle =. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , year =

work page
[26]

Scaling laws for neural language models , year =

Kaplan, Jared and McCandlish, Sam and Henighan, Tom and Brown, Tom B and Chess, Benjamin and Child, Rewon and Gray, Scott and Radford, Alec and Wu, Jeffrey and Amodei, Dario , journal =. Scaling laws for neural language models , year =

work page
[27]

Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Hamza Alobeidli and Alessandro Cappelli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay , booktitle =. The

work page
[28]

Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzm. Proc. of LREC , title =

work page
[29]

Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others , journal =. The

work page
[30]

Unsupervised Dense Information Retrieval with Contrastive Learning , year =

Gautier Izacard and Mathilde Caron and Lucas Hosseini and Sebastian Riedel and Piotr Bojanowski and Armand Joulin and Edouard Grave , journal =. Unsupervised Dense Information Retrieval with Contrastive Learning , year =

work page
[31]

Datamodels: Predicting Predictions from Training Data , year =

Ilyas, Andrew and Park, Sung Min and Engstrom, Logan and Leclerc, Guillaume and Madry, Aleksander , booktitle =. Datamodels: Predicting Predictions from Training Data , year =

work page
[32]

Finetuned Language Models are Zero-Shot Learners , year =

Wei, Jason and Bosma, Maarten and Zhao, Vincent and Guu, Kelvin and Yu, Adams Wei and Lester, Brian and Du, Nan and Dai, Andrew M and Le, Quoc V , booktitle =. Finetuned Language Models are Zero-Shot Learners , year =

work page
[33]

Logan Engstrom and Axel Feldmann and Aleksander Madry , booktitle =

work page
[34]

Stella Biderman and Hailey Schoelkopf and Quentin Gregory Anthony and Herbie Bradley and Kyle O'Brien and Eric Hallahan and Mohammad Aflah Khan and Shivanshu Purohit and USVSN Sai Prashanth and Edward Raff and Aviya Skowron and Lintang Sutawika and Oskar van der Wal , booktitle =. Pythia:

work page
[35]

Llama 2: Open foundation and fine-tuned chat models , year =

Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others , journal =. Llama 2: Open foundation and fine-tuned chat models , year =

work page
[36]

Wettig, Alexander and Gupta, Aatmik and Malik, Saumya and Chen, Danqi , booktitle =

work page
[37]

How to Train Data-Efficient

Sachdeva, Noveen and Coleman, Benjamin and Kang, Wang-Cheng and Ni, Jianmo and Hong, Lichan and Chi, Ed H and Caverlee, James and McAuley, Julian and Cheng, Derek Zhiyuan , journal =. How to Train Data-Efficient

work page
[38]

Self-Influence Guided Data Reweighting for Language Model Pre-training , year =

Thakkar, Megh and Bolukbasi, Tolga and Ganapathy, Sriram and Vashishth, Shikhar and Chandar, Sarath and Talukdar, Partha , booktitle =. Self-Influence Guided Data Reweighting for Language Model Pre-training , year =

work page
[39]

Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A and Khashabi, Daniel and Hajishirzi, Hannaneh , booktitle=

work page
[40]

Studying large language model generalization with influence functions , year =

Grosse, Roger and Bae, Juhan and Anil, Cem and Elhage, Nelson and Tamkin, Alex and Tajdini, Amirhossein and Steiner, Benoit and Li, Dustin and Durmus, Esin and Perez, Ethan and others , journal =. Studying large language model generalization with influence functions , year =

work page
[41]

Understanding In-Context Learning via Supportive Pretraining Data , year =

Han, Xiaochuang and Simig, Daniel and Mihaylov, Todor and Tsvetkov, Yulia and Celikyilmaz, Asli and Wang, Tianlu , booktitle =. Understanding In-Context Learning via Supportive Pretraining Data , year =

work page
[42]

Morcos , booktitle=

Amro Kamal Mohamed Abbas and Kushal Tirumala and Daniel Simig and Surya Ganguli and Ari S. Morcos , booktitle=

work page
[43]

Kevin Clark and Minh. Proc. of ICLR , title =

work page
[44]

Kushal Tirumala and Daniel Simig and Armen Aghajanyan and Ari Morcos , booktitle =

work page
[45]

A Survey on Data Selection for Language Models , year =

Albalak, Alon and Elazar, Yanai and Xie, Sang Michael and Longpre, Shayne and Lambert, Nathan and Wang, Xinyi and Muennighoff, Niklas and Hou, Bairu and Pan, Liangming and Jeong, Haewon and others , journal =. A Survey on Data Selection for Language Models , year =

work page
[46]

Hu, Shengding and Tu, Yuge and Han, Xu and He, Chaoqun and Cui, Ganqu and Long, Xiang and Zheng, Zhi and Fang, Yewei and Huang, Yuxiang and Zhao, Weilin and others , booktitle =

work page
[47]

Fan, Simin and Pagliardini, Matteo and Jaggi, Martin , booktitle =

work page
[48]

Le and Tengyu Ma and Adams Wei Yu , booktitle =

Sang Michael Xie and Hieu Pham and Xuanyi Dong and Nan Du and Hanxiao Liu and Yifeng Lu and Percy Liang and Quoc V. Le and Tengyu Ma and Adams Wei Yu , booktitle =

work page
[49]

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page
[50]

Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance , year =

Ye, Jiasheng and Liu, Peiju and Sun, Tianxiang and Zhou, Yunhua and Zhan, Jun and Qiu, Xipeng , journal =. Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance , year =

work page
[51]

Approximations to worst-case data dropping: unmasking failure modes , year=

Huang, Jenny Y and Burt, David R and Nguyen, Tin D and Shen, Yunyi and Broderick, Tamara , journal =. Approximations to worst-case data dropping: unmasking failure modes , year=

work page
[52]

Think you have Solved Question Answering?

Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal =. Think you have Solved Question Answering?

work page
[53]

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , booktitle =

work page
[54]

Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy , booktitle=

work page
[55]

Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , year=

Roemmele, Melissa and Bejan, Cosmin Adrian and Gordon, Andrew S , booktitle=. Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , year=

work page
[56]

Jian Liu and Leyang Cui and Hanmeng Liu and Dandan Huang and Yile Wang and Yue Zhang , booktitle =

work page
[57]

C o QA : A Conversational Question Answering Challenge

Reddy, Siva and Chen, Danqi and Manning, Christopher D. C o QA : A Conversational Question Answering Challenge. TACL. 2019

work page 2019
[58]

Can a Suit of Armor Conduct Electricity?

Mihaylov, Todor and Clark, Peter and Khot, Tushar and Sabharwal, Ashish , booktitle =. Can a Suit of Armor Conduct Electricity?

work page
[59]

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , booktitle =

work page
[60]

TMLR , year=

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models , author=. TMLR , year=

work page
[61]

Weak-to-strong generalization: Eliciting strong capabilities with weak supervision , year =

Burns, Collin and Izmailov, Pavel and Kirchner, Jan Hendrik and Baker, Bowen and Gao, Leo and Aschenbrenner, Leopold and Chen, Yining and Ecoffet, Adrien and Joglekar, Manas and Leike, Jan and others , journal =. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision , year =

work page
[62]

BLADE: Enhancing black-box large language models with small domain-specific models , year =

Li, Haitao and Ai, Qingyao and Chen, Jia and Dong, Qian and Wu, Zhijing and Liu, Yiqun and Chen, Chong and Tian, Qi , journal =. BLADE: Enhancing black-box large language models with small domain-specific models , year =

work page
[63]

Scaling language models: Methods, analysis & insights from training

Rae, Jack W and Borgeaud, Sebastian and Cai, Trevor and Millican, Katie and Hoffmann, Jordan and Song, Francis and Aslanides, John and Henderson, Sarah and Ring, Roman and Young, Susannah and others , journal =. Scaling language models: Methods, analysis & insights from training

work page
[64]

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

Soldaini, Luca and Kinney, Rodney and Bhagia, Akshita and Schwenk, Dustin and Atkinson, David and Authur, Russell and Bogin, Ben and Chandu, Khyathi and Dumas, Jennifer and Elazar, Yanai and Hofmann, Valentin and Jha, Ananya and Kumar, Sachin and Lucy, Li and Lyu, Xinxi and Lambert, Nathan and Magnusson, Ian and Morrison, Jacob and Muennighoff, Niklas and...

work page 2024
[65]

Deduplicating Training Data Makes Language Models Better , year =

Lee, Katherine and Ippolito, Daphne and Nystrom, Andrew and Zhang, Chiyuan and Eck, Douglas and Callison-Burch, Chris and Carlini, Nicholas , booktitle =. Deduplicating Training Data Makes Language Models Better , year =

work page
[66]

Dai and Simon Tong and Dmitry Lepikhin and Yuanzhong Xu and Maxim Krikun and Yanqi Zhou and Adams Wei Yu and Orhan Firat and Barret Zoph and Liam Fedus and Maarten P

Nan Du and Yanping Huang and Andrew M. Dai and Simon Tong and Dmitry Lepikhin and Yuanzhong Xu and Maxim Krikun and Yanqi Zhou and Adams Wei Yu and Orhan Firat and Barret Zoph and Liam Fedus and Maarten P. Bosma and Zongwei Zhou and Tao Wang and Yu Emma Wang and Kellie Webster and Marie Pellat and Kevin Robinson and Kathleen S. Meier. Proc. of ICML , title =

work page
[67]

Ziya2: Data-centric Learning is All

Gan, Ruyi and Wu, Ziwei and Sun, Renliang and Lu, Junyu and Wu, Xiaojun and Zhang, Dixiang and Pan, Kunhao and Yang, Ping and Yang, Qi and Zhang, Jiaxing and others , journal =. Ziya2: Data-centric Learning is All

work page
[68]

Data Selection for Language Models via Importance Resampling , year =

Sang Michael Xie and Shibani Santurkar and Tengyu Ma and Percy Liang , booktitle =. Data Selection for Language Models via Importance Resampling , year =

work page
[69]

First is better than last for language data influence , year=

Yeh, Chih-Kuan and Taly, Ankur and Sundararajan, Mukund and Liu, Frederick and Ravikumar, Pradeep , booktitle =. First is better than last for language data influence , year=

work page
[70]

Scaling Up Influence Functions , year =

Andrea Schioppa and Polina Zablotskaia and David Vilar and Artem Sokolov , booktitle =. Scaling Up Influence Functions , year =

work page
[71]

KR , year=

The Winograd Schema Challenge , author=. KR , year=

work page
[72]

Keisuke Sakaguchi and Ronan Le Bras and Chandra Bhagavatula and Yejin Choi , booktitle =

work page
[73]

Arnold Overwijk and Chenyan Xiong and Jamie Callan , booktitle =

work page
[74]

Sung Min Park and Kristian Georgiev and Andrew Ilyas and Guillaume Leclerc and Aleksander Madry , booktitle =

work page
[75]

Understanding Black-box Predictions via Influence Functions , year =

Pang Wei Koh and Percy Liang , booktitle =. Understanding Black-box Predictions via Influence Functions , year =

work page
[76]

SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models , year =

Yang, Yu and Mishra, Siddhartha and Chiang, Jeffrey N and Mirzasoleiman, Baharan , booktitle =. SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models , year =

work page
[77]

Yue, Xiang and Zheng, Tianyu and Zhang, Ge and Chen, Wenhu , booktitle=

work page
[78]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , year =

Dao, Tri , booktitle =. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , year =

work page
[79]

Reducing activation recomputation in large transformer models , year =

Korthikanti, Vijay Anand and Casper, Jared and Lym, Sangkug and McAfee, Lawrence and Andersch, Michael and Shoeybi, Mohammad and Catanzaro, Bryan , journal =. Reducing activation recomputation in large transformer models , year =

work page
[80]

Lundberg and Su

Scott M. Lundberg and Su. Proc. of NeurIPS , title =

work page

Showing first 80 references.

[1] [1]

Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal =

work page

[2] [2]

ArXiv preprint , year=

Choe, Sang Keun and Ahn, Hwijeen and Bae, Juhan and Zhao, Kewen and Kang, Minsoo and Chung, Youngseog and Pratapa, Adithya and Neiswanger, Willie and Strubell, Emma and Mitamura, Teruko and others , title=. ArXiv preprint , year=

work page

[3] [3]

A Pretrainer

Longpre, Shayne and Yauney, Gregory and Reif, Emily and Lee, Katherine and Roberts, Adam and Zoph, Barret and Zhou, Denny and Wei, Jason and Robinson, Kevin and Mimno, David and Ippolito, Daphne , booktitle =. A Pretrainer

work page

[4] [4]

Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers , year =

Gong, Linyuan and Xiong, Chenyan and Liu, Xiaodong and Bajaj, Payal and Xie, Yiqing and Cheung, Alvin and Gao, Jianfeng and Song, Xia , booktitle =. Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers , year =

work page

[5] [5]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , publisher =. The Theory of Parsing, Translation and Compiling , volume =

work page

[6] [6]

Bajaj, Payal and Xiong, Chenyan and Ke, Guolin and Liu, Xiaodong and He, Di and Tiwary, Saurabh and Liu, Tie-Yan and Bennett, Paul and Song, Xia and Gao, Jianfeng , journal =

work page

[7] [7]

Bennett and Jiawei Han and Xia Song , booktitle =

Yu Meng and Chenyan Xiong and Payal Bajaj and Saurabh Tiwary and Paul N. Bennett and Jiawei Han and Xia Song , booktitle =. Pretraining Text Encoders with Adversarial Mixture of Training Signal Generators , year =

work page

[8] [8]

Publications Manual , year =

work page

[9] [9]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , journal =. Alternation , volume =

work page

[10] [10]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle =. Scalable training of

work page

[11] [11]

Algorithms on Strings, Trees and Sequences , year =

Dan Gusfield , publisher =. Algorithms on Strings, Trees and Sequences , year =

work page

[12] [12]

Tetreault , journal =

Mohammad Sadegh Rasooli and Joel R. Tetreault , journal =. Yara Parser:

work page

[13] [13]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , volume =

Ando, Rie Kubota and Zhang, Tong , issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , volume =. JMLR , numpages =

work page

[14] [14]

and Tukey, John W

Cooley, James W. and Tukey, John W. , journal =. An algorithm for the machine calculation of complex

work page

[15] [15]

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle =

work page

[16] [16]

Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert. Proc. of NeurIPS , title =

work page

[17] [17]

Measuring Massive Multitask Language Understanding , year =

Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , booktitle =. Measuring Massive Multitask Language Understanding , year =

work page

[18] [18]

, journal =

Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and Schuh, Parker and et al. , journal =

work page

[19] [19]

Victor Sanh and Albert Webson and Colin Raffel and Stephen H. Bach and Lintang Sutawika and Zaid Alyafeai and Antoine Chaffin and Arnaud Stiegler and Arun Raja and Manan Dey and M Saiful Bari and Canwen Xu and Urmish Thakker and Shanya Sharma Sharma and Eliza Szczechla and Taewoon Kim and Gunjan Chhablani and Nihal V. Nayak and Debajyoti Datta and Jonatha...

work page

[20] [20]

and Dean, Jeff and Devlin, Jacob and Roberts, Adam and Zhou, Denny and Le, Quoc V

Chung, Hyung Won and Hou, Le and Longpre, Shayne and Zoph, Barret and Tay, Yi and Fedus, William and Li, Eric and Wang, Xuezhi and Dehghani, Mostafa and Brahma, Siddhartha and Webson, Albert and Gu, Shixiang Shane and Dai, Zhuyun and Suzgun, Mirac and Chen, Xinyun and Chowdhery, Aakanksha and Narang, Sharan and Mishra, Gaurav and Yu, Adams and Zhao, Vince...

work page

[21] [21]

An analysis of approximations for maximizing submodular set functions—I , year=

Nemhauser, George L and Wolsey, Laurence A and Fisher, Marshall L , journal=. An analysis of approximations for maximizing submodular set functions—I , year=

work page

[22] [22]

Susan Zhang and Stephen Roller and Naman Goyal and Mikel Artetxe and Moya Chen and Shuohui Chen and Christopher Dewan and Mona Diab and Xian Li and Xi Victoria Lin and Todor Mihaylov and Myle Ott and Sam Shleifer and Kurt Shuster and Daniel Simig and Punit Singh Koura and Anjali Sridhar and Tianlu Wang and Luke Zettlemoyer , journal =

work page

[23] [23]

Liu , journal =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , journal =. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , year =

work page

[24] [24]

Rae and Laurent Sifre , booktitle =

Jordan Hoffmann and Sebastian Borgeaud and Arthur Mensch and Elena Buchatskaya and Trevor Cai and Eliza Rutherford and Diego de Las Casas and Lisa Anne Hendricks and Johannes Welbl and Aidan Clark and Tom Hennigan and Eric Noland and Katherine Millican and George van den Driessche and Bogdan Damoc and Aurelia Guy and Simon Osindero and Karen Simonyan and ...

work page

[25] [25]

Chi and Quoc V

Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed H. Chi and Quoc V. Le and Denny Zhou , booktitle =. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , year =

work page

[26] [26]

Scaling laws for neural language models , year =

Kaplan, Jared and McCandlish, Sam and Henighan, Tom and Brown, Tom B and Chess, Benjamin and Child, Rewon and Gray, Scott and Radford, Alec and Wu, Jeffrey and Amodei, Dario , journal =. Scaling laws for neural language models , year =

work page

[27] [27]

Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Hamza Alobeidli and Alessandro Cappelli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay , booktitle =. The

work page

[28] [28]

Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzm. Proc. of LREC , title =

work page

[29] [29]

Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others , journal =. The

work page

[30] [30]

Unsupervised Dense Information Retrieval with Contrastive Learning , year =

Gautier Izacard and Mathilde Caron and Lucas Hosseini and Sebastian Riedel and Piotr Bojanowski and Armand Joulin and Edouard Grave , journal =. Unsupervised Dense Information Retrieval with Contrastive Learning , year =

work page

[31] [31]

Datamodels: Predicting Predictions from Training Data , year =

Ilyas, Andrew and Park, Sung Min and Engstrom, Logan and Leclerc, Guillaume and Madry, Aleksander , booktitle =. Datamodels: Predicting Predictions from Training Data , year =

work page

[32] [32]

Finetuned Language Models are Zero-Shot Learners , year =

Wei, Jason and Bosma, Maarten and Zhao, Vincent and Guu, Kelvin and Yu, Adams Wei and Lester, Brian and Du, Nan and Dai, Andrew M and Le, Quoc V , booktitle =. Finetuned Language Models are Zero-Shot Learners , year =

work page

[33] [33]

Logan Engstrom and Axel Feldmann and Aleksander Madry , booktitle =

work page

[34] [34]

Stella Biderman and Hailey Schoelkopf and Quentin Gregory Anthony and Herbie Bradley and Kyle O'Brien and Eric Hallahan and Mohammad Aflah Khan and Shivanshu Purohit and USVSN Sai Prashanth and Edward Raff and Aviya Skowron and Lintang Sutawika and Oskar van der Wal , booktitle =. Pythia:

work page

[35] [35]

Llama 2: Open foundation and fine-tuned chat models , year =

Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others , journal =. Llama 2: Open foundation and fine-tuned chat models , year =

work page

[36] [36]

Wettig, Alexander and Gupta, Aatmik and Malik, Saumya and Chen, Danqi , booktitle =

work page

[37] [37]

How to Train Data-Efficient

Sachdeva, Noveen and Coleman, Benjamin and Kang, Wang-Cheng and Ni, Jianmo and Hong, Lichan and Chi, Ed H and Caverlee, James and McAuley, Julian and Cheng, Derek Zhiyuan , journal =. How to Train Data-Efficient

work page

[38] [38]

Self-Influence Guided Data Reweighting for Language Model Pre-training , year =

Thakkar, Megh and Bolukbasi, Tolga and Ganapathy, Sriram and Vashishth, Shikhar and Chandar, Sarath and Talukdar, Partha , booktitle =. Self-Influence Guided Data Reweighting for Language Model Pre-training , year =

work page

[39] [39]

Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A and Khashabi, Daniel and Hajishirzi, Hannaneh , booktitle=

work page

[40] [40]

Studying large language model generalization with influence functions , year =

Grosse, Roger and Bae, Juhan and Anil, Cem and Elhage, Nelson and Tamkin, Alex and Tajdini, Amirhossein and Steiner, Benoit and Li, Dustin and Durmus, Esin and Perez, Ethan and others , journal =. Studying large language model generalization with influence functions , year =

work page

[41] [41]

Understanding In-Context Learning via Supportive Pretraining Data , year =

Han, Xiaochuang and Simig, Daniel and Mihaylov, Todor and Tsvetkov, Yulia and Celikyilmaz, Asli and Wang, Tianlu , booktitle =. Understanding In-Context Learning via Supportive Pretraining Data , year =

work page

[42] [42]

Morcos , booktitle=

Amro Kamal Mohamed Abbas and Kushal Tirumala and Daniel Simig and Surya Ganguli and Ari S. Morcos , booktitle=

work page

[43] [43]

Kevin Clark and Minh. Proc. of ICLR , title =

work page

[44] [44]

Kushal Tirumala and Daniel Simig and Armen Aghajanyan and Ari Morcos , booktitle =

work page

[45] [45]

A Survey on Data Selection for Language Models , year =

Albalak, Alon and Elazar, Yanai and Xie, Sang Michael and Longpre, Shayne and Lambert, Nathan and Wang, Xinyi and Muennighoff, Niklas and Hou, Bairu and Pan, Liangming and Jeong, Haewon and others , journal =. A Survey on Data Selection for Language Models , year =

work page

[46] [46]

Hu, Shengding and Tu, Yuge and Han, Xu and He, Chaoqun and Cui, Ganqu and Long, Xiang and Zheng, Zhi and Fang, Yewei and Huang, Yuxiang and Zhao, Weilin and others , booktitle =

work page

[47] [47]

Fan, Simin and Pagliardini, Matteo and Jaggi, Martin , booktitle =

work page

[48] [48]

Le and Tengyu Ma and Adams Wei Yu , booktitle =

Sang Michael Xie and Hieu Pham and Xuanyi Dong and Nan Du and Hanxiao Liu and Yifeng Lu and Percy Liang and Quoc V. Le and Tengyu Ma and Adams Wei Yu , booktitle =

work page

[49] [49]

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page

[50] [50]

Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance , year =

Ye, Jiasheng and Liu, Peiju and Sun, Tianxiang and Zhou, Yunhua and Zhan, Jun and Qiu, Xipeng , journal =. Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance , year =

work page

[51] [51]

Approximations to worst-case data dropping: unmasking failure modes , year=

Huang, Jenny Y and Burt, David R and Nguyen, Tin D and Shen, Yunyi and Broderick, Tamara , journal =. Approximations to worst-case data dropping: unmasking failure modes , year=

work page

[52] [52]

Think you have Solved Question Answering?

Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal =. Think you have Solved Question Answering?

work page

[53] [53]

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , booktitle =

work page

[54] [54]

Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy , booktitle=

work page

[55] [55]

Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , year=

Roemmele, Melissa and Bejan, Cosmin Adrian and Gordon, Andrew S , booktitle=. Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , year=

work page

[56] [56]

Jian Liu and Leyang Cui and Hanmeng Liu and Dandan Huang and Yile Wang and Yue Zhang , booktitle =

work page

[57] [57]

C o QA : A Conversational Question Answering Challenge

Reddy, Siva and Chen, Danqi and Manning, Christopher D. C o QA : A Conversational Question Answering Challenge. TACL. 2019

work page 2019

[58] [58]

Can a Suit of Armor Conduct Electricity?

Mihaylov, Todor and Clark, Peter and Khot, Tushar and Sabharwal, Ashish , booktitle =. Can a Suit of Armor Conduct Electricity?

work page

[59] [59]

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , booktitle =

work page

[60] [60]

TMLR , year=

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models , author=. TMLR , year=

work page

[61] [61]

Weak-to-strong generalization: Eliciting strong capabilities with weak supervision , year =

Burns, Collin and Izmailov, Pavel and Kirchner, Jan Hendrik and Baker, Bowen and Gao, Leo and Aschenbrenner, Leopold and Chen, Yining and Ecoffet, Adrien and Joglekar, Manas and Leike, Jan and others , journal =. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision , year =

work page

[62] [62]

BLADE: Enhancing black-box large language models with small domain-specific models , year =

Li, Haitao and Ai, Qingyao and Chen, Jia and Dong, Qian and Wu, Zhijing and Liu, Yiqun and Chen, Chong and Tian, Qi , journal =. BLADE: Enhancing black-box large language models with small domain-specific models , year =

work page

[63] [63]

Scaling language models: Methods, analysis & insights from training

Rae, Jack W and Borgeaud, Sebastian and Cai, Trevor and Millican, Katie and Hoffmann, Jordan and Song, Francis and Aslanides, John and Henderson, Sarah and Ring, Roman and Young, Susannah and others , journal =. Scaling language models: Methods, analysis & insights from training

work page

[64] [64]

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

Soldaini, Luca and Kinney, Rodney and Bhagia, Akshita and Schwenk, Dustin and Atkinson, David and Authur, Russell and Bogin, Ben and Chandu, Khyathi and Dumas, Jennifer and Elazar, Yanai and Hofmann, Valentin and Jha, Ananya and Kumar, Sachin and Lucy, Li and Lyu, Xinxi and Lambert, Nathan and Magnusson, Ian and Morrison, Jacob and Muennighoff, Niklas and...

work page 2024

[65] [65]

Deduplicating Training Data Makes Language Models Better , year =

Lee, Katherine and Ippolito, Daphne and Nystrom, Andrew and Zhang, Chiyuan and Eck, Douglas and Callison-Burch, Chris and Carlini, Nicholas , booktitle =. Deduplicating Training Data Makes Language Models Better , year =

work page

[66] [66]

Dai and Simon Tong and Dmitry Lepikhin and Yuanzhong Xu and Maxim Krikun and Yanqi Zhou and Adams Wei Yu and Orhan Firat and Barret Zoph and Liam Fedus and Maarten P

Nan Du and Yanping Huang and Andrew M. Dai and Simon Tong and Dmitry Lepikhin and Yuanzhong Xu and Maxim Krikun and Yanqi Zhou and Adams Wei Yu and Orhan Firat and Barret Zoph and Liam Fedus and Maarten P. Bosma and Zongwei Zhou and Tao Wang and Yu Emma Wang and Kellie Webster and Marie Pellat and Kevin Robinson and Kathleen S. Meier. Proc. of ICML , title =

work page

[67] [67]

Ziya2: Data-centric Learning is All

Gan, Ruyi and Wu, Ziwei and Sun, Renliang and Lu, Junyu and Wu, Xiaojun and Zhang, Dixiang and Pan, Kunhao and Yang, Ping and Yang, Qi and Zhang, Jiaxing and others , journal =. Ziya2: Data-centric Learning is All

work page

[68] [68]

Data Selection for Language Models via Importance Resampling , year =

Sang Michael Xie and Shibani Santurkar and Tengyu Ma and Percy Liang , booktitle =. Data Selection for Language Models via Importance Resampling , year =

work page

[69] [69]

First is better than last for language data influence , year=

Yeh, Chih-Kuan and Taly, Ankur and Sundararajan, Mukund and Liu, Frederick and Ravikumar, Pradeep , booktitle =. First is better than last for language data influence , year=

work page

[70] [70]

Scaling Up Influence Functions , year =

Andrea Schioppa and Polina Zablotskaia and David Vilar and Artem Sokolov , booktitle =. Scaling Up Influence Functions , year =

work page

[71] [71]

KR , year=

The Winograd Schema Challenge , author=. KR , year=

work page

[72] [72]

Keisuke Sakaguchi and Ronan Le Bras and Chandra Bhagavatula and Yejin Choi , booktitle =

work page

[73] [73]

Arnold Overwijk and Chenyan Xiong and Jamie Callan , booktitle =

work page

[74] [74]

Sung Min Park and Kristian Georgiev and Andrew Ilyas and Guillaume Leclerc and Aleksander Madry , booktitle =

work page

[75] [75]

Understanding Black-box Predictions via Influence Functions , year =

Pang Wei Koh and Percy Liang , booktitle =. Understanding Black-box Predictions via Influence Functions , year =

work page

[76] [76]

SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models , year =

Yang, Yu and Mishra, Siddhartha and Chiang, Jeffrey N and Mirzasoleiman, Baharan , booktitle =. SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models , year =

work page

[77] [77]

Yue, Xiang and Zheng, Tianyu and Zhang, Ge and Chen, Wenhu , booktitle=

work page

[78] [78]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , year =

Dao, Tri , booktitle =. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , year =

work page

[79] [79]

Reducing activation recomputation in large transformer models , year =

Korthikanti, Vijay Anand and Casper, Jared and Lym, Sangkug and McAfee, Lawrence and Andersch, Michael and Shoeybi, Mohammad and Catanzaro, Bryan , journal =. Reducing activation recomputation in large transformer models , year =

work page

[80] [80]

Lundberg and Su

Scott M. Lundberg and Su. Proc. of NeurIPS , title =

work page