pith. sign in

arxiv: 2605.17849 · v1 · pith:3SVQYHNPnew · submitted 2026-05-18 · 💻 cs.CL · cs.AI· cs.LG

Generating Pretraining Tokens from Organic Data for Data-Bound Scaling

Pith reviewed 2026-05-20 11:31 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords synthetic data generationLLM pretrainingdata-bound scalingreinforcement learningorganic datarephrasingreformattingdata efficiency
0
0 comments X

The pith

SynPro generates rephrased and reformatted versions of the same organic text to unlock 3.7-5.2 times more effective pretraining tokens than repetition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM pretraining has entered a data-bound regime where available organic text falls short of scaling needs, yet simple repetition leaves much of that text underutilized. SynPro counters this by creating diverse presentations of the original sources through two operations: rephrasing and reformatting. These operations are produced by generators trained with reinforcement learning on rewards that enforce quality, faithfulness to the source, and influence on what the model has not yet absorbed. The generators are refreshed whenever pretraining performance plateaus so they continually target new content. Experiments pretraining 400M and 1.1B models on only 10 percent of Chinchilla-optimal tokens from a fixed organic baseline show that SynPro extracts several times the learning value of repetition and even exceeds the performance of training on fully unique data at the larger scale.

Core claim

SynPro applies rephrasing and reformat operations to present the same organic source in diverse forms, with both generators optimized via reinforcement learning using quality, faithfulness, and data influence rewards and continuously updated as pretraining plateaus to target content the model has yet to absorb, thereby unlocking substantially higher effective utilization of limited organic data.

What carries the argument

RL-optimized rephrasing and reformat generators guided by quality, faithfulness, and data-influence rewards that are refreshed when pretraining plateaus.

If this is right

  • Organic corpora can support longer effective training than repetition allows when internal diversity is generated faithfully.
  • Data-bound scaling continues when synthesis targets content the model has not yet absorbed.
  • Models reach higher performance at 1.1B scale using only a fraction of unique tokens compared with training on fully distinct data.
  • Faithful, model-aware synthesis avoids distribution collapse while increasing data utilization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same rephrasing and reformat approach could be applied to other scarce domains such as code or scientific documents.
  • Generator update frequency during extended pretraining may need tuning to avoid either under- or over-adaptation.
  • Combining this synthesis with explicit curriculum ordering of variants could further improve absorption of hard content.

Load-bearing premise

Rewards for quality, faithfulness, and data influence can be computed reliably from the current model state without introducing distribution shift away from the organic sources.

What would settle it

A controlled run in which the same organic sources are presented with non-RL or non-faithful generators and performance fails to show the reported multiple of effective tokens.

Figures

Figures reproduced from arXiv: 2605.17849 by Chenyan Xiong, Zichun Yu.

Figure 1
Figure 1. Figure 1: (a) Paradigm shift in frontier pretraining from compute-bound to data-bound. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SYNPRO. We train generators to provide faithful and informative synthetic data from organic source, enabling sustained improvement for data-bound scaling. gains after only a few (typically 4) epochs (Muennighoff et al., 2023). This phenomenon, known as the data wall, leads to a plateau in performance despite increased training time. Synthetic data for pretraining. Generating synthetic text is a… view at source ↗
Figure 3
Figure 3. Figure 3: Faithfulness analysis on 1,000 randomly sampled organic documents not seen in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution preservation analysis. t-SNE illustration of Voronoi clusters, where [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Model-awareness analysis on the 1.1B model. (a) Influence correlation and (b, c) [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: 1B model & 2.2B unique organic tokens 18 [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Quality analysis on 1,000 randomly sampled organic documents not seen in RL. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example validation reward (400M pretraining model, [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
read the original abstract

LLM pretraining is shifting from a compute-bound to a data-bound regime, where available human (organic) text falls far short of scaling demands. However, reaching the data-bound regime does not mean the model has fully utilized its organic corpus. In this paper, we introduce SynPro, a synthetic data generation framework that helps LLMs more thoroughly learn from limited organic data. SynPro applies two operations, rephrasing and reformat, that present the same organic source in diverse forms to facilitate deeper learning without introducing external information. Both generators are optimized via reinforcement learning with quality, faithfulness, and data influence rewards, and are continuously updated as pretraining plateaus to target content the model has yet to absorb. We pretrain 400M and 1.1B models with 10% of their Chinchilla-optimal tokens (0.8B and 2.2B) from DCLM-Baseline, reflecting a realistic data-bound regime in frontier pretraining. Our results reveal that organic data is significantly underutilized by standard repetition: SynPro unlocks 3.7-5.2x the effective tokens of repetition, even surpassing the non-data-bound oracle that trains on equivalent unique data at the 1.1B scale. Analyses confirm that faithful, model-aware synthesis sustains data-bound scaling without causing distribution collapse. We open-source our code at https://github.com/cxcscmu/SynPro.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SynPro, a framework for generating synthetic pretraining tokens from organic data using reinforcement learning optimized rephrasing and reformat generators. These generators are trained with rewards for quality, faithfulness, and data influence, and updated during pretraining to target under-learned content. Experiments pretrain 400M and 1.1B models using only 10% of Chinchilla-optimal tokens from the DCLM-Baseline dataset, demonstrating that SynPro yields 3.7-5.2 times the effective tokens compared to standard repetition and even outperforms an oracle trained on equivalent unique data at the 1.1B scale.

Significance. If validated, this result suggests that organic data is substantially underutilized in standard pretraining and that model-aware synthetic augmentation can unlock additional scaling gains without requiring more unique data or external sources. This has potential implications for data-efficient pretraining in the data-bound regime. The open-sourcing of code is a positive step for reproducibility.

major comments (2)
  1. [Abstract] Abstract: The claim that SynPro surpasses the non-data-bound oracle at the 1.1B scale is load-bearing on the assertion that generated tokens remain strictly within the organic data distribution. The RL rewards for quality, faithfulness, and data influence must be shown to be computed solely from the current model state and the organic corpus without external pretrained components or embeddings; otherwise the comparison to the unique-data oracle is invalid.
  2. [Experiments] Experiments section (results on effective tokens): The 3.7-5.2x multiplier and effective-token metric need an independent definition and evaluation protocol that does not reuse the data-influence reward signal from the RL training loop, to rule out circularity in the reported gains over repetition.
minor comments (2)
  1. [Abstract] Abstract: The term 'non-data-bound oracle' should be briefly defined or referenced to a methods subsection for clarity.
  2. [Introduction] Introduction: Expand the acronym DCLM-Baseline on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify key aspects of our work. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that SynPro surpasses the non-data-bound oracle at the 1.1B scale is load-bearing on the assertion that generated tokens remain strictly within the organic data distribution. The RL rewards for quality, faithfulness, and data influence must be shown to be computed solely from the current model state and the organic corpus without external pretrained components or embeddings; otherwise the comparison to the unique-data oracle is invalid.

    Authors: We agree that explicit verification of the reward computation is necessary to support the oracle comparison. In the revised manuscript, we will add a dedicated subsection in Methods detailing that quality is measured via the current model's perplexity on generated text, faithfulness via token-level overlap and semantic similarity computed from the current model's hidden states on the organic source, and data influence via the loss reduction on the specific organic sample under the current model parameters. All computations use only the organic corpus and the model being trained; no external pretrained models or fixed embeddings are involved. We will include the exact reward equations and a diagram of the computation graph to confirm the generated tokens stay within the organic distribution. revision: yes

  2. Referee: [Experiments] Experiments section (results on effective tokens): The 3.7-5.2x multiplier and effective-token metric need an independent definition and evaluation protocol that does not reuse the data-influence reward signal from the RL training loop, to rule out circularity in the reported gains over repetition.

    Authors: We acknowledge the risk of circularity if the evaluation metric directly reuses the RL reward. The reported multiplier is currently derived from downstream benchmark accuracy and validation loss differences between SynPro and repetition runs, normalized by tokens seen. To eliminate any overlap, we will revise the Experiments section to introduce an independent protocol: effective tokens are computed as the volume of unique organic data required for a repetition baseline to reach the same final validation loss as the SynPro model, measured on a held-out validation split never used in RL. This protocol will be described with pseudocode and reported in updated tables and figures. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical gains reported from controlled pretraining experiments

full rationale

The paper defines SynPro as an RL-based rephrasing/reformatting framework using quality, faithfulness, and data-influence rewards to generate variants from a fixed organic corpus (DCLM-Baseline). It then reports downstream pretraining results at 400M and 1.1B scales, comparing repetition baselines against an oracle that uses equivalent unique tokens. The 3.7-5.2x effective-token multiplier and oracle-surpassing claim are presented as measured outcomes of these runs, not as quantities algebraically derived from the reward functions themselves. No equation equates the reported multiplier to a fitted reward term by construction, and the central scaling comparison rests on external performance metrics rather than self-referential definitions or self-citation chains. The derivation chain therefore remains self-contained against the stated experimental benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the premise that rephrasing and reformatting preserve all necessary information while increasing learnability, plus the assumption that the three RL rewards can be balanced without external data or collapse.

free parameters (1)
  • RL reward weights for quality, faithfulness, and data influence
    The abstract states the generators are optimized via reinforcement learning with these three rewards; their relative weighting is a free parameter that must be chosen or tuned.
axioms (1)
  • domain assumption Rephrasing and reformatting operations introduce no external information beyond the organic source.
    Stated directly in the abstract as a core property of the two operations.

pith-pipeline@v0.9.0 · 5789 in / 1512 out tokens · 29048 ms · 2026-05-20T11:31:15.660949+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

194 extracted references · 194 canonical work pages

  1. [1]

    Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal =

  2. [2]

    ArXiv preprint , year=

    Choe, Sang Keun and Ahn, Hwijeen and Bae, Juhan and Zhao, Kewen and Kang, Minsoo and Chung, Youngseog and Pratapa, Adithya and Neiswanger, Willie and Strubell, Emma and Mitamura, Teruko and others , title=. ArXiv preprint , year=

  3. [3]

    A Pretrainer

    Longpre, Shayne and Yauney, Gregory and Reif, Emily and Lee, Katherine and Roberts, Adam and Zoph, Barret and Zhou, Denny and Wei, Jason and Robinson, Kevin and Mimno, David and Ippolito, Daphne , booktitle =. A Pretrainer

  4. [4]

    Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers , year =

    Gong, Linyuan and Xiong, Chenyan and Liu, Xiaodong and Bajaj, Payal and Xie, Yiqing and Cheung, Alvin and Gao, Jianfeng and Song, Xia , booktitle =. Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers , year =

  5. [5]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , publisher =. The Theory of Parsing, Translation and Compiling , volume =

  6. [6]

    Bajaj, Payal and Xiong, Chenyan and Ke, Guolin and Liu, Xiaodong and He, Di and Tiwary, Saurabh and Liu, Tie-Yan and Bennett, Paul and Song, Xia and Gao, Jianfeng , journal =

  7. [7]

    Bennett and Jiawei Han and Xia Song , booktitle =

    Yu Meng and Chenyan Xiong and Payal Bajaj and Saurabh Tiwary and Paul N. Bennett and Jiawei Han and Xia Song , booktitle =. Pretraining Text Encoders with Adversarial Mixture of Training Signal Generators , year =

  8. [8]

    Publications Manual , year =

  9. [9]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , journal =. Alternation , volume =

  10. [10]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle =. Scalable training of

  11. [11]

    Algorithms on Strings, Trees and Sequences , year =

    Dan Gusfield , publisher =. Algorithms on Strings, Trees and Sequences , year =

  12. [12]

    Tetreault , journal =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , journal =. Yara Parser:

  13. [13]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , volume =

    Ando, Rie Kubota and Zhang, Tong , issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , volume =. JMLR , numpages =

  14. [14]

    and Tukey, John W

    Cooley, James W. and Tukey, John W. , journal =. An algorithm for the machine calculation of complex

  15. [15]

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle =

  16. [16]

    Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert. Proc. of NeurIPS , title =

  17. [17]

    Measuring Massive Multitask Language Understanding , year =

    Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , booktitle =. Measuring Massive Multitask Language Understanding , year =

  18. [18]

    , journal =

    Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and Schuh, Parker and et al. , journal =

  19. [19]

    Victor Sanh and Albert Webson and Colin Raffel and Stephen H. Bach and Lintang Sutawika and Zaid Alyafeai and Antoine Chaffin and Arnaud Stiegler and Arun Raja and Manan Dey and M Saiful Bari and Canwen Xu and Urmish Thakker and Shanya Sharma Sharma and Eliza Szczechla and Taewoon Kim and Gunjan Chhablani and Nihal V. Nayak and Debajyoti Datta and Jonatha...

  20. [20]

    and Dean, Jeff and Devlin, Jacob and Roberts, Adam and Zhou, Denny and Le, Quoc V

    Chung, Hyung Won and Hou, Le and Longpre, Shayne and Zoph, Barret and Tay, Yi and Fedus, William and Li, Eric and Wang, Xuezhi and Dehghani, Mostafa and Brahma, Siddhartha and Webson, Albert and Gu, Shixiang Shane and Dai, Zhuyun and Suzgun, Mirac and Chen, Xinyun and Chowdhery, Aakanksha and Narang, Sharan and Mishra, Gaurav and Yu, Adams and Zhao, Vince...

  21. [21]

    An analysis of approximations for maximizing submodular set functions—I , year=

    Nemhauser, George L and Wolsey, Laurence A and Fisher, Marshall L , journal=. An analysis of approximations for maximizing submodular set functions—I , year=

  22. [22]

    Susan Zhang and Stephen Roller and Naman Goyal and Mikel Artetxe and Moya Chen and Shuohui Chen and Christopher Dewan and Mona Diab and Xian Li and Xi Victoria Lin and Todor Mihaylov and Myle Ott and Sam Shleifer and Kurt Shuster and Daniel Simig and Punit Singh Koura and Anjali Sridhar and Tianlu Wang and Luke Zettlemoyer , journal =

  23. [23]

    Liu , journal =

    Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , journal =. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , year =

  24. [24]

    Rae and Laurent Sifre , booktitle =

    Jordan Hoffmann and Sebastian Borgeaud and Arthur Mensch and Elena Buchatskaya and Trevor Cai and Eliza Rutherford and Diego de Las Casas and Lisa Anne Hendricks and Johannes Welbl and Aidan Clark and Tom Hennigan and Eric Noland and Katherine Millican and George van den Driessche and Bogdan Damoc and Aurelia Guy and Simon Osindero and Karen Simonyan and ...

  25. [25]

    Chi and Quoc V

    Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed H. Chi and Quoc V. Le and Denny Zhou , booktitle =. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , year =

  26. [26]

    Scaling laws for neural language models , year =

    Kaplan, Jared and McCandlish, Sam and Henighan, Tom and Brown, Tom B and Chess, Benjamin and Child, Rewon and Gray, Scott and Radford, Alec and Wu, Jeffrey and Amodei, Dario , journal =. Scaling laws for neural language models , year =

  27. [27]

    Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Hamza Alobeidli and Alessandro Cappelli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay , booktitle =. The

  28. [28]

    Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzm. Proc. of LREC , title =

  29. [29]

    Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others , journal =. The

  30. [30]

    Unsupervised Dense Information Retrieval with Contrastive Learning , year =

    Gautier Izacard and Mathilde Caron and Lucas Hosseini and Sebastian Riedel and Piotr Bojanowski and Armand Joulin and Edouard Grave , journal =. Unsupervised Dense Information Retrieval with Contrastive Learning , year =

  31. [31]

    Datamodels: Predicting Predictions from Training Data , year =

    Ilyas, Andrew and Park, Sung Min and Engstrom, Logan and Leclerc, Guillaume and Madry, Aleksander , booktitle =. Datamodels: Predicting Predictions from Training Data , year =

  32. [32]

    Finetuned Language Models are Zero-Shot Learners , year =

    Wei, Jason and Bosma, Maarten and Zhao, Vincent and Guu, Kelvin and Yu, Adams Wei and Lester, Brian and Du, Nan and Dai, Andrew M and Le, Quoc V , booktitle =. Finetuned Language Models are Zero-Shot Learners , year =

  33. [33]

    Logan Engstrom and Axel Feldmann and Aleksander Madry , booktitle =

  34. [34]

    Stella Biderman and Hailey Schoelkopf and Quentin Gregory Anthony and Herbie Bradley and Kyle O'Brien and Eric Hallahan and Mohammad Aflah Khan and Shivanshu Purohit and USVSN Sai Prashanth and Edward Raff and Aviya Skowron and Lintang Sutawika and Oskar van der Wal , booktitle =. Pythia:

  35. [35]

    Llama 2: Open foundation and fine-tuned chat models , year =

    Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others , journal =. Llama 2: Open foundation and fine-tuned chat models , year =

  36. [36]

    Wettig, Alexander and Gupta, Aatmik and Malik, Saumya and Chen, Danqi , booktitle =

  37. [37]

    How to Train Data-Efficient

    Sachdeva, Noveen and Coleman, Benjamin and Kang, Wang-Cheng and Ni, Jianmo and Hong, Lichan and Chi, Ed H and Caverlee, James and McAuley, Julian and Cheng, Derek Zhiyuan , journal =. How to Train Data-Efficient

  38. [38]

    Self-Influence Guided Data Reweighting for Language Model Pre-training , year =

    Thakkar, Megh and Bolukbasi, Tolga and Ganapathy, Sriram and Vashishth, Shikhar and Chandar, Sarath and Talukdar, Partha , booktitle =. Self-Influence Guided Data Reweighting for Language Model Pre-training , year =

  39. [39]

    Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A and Khashabi, Daniel and Hajishirzi, Hannaneh , booktitle=

  40. [40]

    Studying large language model generalization with influence functions , year =

    Grosse, Roger and Bae, Juhan and Anil, Cem and Elhage, Nelson and Tamkin, Alex and Tajdini, Amirhossein and Steiner, Benoit and Li, Dustin and Durmus, Esin and Perez, Ethan and others , journal =. Studying large language model generalization with influence functions , year =

  41. [41]

    Understanding In-Context Learning via Supportive Pretraining Data , year =

    Han, Xiaochuang and Simig, Daniel and Mihaylov, Todor and Tsvetkov, Yulia and Celikyilmaz, Asli and Wang, Tianlu , booktitle =. Understanding In-Context Learning via Supportive Pretraining Data , year =

  42. [42]

    Morcos , booktitle=

    Amro Kamal Mohamed Abbas and Kushal Tirumala and Daniel Simig and Surya Ganguli and Ari S. Morcos , booktitle=

  43. [43]

    Kevin Clark and Minh. Proc. of ICLR , title =

  44. [44]

    Kushal Tirumala and Daniel Simig and Armen Aghajanyan and Ari Morcos , booktitle =

  45. [45]

    A Survey on Data Selection for Language Models , year =

    Albalak, Alon and Elazar, Yanai and Xie, Sang Michael and Longpre, Shayne and Lambert, Nathan and Wang, Xinyi and Muennighoff, Niklas and Hou, Bairu and Pan, Liangming and Jeong, Haewon and others , journal =. A Survey on Data Selection for Language Models , year =

  46. [46]

    Hu, Shengding and Tu, Yuge and Han, Xu and He, Chaoqun and Cui, Ganqu and Long, Xiang and Zheng, Zhi and Fang, Yewei and Huang, Yuxiang and Zhao, Weilin and others , booktitle =

  47. [47]

    Fan, Simin and Pagliardini, Matteo and Jaggi, Martin , booktitle =

  48. [48]

    Le and Tengyu Ma and Adams Wei Yu , booktitle =

    Sang Michael Xie and Hieu Pham and Xuanyi Dong and Nan Du and Hanxiao Liu and Yifeng Lu and Percy Liang and Quoc V. Le and Tengyu Ma and Adams Wei Yu , booktitle =

  49. [49]

    Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

  50. [50]

    Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance , year =

    Ye, Jiasheng and Liu, Peiju and Sun, Tianxiang and Zhou, Yunhua and Zhan, Jun and Qiu, Xipeng , journal =. Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance , year =

  51. [51]

    Approximations to worst-case data dropping: unmasking failure modes , year=

    Huang, Jenny Y and Burt, David R and Nguyen, Tin D and Shen, Yunyi and Broderick, Tamara , journal =. Approximations to worst-case data dropping: unmasking failure modes , year=

  52. [52]

    Think you have Solved Question Answering?

    Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal =. Think you have Solved Question Answering?

  53. [53]

    Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , booktitle =

  54. [54]

    Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy , booktitle=

  55. [55]

    Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , year=

    Roemmele, Melissa and Bejan, Cosmin Adrian and Gordon, Andrew S , booktitle=. Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , year=

  56. [56]

    Jian Liu and Leyang Cui and Hanmeng Liu and Dandan Huang and Yile Wang and Yue Zhang , booktitle =

  57. [57]

    C o QA : A Conversational Question Answering Challenge

    Reddy, Siva and Chen, Danqi and Manning, Christopher D. C o QA : A Conversational Question Answering Challenge. TACL. 2019

  58. [58]

    Can a Suit of Armor Conduct Electricity?

    Mihaylov, Todor and Clark, Peter and Khot, Tushar and Sabharwal, Ashish , booktitle =. Can a Suit of Armor Conduct Electricity?

  59. [59]

    Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , booktitle =

  60. [60]

    TMLR , year=

    Beyond the imitation game: Quantifying and extrapolating the capabilities of language models , author=. TMLR , year=

  61. [61]

    Weak-to-strong generalization: Eliciting strong capabilities with weak supervision , year =

    Burns, Collin and Izmailov, Pavel and Kirchner, Jan Hendrik and Baker, Bowen and Gao, Leo and Aschenbrenner, Leopold and Chen, Yining and Ecoffet, Adrien and Joglekar, Manas and Leike, Jan and others , journal =. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision , year =

  62. [62]

    BLADE: Enhancing black-box large language models with small domain-specific models , year =

    Li, Haitao and Ai, Qingyao and Chen, Jia and Dong, Qian and Wu, Zhijing and Liu, Yiqun and Chen, Chong and Tian, Qi , journal =. BLADE: Enhancing black-box large language models with small domain-specific models , year =

  63. [63]

    Scaling language models: Methods, analysis & insights from training

    Rae, Jack W and Borgeaud, Sebastian and Cai, Trevor and Millican, Katie and Hoffmann, Jordan and Song, Francis and Aslanides, John and Henderson, Sarah and Ring, Roman and Young, Susannah and others , journal =. Scaling language models: Methods, analysis & insights from training

  64. [64]

    Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

    Soldaini, Luca and Kinney, Rodney and Bhagia, Akshita and Schwenk, Dustin and Atkinson, David and Authur, Russell and Bogin, Ben and Chandu, Khyathi and Dumas, Jennifer and Elazar, Yanai and Hofmann, Valentin and Jha, Ananya and Kumar, Sachin and Lucy, Li and Lyu, Xinxi and Lambert, Nathan and Magnusson, Ian and Morrison, Jacob and Muennighoff, Niklas and...

  65. [65]

    Deduplicating Training Data Makes Language Models Better , year =

    Lee, Katherine and Ippolito, Daphne and Nystrom, Andrew and Zhang, Chiyuan and Eck, Douglas and Callison-Burch, Chris and Carlini, Nicholas , booktitle =. Deduplicating Training Data Makes Language Models Better , year =

  66. [66]

    Dai and Simon Tong and Dmitry Lepikhin and Yuanzhong Xu and Maxim Krikun and Yanqi Zhou and Adams Wei Yu and Orhan Firat and Barret Zoph and Liam Fedus and Maarten P

    Nan Du and Yanping Huang and Andrew M. Dai and Simon Tong and Dmitry Lepikhin and Yuanzhong Xu and Maxim Krikun and Yanqi Zhou and Adams Wei Yu and Orhan Firat and Barret Zoph and Liam Fedus and Maarten P. Bosma and Zongwei Zhou and Tao Wang and Yu Emma Wang and Kellie Webster and Marie Pellat and Kevin Robinson and Kathleen S. Meier. Proc. of ICML , title =

  67. [67]

    Ziya2: Data-centric Learning is All

    Gan, Ruyi and Wu, Ziwei and Sun, Renliang and Lu, Junyu and Wu, Xiaojun and Zhang, Dixiang and Pan, Kunhao and Yang, Ping and Yang, Qi and Zhang, Jiaxing and others , journal =. Ziya2: Data-centric Learning is All

  68. [68]

    Data Selection for Language Models via Importance Resampling , year =

    Sang Michael Xie and Shibani Santurkar and Tengyu Ma and Percy Liang , booktitle =. Data Selection for Language Models via Importance Resampling , year =

  69. [69]

    First is better than last for language data influence , year=

    Yeh, Chih-Kuan and Taly, Ankur and Sundararajan, Mukund and Liu, Frederick and Ravikumar, Pradeep , booktitle =. First is better than last for language data influence , year=

  70. [70]

    Scaling Up Influence Functions , year =

    Andrea Schioppa and Polina Zablotskaia and David Vilar and Artem Sokolov , booktitle =. Scaling Up Influence Functions , year =

  71. [71]

    KR , year=

    The Winograd Schema Challenge , author=. KR , year=

  72. [72]

    Keisuke Sakaguchi and Ronan Le Bras and Chandra Bhagavatula and Yejin Choi , booktitle =

  73. [73]

    Arnold Overwijk and Chenyan Xiong and Jamie Callan , booktitle =

  74. [74]

    Sung Min Park and Kristian Georgiev and Andrew Ilyas and Guillaume Leclerc and Aleksander Madry , booktitle =

  75. [75]

    Understanding Black-box Predictions via Influence Functions , year =

    Pang Wei Koh and Percy Liang , booktitle =. Understanding Black-box Predictions via Influence Functions , year =

  76. [76]

    SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models , year =

    Yang, Yu and Mishra, Siddhartha and Chiang, Jeffrey N and Mirzasoleiman, Baharan , booktitle =. SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models , year =

  77. [77]

    Yue, Xiang and Zheng, Tianyu and Zhang, Ge and Chen, Wenhu , booktitle=

  78. [78]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , year =

    Dao, Tri , booktitle =. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , year =

  79. [79]

    Reducing activation recomputation in large transformer models , year =

    Korthikanti, Vijay Anand and Casper, Jared and Lym, Sangkug and McAfee, Lawrence and Andersch, Michael and Shoeybi, Mohammad and Catanzaro, Bryan , journal =. Reducing activation recomputation in large transformer models , year =

  80. [80]

    Lundberg and Su

    Scott M. Lundberg and Su. Proc. of NeurIPS , title =

Showing first 80 references.