Guilherme Penedo
Identifiers
- name variant Guilherme Penedo 0.60 · backfill
Papers (5)
- How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data cs.CL · 2026 · author #4
- SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model cs.CL · 2025 · author #5
- The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale cs.CL · 2024 · author #1
- The Falcon Series of Open Language Models cs.CL · 2023 · author #14
- The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only cs.CL · 2023 · author #1
Mentions
- 2311.16867 #14 · arxiv_oai · confidence 0.70 Guilherme Penedo
Frequent Coauthors
- Colin Raffel 3 shared papers
- Hynek Kydl\'i\v{c}ek 3 shared papers
- Leandro Von Werra 3 shared papers
- Thomas Wolf 3 shared papers
- Alessandro Cappelli 2 shared papers
- Anton Lozhkov 2 shared papers
- Baptiste Pannier 2 shared papers
- Daniel Hesslow 2 shared papers
- Ebtesam Almazrouei 2 shared papers
- Elie Bakouch 2 shared papers
- Hamza Alobeidli 2 shared papers
- Julien Launay 2 shared papers
- Lewis Tunstall 2 shared papers
- Loubna Ben Allal 2 shared papers
- Quentin Malartic 2 shared papers
- Ruxandra Cojocaru 2 shared papers
- Abdulaziz Alshamsi 1 shared papers
- Agust\'in Piqueres Lajar\'in 1 shared papers
- Andr\'es Marafioti 1 shared papers
- Atsuki Yamaguchi 1 shared papers