pith. sign in

Guilherme Penedo

Identifiers

  • name variant Guilherme Penedo 0.60 · backfill

Papers (5)

  1. How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data cs.CL · 2026 · author #4
  2. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model cs.CL · 2025 · author #5
  3. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale cs.CL · 2024 · author #1
  4. The Falcon Series of Open Language Models cs.CL · 2023 · author #14
  5. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only cs.CL · 2023 · author #1

Mentions

  • 2311.16867 #14 · arxiv_oai · confidence 0.70 Guilherme Penedo

Frequent Coauthors