pith. sign in

arxiv: 2605.27605 · v1 · pith:LLHPEH76new · submitted 2026-05-26 · 💻 cs.AI · cs.SE

Laguna M.1/XS.2 Technical Report

Julien Abadji , Marah Abdin , Connor Adams , Eric Alcaide , Mustafa Altun , Michele Artoni , Junze Bao , Uday Barar
show 88 more authors
This is my paper

Pith reviewed 2026-06-29 16:50 UTC · model grok-4.3

classification 💻 cs.AI cs.SE
keywords Mixture of Expertsagentic codingSWE-benchfoundation modelstechnical reportopen source modelsModel Factory
0
0 comments X

The pith

Laguna M.1 and XS.2 Mixture-of-Experts models perform competitively with state-of-the-art open models on agentic coding benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Laguna M.1 and XS.2, two Mixture-of-Experts models for long-horizon agentic coding tasks. M.1 activates 23.4 billion parameters out of 225.8 billion total, while XS.2 activates 3 billion out of 33.4 billion. Both were trained from scratch using an integrated Model Factory system for data, training, evaluation, and inference. The models achieve competitive results on SWE-bench Verified, SWE-bench Multilingual, SWE-Bench Pro, and Terminal-Bench 2.0 compared to other open models in their size classes. A reader would care because this advances practical AI systems capable of handling extended software engineering workflows.

Core claim

Laguna M.1 and Laguna XS.2 are Mixture-of-Experts foundation models built for long-horizon, agentic coding. M.1 has 225.8B total parameters with 23.4B activated per token, and XS.2 has 33.4B total with 3B activated. Trained end-to-end in the Model Factory system, they prove competitive with state-of-the-art open models on agentic software engineering and terminal benchmarks including SWE-bench Verified, SWE-bench Multilingual, SWE-Bench Pro, and Terminal-Bench 2.0. XS.2 weights are released under Apache 2.0.

What carries the argument

The Model Factory, an integrated stack of versioned data, training, evaluation, and inference components that industrializes model development.

If this is right

  • M.1 delivers high performance in its weight class for complex coding agents.
  • XS.2 offers accessible open weights for smaller-scale agentic applications.
  • The described training process supports systematic development of similar models.
  • Competitive benchmark scores indicate readiness for real-world long-horizon tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the benchmarks hold, these models could reduce the gap between open and proprietary agentic coding systems.
  • Releasing XS.2 enables community testing on additional tasks beyond the reported benchmarks.
  • The factory approach may generalize to training models for other long-horizon domains like scientific computing.

Load-bearing premise

The selected benchmarks accurately measure real-world agentic coding performance without overfitting or data issues.

What would settle it

Demonstration that the models fail on a new set of unseen long-horizon coding problems or evidence that training data overlapped with benchmark test cases.

read the original abstract

We present Laguna M.1 and Laguna XS.2, two Mixture-of-Experts foundation models built for long-horizon, agentic coding: M.1 has $225.8$B total parameters ($23.4$B activated per token) and XS.2 has $33.4$B total ($3$B activated). Both models were trained from scratch end-to-end inside the same internal system that we refer to as our Model Factory: a tightly-integrated stack of versioned data, training, evaluation, and inference components that turn model development into an industrial process. We describe the principles and design choices of the Model Factory and also detail the end-to-end training process of our models, throughout pre-training data and architecture, post-training stages, evaluation, and quantization. On agentic software engineering and terminal benchmarks (SWE-bench Verified, SWE-bench Multilingual, SWE-Bench Pro, and Terminal-Bench 2.0) M.1 and XS.2 are competitive with state-of-the-art open models in their respective weight classes. Laguna XS.2 weights are released under Apache~2.0 at https://huggingface.co/collections/poolside/laguna-xs2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents Laguna M.1 (225.8B total / 23.4B active parameters) and Laguna XS.2 (33.4B total / 3B active parameters), two Mixture-of-Experts models trained end-to-end from scratch inside an internal Model Factory pipeline for long-horizon agentic coding. It describes the Model Factory architecture, pre-training data and architecture choices, post-training stages, evaluation, and quantization, and asserts that both models are competitive with state-of-the-art open models on SWE-bench Verified, SWE-bench Multilingual, SWE-Bench Pro, and Terminal-Bench 2.0. XS.2 weights are released under Apache 2.0.

Significance. If the benchmark competitiveness claims are substantiated with full scores, protocols, and decontamination evidence, the work would provide useful open-weight baselines for agentic software engineering and demonstrate an integrated industrial training stack for MoE models at these scales.

major comments (2)
  1. [Abstract] Abstract: the central claim that M.1 and XS.2 'are competitive with state-of-the-art open models' on the four named benchmarks is unsupported by any scores, tables, error bars, or evaluation methodology details, making the primary result impossible to assess.
  2. [Abstract] Abstract / Evaluation description: no decontamination steps, membership-inference results, or exact agent harness specifications (retries, temperature, tool setup) are provided for SWE-bench variants or Terminal-Bench 2.0, which are load-bearing for the generalization claim given known contamination risks in code pre-training.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The feedback highlights important areas for improving clarity and rigor in presenting our results. We address each major comment below with specific plans for revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that M.1 and XS.2 'are competitive with state-of-the-art open models' on the four named benchmarks is unsupported by any scores, tables, error bars, or evaluation methodology details, making the primary result impossible to assess.

    Authors: We agree that the abstract would benefit from greater specificity to allow immediate assessment of the competitiveness claim. While the full manuscript contains comparative tables and scores in the Evaluation section (including point estimates against open models on SWE-bench Verified, SWE-bench Multilingual, SWE-Bench Pro, and Terminal-Bench 2.0), these are not summarized in the abstract. We will revise the abstract to include the key headline scores (with brief mention of the evaluation protocol) so that the primary result is directly supported by numbers rather than a qualitative statement alone. revision: yes

  2. Referee: [Abstract] Abstract / Evaluation description: no decontamination steps, membership-inference results, or exact agent harness specifications (retries, temperature, tool setup) are provided for SWE-bench variants or Terminal-Bench 2.0, which are load-bearing for the generalization claim given known contamination risks in code pre-training.

    Authors: We concur that explicit documentation of these elements is necessary given the contamination risks in code benchmarks. The manuscript's Evaluation section describes the overall harness but lacks the requested granularity. We will add a dedicated subsection specifying the exact agent configuration (temperature, retries, tool setup, and prompting template) for each benchmark. We will also detail the decontamination procedure applied during pre-training data curation (n-gram overlap filtering against benchmark test sets). Membership-inference testing was not performed; we will note this explicitly as a limitation while emphasizing the standard decontamination steps taken. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical benchmark results on external suites

full rationale

The paper is a technical report describing end-to-end training of two MoE models inside an internal Model Factory and reporting scores on named external benchmarks (SWE-bench Verified, SWE-bench Multilingual, SWE-Bench Pro, Terminal-Bench 2.0). No equations, first-principles derivations, or fitted quantities are presented as predictions; the central claim is simply competitiveness with other open models on those benchmarks. No self-citation chains, ansatzes, or renamings appear in the provided text that would reduce the reported results to the inputs by construction. The evaluation chain is therefore self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical technical report on model training and evaluation with no mathematical axioms, free parameters, or new invented entities described.

pith-pipeline@v0.9.1-grok · 6197 in / 1143 out tokens · 53101 ms · 2026-06-29T16:50:04.318367+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Aurora: A Leverage-Aware Spectral Optimizer

    cs.LG 2026-06 unverdicted novelty 6.0

    Aurora is a leverage-aware spectral optimizer that enforces uniform row norms in matrix updates while preserving Muon's polar geometry, outperforming Muon and achieving SOTA among spectral methods on modded-nanoGPT.

Reference graph

Works this paper leans on

108 extracted references · cited by 1 Pith paper

  1. [1]

    Abdin, J

    M. Abdin, J. Aneja, H. Behl, et al. Phi-4 Technical Report. 2024 (cit. on p. 11)

  2. [2]

    Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs

    A. Ahmadian, C. Cremer , M. Gallé, M. Fadaee, J. Kreutzer , O. Pietquin, A. Üstün, and S. Hooker . “Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs”. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Ed. by L.-W. Ku, A. Martins, and V. Sriku...

  3. [3]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy , F. Lebron, and S. Sanghai. “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints”. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing . Ed. by H. Bouamor , J. Pino, and K. Bali. Singapore: Association for Computational Linguistics...

  4. [4]

    Alvarez, O

    E. Alvarez, O. Almog, E. Chung, S. Layton, D. Stosic, R. Krashinsky , and K. Aubrey . Introducing NVFP4 for Efficient and Accurate Low-Precision Inference . NVIDIA Developer Blog, June 2025 (cit. on p. 24)

  5. [5]

    The Polar Express: Optimal Matrix Sign Methods and their Application to the Muon Algorithm

    N. Amsel, D. Persson, C. Musco, and R. M. Gower . “The Polar Express: Optimal Matrix Sign Methods and their Application to the Muon Algorithm”. In: The Fourteenth International Conference on Learning Representations. 2026 (cit. on p. 7)

  6. [6]

    Introducing Claude Haiku 4.5

    Anthropic. Introducing Claude Haiku 4.5 . Anthropic Blog. 2025 (cit. on pp. 2, 25)

  7. [7]

    Introducing Claude Sonnet 4.6

    Anthropic. Introducing Claude Sonnet 4.6 . Anthropic Blog. 2025 (cit. on pp. 2, 25)

  8. [8]

    Quantifying infrastructure noise in agentic coding evals

    Anthropic. Quantifying infrastructure noise in agentic coding evals . Anthropic Engineering Blog. 2026 (cit. on p. 26)

  9. [9]

    FoundationDB: A Distributed Database Designed for Key-Value Storage

    Apple Inc. FoundationDB: A Distributed Database Designed for Key-Value Storage . https://www.foundationdb.org. Accessed 2026-05-19. 2013 (cit. on p. 4)

  10. [10]

    Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert Models

    L. Belenki, A. Agarwal, T. Shi, and K. Toutanova. “Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert Models”. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Ed. by W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar . Vienna, Austria: Association for Computational Lingu...

  11. [11]

    Beltagy , M

    I. Beltagy , M. E. Peters, and A. Cohan. Longformer: The Long-Document Transformer . 2020 (cit. on p. 5)

  12. [12]

    PIQA: Reasoning about Physical Commonsense in Natural Language

    Y. Bisk, R. Zellers, R. Le Bras, J. Gao, and Y. Choi. “PIQA: Reasoning about Physical Commonsense in Natural Language”. In: Proceedings of the AAAI Conference on Artificial Intelligence 34.05 (Apr . 2020) (cit. on p. 34)

  13. [13]

    MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation

    F. Cassano, J. Gouwar , D. Nguyen, et al. “MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation”. In: IEEE Transactions on Software Engineering (2023) (cit. on pp. 24, 34)

  14. [14]

    A. Chen, A. Li, B. Gong, et al. MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention . 2025 (cit. on p. 21)

  15. [15]

    M. F. Chen, T. Murray , D. Heineman, M. Jordan, H. Hajishirzi, C. Ré, L. Soldaini, and K. Lo. Olmix: A Framework for Data Mixing Throughout LM Development . 2026 (cit. on p. 13)

  16. [16]

    PaLM: Scaling Language Modeling with Pathways

    A. Chowdhery , S. Narang, J. Devlin, et al. “PaLM: Scaling Language Modeling with Pathways”. In: Journal of Machine Learning Research 24.240 (2023) (cit. on p. 9)

  17. [17]

    Clark, I

    P. Clark, I. Cowhey , O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge . 2018 (cit. on p. 34)

  18. [18]

    Cobbe, V

    K. Cobbe, V. Kosaraju, M. Bavarian, et al. Training Verifiers to Solve Math Word Problems . 2021 (cit. on p. 24)

  19. [19]

    Dagster: The Data Orchestration Platform

    Dagster Labs. Dagster: The Data Orchestration Platform . https://dagster.io. Accessed 2026-05-19. 2023 (cit. on p. 3)

  20. [20]

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    D. Dai, C. Deng, C. Zhao, et al. “DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models”. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Ed. by L.-W. Ku, A. Martins, and V. Srikumar . Bangkok, Thailand: Association for Computational Linguistics, Aug....

  21. [21]

    DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

    DeepSeek-AI. DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence . Technical report. 2026 (cit. on pp. 2, 24, 25)

  22. [22]

    DeepSeek-AI, A. Liu, B. Feng, et al. DeepSeek-V3 Technical Report. 2025 (cit. on p. 5)

  23. [23]

    X. Deng, J. Da, E. Pan, et al. SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? 2025 (cit. on p. 25)

  24. [24]

    H. D. Dixit, S. Pendharkar , M. Beadon, C. Mason, T. Chakravarthy , B. Muthiah, and S. Sankar .Silent Data Corruptions at Scale . 2021 (cit. on p. 8)

  25. [25]

    Token Distillation: Attention-Aware Input Embeddings for New Tokens

    K. Dobler , D. Elliott, and G. de Melo. “Token Distillation: Attention-Aware Input Embeddings for New Tokens”. In: The Fourteenth International Conference on Learning Representations . 2026 (cit. on p. 17)

  26. [26]

    Envoy: An Open Source Edge and Service Proxy

    Envoy Project Authors. Envoy: An Open Source Edge and Service Proxy . https://www.envoyproxy.io. Accessed 2026-05-19. 2017 (cit. on p. 22). LAGUNA M.1/XS.2 TECHNICAL REPORT 28

  27. [27]

    Riviere, S

    Gemma Team, M. Riviere, S. Pathak, et al. Gemma 2: Improving Open Language Models at a Practical Size . 2024 (cit. on p. 5)

  28. [28]

    GLM-4.7: Mid-Cycle Update to the GLM Coding Series

    GLM Team. GLM-4.7: Mid-Cycle Update to the GLM Coding Series . Zhipu AI Technical Report. 2026 (cit. on pp. 2, 25)

  29. [29]

    Gemma 4: Byte for byte, the most capable open models

    Google. Gemma 4: Byte for byte, the most capable open models . https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/ . Google blog post. 2026 (cit. on pp. 2, 24, 25)

  30. [30]

    CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

    A. Gu, B. Rozière, H. Leather , A. Solar-Lezama, G. Synnaeve, and S. I. Wang. “CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution”. In: International Conference on Machine Learning . 2024 (cit. on p. 24)

  31. [31]

    Gunasekar , Y

    S. Gunasekar , Y. Zhang, J. Aneja, et al. Textbooks Are All You Need . 2023 (cit. on p. 11)

  32. [32]

    Measuring Massive Multitask Language Understanding

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. “Measuring Massive Multitask Language Understanding”. In: International Conference on Learning Representations . 2021 (cit. on pp. 24, 34)

  33. [33]

    Measuring Mathematical Problem Solving With the MATH Dataset

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. “Measuring Mathematical Problem Solving With the MATH Dataset”. In: Advances in Neural Information Processing Systems, Datasets and Benchmarks Track . 2021 (cit. on p. 24)

  34. [34]

    MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    S. Hu, Y. Tu, X. Han, et al. “MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies”. In: First Conference on Language Modeling . 2024 (cit. on p. 5)

  35. [35]

    GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

    Y. Huang, Y. Cheng, A. Bapna, et al. “GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism”. In: Advances in Neural Information Processing Systems . Ed. by H. Wallach, H. Larochelle, A. Beygelzimer , F. d’Alché-Buc, E. Fox, and R. Garnett. Vol. 32. Curran Associates, Inc., 2019 (cit. on p. 6)

  36. [36]

    Idahl, B

    M. Idahl, B. Droste, B. Plüster , and J. P. Harries. propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale . 2026 (cit. on p. 11)

  37. [37]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    N. Jain, K. Han, A. Gu, et al. “LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code”. In: The Thirteenth International Conference on Learning Representations . 2025 (cit. on p. 24)

  38. [38]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?” In: The Twelfth International Conference on Learning Representations. 2024 (cit. on p. 25)

  39. [39]

    Jordan, Y

    K. Jordan, Y. Jin, V. Boza, J. Y ou, F. Cesista, L. Newhouse, and J. Bernstein. Muon: An optimizer for hidden layers in neural networks . 2024 (cit. on p. 6)

  40. [40]

    GlotLID: Language Identification for Low-Resource Languages

    A. H. Kargaran, A. Imani, F. Yvon, and H. Schuetze. “GlotLID: Language Identification for Low-Resource Languages”. In: Findings of the Association for Computational Linguistics: EMNLP 2023 . Ed. by H. Bouamor , J. Pino, and K. Bali. Singapore: Association for Computational Linguistics, Dec. 2023 (cit. on p. 10)

  41. [41]

    Kimi K2: Open Agentic Intelligence

    Kimi Team. Kimi K2: Open Agentic Intelligence . 2025 (cit. on p. 34)

  42. [42]

    Reducing Activation Recomputation in Large Transformer Models

    V. A. Korthikanti, J. Casper , S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro. “Reducing Activation Recomputation in Large Transformer Models”. In: Proceedings of Machine Learning and Systems . Vol. 5. 2023 (cit. on p. 7)

  43. [43]

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    W. Kwon, Z. Li, S. Zhuang, et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention”. In: Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles . 2023 (cit. on pp. 22, 23)

  44. [44]

    A. Lee, B. Miranda, and S. Koyejo. Beyond Scale: The Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-Trained on Formally Diverse Data . 2023 (cit. on p. 13)

  45. [45]

    DataComp-LM: In search of the next generation of training sets for language models

    J. Li, A. Fang, G. Smyrnis, et al. “DataComp-LM: In search of the next generation of training sets for language models”. In: The Thirty-eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track. 2024 (cit. on p. 13)

  46. [46]

    TorchTitan: One-stop PyTorch native solution for production ready LLM pretraining

    W. Liang, T. Liu, L. Wright, et al. “TorchTitan: One-stop PyTorch native solution for production ready LLM pretraining”. In: The Thirteenth International Conference on Learning Representations . 2025 (cit. on p. 6)

  47. [47]

    AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration

    J. Lin, J. Tang, H. Tang, et al. “ AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration”. In: Proceedings of Machine Learning and Systems . Ed. by P. Gibbons, G. Pekhimenko, and C. D. Sa. Vol. 6. 2024 (cit. on p. 23). LAGUNA M.1/XS.2 TECHNICAL REPORT 29

  48. [48]

    Is Y our Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

    J. Liu, C. S. Xia, Y. Wang, and L. Zhang. “Is Y our Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation”. In: Thirty-seventh Conference on Neural Information Processing Systems. 2023 (cit. on p. 34)

  49. [49]

    J. Liu, J. Su, X. Yao, et al. Muon is Scalable for LLM Training . 2025 (cit. on pp. 6, 9, 21)

  50. [50]

    RegMix: Data Mixture as Regression for Language Model Pre-training

    Q. Liu, X. Zheng, N. Muennighoff, G. Zeng, L. Dou, T. Pang, J. Jiang, and M. Lin. “RegMix: Data Mixture as Regression for Language Model Pre-training”. In: The Thirteenth International Conference on Learning Representations. 2025 (cit. on p. 13)

  51. [51]

    SpinQuant: LLM Quantization with Learned Rotations

    Z. Liu, C. Zhao, I. Fedorov , et al. “SpinQuant: LLM Quantization with Learned Rotations”. In: The Thirteenth International Conference on Learning Representations . 2025 (cit. on p. 23)

  52. [52]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter . “Decoupled Weight Decay Regularization”. In: International Conference on Learning Representations. 2019 (cit. on p. 9)

  53. [53]

    Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

    P. Maini, S. Seto, H. Bai, D. Grangier , Y. Zhang, and N. Jaitly . “Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling”. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . 2024 (cit. on p. 11)

  54. [54]

    On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

    S. Malladi, K. Lyu, A. Panigrahi, and S. Arora. “On the SDEs and Scaling Rules for Adaptive Gradient Algorithms”. In: Advances in Neural Information Processing Systems . 2022 (cit. on p. 6)

  55. [55]

    M. A. Merrill, A. G. Shaw , N. Carlini, et al. Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces . 2026 (cit. on p. 25)

  56. [56]

    Mixed Precision Training

    P. Micikevicius, S. Narang, J. Alben, et al. “Mixed Precision Training”. In: International Conference on Learning Representations. 2018 (cit. on p. 9)

  57. [57]

    Devstral 2: Mistral Vibe CLI

    Mistral AI. Devstral 2: Mistral Vibe CLI . Mistral AI Blog. 2025 (cit. on pp. 2, 25)

  58. [58]

    Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

    NVIDIA. Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning. 2025 (cit. on pp. 5, 24, 25)

  59. [59]

    GPT -5.4 Nano System Card

    OpenAI. GPT -5.4 Nano System Card. OpenAI System Card. 2026 (cit. on pp. 2, 25)

  60. [60]

    Introducing SWE-bench Verified

    OpenAI. Introducing SWE-bench Verified. OpenAI Blog. 2024 (cit. on p. 25)

  61. [61]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    G. Penedo, H. Kydlíček, L. Ben allal, A. Lozhkov , M. Mitchell, C. Raffel, L. Von Werra, and T. Wolf. “The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale”. In: The Thirty-eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track . 2024 (cit. on p. 10)

  62. [62]

    YaRN: Efficient Context Window Extension of Large Language Models

    B. Peng, J. Quesnelle, H. Fan, and E. Shippole. “YaRN: Efficient Context Window Extension of Large Language Models”. In: The Twelfth International Conference on Learning Representations . 2024 (cit. on p. 6)

  63. [63]

    Post-Training in the Model Factory

    Poolside. Post-Training in the Model Factory. https://poolside.ai/blog/post-training-in-the-model-factory . Accessed 2026-05-14. 2025 (cit. on p. 22)

  64. [64]

    Through the looking glass of benchmark hacking

    Poolside. Through the looking glass of benchmark hacking . https://poolside.ai/blog/through-the-looking-glass . Accessed 2026-05-18. 2026 (cit. on p. 26)

  65. [65]

    J. Qin, Y. Xi, J. Huang, et al. APTBench: Benchmarking Agentic Potential of Base LLMs During Pre-Training . 2025 (cit. on p. 34)

  66. [66]

    Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models

    Z. Qiu, Z. Huang, B. Zheng, et al. “Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models”. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Ed. by W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar . Vienna, Austria: Association for ...

  67. [67]

    Gated Attention for Large Language Models: Non-linearity , Sparsity , and Attention-Sink-Free

    Z. Qiu, Z. Wang, B. Zheng, et al. “Gated Attention for Large Language Models: Non-linearity , Sparsity , and Attention-Sink-Free”. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems . 2026 (cit. on p. 5)

  68. [68]

    Qwen3.5: Accelerating Productivity with Native Multimodal Agents

    Qwen Team. Qwen3.5: Accelerating Productivity with Native Multimodal Agents . Feb. 2026 (cit. on pp. 2, 24, 25)

  69. [69]

    Qwen3.6-35B-A3B: Agentic Coding Power , Now Open to All

    Qwen Team. Qwen3.6-35B-A3B: Agentic Coding Power , Now Open to All. Apr . 2026 (cit. on pp. 2, 25)

  70. [70]

    LLM Compressor

    Red Hat AI and vLLM Project. LLM Compressor. Aug. 2024 (cit. on p. 23)

  71. [71]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    D. Rein, B. L. Hou, A. C. Stickland, J. Petty , R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman. “GPQA: A Graduate-Level Google-Proof Q&A Benchmark”. In: First Conference on Language Modeling . 2024 (cit. on p. 34). LAGUNA M.1/XS.2 TECHNICAL REPORT 30

  72. [72]

    Efficient Domain Adaptation of Language Models via Adaptive Tokenization

    V. Sachidananda, J. Kessler , and Y.-A. Lai. “Efficient Domain Adaptation of Language Models via Adaptive Tokenization”. In: Proceedings of the 2nd Workshop on Simple and Efficient Natural Language Processing ( SustaiNLP), EMNLP. 2021 (cit. on p. 17)

  73. [73]

    Sakaguchi, R

    K. Sakaguchi, R. Le Bras, C. Bhagavatula, and Y. Choi. WinoGrande: An Adversarial Winograd Schema Challenge at Scale . 2019 (cit. on p. 34)

  74. [74]

    Sedova, S

    A. Sedova, S. Seto, N. Schluter , and P. Ablin. Scaling Laws for Mixture Pretraining Under Data Constraints . 2026 (cit. on p. 10)

  75. [75]

    Neural Machine Translation of Rare Words with Subword Units

    R. Sennrich, B. Haddow , and A. Birch. “Neural Machine Translation of Rare Words with Subword Units”. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Ed. by K. Erk and N. A. Smith. Berlin, Germany: Association for Computational Linguistics, Aug. 2016 (cit. on p. 5)

  76. [76]

    Z. Shao, P. Wang, Q. Zhu, et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models . 2024 (cit. on p. 21)

  77. [77]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    N. Shazeer , A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer”. In: International Conference on Learning Representations. 2017 (cit. on pp. 5, 6)

  78. [78]

    Shoeybi, M

    M. Shoeybi, M. Patwary , R. Puri, P. LeGresley , J. Casper , and B. Catanzaro. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism . 2019 (cit. on p. 6)

  79. [79]

    Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

    L. Soldaini, R. Kinney , A. Bhagia, et al. “Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research”. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Ed. by L.-W. Ku, A. Martins, and V. Srikumar . Bangkok, Thailand: Association for Computational Linguistic...

  80. [80]

    Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset

    D. Su, K. Kong, Y. Lin, et al. “Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset”. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . 2025 (cit. on p. 11)

Showing first 80 references.