arxiv: 2407.21783 · v3 · submitted 2024-07-31 · 💻 cs.AI · cs.CL· cs.CV

Recognition: 3 theorem links

· Lean Theorem

The Llama 3 Herd of Models

Aaron Grattafiori , Abhimanyu Dubey , Abhinav Jauhri , Abhinav Pandey , Abhishek Kadian , Ahmad Al-Dahle , Aiesha Letman , Akhil Mathur

show 549 more authors

Alan Schelten Alex Vaughan Amy Yang Angela Fan Anirudh Goyal Anthony Hartshorn Aobo Yang Archi Mitra Archie Sravankumar Artem Korenev Arthur Hinsvark Arun Rao Aston Zhang Aurelien Rodriguez Austen Gregerson Ava Spataru Baptiste Roziere Bethany Biron Binh Tang Bobbie Chern Charlotte Caucheteux Chaya Nayak Chloe Bi Chris Marra Chris McConnell Christian Keller Christophe Touret Chunyang Wu Corinne Wong Cristian Canton Ferrer Cyrus Nikolaidis Damien Allonsius Daniel Song Danielle Pintz Danny Livshits Danny Wyatt David Esiobu Dhruv Choudhary Dhruv Mahajan Diego Garcia-Olano Diego Perino Dieuwke Hupkes Egor Lakomkin Ehab AlBadawy Elina Lobanova Emily Dinan Eric Michael Smith Filip Radenovic Francisco Guzm\'an Frank Zhang Gabriel Synnaeve Gabrielle Lee Georgia Lewis Anderson Govind Thattai Graeme Nail Gregoire Mialon Guan Pang Guillem Cucurell Hailey Nguyen Hannah Korevaar Hu Xu Hugo Touvron Iliyan Zarov Imanol Arrieta Ibarra Isabel Kloumann Ishan Misra Ivan Evtimov Jack Zhang Jade Copet Jaewon Lee Jan Geffert Jana Vranes Jason Park Jay Mahadeokar Jeet Shah Jelmer van der Linde Jennifer Billock Jenny Hong Jenya Lee Jeremy Fu Jianfeng Chi Jianyu Huang Jiawen Liu Jie Wang Jiecao Yu Joanna Bitton Joe Spisak Jongsoo Park Joseph Rocca Joshua Johnstun Joshua Saxe Junteng Jia Kalyan Vasuden Alwala Karthik Prasad Kartikeya Upasani Kate Plawiak Ke Li Kenneth Heafield Kevin Stone Khalid El-Arini Krithika Iyer Kshitiz Malik Kuenley Chiu Kunal Bhalla Kushal Lakhotia Lauren Rantala-Yeary Laurens van der Maaten Lawrence Chen Liang Tan Liz Jenkins Louis Martin Lovish Madaan Lubo Malo Lukas Blecher Lukas Landzaat Luke de Oliveira Madeline Muzzi Mahesh Pasupuleti Mannat Singh Manohar Paluri Marcin Kardas Maria Tsimpoukelli Mathew Oldham Mathieu Rita Maya Pavlova Melanie Kambadur Mike Lewis Min Si Mitesh Kumar Singh Mona Hassan Naman Goyal Narjes Torabi Nikolay Bashlykov Nikolay Bogoychev Niladri Chatterji Ning Zhang Olivier Duchenne Onur \c{C}elebi Patrick Alrassy Pengchuan Zhang Pengwei Li Petar Vasic Peter Weng Prajjwal Bhargava Pratik Dubal Praveen Krishnan Punit Singh Koura Puxin Xu Qing He Qingxiao Dong Ragavan Srinivasan Raj Ganapathy Ramon Calderer Ricardo Silveira Cabral Robert Stojnic Roberta Raileanu Rohan Maheswari Rohit Girdhar Rohit Patel Romain Sauvestre Ronnie Polidoro Roshan Sumbaly Ross Taylor Ruan Silva Rui Hou Rui Wang Saghar Hosseini Sahana Chennabasappa Sanjay Singh Sean Bell Seohyun Sonia Kim Sergey Edunov Shaoliang Nie Sharan Narang Sharath Raparthy Sheng Shen Shengye Wan Shruti Bhosale Shun Zhang Simon Vandenhende Soumya Batra Spencer Whitman Sten Sootla Stephane Collot Suchin Gururangan Sydney Borodinsky Tamar Herman Tara Fowler Tarek Sheasha Thomas Georgiou Thomas Scialom Tobias Speckbacher Todor Mihaylov Tong Xiao Ujjwal Karn Vedanuj Goswami Vibhor Gupta Vignesh Ramanathan Viktor Kerkez Vincent Gonguet Virginie Do Vish Vogeti V\'itor Albiero Vladan Petrovic Weiwei Chu Wenhan Xiong Wenyin Fu Whitney Meers Xavier Martinet Xiaodong Wang Xiaofang Wang Xiaoqing Ellen Tan Xide Xia Xinfeng Xie Xuchao Jia Xuewei Wang Yaelle Goldschlag Yashesh Gaur Yasmine Babaei Yi Wen Yiwen Song Yuchen Zhang Yue Li Yuning Mao Zacharie Delpierre Coudert Zheng Yan Zhengxing Chen Zoe Papakipos Aaditya Singh Aayushi Srivastava Abha Jain Adam Kelsey Adam Shajnfeld Adithya Gangidi Adolfo Victoria Ahuva Goldstand Ajay Menon Ajay Sharma Alex Boesenberg Alexei Baevski Allie Feinstein Amanda Kallet Amit Sangani Amos Teo Anam Yunus Andrei Lupu Andres Alvarado Andrew Caples Andrew Gu Andrew Ho Andrew Poulton Andrew Ryan Ankit Ramchandani Annie Dong Annie Franco Anuj Goyal Aparajita Saraf Arkabandhu Chowdhury Ashley Gabriel Ashwin Bharambe Assaf Eisenman Azadeh Yazdan Beau James Ben Maurer Benjamin Leonhardi Bernie Huang Beth Loyd Beto De Paola Bhargavi Paranjape Bing Liu Bo Wu Boyu Ni Braden Hancock Bram Wasti Brandon Spence Brani Stojkovic Brian Gamido Britt Montalvo Carl Parker Carly Burton Catalina Mejia Ce Liu Changhan Wang Changkyu Kim Chao Zhou Chester Hu Ching-Hsiang Chu Chris Cai Chris Tindal Christoph Feichtenhofer Cynthia Gao Damon Civin Dana Beaty Daniel Kreymer Daniel Li David Adkins David Xu Davide Testuggine Delia David Devi Parikh Diana Liskovich Didem Foss Dingkang Wang Duc Le Dustin Holland Edward Dowling Eissa Jamil Elaine Montgomery Eleonora Presani Emily Hahn Emily Wood Eric-Tuan Le Erik Brinkman Esteban Arcaute Evan Dunbar Evan Smothers Fei Sun Felix Kreuk Feng Tian Filippos Kokkinos Firat Ozgenel Francesco Caggioni Frank Kanayet Frank Seide Gabriela Medina Florez Gabriella Schwarz Gada Badeer Georgia Swee Gil Halpern Grant Herman Grigory Sizov Guangyi (Jack) Zhang Guna Lakshminarayanan Hakan Inan Hamid Shojanazeri Han Zou Hannah Wang Hanwen Zha Haroun Habeeb Harrison Rudolph Helen Suk Henry Aspegren Hunter Goldman Hongyuan Zhan Ibrahim Damlaj Igor Molybog Igor Tufanov Ilias Leontiadis Irina-Elena Veliche Itai Gat Jake Weissman James Geboski James Kohli Janice Lam Japhet Asher Jean-Baptiste Gaya Jeff Marcus Jeff Tang Jennifer Chan Jenny Zhen Jeremy Reizenstein Jeremy Teboul Jessica Zhong Jian Jin Jingyi Yang Joe Cummings Jon Carvill Jon Shepard Jonathan McPhie Jonathan Torres Josh Ginsburg Junjie Wang Kai Wu Kam Hou U Karan Saxena Kartikay Khandelwal Katayoun Zand Kathy Matosich Kaushik Veeraraghavan Kelly Michelena Keqian Li Kiran Jagadeesh Kun Huang Kunal Chawla Kyle Huang Lailin Chen Lakshya Garg Lavender A Leandro Silva Lee Bell Lei Zhang Liangpeng Guo Licheng Yu Liron Moshkovich Luca Wehrstedt Madian Khabsa Manav Avalani Manish Bhatt Martynas Mankus Matan Hasson Matthew Lennie Matthias Reso Maxim Groshev Maxim Naumov Maya Lathi Meghan Keneally Miao Liu Michael L. Seltzer Michal Valko Michelle Restrepo Mihir Patel Mik Vyatskov Mikayel Samvelyan Mike Clark Mike Macey Mike Wang Miquel Jubert Hermoso Mo Metanat Mohammad Rastegari Munish Bansal Nandhini Santhanam Natascha Parks Natasha White Navyata Bawa Nayan Singhal Nick Egebo Nicolas Usunier Nikhil Mehta Nikolay Pavlovich Laptev Ning Dong Norman Cheng Oleg Chernoguz Olivia Hart Omkar Salpekar Ozlem Kalinli Parkin Kent Parth Parekh Paul Saab Pavan Balaji Pedro Rittner Philip Bontrager Pierre Roux Piotr Dollar Polina Zvyagina Prashant Ratanchandani Pritish Yuvraj Qian Liang Rachad Alao Rachel Rodriguez Rafi Ayub Raghotham Murthy Raghu Nayani Rahul Mitra Rangaprabhu Parthasarathy Raymond Li Rebekkah Hogan Robin Battey Rocky Wang Russ Howes Ruty Rinott Sachin Mehta Sachin Siby Sai Jayesh Bondu Samyak Datta Sara Chugh Sara Hunt Sargun Dhillon Sasha Sidorov Satadru Pan Saurabh Mahajan Saurabh Verma Seiji Yamamoto Sharadh Ramaswamy Shaun Lindsay Sheng Feng Shenghao Lin Shengxin Cindy Zha Shishir Patil Shiva Shankar Shuqiang Zhang Sinong Wang Sneha Agarwal Soji Sajuyigbe Soumith Chintala Stephanie Max Stephen Chen Steve Kehoe Steve Satterfield Sudarshan Govindaprasad Sumit Gupta Summer Deng Sungmin Cho Sunny Virk Suraj Subramanian Sy Choudhury Sydney Goldman Tal Remez Tamar Glaser Tamara Best Thilo Koehler Thomas Robinson Tianhe Li Tianjun Zhang Tim Matthews Timothy Chou Tzook Shaked Varun Vontimitta Victoria Ajayi Victoria Montanez Vijai Mohan Vinay Satish Kumar Vishal Mangla Vlad Ionescu Vlad Poenaru Vlad Tiberiu Mihailescu Vladimir Ivanov Wei Li Wenchen Wang Wenwen Jiang Wes Bouaziz Will Constable Xiaocheng Tang Xiaojian Wu Xiaolan Wang Xilun Wu Xinbo Gao Yaniv Kleinman Yanjun Chen Ye Hu Ye Jia Ye Qi Yenda Li Yilin Zhang Ying Zhang Yossi Adi Youngjin Nam Yu (Sid) Wang Yu Zhao Yuchen Hao Yundi Qian Yunlu Li Yuzi He Zach Rait Zachary DeVito Zef Rosnbrick Zhaoduo Wen Zhenyu Yang Zhiwei Zhao Zhiyu Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-08 21:44 UTC · model claude-opus-4-7

classification 💻 cs.AI cs.CLcs.CV

keywords foundation modelslarge language modelsdense Transformerlong contextmultilingualitytool usecompositional multimodalityopen weights

0 comments

The pith

A 405B-parameter dense Transformer with a 128K context matches GPT-4-class quality across language, code, reasoning, and tool use, and reaches competitive multimodal performance by attaching image, video, and speech encoders rather than tra

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a family of foundation language models topped by a 405B-parameter dense Transformer with a 128K-token context window, and reports an empirical case that this open model is roughly on par with the leading closed models on multilingual text, coding, reasoning, and tool-use benchmarks. It defends a deliberately conservative architectural choice — a single dense Transformer rather than a mixture of experts or other sparsity scheme — and argues that careful data curation, scaling, and post-training pipelines (supervised fine-tuning, preference optimization, safety tuning) are what carry the quality. For multimodality, it advances a compositional thesis: train modality encoders for image, video, and speech and attach them to the language model through adapters, rather than pretraining a single fused multimodal model from scratch. The paper claims this composition is already competitive with state-of-the-art on standard perception benchmarks. The release of weights for both pretrained and post-trained versions, along with a separate input/output safety classifier, is part of the contribution: it puts a frontier-scale model in third-party hands so others can reproduce or contest the parity claim.

Core claim

A 405-billion-parameter dense Transformer with a 128K-token context window, trained and post-trained at scale, reaches quality comparable to the strongest closed language models on a wide span of tasks: multilingual text, code, mathematical and general reasoning, and tool use. The paper further argues that you do not need a single end-to-end multimodal model to be competitive on images, video, and speech: bolting modality-specific encoders onto the frozen-or-lightly-adapted language model — a compositional rather than fused design — already lands near state-of-the-art on standard benchmarks for those modalities.

What carries the argument

A 405B dense Transformer scaled with disciplined data, long-context (128K) training, and a multi-stage post-training stack (SFT + preference optimization + safety tuning), combined with a compositional multimodal recipe in which separately trained image, video, and speech encoders are attached via adapters to the language model rather than co-trained from scratch.

If this is right

<0>An openly released 405B model with 128K context lets outside groups reproduce
audit
and red-team frontier-scale behavior
including contamination checks the paper itself cannot fully rule out.</0>
<1>If a plain dense Transformer at this scale really matches mixture-of-experts and other architectural variants used by competitors
the marginal value of architectural novelty over data and post-training is smaller than commonly assumed.</1>
<2>The compositional multimodal result implies that strong vision
video

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

<0>Editorial inference: the parity-with-GPT-4 framing is partly a claim about ceilings — that a dense
well-curated recipe at 405B is near a plateau where further gains from scale-and-data alone are sublinear
and that the next frontier is post-training
tools
and modalities rather than parameter count.</0>
<1>Editorial inference: the compositional multimodal choice is also a hedging strategy — encoders can be swapped or upgraded without retraining the language core
which matters more for a release pipeline than for a single benchmark number.</1>
<2>Editorial inference: open-weighting a 405B model effectively externalizes evaluation

Load-bearing premise

That the public benchmark scores used to claim parity with the strongest closed models actually reflect general capability, rather than overlap between evaluation sets and the (undisclosed) pretraining corpus or evaluation choices that flatter the released model.

What would settle it

A controlled head-to-head evaluation on tasks constructed after Llama 3's training cutoff and verified to be absent from its training data — covering multilingual reasoning, code, math, and long-context retrieval — in which the 405B model trails the named frontier models by a wide margin would directly undermine the parity claim.

read the original abstract

Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Engineering report for a real open-weights frontier release; the artifact is the contribution, the GPT-4 parity framing is the soft spot.

read the letter

Quick take: this is the Llama 3 tech report. Treat it as an engineering and evaluation document for a released artifact, not a research paper with a new idea. Judged that way, it is one of the more useful documents of its kind in 2024.

What's actually here that matters: a 405B dense Transformer with 128K context, released with weights, plus the 8B and 70B variants and Llama Guard 3. That is a real artifact people can run, fine-tune, and probe. The paper is also unusually candid about the recipe — scaling-law work used to pick the 405B target, the data-curation pipeline, the staged long-context extension, the choice to skip MoE for training stability, the iterative SFT + DPO post-training instead of RLHF with PPO, and the compositional encoder approach for vision/video/speech. None of these are individually new, but the integrated recipe at this scale, written down in this much detail, is genuinely useful to the field. The infrastructure section (parallelism, failure rates on 16K H100s) is the kind of thing other groups will cite for years.

Soft spots, in proportion: the abstract's "comparable to GPT-4" line is the part doing the most scientific work and the part with the least support. The stress note has it right. Self-run evals against a closed, moving API, with prompt format and decoding choices set by the authors, swing several points on most of these benchmarks — which is the size of the gaps being claimed. Their decontamination is n-gram overlap on public evals, which catches duplicates but not paraphrases or translated variants. The pretraining corpus is not disclosed at document level, so contamination cannot be audited from the paper alone. None of this kills the artifact claim; it just means readers should treat the parity framing as a hypothesis the community has to verify, which by now it largely has, with caveats.

The multimodal section is weaker because those models weren't released at the time, so we are taking the numbers on trust. Fine for context, not load-bearing.

Who this is for: anyone building, fine-tuning, or evaluating frontier open models; infrastructure people; safety researchers using Llama Guard. Less useful if you want a new scientific claim — there isn't one, and the paper doesn't really pretend otherwise outside the abstract.

Recommendation: yes, this deserves serious referee time at a venue that handles systems/artifact papers, and it absolutely belongs in any reading group covering open foundation models. I'll cite it. The reader's CONDITIONAL verdict is fair; I'd weight novelty-as-artifact a bit higher than 5 and keep the contamination caveat exactly where they put it.

Referee Report

4 major / 4 minor

Summary. The manuscript presents the Llama 3 family of foundation models, headlined by a dense 405B-parameter Transformer with a 128K-token context window, alongside 8B and 70B variants. The authors describe pretraining, post-training (SFT and preference optimization), a safety model (Llama Guard 3), and a compositional approach for adding image, video, and speech encoders to the language backbone. The empirical claim is that Llama 3 reaches quality "comparable to leading language models such as GPT-4" across multilingual text, code, reasoning, and tool-use benchmarks, and that the multimodal extensions are competitive with the state of the art on their respective tasks. Pretrained and post-trained 405B weights and Llama Guard 3 are released; the multimodal extensions are not.

Significance. If the parity claim holds, this is a significant artifact contribution: an openly released 405B dense model with a 128K context window narrows the gap between open-weights and closed frontier models and enables third-party scientific work (interpretability, fine-tuning, contamination audits, red-teaming) that is impossible on closed APIs. The release of Llama Guard 3 as a separate safety classifier and the description of a working compositional multimodal pipeline are independently useful. The paper's strengths that should be credited explicitly: (i) the weights themselves are released, so the headline inference claims are independently verifiable in a way that closed-model papers are not; (ii) the scope of the empirical evaluation is unusually broad; (iii) the compositional rather than end-to-end multimodal recipe is a concrete, reproducible design choice. The weakness, addressed below, is that the relative claim against GPT-4 is the load-bearing scientific assertion and is the part hardest to verify from the paper alone.

major comments (4)

[Abstract / Evaluation sections] The central comparative claim — 'comparable quality to leading language models such as GPT-4' — is a relative claim against a closed, moving target. The manuscript should make explicit, in one place, for every headline benchmark: (a) whether the GPT-4 number was re-run by the authors or quoted, (b) the API snapshot/date, (c) prompt template, system message, few-shot exemplars, decoding parameters, and CoT policy used for both models, and (d) whether these were held identical across systems. Several percentage points on MMLU/GSM8K/MATH/HumanEval/MBPP can be moved by these choices alone, and the claimed parity gaps are of that order. Without this matrix the parity claim cannot be audited.
[Pretraining data / decontamination] The decontamination methodology (typically n-gram overlap with eval sets in releases of this kind) catches near-duplicates but not paraphrases, translations, or solutions discussed in web text. Because the pretraining corpus is not disclosed at document level, an independent contamination probe is needed to support the headline benchmark numbers: e.g., performance on freshly constructed or post-cutoff held-out variants, perplexity gap between benchmark items and matched controls, or membership-inference-style tests on benchmark instances. Please add at least one such probe, or qualify the parity framing accordingly.
[Multimodal experiments] The abstract states the compositional image/video/speech approach 'performs competitively with the state-of-the-art,' but the corresponding models are 'not yet being broadly released.' For a non-released system the burden on evaluation transparency is higher, not lower: please ensure the multimodal sections specify exactly which baselines, checkpoints, and protocols are compared, and which numbers are taken from prior work versus re-run.
[Scope of contribution] It would help the reader if the manuscript stated which elements are intended as scientific contributions (e.g., scaling-law analyses, post-training recipe ablations, the compositional multimodal recipe, Llama Guard 3 design) versus engineering/release documentation. As written, the paper mixes both, and reviewers cannot easily identify which claims are meant to be defended on methodological grounds.

minor comments (4)

[Abstract] 'comparable quality to leading language models such as GPT-4' would be more precise as a quantified statement (e.g., 'within X points on benchmark suite Y under matched protocol Z'). The current phrasing invites overreading.
[Abstract] Clarify what 'compositional approach' means at the abstract level (frozen LM + trained adapter + modality encoder, or otherwise), since this is the multimodal design contribution being claimed.
[Release] Specify in the abstract or introduction the license under which Llama 3 and Llama Guard 3 are released, as this materially affects the artifact's scientific value (third-party reproducibility, contamination audits, fine-tuning studies).
[Terminology] The phrase 'natively support multilinguality, coding, reasoning, and tool usage' conflates capability with training emphasis; consider rewording to indicate that these are explicitly targeted in the data mixture and post-training, not architectural features.

Simulated Author's Rebuttal

4 responses · 2 unresolved

We thank the referee for a careful and constructive report, and in particular for crediting the open release of weights as the mechanism by which our headline inference claims can be independently audited. We agree with the central thrust of the major comments: the load-bearing scientific assertion in the manuscript is the parity claim against closed frontier models, and that claim deserves a more explicit evaluation-protocol matrix and an independent contamination probe than the current draft provides. We will revise accordingly. Below we respond point by point, indicate where the manuscript will be amended, and note one item (closed-model evaluation transparency) where our ability to comply is intrinsically limited by the closed nature of the comparator.

read point-by-point responses

Referee: The 'comparable to GPT-4' claim needs an explicit per-benchmark matrix: re-run vs. quoted, API snapshot/date, prompt template, system message, few-shot exemplars, decoding parameters, CoT policy, and whether these were held identical across systems.

Authors: We agree and will add such a matrix as an appendix table covering every headline benchmark in the main text (MMLU, MMLU-Pro, GSM8K, MATH, HumanEval, MBPP, GPQA, IFEval, multilingual and tool-use suites). For each row we will state: (a) re-run by us vs. quoted from the source paper/leaderboard; (b) for re-runs of GPT-4/GPT-4o/Claude/Gemini, the exact API model identifier and the date window in which calls were made; (c) the full prompt, system message, k-shot exemplars, temperature/top-p/max-tokens, and CoT/no-CoT setting; and (d) an explicit indicator of whether the protocol was held identical across systems. Where we quoted vendor-reported numbers (because re-running was not possible or not faithful, e.g. tool-use harnesses we do not control) we will mark this and avoid framing those rows as parity evidence. We will also soften abstract language from 'comparable quality' to a more precise statement keyed to the matrix, and flag that the residual gaps are within the range that prompt/decoding choices alone can move. We acknowledge that some transparency limits are intrinsic: we cannot disclose the internals of closed comparators, only our calling conditions. revision: yes
Referee: n-gram decontamination misses paraphrases/translations/solutions in web text; an independent contamination probe is needed (post-cutoff variants, perplexity-gap, membership-inference) or the parity framing should be qualified.

Authors: This is a fair point. Our released decontamination procedure is indeed n-gram-based and we agree it does not bound paraphrase or translated leakage. In revision we will add at least two probes: (i) evaluation on post-training-cutoff held-out variants — we will report results on benchmarks released after our data cutoff (e.g. recent contest-math and code competitions, post-cutoff GPQA-style items, and freshly authored multilingual items) and contrast with the headline numbers; and (ii) a perplexity-gap analysis comparing model NLL on benchmark items vs. matched controls drawn from the same source distribution but not in any benchmark. Where the gap is non-trivial we will flag the affected benchmarks and weaken the parity framing for those specific tasks rather than the overall claim. A full membership-inference study at 405B is more involved; we will scope what is feasible and report it, and otherwise will explicitly qualify the framing as the referee suggests. revision: yes
Referee: For the unreleased multimodal models the bar on evaluation transparency is higher: specify baselines, checkpoints, protocols, and which numbers are quoted vs. re-run.

Authors: We accept this. The multimodal sections will be revised so that every reported comparison lists: the baseline model and exact checkpoint/version, whether the number is taken from the original publication or re-run by us, and the evaluation protocol (prompt, decoding, frame-sampling for video, audio preprocessing for speech, scoring script). We will also add a per-task table separating 're-run by us under matched protocol' from 'quoted from prior work' rows, mirroring the language-model matrix described above. We will additionally weaken 'competitively with the state-of-the-art' in the abstract to a task-conditional statement, since the unreleased status of the multimodal models means readers cannot independently verify these numbers and we should not lean on them as if they could. revision: yes
Referee: State which elements are intended as scientific contributions versus engineering/release documentation, so reviewers can identify which claims are defended on methodological grounds.

Authors: We agree this clarification will help readers. We will add a short 'Scope of contributions' subsection in the introduction that explicitly classifies the components. Our intended scientific contributions are: the scaling-law analysis used to choose the 405B compute/data point and its predictive validation; the post-training recipe (rejection sampling + SFT + DPO iteration) and its ablations; the compositional multimodal recipe; and the Llama Guard 3 taxonomy and classifier design. The remaining material — infrastructure, parallelism, data-pipeline engineering, and the benchmark suite itself — is release/engineering documentation supporting reproducibility of the released weights, and we will label it as such rather than as methodological claims to be defended. Headline benchmark numbers are evidence about the released artifact, not standalone scientific claims, and we will frame them that way. revision: yes

standing simulated objections not resolved

Full transparency on the comparator side of the GPT-4 parity claim is intrinsically bounded: we can disclose our API snapshot, prompts, and decoding settings, but not the closed model's internals, version drift between snapshots, or any server-side prompt processing. We will document our side completely and qualify the parity claim accordingly, but cannot eliminate this asymmetry.
A complete membership-inference contamination study at 405B scale across all headline benchmarks may exceed what we can include in revision; we will report what is feasible and qualify the remainder rather than over-claim coverage.

Circularity Check

0 steps flagged

No significant circularity: Llama 3 is an empirical engineering report whose central claim is benchmarked against external systems and externally reproducible weights, not a self-derivation.

full rationale

The paper's load-bearing claim — that the 405B dense Transformer with 128K context delivers "comparable quality to leading language models such as GPT-4 on a plethora of tasks," and that compositional image/video/speech encoders are competitive with SOTA — is a relative empirical claim against external systems on external benchmarks (MMLU, GSM8K, MATH, HumanEval, MBPP, etc.). It is not derived from a chain of equations, fitted parameters, or a uniqueness theorem; there is therefore no structural way for the conclusion to reduce to its own inputs by definition.\n\nThe artifact itself (open weights, Llama Guard 3) is externally reproducible: any third party can rerun the released model and check the numbers, which satisfies the "code-reproduced / externally falsifiable" exception to circularity. Self-citation to prior Llama work is bibliographic rather than load-bearing for the parity claim.\n\nThe genuine concerns flagged by the reader — benchmark contamination given undisclosed pretraining data, asymmetric evaluation harnesses against a closed GPT-4 API, n-gram-only decontamination missing paraphrases — are real, but they are measurement-validity / correctness-risk issues, not circularity in the technical sense used here. They would show up as "the benchmark numbers may not measure what the paper says they measure," not as "the prediction equals the input by construction." Per the analyzer's hard rule #5, "this is not standard consensus" or "the eval protocol is suspect" belongs under correctness risk, not circularity.\n\nOnly text available is the abstract, so a thorough section-by-section walk is not possible, but nothing in the abstract describes a derivation step that fits the seven circularity patterns. Score: 1, reflecting routine self-citation to prior Llama models without any load-bearing reduction.

Axiom & Free-Parameter Ledger

4 free parameters · 3 axioms · 1 invented entities

Llama 3 introduces no new physical or theoretical entities. The 'ledger' is dominated by engineering free parameters (scale, context, data mixture, post-training settings) plus standard benchmark-validity assumptions. The single new released artifact beyond the LLMs themselves is Llama Guard 3, which is empirically falsifiable because its weights are public.

free parameters (4)

Model scale (8B, 70B, 405B parameters) = 405B flagship
Chosen via internal scaling-law experiments described in the report; not derived from theory.
Context length = 128K tokens
Engineering choice based on long-context training and evaluation.
Data mixture weights across domains and languages = not disclosed at document level
Tuned empirically; central to the parity claim but not externally auditable.
Post-training hyperparameters (SFT/preference optimization) = internal
Tuned against internal eval suites.

axioms (3)

domain assumption Public benchmarks validly measure the capabilities they name (MMLU, HumanEval, GSM8K, MATH, multilingual, ASR, etc.).
The parity claim is benchmark-mediated.
domain assumption Pretraining data is adequately decontaminated against evaluation sets.
Required to interpret benchmark parity as capability parity.
domain assumption Compositional multimodality (frozen-ish encoders + adapters + LLM) is a fair comparator to end-to-end multimodal systems on the chosen tasks.
Underlies the 'competitive with state-of-the-art' multimodal claim.

invented entities (1)

Llama Guard 3 independent evidence
purpose: Classifier for unsafe model inputs and outputs, released alongside the language models.
Released with weights, so its behavior is directly testable by third parties; this is a new artifact rather than an unfalsifiable postulate.

pith-pipeline@v0.9.0 · 14829 in / 5495 out tokens · 89954 ms · 2026-05-08T21:44:46.085270+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.PhiForcing / Foundation.DimensionForcing phi_equation; eight_tick_forces_D3 unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Large Language Models Lack Temporal Awareness of Medical Knowledge
cs.LG 2026-05 unverdicted novelty 8.0

LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.
Inference-Time Machine Unlearning via Gated Activation Redirection
cs.LG 2026-05 conditional novelty 8.0

GUARD-IT performs machine unlearning in LLMs via inference-time gated activation redirection, matching or exceeding gradient-based baselines on TOFU and MUSE while preserving utility and working under quantization.
Pretraining Exposure Explains Popularity Judgments in Large Language Models
cs.CL 2026-05 unverdicted novelty 8.0

LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
Grid Games: The Power of Multiple Grids for Quantizing Large Language Models
cs.LG 2026-05 accept novelty 8.0

Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.
SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models
cs.CV 2026-05 unverdicted novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims
cs.CR 2026-05 unverdicted novelty 8.0

Acceptance Cards is a new four-diagnostic standard for safe fine-tuning defense claims that requires statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer; under this pro...
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
cs.AI 2026-05 unverdicted novelty 8.0

Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents
cs.LG 2026-05 unverdicted novelty 8.0

OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy...
MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs
cs.LG 2026-05 unverdicted novelty 8.0

MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.
Narrow Secret Loyalty Dodges Black-Box Audits
cs.CR 2026-05 unverdicted novelty 8.0

Narrow secret loyalties implanted via fine-tuning in LLMs at multiple scales evade black-box audits unless the auditor knows the target principal.
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
cs.AI 2026-05 unverdicted novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...
DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning
cs.LG 2026-05 conditional novelty 8.0

INT4 quantization recovers up to 22 times more forgotten training data in unlearned LLMs, and the proposed DURABLEUN-SAF method is the first to maintain forgetting across BF16, INT8, and INT4 precisions.
Efficient Training on Multiple Consumer GPUs with RoundPipe
cs.DC 2026-04 conditional novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on...
Architecture Determines Observability of Transformers
cs.LG 2026-04 unverdicted novelty 8.0

Certain transformer architectures lose internal linear signals for decision quality during training, making observability an architecture-dependent property rather than a universal one.
HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models
cs.SD 2026-04 unverdicted novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-lang...
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks
cs.CL 2026-04 unverdicted novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models
cs.CL 2026-04 conditional novelty 8.0

Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
LINE: LLM-based Iterative Neuron Explanations for Vision Models
cs.CV 2026-04 unverdicted novelty 8.0

LINE iteratively refines open-vocabulary neuron concepts in vision models via LLM proposals and text-to-image testing, achieving up to 0.11 AUC gains on ImageNet while uncovering 27% new concepts missed by fixed vocabularies.
Backdoor Attacks on Decentralised Post-Training
cs.CR 2026-03 conditional novelty 8.0

An adversary controlling an intermediate pipeline stage in decentralized LLM post-training can inject a backdoor that reduces alignment from 80% to 6%, with the backdoor persisting in 60% of cases even after subsequen...
Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection
cs.CL 2026-03 unverdicted novelty 8.0

Knowledge Packs deliver knowledge via pre-computed KV caches with exact equivalence under causal masking, achieving zero divergences on tested questions and enabling value-based steering without training.
Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
cs.CV 2024-09 accept novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
cs.CL 2024-09 accept novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
cs.AI 2024-08 unverdicted novelty 8.0

The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
cs.CL 2024-06 unverdicted novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation
cs.LG 2026-05 unverdicted novelty 7.0

RxEval benchmark shows frontier LLMs reach at most 46.10% exact match on prescription-level medication, dose, and route selection from real patient trajectories.
From Table to Cell: Attention for Better Reasoning with TABALIGN
cs.AI 2026-05 unverdicted novelty 7.0

TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding exec...
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation
cs.CL 2026-05 unverdicted novelty 7.0

New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
cs.LG 2026-05 unverdicted novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Dynamic Latent Routing
cs.LG 2026-05 unverdicted novelty 7.0

Dynamic Latent Routing jointly learns discrete latent codes, routing policies, and model parameters via dynamic search to match or exceed supervised fine-tuning by 6.6 points on average in low-data settings across fou...
How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization
cs.LG 2026-05 unverdicted novelty 7.0

The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, an...
BOOKMARKS: Efficient Active Storyline Memory for Role-playing
cs.CL 2026-05 unverdicted novelty 7.0

BOOKMARKS introduces searchable bookmarks as reusable answers to storyline questions, enabling active initialization and passive synchronization for more consistent role-playing agent memory than recurrent summarization.
Generative Floor Plan Design with LLMs via Reinforcement Learning with Verifiable Rewards
cs.CL 2026-05 unverdicted novelty 7.0

Fine-tuned LLMs trained with reinforcement learning using verifiable rewards produce floor plans that satisfy connectivity and numerical constraints, outperforming prior methods with at least 94% relative improvement ...
Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding
cs.CL 2026-05 unverdicted novelty 7.0

Mistletoe is a stealthy attack that collapses the speedup of speculative decoding by reducing average accepted length τ without changing output semantics or perplexity.
Inducing Artificial Uncertainty in Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
LongBEL: Long-Context and Document-Consistent Biomedical Entity Linking
cs.CL 2026-05 unverdicted novelty 7.0

LongBEL improves biomedical entity linking consistency by combining full-document context with memory of previous predictions trained via cross-validation rather than gold labels.
IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages
cs.CL 2026-05 unverdicted novelty 7.0

A parallel multi-turn medical dialogue dataset spanning English and nine Indic languages is created from synthetic consultations to enable personalized AI healthcare interactions.
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective
cs.LG 2026-05 conditional novelty 7.0

ConSPO improves RLVR training by aligning rollout scores with generation likelihoods via length-normalized log-probabilities and applying a group-wise InfoNCE contrastive loss with a scheduled margin, outperforming GR...
From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning
cs.LG 2026-05 conditional novelty 7.0

AutoSelection discovers data recipes from a 90K instruction pool that outperform full-data training and other selectors on reasoning tasks for SFT across multiple models.
Uncovering Symmetry Transfer in Large Language Models via Layer-Peeled Optimization
math.OC 2026-05 conditional novelty 7.0

Symmetries in next-token prediction targets induce corresponding geometric symmetries such as circulant matrices and equiangular tight frames in the optimal weights and embeddings of a layer-peeled LLM surrogate model.
Simulating Students or Sycophantic Problem Solving? On Misconception Faithfulness of LLM Simulators
cs.CL 2026-05 conditional novelty 7.0

LLM simulators exhibit near-zero selective response to targeted misconception feedback and behave sycophantically, but SFT and SFS-aligned RL improve this property.
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts
cs.AI 2026-05 conditional novelty 7.0

BadSKP poisons graph node embeddings to steer soft prompts in KG-enhanced LLMs, achieving high attack success rates where text-channel backdoors fail due to semantic anchoring.
From Noise to Diversity: Random Embedding Injection in LLM Reasoning
cs.AI 2026-05 conditional novelty 7.0

Random Soft Prompts (RSPs) sampled from the embedding distribution improve Pass@N on reasoning benchmarks by increasing early-stage token diversity without any training.
Reviving In-domain Fine-tuning Methods for Source-Free Cross-domain Few-shot Learning
cs.CV 2026-05 unverdicted novelty 7.0

LoRA adapters fix collapsed visual CLS token attention in CLIP for superior cross-domain few-shot learning, and the new Semantic Probe framework revives prompt methods to reach state-of-the-art on four benchmarks.
CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG
cs.AI 2026-05 unverdicted novelty 7.0

CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.
Much of Geospatial Web Search Is Beyond Traditional GIS
cs.IR 2026-05 unverdicted novelty 7.0

Analysis of 1.01 million unfiltered Bing queries identifies 18% as geospatial, dominated by transactional categories like costs (15.3%) that exceed traditional GIS scope.
Count Anything at Any Granularity
cs.CV 2026-05 unverdicted novelty 7.0

Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for impro...
Uniform Scaling Limits in AdamW-Trained Transformers
stat.ML 2026-05 unverdicted novelty 7.0

AdamW-trained transformer hidden states and backpropagated variables converge uniformly in L2 to a forward-backward ODE system (McKean-Vlasov when non-causal) at rate O(L^{-1}+L^{-1/3}H^{-1/2}) as depth L and heads H ...
SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing
cs.LG 2026-05 unverdicted novelty 7.0

SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.
ConQuR: Corner Aligned Activation Quantization via Optimized Rotations for LLMs
cs.LG 2026-05 unverdicted novelty 7.0

ConQuR is a post-training rotation calibration technique that aligns activations to hypercube corners via Procrustes optimization and online updates, delivering competitive LLM quantization performance without end-to-...
MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization
cs.LG 2026-05 unverdicted novelty 7.0

MASS-DPO derives a Plackett-Luce-specific log-determinant Fisher information objective to select non-redundant negative samples, matching or exceeding multi-negative DPO performance with substantially fewer negatives ...
Locking Pretrained Weights via Deep Low-Rank Residual Distillation
cs.LG 2026-05 unverdicted novelty 7.0

DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via mo...
Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions
cs.CL 2026-05 unverdicted novelty 7.0

GCAD reduces coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1 in persona-steering tasks by using gated attention-delta interventions from system prompts.
BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization
cs.LG 2026-05 unverdicted novelty 7.0

BCJR-QAT makes trellis quantization differentiable via BCJR soft decoding at finite temperature, allowing QAT to improve 2-bit LLM perplexity over PTQ with a fused GPU kernel and a drift-budget escape condition.
Infinite Mask Diffusion for Few-Step Distillation
cs.CL 2026-05 unverdicted novelty 7.0

Infinite Mask Diffusion Models use stochastic infinite-state masks to overcome the factorization error lower bound in standard masked diffusion, achieving superior few-step performance on language tasks via distillation.
Privacy-preserving Chunk Scheduling in a BitTorrent Implementation of Federated Learning
cs.DC 2026-05 unverdicted novelty 7.0

FLTorrent achieves within-round source unlinkability in decentralized federated learning via a BitTorrent warm-up with pre-round obfuscation, randomized lags, and coordination-only non-owner-first scheduling, reaching...
SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding
cs.LG 2026-05 unverdicted novelty 7.0

SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.
Unsupervised Process Reward Models
cs.LG 2026-05 unverdicted novelty 7.0

Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.
GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation
cs.SI 2026-05 unverdicted novelty 7.0

GraphInstruct is a progressive benchmark with six complexity levels for LLM graph generation that identifies multi-constraint composition as the hardest point and shows a verification-guided iterative framework outper...