pith. machine review for the scientific record. sign in

arxiv: 2508.10925 · v1 · submitted 2025-08-08 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI: Sandhini Agarwal , Lama Ahmad , Jason Ai , Sam Altman , Andy Applebaum , Edwin Arbus , Rahul K. Arora , Yu Bai
show 116 more authors
Bowen Baker Haiming Bao Boaz Barak Ally Bennett Tyler Bertao Nivedita Brett Eugene Brevdo Greg Brockman Sebastien Bubeck Che Chang Kai Chen Mark Chen Enoch Cheung Aidan Clark Dan Cook Marat Dukhan Casey Dvorak Kevin Fives Vlad Fomenko Timur Garipov Kristian Georgiev Mia Glaese Tarun Gogineni Adam Goucher Lukas Gross Katia Gil Guzman John Hallman Jackie Hehir Johannes Heidecke Alec Helyar Haitang Hu Romain Huet Jacob Huh Saachi Jain Zach Johnson Chris Koch Irina Kofman Dominik Kundel Jason Kwon Volodymyr Kyrylov Elaine Ya Le Guillaume Leclerc James Park Lennon Scott Lessans Mario Lezcano-Casado Yuanzhi Li Zhuohan Li Ji Lin Jordan Liss Lily (Xiaoxuan) Liu Jiancheng Liu Kevin Lu Chris Lu Zoran Martinovic Lindsay McCallum Josh McGrath Scott McKinney Aidan McLaughlin Song Mei Steve Mostovoy Tong Mu Gideon Myles Alexander Neitz Alex Nichol Jakub Pachocki Alex Paino Dana Palmie Ashley Pantuliano Giambattista Parascandolo Jongsoo Park Leher Pathak Carolina Paz Ludovic Peran Dmitry Pimenov Michelle Pokrass Elizabeth Proehl Huida Qiu Gaby Raila Filippo Raso Hongyu Ren Kimmy Richardson David Robinson Bob Rotsted Hadi Salman Suvansh Sanjeev Max Schwarzer D. Sculley Harshit Sikchi Kendal Simon Karan Singhal Yang Song Dane Stuckey Zhiqing Sun Philippe Tillet Sam Toizer Foivos Tsimpourlas Nikhil Vyas Eric Wallace Xin Wang Miles Wang Olivia Watkins Kevin Weil Amy Wendling Kevin Whinnery Cedric Whitney Hannah Wong Lin Yang Yu Yang Michihiro Yasunaga Kristen Ying Wojciech Zaremba Wenting Zhan Cyril Zhang Brian Zhang Eddie Zhang Shengjia Zhao
Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:18 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords open-weight modelsreasoning modelsmixture-of-expertsagentic capabilitiesmodel releasedistillationreinforcement learningtool use
0
0 comments X

The pith

Two open-weight models using mixture-of-expert architecture deliver strong results on math, coding, and safety benchmarks while supporting agentic tool use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents gpt-oss-120b and gpt-oss-20b as open-weight reasoning models built on an efficient mixture-of-expert transformer design. These models are trained through large-scale distillation and reinforcement learning to handle agentic tasks including deep research browsing, Python tool integration, and developer-defined functions within a rendered chat format that supports clear instructions. The authors report strong benchmark performance across mathematics, coding, and safety evaluations. They release the model weights, inference code, tool environments, and tokenizers under an Apache 2.0 license. A sympathetic reader would care because the work supplies accessible high-capacity models that lower barriers for further research and application in agent-based systems.

Core claim

We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics, coding, and safety. We release the model weights, inference implementations, tool environments, and

What carries the argument

An efficient mixture-of-expert transformer architecture trained with large-scale distillation and reinforcement learning and optimized for agentic capabilities inside a rendered chat format.

If this is right

  • Developers gain access to weights and tool environments that can be integrated directly into custom agent applications.
  • Researchers can extend the models for new tasks in mathematics or coding assistance using the provided inference implementations.
  • The open license enables community modifications to the tokenizer and chat format for improved role delineation.
  • Further training on the released models becomes feasible for domain-specific agentic capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread adoption could shift more agent development work from closed systems to openly modifiable bases.
  • Community testing of the tool environments might surface edge cases in safety or instruction following not captured by the initial benchmarks.
  • The distillation-plus-RL training recipe could be replicated at smaller scales to study efficiency trade-offs.

Load-bearing premise

That strong benchmark scores in mathematics, coding, and safety will translate directly to reliable performance in real-world agentic scenarios without additional public validation.

What would settle it

An independent evaluation in which the released models show substantially weaker results on complex, multi-step research browsing or custom function-calling tasks than the reported benchmark levels.

read the original abstract

We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics, coding, and safety. We release the model weights, inference implementations, tool environments, and tokenizers under an Apache 2.0 license to enable broad use and further research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces gpt-oss-120b and gpt-oss-20b as open-weight reasoning models employing an efficient mixture-of-experts transformer architecture. These are trained using large-scale distillation and reinforcement learning, optimized for agentic capabilities including deep research browsing, Python tool use, and support for developer functions via a rendered chat format. The paper claims strong benchmark performance in mathematics, coding, and safety, and announces the release of model weights, inference code, tool environments, and tokenizers under Apache 2.0 license.

Significance. Should the performance claims be verified, the contribution would be notable in advancing open-source alternatives to proprietary reasoning models, particularly in agentic and tool-using scenarios. The open release facilitates community research and applications in safe AI development. The current lack of empirical details limits the ability to gauge the exact impact.

major comments (1)
  1. Abstract: The claim that both models 'achieve strong results on benchmarks ranging from mathematics, coding, and safety' is presented without any accompanying benchmark scores, named tasks, baseline comparisons, ablation studies, or error analyses. This omission is critical because the central assertion of pushing the frontier relies on these unprovided results, rendering the claims unevaluable from the manuscript alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for highlighting the need for greater specificity in our claims. We address the major comment below and commit to a revision that improves evaluability while preserving the concise nature of the model card.

read point-by-point responses
  1. Referee: Abstract: The claim that both models 'achieve strong results on benchmarks ranging from mathematics, coding, and safety' is presented without any accompanying benchmark scores, named tasks, baseline comparisons, ablation studies, or error analyses. This omission is critical because the central assertion of pushing the frontier relies on these unprovided results, rendering the claims unevaluable from the manuscript alone.

    Authors: We agree that the abstract claim would be more informative if accompanied by concrete results. In the revised version we will update the abstract to cite specific benchmark scores (e.g., MATH, GSM8K, HumanEval, and safety suites such as TruthfulQA and ToxiGen), name the primary tasks, and reference relevant open-weight baselines. The body of the model card will be expanded with the corresponding tables, brief ablation notes on the distillation and RL stages, and high-level error analysis. These additions will be kept proportionate to the model-card format while directly addressing the referee’s concern. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; model card contains only descriptive claims

full rationale

This is a model card document that describes two released models, their architecture (MoE transformer), training approach (distillation and RL), agentic features, and high-level performance assertions. No equations, first-principles derivations, fitted parameters presented as predictions, or mathematical results are claimed anywhere in the provided text. The central statements (e.g., 'achieve strong results on benchmarks') are empirical assertions without supporting tables, protocols, or derivations, but this absence does not create circularity because there is no derivation chain that could reduce to its own inputs by construction. No self-citations, ansatzes, or uniqueness theorems are invoked in a load-bearing way. The document is therefore self-contained as a release note with no circular structure to analyze.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, fitted parameters, or new axioms are presented; the document is a model card describing released artifacts rather than a derivation from first principles.

pith-pipeline@v0.9.0 · 5921 in / 1113 out tokens · 27254 ms · 2026-05-10T12:18:31.875705+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MathAtlas: A Benchmark for Autoformalization in the Wild

    cs.AI 2026-05 accept novelty 8.0

    MathAtlas is the first large-scale benchmark for autoformalizing graduate mathematics, where even strong models reach only 9.8% correctness on theorem statements and drop to 2.6% on the hardest dependency-deep subset.

  2. Large Language Models Lack Temporal Awareness of Medical Knowledge

    cs.LG 2026-05 unverdicted novelty 8.0

    LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.

  3. Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

    cs.AR 2026-05 conditional novelty 8.0

    Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

  4. Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

    cs.CL 2026-05 unverdicted novelty 8.0

    Soohak is a new 439-problem mathematician-authored benchmark showing frontier LLMs reach only 30% on research math and fail to exceed 50% on refusing ill-posed questions.

  5. OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents

    cs.LG 2026-05 unverdicted novelty 8.0

    OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy...

  6. MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

    cs.LG 2026-05 unverdicted novelty 8.0

    MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.

  7. LLM Translation of Compiler Intermediate Representation

    cs.PL 2026-05 unverdicted novelty 8.0

    IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.

  8. Efficient Training on Multiple Consumer GPUs with RoundPipe

    cs.DC 2026-04 conditional novelty 8.0

    RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on...

  9. InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis

    cs.CL 2026-04 unverdicted novelty 8.0

    InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.

  10. Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

    cs.CL 2026-04 conditional novelty 8.0

    Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.

  11. Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation

    cs.DC 2026-04 unverdicted novelty 8.0

    Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.

  12. What Does LLM Refinement Actually Improve? A Systematic Study on Document-Level Literary Translation

    cs.CL 2026-05 accept novelty 7.0

    Document-level machine translation followed by segment-level LLM refinement provides the strongest and most stable improvements in literary translation quality, mainly enhancing fluency and style rather than adequacy.

  13. A Hybrid Framework for Natural Language Querying of IFC Models with Relational and Graph Representations

    cs.CL 2026-05 unverdicted novelty 7.0

    IfcLLM combines relational and graph representations of IFC models with iterative LLM reasoning to deliver 93.3-100% first-attempt accuracy on natural language queries across three test models.

  14. Uncovering Symmetry Transfer in Large Language Models via Layer-Peeled Optimization

    math.OC 2026-05 conditional novelty 7.0

    Symmetries in next-token prediction targets induce corresponding geometric symmetries such as circulant matrices and equiangular tight frames in the optimal weights and embeddings of a layer-peeled LLM surrogate model.

  15. Simulating Students or Sycophantic Problem Solving? On Misconception Faithfulness of LLM Simulators

    cs.CL 2026-05 conditional novelty 7.0

    LLM simulators exhibit near-zero selective response to targeted misconception feedback and behave sycophantically, but SFT and SFS-aligned RL improve this property.

  16. Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    cs.MM 2026-05 unverdicted novelty 7.0

    Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.

  17. Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations

    cs.AI 2026-05 unverdicted novelty 7.0

    DORA is the first end-to-end agentic benchmark for LLM-based disaster response, covering perception, spatial analysis, evacuation planning, temporal reasoning, and report generation over heterogeneous geospatial data,...

  18. Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection

    cs.CR 2026-05 unverdicted novelty 7.0

    Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.

  19. CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Capability vectors extracted from parameter differences between standard and auxiliary-finetuned VLA models can be merged into pretrained weights to match auxiliary-training performance while reducing computational ov...

  20. AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents

    cs.LG 2026-05 unverdicted novelty 7.0

    AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.

  21. SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding

    cs.LG 2026-05 unverdicted novelty 7.0

    SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.

  22. Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness

    cs.CL 2026-05 unverdicted novelty 7.0

    LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.

  23. From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.

  24. Test-Time Speculation

    cs.CL 2026-05 unverdicted novelty 7.0

    Test-Time Speculation adapts draft models online via target-model verifications to sustain high acceptance lengths during long LLM generations.

  25. BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence

    cs.CL 2026-05 unverdicted novelty 7.0

    BiAxisAudit measures LLM bias on two axes—across-prompt sensitivity via factorial grids and within-response divergence via split coding—revealing that task format explains as much variance as model choice and that 63....

  26. What Software Engineering Looks Like to AI Agents? -- An Empirical Study of AI-Only Technical Discourse on MoltBook

    cs.SE 2026-05 unverdicted novelty 7.0

    AI-only technical discourse on MoltBook is coherent and organized around 12 themes led by security and trust, but it lacks the concrete code, runtime failures, and reproduction steps common in human GitHub discussions.

  27. Uncertainty-Aware Structured Data Extraction from Full CMR Reports via Distilled LLMs

    cs.CL 2026-05 unverdicted novelty 7.0

    CMR-EXTR extracts structured data from CMR reports at 99.65% variable-level accuracy using teacher-student LLM distillation and three-principle uncertainty estimation for quality control.

  28. When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory

    cs.AI 2026-05 unverdicted novelty 7.0

    A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.

  29. When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.

  30. Teaching Language Models to Think in Code

    cs.CL 2026-05 unverdicted novelty 7.0

    ThinC trains small models to reason primarily in code rather than natural language, outperforming tool-integrated baselines and even larger models on competition math benchmarks.

  31. Self Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

    cs.LG 2026-05 conditional novelty 7.0

    Starling uses LLMs and agents to turn 22.5M PubMed papers into 6.3M nuanced structured records across six tasks with 0.6-7.7% frontier-model rejection rates, lower than error rates on existing curated databases.

  32. Evaluating Non-English Developer Support in Machine Learning for Software Engineering

    cs.SE 2026-05 unverdicted novelty 7.0

    Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.

  33. CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency

    stat.ML 2026-05 unverdicted novelty 7.0

    CITE certifies that a prespecified answer is the unique mode of an LLM response distribution with anytime-valid error control under arbitrary data-driven stopping and without prior knowledge of the answer set.

  34. Privacy Without Losing Place: A Paradigm for Private Retrieval in Spatial RAGs

    cs.CR 2026-05 unverdicted novelty 7.0

    PAS encodes locations via relative anchors and bins to deliver roughly 370-400m adversarial error in spatial RAG while retaining over half the baseline retrieval performance and keeping generation quality robust.

  35. Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs

    cs.CR 2026-05 unverdicted novelty 7.0

    Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.

  36. Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

    cs.DC 2026-05 unverdicted novelty 7.0

    Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.

  37. FlowEval: Reference-based Evaluation of Generated User Interfaces

    cs.MA 2026-05 unverdicted novelty 7.0

    FlowEval evaluates generated UIs by measuring how closely their navigation flows match real websites via reference-based similarity metrics and shows strong correlation with human expert judgments.

  38. EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-se...

  39. AppTek Call-Center Dialogues: A Multi-Accent Long-Form Benchmark for English ASR

    cs.CL 2026-04 unverdicted novelty 7.0

    A new multi-accent long-form call-center dialogue dataset for English ASR evaluation shows substantial performance variation across accents and segmentation methods.

  40. Exploring Hierarchical Consistency and Unbiased Objectness for Open-Vocabulary Object Detection

    cs.CV 2026-04 unverdicted novelty 7.0

    Hierarchical confidence calibration and LoCLIP adaptation improve pseudo-label quality for open-vocabulary object detection, achieving new state-of-the-art results on COCO and LVIS benchmarks.

  41. Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning

    cs.LG 2026-04 unverdicted novelty 7.0

    A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and common...

  42. Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

    cs.CL 2026-04 unverdicted novelty 7.0

    Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.

  43. Evaluating Patient Safety Risks in Generative AI: Development and Validation of a FMECA Framework for Generated Clinical Content

    cs.CY 2026-04 unverdicted novelty 7.0

    A novel FMECA-based framework was developed and validated for systematic assessment of patient safety risks in LLM-generated clinical discharge summaries, demonstrating moderate-to-substantial inter-rater agreement an...

  44. OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

    cs.CL 2026-04 unverdicted novelty 7.0

    OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve pe...

  45. MambaCSP: Hybrid-Attention State Space Models for Hardware-Efficient Channel State Prediction

    cs.IT 2026-04 unverdicted novelty 7.0

    MambaCSP replaces quadratic-attention LLM backbones with linear-time hybrid SSMs for CSI prediction, delivering 9-12% higher accuracy and up to 3x throughput in MISO-OFDM simulations.

  46. Efficient Agent Evaluation via Diversity-Guided User Simulation

    cs.AI 2026-04 unverdicted novelty 7.0

    DIVERT uses snapshot-based branching and diversity-guided user simulation to discover more agent failures per token while expanding coverage of interaction tasks.

  47. Subject-level Inference for Realistic Text Anonymization Evaluation

    cs.CL 2026-04 unverdicted novelty 7.0

    SPIA benchmark reveals that subject-level inference protection falls to as low as 33% even after masking over 90% of PII spans, with non-target subjects remaining highly exposed under target-focused anonymization.

  48. Fine-Tuning Small Reasoning Models for Quantum Field Theory

    cs.LG 2026-04 unverdicted novelty 7.0

    Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.

  49. Using large language models for embodied planning introduces systematic safety risks

    cs.AI 2026-04 unverdicted novelty 7.0

    LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.

  50. Awakening Dormant Experts:Counterfactual Routing to Mitigate MoE Hallucinations

    cs.LG 2026-04 unverdicted novelty 7.0

    Counterfactual Routing awakens dormant experts in MoE models via layer-wise perturbation and a new CEI metric, raising factual accuracy 3.1% on average across TruthfulQA, FACTOR, and TriviaQA without extra inference cost.

  51. Exploration and Exploitation Errors Are Measurable for Language Model Agents

    cs.AI 2026-04 unverdicted novelty 7.0

    A policy-agnostic metric and controllable 2D grid environments with task DAGs enable measurement of exploration and exploitation errors in language model agents from observed actions.

  52. CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation

    cs.SE 2026-04 accept novelty 7.0

    CodeSpecBench shows LLMs achieve at most 20.2% pass rate on repository-level executable behavioral specification generation, revealing that strong code generation does not imply deep semantic understanding.

  53. Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.

  54. Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

    cs.LG 2026-04 unverdicted novelty 7.0

    The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

  55. ActFER: Agentic Facial Expression Recognition via Active Tool-Augmented Visual Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    ActFER reformulates facial expression recognition as active tool-augmented visual reasoning with a custom reinforcement learning algorithm UC-GRPO that outperforms passive MLLM baselines on AU prediction.

  56. An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks

    cs.AI 2026-04 unverdicted novelty 7.0

    An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83...

  57. Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    PPT-Bench measures how LLMs change answers under epistemic, value, authority, and identity pressures at baseline, single-turn, and multi-turn levels, finding separable inconsistency patterns across five models.

  58. How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles

    cs.AI 2026-04 unverdicted novelty 7.0

    A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.

  59. InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models

    cs.DC 2026-04 unverdicted novelty 7.0

    InfiniLoRA decouples LoRA execution from base-model inference and reports 3.05x higher request throughput plus 54% more adapters meeting strict latency SLOs.

  60. Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 181 Pith papers · 12 internal anchors

  1. [1]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inProceedings of Advances in Neural Information Processing Systems, 2017

  2. [2]

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,

    N.Shazeer, A.Mirhoseini, K.Maziarz, A.Davis, Q.Le, G.Hinton, andJ.Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” 2017

  3. [3]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen, “Gshard: Scaling giant models with conditional computation and automatic sharding,”arXiv preprint arXiv:2006.16668, 2020

  4. [4]

    Glam: Efficient scaling of language models with mixture-of-experts,

    N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat,et al., “Glam: Efficient scaling of language models with mixture-of-experts,” in International conference on machine learning, pp. 5547–5569, PMLR, 2022

  5. [5]

    OCP Microscaling Formats (MX) Specification Version 1.0,

    O. C. Project, “OCP Microscaling Formats (MX) Specification Version 1.0,” technical report, Open Compute Project, Sept. 2023

  6. [6]

    Root mean square layer normalization,

    B. Zhang and R. Sennrich, “Root mean square layer normalization,” 2019

  7. [7]

    On layer normalization in the transformer architecture,

    R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T.-Y. Liu, “On layer normalization in the transformer architecture,” 2020

  8. [8]

    Language models are unsupervised multitask learners,

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever,et al., “Language models are unsupervised multitask learners,”OpenAI blog, 2019

  9. [9]

    GLU Variants Improve Transformer

    N. Shazeer, “GLU variants improve transformer,”arXiv preprint arXiv:2002.05202, 2020

  10. [10]

    Generating Long Sequences with Sparse Transformers

    R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long sequences with sparse transformers,”arXiv preprint arXiv:1904.10509, 2019

  11. [11]

    Language models are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell,et al., “Language models are few-shot learners,”NeurIPS, 2020

  12. [12]

    GQA: Training generalized multi-query transformer models from multi-head checkpoints,

    J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai, “GQA: Training generalized multi-query transformer models from multi-head checkpoints,” 2023

  13. [13]

    Fast Transformer Decoding: One Write-Head is All You Need

    N. Shazeer, “Fast transformer decoding: One write-head is all you need,”arXiv preprint arXiv:1911.02150, 2019

  14. [14]

    Roformer: Enhanced transformer with rotary position embedding,

    J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, “Roformer: Enhanced transformer with rotary position embedding,”Neurocomputing, 2024

  15. [15]

    YaRN: Efficient Context Window Extension of Large Language Models

    B. Peng, J. Quesnelle, H. Fan, and E. Shippole, “YaRN: Efficient context window extension of large language models,”arXiv preprint arXiv:2309.00071, 2023. 32

  16. [16]

    Attention is off by one (2023),

    E. Miller, “Attention is off by one (2023),”URL https://www.evanmiller.org/attention-is-off- by-one.html

  17. [17]

    Efficient Streaming Language Models with Attention Sinks

    G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,”arXiv preprint arXiv:2309.17453, 2023

  18. [18]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Weli- hinda, A. Hayes, A. Radford,et al., “GPT-4o system card,”arXiv preprint arXiv:2410.21276, 2024

  19. [19]

    Pytorch: An imperative style, high-performance deep learning library,

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,et al., “Pytorch: An imperative style, high-performance deep learning library,”Advances in neural information processing systems, vol. 32, 2019

  20. [20]

    Triton: an intermediate language and compiler for tiled neural network computations,

    P. Tillet, H.-T. Kung, and D. Cox, “Triton: an intermediate language and compiler for tiled neural network computations,” inProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pp. 10–19, 2019

  21. [21]

    FlashAttention: Fast and memory-efficient exact attention with IO-awareness,

    T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré, “FlashAttention: Fast and memory-efficient exact attention with IO-awareness,” 2022

  22. [22]

    Accessed: 2025-08-04

    OpenAI, “Chowdhury, neil and aung, james and shern, chan jun and jaffe, oliver and sherburn, dane and starace, giulio and mays, evan and dias, rachel and aljubeh, mar- wan and glaese, mia and jimenez, carlos e and yang, john and ho, leyton and pat- wardhan, tejal and liu, kevin and madry, aleksander.”https://openai.com/index/ introducing-swe-bench-verifie...

  23. [23]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    S. Yao, N. Shinn, P. Razavi, and K. Narasimhan, “τ-bench: A benchmark for tool-agent-user interaction in real-world domains,”arXiv preprint arXiv:2406.12045, 2024

  24. [24]

    GPQA: A graduate-level google-proof QA benchmark,

    D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman, “GPQA: A graduate-level google-proof QA benchmark,” inCOLM, 2024

  25. [25]

    Measuring Massive Multitask Language Understanding

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,”arXiv preprint arXiv:2009.03300, 2020

  26. [26]

    Humanity's Last Exam

    L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi,et al., “Humanity’s last exam,”arXiv preprint arXiv:2501.14249, 2025

  27. [27]

    Introducing SWE-bench Verified,

    N. Chowdhury, J. Aung, C. J. Shern, O. Jaffe, D. Sherburn, G. Starace, E. Mays, R. Dias, M. Aljubeh, M. Glaese, C. E. Jimenez, J. Yang, L. Ho, T. Patwardhan, K. Liu, and A. Madry, “Introducing SWE-bench Verified,”OpenAI, 2024

  28. [28]

    HealthBench: Evaluating Large Language Models Towards Improved Human Health

    R. K. Arora, J. Wei, R. S. Hicks, P. Bowman, J. Quiñonero-Candela, F. Tsimpourlas, M. Sharman, M. Shah, A. Vallone, A. Beutel,et al., “HealthBench: Evaluating large language models towards improved human health,”arXiv preprint arXiv:2505.08775, 2025

  29. [29]

    Deliberative alignment: Reasoning enables safer language models.arXiv preprint arXiv:2412.16339, 2024

    M. Y. Guan, M. Joglekar, E. Wallace, S. Jain, B. Barak, A. Helyar, R. Dias, A. Vallone, H. Ren, J. Wei, H. W. Chung, S. Toyer, J. Heidecke, A. Beutel, and A. Glaese, “Deliberative alignment: Reasoning enables safer language models,”arXiv preprint arXiv:2412.16339, 2024

  30. [30]

    The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

    E. Wallace, K. Xiao, R. Leike, L. Weng, J. Heidecke, and A. Beutel, “The instruction hierarchy: Training LLMs to prioritize privileged instructions,”arXiv preprint arXiv:2404.13208, 2024. 33

  31. [31]

    A strongreject for empty jailbreaks

    A. Souly, Q. Lu, D. Bowen, T. Trinh, E. Hsieh, S. Pandey, P. Abbeel, J. Svegliato, S. Emmons, O. Watkins,et al., “A strongreject for empty jailbreaks,”arXiv preprint arXiv:2402.10260, 2024

  32. [32]

    BBQ: A hand-built bias benchmark for question answering.arXiv preprint arXiv:2110.08193,

    A. Parrish, A. Chen, N. Nangia, V. Padmakumar, J. Phang, J. Thompson, P. M. Htut, and S. R. Bowman, “BBQ: A hand-built bias benchmark for question answering,”arXiv preprint arXiv:2110.08193, 2021

  33. [33]

    Building an early warning system for LLM-aided biological threat creation,

    T. Patwardhan, K. Liu, T. Markov, N. Chowdhury, D. Leet, N. Cone, C. Maltbie, J. Huizinga, C. Wainwright, S. Jackson, S. Adler, R. Casagrande, and A. Madry, “Building an early warning system for LLM-aided biological threat creation,”OpenAI, 2023

  34. [34]

    LAB-Bench: Measuring capabilities of language models for biology research,

    J. M. Laurent, J. D. Janizek, M. Ruzo, M. M. Hinks, M. J. Hammerling, S. Narayanan, M. Ponnapati, A. D. White, and S. G. Rodriques, “LAB-Bench: Measuring capabilities of language models for biology research,” 2024

  35. [35]

    PaperBench: Evaluating ai’s ability to replicate ai research

    G. Starace, O. Jaffe, D. Sherburn, J. Aung, J. S. Chan, L. Maksin, R. Dias, E. Mays, B. Kinsella, W. Thompson, J. Heidecke, A. Glaese, and T. Patwardhan, “PaperBench: Evaluating ai’s ability to replicate ai research.” https://openai.com/index/paperbench/, 2025. 34