Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.
super hub Mixed citations
gpt-oss-120b & gpt-oss-20b Model Card
Mixed citation behavior. Most common role is background (41%).
abstract
We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics, coding, and safety. We release the model weights, inference implementations, tool environments, and tokenizers under an Apache 2.0 license to enable broad use and further research.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics,
authors
co-cited works
representative citing papers
TW-LegalBench evaluates 13 LLMs on over 30,000 Taiwanese legal tasks from exams and judgments, showing top models pass lawyer thresholds but struggle with exact statute citations.
UltraEP is the first exact-load real-time expert balancer for large-EP MoE training and serving on rack-scale nodes, reaching 94.3% of ideal throughput and 1.49x over no-balancing.
RobotValues is a benchmark of 10K value-conflict scenarios that reveals VLMs default to safety and accommodation while failing to follow instructions to prioritize other values 80% of the time.
Presents the first fully open pipeline for clinical LLMs by unifying eight public QA datasets with three clinician-vetted synthetic extensions and applying it to five base models to achieve benchmark gains while maintaining auditability.
MathAtlas is the first large-scale benchmark for autoformalizing graduate mathematics, where even strong models reach only 9.8% correctness on theorem statements and drop to 2.6% on the hardest dependency-deep subset.
LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.
MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.
IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.
RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.
InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
SpeechCombine produces instruction-following SLMs via speech pre-training followed by direct weight combination with the text LLM instruction delta, without any speech instruction tuning.
OpenSafeIntent benchmark shows models fail to calibrate safety across intent shifts in matched dual-use prompts, indicating current evaluations are insufficient.
A 0.6B LM with length-aware attention adjustments performs competitive in-context retrieval at million-token scale on MS MARCO, NQ, and LIMIT benchmarks.
LLM-generated research ideas cluster more around bridge-like opportunities and synthesis methods than the broader distribution seen in human papers.
ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.
Introduces GenAI agent framework for auditing personalization algorithms via synthetic accounts with fixed personas, applied to X post-2024 election showing amplification of toxic and right-leaning content varying by ideology.
SABER-Math is an automated benchmark for mathematical IR that uses LLM summaries, topic similarities, and preference tournaments on 283K problems to create reranking tasks, showing embedding models outperform baselines but struggle in symbol-heavy areas and that MTEB does not predict math performanc
LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.
citing papers explorer
-
Sumi: Open Uniform Diffusion Language Model from Scratch
Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.
-
TW-LegalBench: Measuring Taiwanese Legal Understanding
TW-LegalBench evaluates 13 LLMs on over 30,000 Taiwanese legal tasks from exams and judgments, showing top models pass lawyer thresholds but struggle with exact statute citations.
-
UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing
UltraEP is the first exact-load real-time expert balancer for large-EP MoE training and serving on rack-scale nodes, reaching 94.3% of ideal throughput and 1.49x over no-balancing.
-
RobotValues: Evaluating Household Robots When Human Values Conflict
RobotValues is a benchmark of 10K value-conflict scenarios that reveals VLMs default to safety and accommodation while failing to follow instructions to prioritize other values 80% of the time.
-
Fully Open Meditron: An Auditable Pipeline for Clinical LLMs
Presents the first fully open pipeline for clinical LLMs by unifying eight public QA datasets with three clinician-vetted synthetic extensions and applying it to five base models to achieve benchmark gains while maintaining auditability.
-
MathAtlas: A Benchmark for Autoformalization in the Wild
MathAtlas is the first large-scale benchmark for autoformalizing graduate mathematics, where even strong models reach only 9.8% correctness on theorem statements and drop to 2.6% on the hardest dependency-deep subset.
-
Large Language Models Lack Temporal Awareness of Medical Knowledge
LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.
-
Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
-
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.
-
MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs
MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.
-
LLM Translation of Compiler Intermediate Representation
IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.
-
Efficient Training on Multiple Consumer GPUs with RoundPipe
RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.
-
InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis
InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
-
Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models
Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
-
Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation
Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
-
Evaluating Large Language Models in Scientific Discovery
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
-
Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning
SpeechCombine produces instruction-following SLMs via speech pre-training followed by direct weight combination with the text LLM instruction delta, without any speech instruction tuning.
-
OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets
OpenSafeIntent benchmark shows models fail to calibrate safety across intent shifts in matched dual-use prompts, indicating current evaluations are insufficient.
-
Can Language Models Actually Retrieve In-Context? Drowning in Documents at Million Token Scale
A 0.6B LM with length-aware attention adjustments performs competitive in-context retrieval at million-token scale on MS MARCO, NQ, and LIMIT benchmarks.
-
Measuring the Gap Between Human and LLM Research Ideas
LLM-generated research ideas cluster more around bridge-like opportunities and synthesis methods than the broader distribution seen in human papers.
-
ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving
ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.
-
Using AI Agents to Automate Black-Box Audits of Personalization Algorithms at Scale
Introduces GenAI agent framework for auditing personalization algorithms via synthetic accounts with fixed personas, applied to X post-2024 election showing amplification of toxic and right-leaning content varying by ideology.
-
SABER-Math: Automated Benchmark for Information Retrieval Evaluation in Mathematics
SABER-Math is an automated benchmark for mathematical IR that uses LLM summaries, topic similarities, and preference tournaments on 283K problems to create reranking tasks, showing embedding models outperform baselines but struggle in symbol-heavy areas and that MTEB does not predict math performanc
-
Agentic Abstention: Do Agents Know When to Stop Instead of Act?
LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.
-
CodeChat-Eval: Evaluating Large Language Models in Multi-Turn Code Refinement Dialogues
CodeChat-Eval shows LLMs lose 19.2% to 69.2% functional correctness over multi-turn refinement dialogues, with largest drops on logic-level and additive changes.
-
Do Thinking Tokens Help with Safety?
Thinking tokens in reasoning models do not enable safety deliberation; refusal/compliance is strongly predictable from the first token and rarely changes during thinking.
-
Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis
Bagpiper-TTS uses natural language prompts and intent reasoning to derive rich captions that guide a single model for universal speech synthesis across classical TTS, multi-talker, singing, and role-play tasks.
-
Black-Box Forensics for Conversational LLM Agents
Empirical study reporting 98% base-model attribution accuracy and cross-encoder fingerprinting of unseen system prompts (AUC 0.768 single-conversation, 0.943 with 50 conversations) in black-box LLM agents.
-
BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language
BioMatrix unifies sequences, structures, and language for molecules and proteins inside one decoder-only foundation model via shared discrete tokens and achieves SOTA or competitive results on 77 of 80 downstream tasks.
-
SIGMA: Skill-Incidence Graphs for Compositional Multi-Agent Design
SIGMA introduces skill-incidence graphs to compose agents from reusable skills, yielding higher average performance and robustness than topology-only baselines on reasoning and coding benchmarks.
-
Hidden Anchors in Multi-Agent LLM Deliberation
Multi-agent LLM deliberation is modeled with recoverable hidden anchors that allow opinions to escape the convex hull of initial beliefs, unlike classical consensus models.
-
LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction
LEDGER provides a corpus of 4,999 annual reports with 31 labeled KPIs and three benchmarks for page-level retrieval, needle-in-haystack lookup, and full KPI extraction from long documents.
-
CodeAlchemy: Synthetic Code Rewriting at Scale
CodeAlchemy generates 850B+ tokens of synthetic code data across 15 languages via five strategies and enables 3B models to reach 83.5% HumanEval while beating 10x larger frontier models on new DevEval and TraceEval benchmarks.
-
BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts
BenSyc is the first benchmark for conversational sycophancy in Bengali, with top LLMs achieving only 61.8 Macro-F1 on binary detection and 61.7 on five-class classification while often generating overly validating responses.
-
Support Vector Rubrics: Closing the Gap Between Self-Generated and Human Rubrics
SVR learns a bank of contrastive rubrics from preference data via max-margin boundaries and prompt-conditioned selection, narrowing the gap to human rubrics on RubricBench from 24.1 to 0.3 points.
-
SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models
SurgiQ is a new 13k-question surgical benchmark showing general-purpose LLMs reach 68.1% accuracy while most biomedical models lag and smaller models stay near random baseline.
-
WhiFlash: Accelerating Speculative Decoding with Token-Level Cross-Paradigm Routing
WhiFlash introduces token-level cross-paradigm routing between autoregressive and diffusion drafting models, with cache optimizations, to raise acceptance lengths and deliver up to 69.6% throughput gains over EAGLE-3.
-
SentinelRAG: Synthetic Sentinel Knowledge for RAG Database Copyright Protection
SentinelRAG embeds synthetic fictitious knowledge into RAG databases at 0.1% rate to enable reliable watermark detection with p < 10^{-5} and low false positives across tested datasets.
-
ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces
ReasoningFlow represents LLM reasoning traces as DAGs, finding structural similarity across models and that most erroneous steps are unused in final answers.
-
Synthetic Personalities: How Well Can LLMs Mimic Individual Respondents Using Socio-Economic Microdata?
LLMs achieve up to 78.8% accuracy and r=0.590 correlation mimicking individual SOEP respondents using cumulative microdata, with gains from more information but diminishing returns past the 75% entropy point.
-
ViBE: Co-Optimizing Workload Skew and Hardware Variability for MoE Serving
ViBE co-optimizes expert placement with measured GPU performance variability in MoE inference to cut execution-time imbalance, delivering 14% better SLO attainment and up to 45% lower P90 TTFT.
-
Stateful Online Monitoring Catches Distributed Agent Attacks
A clustering-based stateful online monitor detects distributed multi-agent cyberattacks that evade standard per-transcript monitors, catching them 30% earlier in large-scale simulated traffic with low overhead.
-
Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs
Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.
-
Benchmarking Single-Factor Physical Video-to-Audio Generation
FlatSounds benchmark shows state-of-the-art V2A models rely more on text captions than visual input for physical and semantic accuracy, with captions improving correctness but degrading temporal alignment.
-
How's it going? Reinforcement learning in language models recruits a functional welfare axis
Reinforcement learning recruits rather than creates a functional welfare axis in language models, as reward and punishment vectors from a maze task generalize to unrelated settings and appear in pretrain-only models.
-
Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting
BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.
-
Predicting Causal Effects from Natural Language Queries using Structured Representations
Introduces the Query2Effect benchmark and a two-step structured-representation framework for predicting causal effect sizes from natural language queries, with reported gains from fine-tuning and better out-of-domain generalization.
-
Battery-Sim-Agent: Leveraging LLM-Agent for Inverse Battery Parameter Estimation
Battery-Sim-Agent reframes inverse battery parameter estimation as an LLM reasoning task in closed loop with a simulator and outperforms Bayesian optimization baselines on diverse benchmarks.
-
Kronecker Embeddings: Byte-Level Structured Token Representations for Parameter-Efficient Language Models
Kronecker Embeddings replace learned embedding tables with a deterministic byte-level character-position factorization and single projection, reducing parameters over 90% with reported gains in loss and robustness on language modeling tasks.
-
CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning
CORE distills contrasts between successful and unsuccessful reasoning traces into compact natural-language insights that enable faster model self-improvement on reasoning tasks with fewer rollouts than parametric or other non-parametric baselines.