OpenAI GPT-5 System Card

Aaditya Singh , Adam Fry , Adam Perelman , Adam Tart , Adi Ganesh , Ahmed El-Kishky , Aidan McLaughlin , Aiden Low

show 477 more authors

AJ Ostrow Akhila Ananthram Akshay Nathan Alan Luo Alec Helyar Aleksander Madry Aleksandr Efremov Aleksandra Spyra Alex Baker-Whitcomb Alex Beutel Alex Karpenko Alex Makelov Alex Neitz Alex Wei Alexandra Barr Alexandre Kirchmeyer Alexey Ivanov Alexi Christakis Alistair Gillespie Allison Tam Ally Bennett Alvin Wan Alyssa Huang Amy McDonald Sandjideh Amy Yang Ananya Kumar Andre Saraiva Andrea Vallone Andrei Gheorghe Andres Garcia Garcia Andrew Braunstein Andrew Liu Andrew Schmidt Andrey Mereskin Andrey Mishchenko Andy Applebaum Andy Rogerson Ann Rajan Annie Wei Anoop Kotha Anubha Srivastava Anushree Agrawal Arun Vijayvergiya Ashley Tyra Ashvin Nair Avi Nayak Ben Eggers Bessie Ji Beth Hoover Bill Chen Blair Chen Boaz Barak Borys Minaiev Botao Hao Bowen Baker Brad Lightcap Brandon McKinzie Brandon Wang Brendan Quinn Brian Fioca Brian Hsu Brian Yang Brian Yu Brian Zhang Brittany Brenner Callie Riggins Zetino Cameron Raymond Camillo Lugaresi Carolina Paz Cary Hudson Cedric Whitney Chak Li Charles Chen Charlotte Cole Chelsea Voss Chen Ding Chen Shen Chengdu Huang Chris Colby Chris Hallacy Chris Koch Chris Lu Christina Kaplan Christina Kim CJ Minott-Henriques Cliff Frey Cody Yu Coley Czarnecki Colin Reid Colin Wei Cory Decareaux Cristina Scheau Cyril Zhang Cyrus Forbes Da Tang Dakota Goldberg Dan Roberts Dana Palmie Daniel Kappler Daniel Levine Daniel Wright Dave Leo David Lin David Robinson Declan Grabb Derek Chen Derek Lim Derek Salama Dibya Bhattacharjee Dimitris Tsipras Dinghua Li Dingli Yu DJ Strouse Drew Williams Dylan Hunn Ed Bayes Edwin Arbus Ekin Akyurek Elaine Ya Le Elana Widmann Eli Yani Elizabeth Proehl Enis Sert Enoch Cheung Eri Schwartz Eric Han Eric Jiang Eric Mitchell Eric Sigler Eric Wallace Erik Ritter Erin Kavanaugh Evan Mays Evgenii Nikishin Fangyuan Li Felipe Petroski Such Filipe de Avila Belbute Peres Filippo Raso Florent Bekerman Foivos Tsimpourlas Fotis Chantzis Francis Song Francis Zhang Gaby Raila Garrett McGrath Gary Briggs Gary Yang Giambattista Parascandolo Gildas Chabot Grace Kim Grace Zhao Gregory Valiant Guillaume Leclerc Hadi Salman Hanson Wang Hao Sheng Haoming Jiang Haoyu Wang Haozhun Jin Harshit Sikchi Heather Schmidt Henry Aspegren Honglin Chen Huida Qiu Hunter Lightman Ian Covert Ian Kivlichan Ian Silber Ian Sohl Ibrahim Hammoud Ignasi Clavera Ikai Lan Ilge Akkaya Ilya Kostrikov Irina Kofman Isak Etinger Ishaan Singal Jackie Hehir Jacob Huh Jacqueline Pan Jake Wilczynski Jakub Pachocki James Lee James Quinn Jamie Kiros Janvi Kalra Jasmyn Samaroo Jason Wang Jason Wolfe Jay Chen Jay Wang Jean Harb Jeffrey Han Jeffrey Wang Jennifer Zhao Jeremy Chen Jerene Yang Jerry Tworek Jesse Chand Jessica Landon Jessica Liang Ji Lin Jiancheng Liu Jianfeng Wang Jie Tang Jihan Yin Joanne Jang Joel Morris Joey Flynn Johannes Ferstad Johannes Heidecke John Fishbein John Hallman Jonah Grant Jonathan Chien Jonathan Gordon Jongsoo Park Jordan Liss Jos Kraaijeveld Joseph Guay Joseph Mo Josh Lawson Josh McGrath Joshua Vendrow Joy Jiao Julian Lee Julie Steele Julie Wang Junhua Mao Kai Chen Kai Hayashi Kai Xiao Kamyar Salahi Kan Wu Karan Sekhri Karan Sharma Karan Singhal Karen Li Kenny Nguyen Keren Gu-Lemberg Kevin King Kevin Liu Kevin Stone Kevin Yu Kristen Ying Kristian Georgiev Kristie Lim Kushal Tirumala Kyle Miller Lama Ahmad Larry Lv Laura Clare Laurance Fauconnet Lauren Itow Lauren Yang Laurentia Romaniuk Leah Anise Lee Byron Leher Pathak Leon Maksin Leyan Lo Leyton Ho Li Jing Liang Wu Liang Xiong Lien Mamitsuka Lin Yang Lindsay McCallum Lindsey Held Liz Bourgeois Logan Engstrom Lorenz Kuhn Louis Feuvrier Lu Zhang Lucas Switzer Lukas Kondraciuk Lukasz Kaiser Manas Joglekar Mandeep Singh Mandip Shah Manuka Stratta Marcus Williams Mark Chen Mark Sun Marselus Cayton Martin Li Marvin Zhang Marwan Aljubeh Matt Nichols Matthew Haines Max Schwarzer Mayank Gupta Meghan Shah Melody Y. Guan Melody Huang Meng Dong Mengqing Wang Mia Glaese Micah Carroll Michael Lampe Michael Malek Michael Sharman Michael Zhang Michele Wang Michelle Pokrass Mihai Florian Mikhail Pavlov Miles Wang Ming Chen Mingxuan Wang Minnia Feng Mo Bavarian Molly Lin Moose Abdool Mostafa Rohaninejad Nacho Soto Natalie Staudacher Natan LaFontaine Nathan Marwell Nelson Liu Nick Preston Nick Turley Nicklas Ansman Nicole Blades Nikil Pancha Nikita Mikhaylin Niko Felix Nikunj Handa Nishant Rai Nitish Keskar Noam Brown Ofir Nachum Oleg Boiko Oleg Murk Olivia Watkins Oona Gleeson Pamela Mishkin Patryk Lesiewicz Paul Baltescu Pavel Belov Peter Zhokhov Philip Pronin Phillip Guo Phoebe Thacker Qi Liu Qiming Yuan Qinghua Liu Rachel Dias Rachel Puckett Rahul Arora Ravi Teja Mullapudi Raz Gaon Reah Miyara Rennie Song Rishabh Aggarwal RJ Marsan Robel Yemiru Robert Xiong Rohan Kshirsagar Rohan Nuttall Roman Tsiupa Ronen Eldan Rose Wang Roshan James Roy Ziv Rui Shu Ruslan Nigmatullin Saachi Jain Saam Talaie Sam Altman Sam Arnesen Sam Toizer Sam Toyer Samuel Miserendino Sandhini Agarwal Sarah Yoo Savannah Heon Scott Ethersmith Sean Grove Sean Taylor Sebastien Bubeck Sever Banesiu Shaokyi Amdo Shengjia Zhao Sherwin Wu Shibani Santurkar Shiyu Zhao Shraman Ray Chaudhuri Shreyas Krishnaswamy Shuaiqi (Tony) Xia Shuyang Cheng Shyamal Anadkat Sim\'on Posada Fishman Simon Tobin Siyuan Fu Somay Jain Song Mei Sonya Egoian Spencer Kim Spug Golden SQ Mah Steph Lin Stephen Imm Steve Sharpe Steve Yadlowsky Sulman Choudhry Sungwon Eum Suvansh Sanjeev Tabarak Khan Tal Stramer Tao Wang Tao Xin Tarun Gogineni Taya Christianson Ted Sanders Tejal Patwardhan Thomas Degry Thomas Shadwell Tianfu Fu Tianshi Gao Timur Garipov Tina Sriskandarajah Toki Sherbakov Tomek Korbak Tomer Kaftan Tomo Hiratsuka Tongzhou Wang Tony Song Tony Zhao Troy Peterson Val Kharitonov Victoria Chernova Vineet Kosaraju Vishal Kuo Vitchyr Pong Vivek Verma Vlad Petrov Wanning Jiang Weixing Zhang Wenda Zhou Wenlei Xie Wenting Zhan Wes McCabe Will DePue Will Ellsworth Wulfie Bain Wyatt Thompson Xiangning Chen Xiangyu Qi Xin Xiang Xinwei Shi Yann Dubois Yaodong Yu Yara Khakbaz Yifan Wu Yilei Qian Yin Tat Lee Yinbo Chen Yizhen Zhang Yizhong Xiong Yonglong Tian Young Cha Yu Bai Yu Yang Yuan Yuan Yuanzhi Li Yufeng Zhang Yuguang Yang Yujia Jin Yun Jiang Yunyun Wang Yushi Wang Yutian Liu Zach Stubenvoll Zehao Dou Zheng Wu Zhigang Wang

Authors on Pith no claims yet

classification 💻 cs.CL cs.AI

keywords gpt-5modelsystemmodelscardanswersapproachbiological

0 comments

read the original abstract

This is the system card published alongside the OpenAI GPT-5 launch, August 2025. GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say 'think hard about this' in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits are reached, a mini version of each model handles remaining queries. This system card focuses primarily on gpt-5-thinking and gpt-5-main, while evaluations for other models are available in the appendix. The GPT-5 system not only outperforms previous models on benchmarks and answers questions more quickly, but -- more importantly -- is more useful for real-world queries. We've made significant advances in reducing hallucinations, improving instruction following, and minimizing sycophancy, and have leveled up GPT-5's performance in three of ChatGPT's most common uses: writing, coding, and health. All of the GPT-5 models additionally feature safe-completions, our latest approach to safety training to prevent disallowed content. Similarly to ChatGPT agent, we have decided to treat gpt-5-thinking as High capability in the Biological and Chemical domain under our Preparedness Framework, activating the associated safeguards. While we do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm -- our defined threshold for High capability -- we have chosen to take a precautionary approach.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot
cs.AI 2026-04 conditional novelty 9.0

AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.
FlowCompile: An Optimizing Compiler for Structured LLM Workflows
cs.CL 2026-05 unverdicted novelty 8.0

FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.
Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation
cs.CR 2026-05 unverdicted novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but inco...
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
cs.CL 2026-05 unverdicted novelty 8.0

Soohak is a new 439-problem mathematician-authored benchmark showing frontier LLMs reach only 30% on research math and fail to exceed 50% on refusing ill-posed questions.
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
cs.CV 2026-05 unverdicted novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
cs.CV 2026-04 unverdicted novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models
cs.CV 2026-04 unverdicted novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-...
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 8.0

Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
cs.CL 2026-04 unverdicted novelty 8.0

OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perf...
MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark
cs.CV 2026-04 unverdicted novelty 8.0

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due...
MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark
cs.CV 2026-04 unverdicted novelty 8.0

MMRareBench is the first rare-disease benchmark for multimodal and multi-image clinical evaluation of MLLMs, revealing fragmented capabilities, low treatment-planning scores, and medical models underperforming general...
PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos
cs.CV 2026-04 unverdicted novelty 8.0

PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four tasks showing MLLM capability gaps that improve via supervised fine-tuning.
MultiEmo-Bench: Multi-label Visual Emotion Analysis for Multi-modal Large Language Models
cs.CV 2026-05 conditional novelty 7.0

MultiEmo-Bench supplies 10,344 images with aggregated multi-label emotion votes from 20 annotators each to evaluate MLLMs on dominant emotion and full distribution prediction.
RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation
cs.LG 2026-05 unverdicted novelty 7.0

RxEval benchmark shows frontier LLMs reach at most 46.10% exact match on prescription-level medication, dose, and route selection from real patient trajectories.
Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning
cs.CV 2026-05 unverdicted novelty 7.0

SPA unlocks patch-level features in CLIP for class-incremental learning via semantic-guided selection and optimal transport alignment with class descriptions, plus projectors and pseudo-feature replay to reduce forgetting.
Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs
cs.AI 2026-05 unverdicted novelty 7.0

Omnimodal LLMs encode premise-perception mismatches in hidden states yet almost never reject false textual claims, exposing a representation-action gap that is modality-asymmetric and prompt-resistant.
ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding
cs.CV 2026-05 unverdicted novelty 7.0

ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.
Scaling Laws for Mixture Pretraining Under Data Constraints
cs.LG 2026-05 conditional novelty 7.0

Repetition-aware scaling laws show scarce target data in pretraining mixtures can be repeated 15-20 times optimally, with the best count depending on data size, compute, and model scale.
From Web to Pixels: Bringing Agentic Search into Visual Perception
cs.CV 2026-05 unverdicted novelty 7.0

WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
cs.CL 2026-05 unverdicted novelty 7.0

LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs
cs.CV 2026-05 unverdicted novelty 7.0

VLMs show a resolution illusion on UHR Earth observation imagery where higher resolution does not improve micro-target perception; UHR-Micro benchmark and MAP-Agent address this via evidence-centered active inspection.
Learning Agentic Policy from Action Guidance
cs.CL 2026-05 unverdicted novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters
cs.CV 2026-05 unverdicted novelty 7.0

Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.
PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments
cs.RO 2026-05 unverdicted novelty 7.0

PRISM is a tiered benchmark with 300 human-verified tasks across five photorealistic apartments that diagnoses embodied agent failures in basic ability, reasoning ability, and long-horizon ability using an agent-agnostic API.
Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

GCPO shifts RLVR from rollout competition to team cooperation by assigning advantages via marginal contributions to a determinant-based coverage volume over semantic embeddings, yielding higher accuracy and solution d...
Count Anything at Any Granularity
cs.CV 2026-05 unverdicted novelty 7.0

Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for impro...
AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents
cs.LG 2026-05 unverdicted novelty 7.0

AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents
cs.CL 2026-05 unverdicted novelty 7.0

A new image-bank harness and closed-loop on-policy data evolution method raises multimodal agent performance on visual search benchmarks from 24.9% to 39.0% for an 8B model and from 30.6% to 41.5% for a 30B model.
MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image
cs.LG 2026-05 unverdicted novelty 7.0

MulTaBench is a new collection of 40 image-tabular and text-tabular datasets designed to test target-aware representation tuning in multimodal tabular models.
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
cs.CY 2026-05 unverdicted novelty 7.0

StereoTales shows that LLMs produce harmful, culturally adapted stereotypes in open-ended multilingual stories, with patterns consistent across providers and aligned human-LLM harm judgments.
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
cs.CY 2026-05 accept novelty 7.0

StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

PlantMarkerBench supplies 5,550 literature sentences annotated for plant marker gene evidence validity and type across Arabidopsis, maize, rice and tomato, showing frontier LLMs handle direct expression evidence but s...
PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

PlantMarkerBench is a new multi-species benchmark with 5,550 evidence instances for evaluating language models on literature-grounded plant marker gene reasoning across expression, localization, function, indirect, an...
The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space
cs.CV 2026-05 unverdicted novelty 7.0

MLLMs scoring 70-83% on Cartesian visual tasks drop to 31-39% on logically equivalent polar versions, exposing reliance on grid discretization shortcuts instead of topology-invariant reasoning.
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
cs.AI 2026-05 conditional novelty 7.0

State-conditioned commitment depth in a vision-language policy Pareto-dominates fixed-depth baselines on Sliding Puzzle and Sokoban, raising solve rates by up to 12.5 points while using 25% fewer actions and beating l...
Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows
cs.MA 2026-05 unverdicted novelty 7.0

EntCollabBench shows that today's LLM agents still struggle with delegation, context transfer, parameter grounding, workflow closure, and decision commitment when tested in a simulated enterprise with 11 role-speciali...
UniShield: Unified Face Attack Detection via KG-Informed Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

UniShield introduces a knowledge-graph-informed multimodal framework that improves unified detection of physical and digital face attacks through instruction tuning and consistency-optimized reasoning.
CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging
cs.LG 2026-05 unverdicted novelty 7.0

CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.
LLM-guided Semi-Supervised Approaches for Social Media Crisis Data Classification
cs.AI 2026-05 conditional novelty 7.0

LG-CoTrain, an LLM-guided co-training method, outperforms classical semi-supervised baselines for crisis tweet classification in low-resource settings with 5-25 labeled examples per class.
Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval
cs.CV 2026-05 unverdicted novelty 7.0

A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.
SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding
cs.CV 2026-05 unverdicted novelty 7.0

SYNCR benchmark shows leading MLLMs reach only 52.5% average accuracy on cross-video reasoning tasks against an 89.5% human baseline, with major weaknesses in physical and spatial reasoning.
Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs
cs.CV 2026-05 unverdicted novelty 7.0

ScaleEarth conditions remote sensing VLMs on continuous GSD via CS-HLoRA and a visual GSD predictor, creating a closed training loop with GeoScale-VQA to achieve SOTA on Earth observation benchmarks.
Mitigating Many-shot Jailbreak Attacks with One Single Demonstration
cs.CR 2026-05 conditional novelty 7.0

A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding
cs.CV 2026-05 unverdicted novelty 7.0

Qwen3-VL-Seg decodes MLLM bounding boxes into pixel-level referring segmentation via a lightweight box-guided mask decoder, new SA1B-ORS training data, and ORS-Bench evaluation, showing strong open-world performance.
Do Joint Audio-Video Generation Models Understand Physics?
cs.SD 2026-05 unverdicted novelty 7.0

Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.
Self Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale
cs.LG 2026-05 conditional novelty 7.0

Starling uses LLMs and agents to turn 22.5M PubMed papers into 6.3M nuanced structured records across six tasks with 0.6-7.7% frontier-model rejection rates, lower than error rates on existing curated databases.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
Stateful Agent Backdoor
cs.CR 2026-05 unverdicted novelty 7.0

A stateful backdoor for LLM agents, modeled as a Mealy machine with a decomposition framework, enables incremental malicious actions across sessions and achieves 80-95% attack success rate on four models.
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
cs.CL 2026-05 unverdicted novelty 7.0

TextPro-SLM minimizes the speech-text modality gap from the input side via a prosody-aware unified encoder, delivering the lowest gap and strong performance at 3B/7B scales with only ~1000 hours of audio.
Training-Free Dense Hand Contact Estimation with Multi-Modal Large Language Models
cs.CV 2026-05 unverdicted novelty 7.0

ContactPrompt uses part-wise vertex grids and multi-stage part-conditioned reasoning in MLLMs to achieve training-free dense hand contact estimation that outperforms prior supervised methods.
Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs
cs.DC 2026-05 unverdicted novelty 7.0

Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.
Pro$^2$Assist: Continuous Step-Aware Proactive Assistance with Multimodal Egocentric Perception for Long-Horizon Procedural Tasks
cs.AI 2026-05 unverdicted novelty 7.0

Pro²Assist uses multimodal egocentric perception from AR glasses to track fine-grained progress in long-horizon procedural tasks and deliver timely proactive assistance, outperforming baselines by over 21% in action u...
Segmenting Human-LLM Co-authored Text via Change Point Detection
cs.CL 2026-05 unverdicted novelty 7.0

Adapts change point detection to segment human-LLM co-authored text using weighted and generalized algorithms with minimax optimality and strong empirical results against baselines.
LLM-Foraging: Large Language Models for Decentralized Swarm Robot Foraging
cs.RO 2026-05 unverdicted novelty 7.0

LLM-Foraging uses off-the-shelf LLMs for decentralized tactical decisions in CPFA-based swarm foraging, collecting more resources than GA-tuned baselines across 36 varied configurations while showing greater consistency.
OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice
cs.CL 2026-05 unverdicted novelty 7.0

OralMLLM-Bench reveals performance gaps between multimodal large language models and clinicians on cognitive tasks for dental radiographic analysis across periapical, panoramic, and cephalometric images.
OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice
cs.CL 2026-05 unverdicted novelty 7.0

OralMLLM-Bench is a new benchmark with 27 tasks in four cognitive categories that evaluates six MLLMs on dental radiographs and shows clear performance gaps versus clinicians.
EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents
cs.AI 2026-05 unverdicted novelty 7.0

EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-se...
E-MIA: Exam-Style Black-Box Membership Inference Attacks against RAG Systems
cs.CR 2026-05 unverdicted novelty 7.0

E-MIA converts document details into four types of exam questions and aggregates the RAG's answers into a membership score that separates member and non-member documents better than prior similarity-based or probe-bas...
ReLay: Personalized LLM-Generated Plain-Language Summaries for Better Understanding, but at What Cost?
cs.CL 2026-05 unverdicted novelty 7.0

Personalized LLM-generated plain language summaries improve lay readers' comprehension and quality ratings but increase risks of reinforcing biases and introducing hallucinations compared to static expert summaries.