AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.
super hub Mixed citations
OpenAI GPT-5 System Card
Mixed citation behavior. Most common role is background (51%).
abstract
This is the system card published alongside the OpenAI GPT-5 launch, August 2025. GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say 'think hard about this' in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits are reached, a mini version of each model handles remaining queries. This system card focuses primarily on gpt-5-thinking and gpt-5-main, while evaluations for other models are available in the appendix. The GPT-5 system not only outperforms previous models on benchmarks and answers questions more quickly, but -- more importantly -- is more useful for real-world queries. We've made significant advances in reducing hallucinations, improving instruction following, and minimizing sycophancy, and have leveled up GPT-5's performance in three of ChatGPT's most common uses: writing, coding, and health. All of the GPT-5 models additionally feature safe-completions, our latest approach to safety training to prevent disallowed content. Similarly to ChatGPT agent, we have decided to treat gpt-5-thinking as High capability in the Biological and Chemical domain under our Preparedness Framework, activating the associated safeguards. While we do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm -- our defined threshold for High capability -- we have chosen to take a precautionary approach.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract This is the system card published alongside the OpenAI GPT-5 launch, August 2025. GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say 'think hard about this' in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits ar
co-cited works
representative citing papers
TW-LegalBench evaluates 13 LLMs on over 30,000 Taiwanese legal tasks from exams and judgments, showing top models pass lawyer thresholds but struggle with exact statute citations.
A causal audit with image interventions shows text-only models reach within 5.7 accuracy points of top multimodal VLMs on chest radiography, with some large multimodal models statistically indistinguishable from small text-only baselines.
MetaSyn benchmark shows LLM pipelines recover at most 52.7% of ground-truth included studies due to screening failures on PI/ECO eligibility, despite 90.9% retrieval recall at K=200.
EHRNote-ChatQA is the first benchmark for evidence-grounded multi-turn clinical QA over longitudinal discharge summaries, containing 16,072 medical-expert-verified pairs across eight categories and revealing LLM weaknesses in evidence grounding and multi-turn consistency.
Across 30 LLMs and 205 TLA+ tasks, syntactic correctness reaches at most 26.6% and semantic correctness 8.6%, with all successes limited to progressive prompting and no advantage from larger models.
RobotValues is a benchmark of 10K value-conflict scenarios that reveals VLMs default to safety and accommodation while failing to follow instructions to prioritize other values 80% of the time.
Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.
AMNESIA is a benchmark suite of 70,560 medical QA pairs that evaluates unlearning methods and shows that patient-level unlearning erodes disease-shared knowledge.
FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.
M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.
Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.
OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.
MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.
Introduces APRS task and PanoSeeker agent using VLM plus EgoSphere memory for active 360° search and segmentation, outperforming baselines on a new benchmark.
DisciplineGen-1M is a million-scale multidisciplinary dataset for text-to-image generation and editing, paired with a discipline-informed model that improves results on discipline-specific benchmarks.
AnyGroundBench is a domain-adaptation benchmark for spatio-temporal video grounding across animal, industry, sports, surgery, and public security domains that finds 15 state-of-the-art VLMs fail in zero-shot and ICL settings.
LongEgoRefer is a new benchmark of 1,498 referring expressions in 45-minute average egocentric videos that exposes the failure of existing Video REC models on sparse long-form spatio-temporal grounding.
OpenSafeIntent benchmark shows models fail to calibrate safety across intent shifts in matched dual-use prompts, indicating current evaluations are insufficient.
LongVQUBench introduces a hierarchical benchmark with local, cross-event, and global quality understanding tasks plus needle distortion QA to measure LVLMs' long-term video quality reasoning.
An asynchronous architecture decouples incremental voxel-based mapping from VLM-based semantic enrichment to produce queryable open-vocabulary 3D scene graphs that match or exceed prior methods on segmentation and grounding benchmarks.
citing papers explorer
-
AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot
AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.
-
TW-LegalBench: Measuring Taiwanese Legal Understanding
TW-LegalBench evaluates 13 LLMs on over 30,000 Taiwanese legal tasks from exams and judgments, showing top models pass lawyer thresholds but struggle with exact statute citations.
-
Vision-language models for chest radiography do not always need the image
A causal audit with image interventions shows text-only models reach within 5.7 accuracy points of top multimodal VLMs on chest radiography, with some large multimodal models statistically indistinguishable from small text-only baselines.
-
Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio
MetaSyn benchmark shows LLM pipelines recover at most 52.7% of ground-truth included studies due to screening failures on PI/ECO eligibility, despite 90.9% retrieval recall at K=200.
-
EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries
EHRNote-ChatQA is the first benchmark for evidence-grounded multi-turn clinical QA over longitudinal discharge summaries, containing 16,072 medical-expert-verified pairs across eight categories and revealing LLM weaknesses in evidence grounding and multi-turn consistency.
-
Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation
Across 30 LLMs and 205 TLA+ tasks, syntactic correctness reaches at most 26.6% and semantic correctness 8.6%, with all successes limited to progressive prompting and no advantage from larger models.
-
RobotValues: Evaluating Household Robots When Human Values Conflict
RobotValues is a benchmark of 10K value-conflict scenarios that reveals VLMs default to safety and accommodation while failing to follow instructions to prioritize other values 80% of the time.
-
Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?
Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.
-
AMNESIA: A Large Scale Medical Unlearning Benchmark Suite with Disease-Informed Analysis
AMNESIA is a benchmark suite of 70,560 medical QA pairs that evaluates unlearning methods and shows that patient-level unlearning erodes disease-shared knowledge.
-
FlowCompile: An Optimizing Compiler for Structured LLM Workflows
FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.
-
Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation
M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.
-
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.
-
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
-
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
-
When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models
VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.
-
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.
-
MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark
MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.
-
Seek to Segment: Active Perception for Panoramic Referring Segmentation
Introduces APRS task and PanoSeeker agent using VLM plus EgoSphere memory for active 360° search and segmentation, outperforming baselines on a new benchmark.
-
DisciplineGen-1M: A Large-Scale Dataset for Multidisciplinary Visual Generation and Editing
DisciplineGen-1M is a million-scale multidisciplinary dataset for text-to-image generation and editing, paired with a discipline-informed model that improves results on discipline-specific benchmarks.
-
AnyGroundBench: A Specialized-Domain Benchmark for Video Grounding in Vision-Language Models
AnyGroundBench is a domain-adaptation benchmark for spatio-temporal video grounding across animal, industry, sports, surgery, and public security domains that finds 15 state-of-the-art VLMs fail in zero-shot and ICL settings.
-
LongEgoRefer: A Benchmark for Long-Form Egocentric Video Referring Expression Comprehension
LongEgoRefer is a new benchmark of 1,498 referring expressions in 45-minute average egocentric videos that exposes the failure of existing Video REC models on sparse long-form spatio-temporal grounding.
-
OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets
OpenSafeIntent benchmark shows models fail to calibrate safety across intent shifts in matched dual-use prompts, indicating current evaluations are insufficient.
-
LongVQUBench: Benchmarking Long-Term Video Quality Understanding of Vision-Language Models
LongVQUBench introduces a hierarchical benchmark with local, cross-event, and global quality understanding tasks plus needle distortion QA to measure LVLMs' long-term video quality reasoning.
-
Think While You Map: Asynchronous Vision-Language Agents for Incremental 3D Scene Graphs
An asynchronous architecture decouples incremental voxel-based mapping from VLM-based semantic enrichment to produce queryable open-vocabulary 3D scene graphs that match or exceed prior methods on segmentation and grounding benchmarks.
-
OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning
OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.
-
CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph
Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.
-
MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs
MuseBench shows state-of-the-art MLLMs achieve only 48.29% accuracy on intent-level audiovisual arts understanding versus 87.18% for human experts.
-
OP3DSG: Open-Vocabulary Part-Aware 3D Scene Graph Generation for Real-World Environments
OP3DSG generates unified part-aware open-vocabulary 3D scene graphs via knowledge-guided detection, 3D fusion, and LLM-refined prior graphs, with a new UniGraph3D benchmark showing SOTA results for robotics tasks.
-
Metadata, Structure, or Strategy? A Decomposition of RAG Context Enrichment
Controlled experiments across six benchmarks and four models show RAG context enrichment with metadata, structure, or strategies mostly lowers accuracy, with model-context alignment as the determining factor.
-
An AI agent for treatment reasoning over a biomedical tool universe
ATHENA-R1 is an RL-trained agent using 212 biomedical tools that achieves 94.7% accuracy on drug reasoning and 82.9% on treatment reasoning tasks, outperforming GPT-5 by 17.8 and 10.7 points respectively.
-
See & Sniff: Learning Visuo-Olfactory Representations
Introduces SmellNet-V synthetic visuo-olfactory dataset and See & Sniff self-supervised framework that learns aligned representations and produces smell saliency maps.
-
MKG-RAG-Bench: Benchmarking Retrieval in Multimodal Knowledge Graph-Augmented Generation
MKG-RAG-Bench is a cross-domain benchmark for retrieval in multimodal knowledge graph-augmented generation, constructed via LLM curation from two MKGs with aligned QA datasets.
-
Trustworthy Image Authentication using Forensic Knowledge Graphs
Forensic Knowledge Graphs integrate forensic traces, causal dependencies, and scene links via a new authentication network and Iterative Context Refinement to outperform standard detectors and VLMs on detection, localization, and justification.
-
The Topology of Ill-Posed Questions: Persistent Homology for Detection and Steering in LLMs
Zero-dimensional persistent homology on transformer layer hidden states yields three descriptors per layer whose concatenation improves ill-posedness classification and enables topology-conditioned activation steering across three LLMs.
-
RoboGaze: Evaluating Robot World Models via Structured Vision-Language Analysis
RoboGaze presents a structured multi-agent VLM pipeline and robotics-specific error taxonomy that improves video evaluation metrics by up to 43 F1 points over zero-shot baselines on a 382-clip dataset.
-
Agent-Assisted Side-Channel Attacks on Non-Prefix KV Cache in RAG
SpliceLeak is the first end-to-end side-channel attack on non-prefix KV cache in RAG, using Step-Wave timing leaks to fingerprint private prompt lengths and extract tokens with up to 100% success using 63 requests per token on vLLM+LMCache.
-
CheXpercept: A Benchmark for Evaluating Expert-Level Lesion Perception in Chest X-rays
CheXpercept is a sequential multi-level perception benchmark showing VLMs perform adequately only on coarse lesion detection in chest X-rays while degrading sharply on finer tasks, with medical VLMs offering no advantage over general models.
-
A Systematic Evaluation of Black-Box Uncertainty Estimation Methods for Large Language Models
A unified benchmark of 24 black-box UE methods for LLMs finds no universal winner but favors methods that reason over answer candidates and hybrid combinations of signals.
-
Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play
MAFP applies fictitious play to LLM multi-agent systems to resolve stance entanglement in competitive decision-making, outperforming single-round and multi-round baselines on tournament strength and robustness.
-
LADBench: A Benchmark for Logical Fault Detection in Images
LADBench is a new benchmark showing leading VLMs reach at most 70.11% accuracy on logical fault detection even after explicit hints.
-
Enhancing Pathological VLMs with Cross-scale Reasoning
Presents Scale-VQA benchmark for cross-scale pathology VQA and RL-trained ScaleReasoner-R1 model that reaches SOTA on the new benchmark plus existing single-scale tasks.
-
PDAGENT-BENCH: Characterizing, Grounding, and Architecting LLM Agents for VLSI Physical Design
PDAGENT-BENCH is a new benchmark suite with 353 curated problems and an agentic workflow framework for evaluating LLM/VLM agents across five capability dimensions in VLSI physical design.
-
Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments
GauntletBench reveals frontier AI agents achieve 19.1% success on 100 tasks in video editing, 3D modeling, and similar tools versus over 80% for humans, exposing limitations in overlooked capabilities.
-
One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative Recommenders
FORGE benchmark shows search-augmented LLMs recommend fake products at rates up to 27% from one polluted page and 73.8% from top-3 replacement across 12 models and 225 products.
-
Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents
Introduces a stakeholder-centric benchmark showing current web agents fail all tested prompt injection objectives, with failures falling into stealthy parasitism, misaligned disruption, or compounded failure modes.
-
ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm
Proposes COM-as-Action paradigm for deterministic software manipulation, introduces ComCADBench benchmark and ComActor agent that achieves SOTA performance over GUI baselines.
-
VLADriveBench: Evaluating CoT-Action Relationship in VLA for Autonomous Driving
VLADriveBench combines observational metrics and CoT intervention protocols to evaluate the relevance and causality of reasoning in vision-language-action models for autonomous driving, revealing divergent model behaviors.
-
FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents
FORT synthesizes shortcut-resistant search tasks by controlling four identified shortcut risks across entity selection, graph construction, question formulation, and refinement, producing training data that yields agents with longer search trajectories and top performance among open-source models on
-
GraphInfer-Bench: Benchmarking LLM's Inference Capability on Graphs
Presents GraphInfer-Bench to demonstrate that no evaluated LLM-based method family closes the performance gap on graph inference tasks requiring multi-node reasoning, with plain GNNs matching or exceeding them.
-
Where You Inject Diversity Matters: A Unified Framework for Diverse Generation
A new framework for diverse LLM generation via diversity source characterization and transmission scoring, with specification-level injection outperforming test-time baselines across five tasks and four models.