AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.
super hub Mixed citations
OpenAI GPT-5 System Card
Mixed citation behavior. Most common role is background (51%).
abstract
This is the system card published alongside the OpenAI GPT-5 launch, August 2025. GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say 'think hard about this' in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits are reached, a mini version of each model handles remaining queries. This system card focuses primarily on gpt-5-thinking and gpt-5-main, while evaluations for other models are available in the appendix. The GPT-5 system not only outperforms previous models on benchmarks and answers questions more quickly, but -- more importantly -- is more useful for real-world queries. We've made significant advances in reducing hallucinations, improving instruction following, and minimizing sycophancy, and have leveled up GPT-5's performance in three of ChatGPT's most common uses: writing, coding, and health. All of the GPT-5 models additionally feature safe-completions, our latest approach to safety training to prevent disallowed content. Similarly to ChatGPT agent, we have decided to treat gpt-5-thinking as High capability in the Biological and Chemical domain under our Preparedness Framework, activating the associated safeguards. While we do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm -- our defined threshold for High capability -- we have chosen to take a precautionary approach.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract This is the system card published alongside the OpenAI GPT-5 launch, August 2025. GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say 'think hard about this' in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits ar
co-cited works
representative citing papers
A causal audit with image interventions shows text-only models reach within 5.7 accuracy points of top multimodal VLMs on chest radiography, with some large multimodal models statistically indistinguishable from small text-only baselines.
MetaSyn benchmark shows LLM pipelines recover at most 52.7% of ground-truth included studies due to screening failures on PI/ECO eligibility, despite 90.9% retrieval recall at K=200.
EHRNote-ChatQA is the first benchmark for evidence-grounded multi-turn clinical QA over longitudinal discharge summaries, containing 16,072 medical-expert-verified pairs across eight categories and revealing LLM weaknesses in evidence grounding and multi-turn consistency.
Across 30 LLMs and 205 TLA+ tasks, syntactic correctness reaches at most 26.6% and semantic correctness 8.6%, with all successes limited to progressive prompting and no advantage from larger models.
RobotValues is a benchmark of 10K value-conflict scenarios that reveals VLMs default to safety and accommodation while failing to follow instructions to prioritize other values 80% of the time.
Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.
AMNESIA is a benchmark suite of 70,560 medical QA pairs that evaluates unlearning methods and shows that patient-level unlearning erodes disease-shared knowledge.
FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.
M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.
Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.
OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.
MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.
Introduces APRS task and PanoSeeker agent using VLM plus EgoSphere memory for active 360° search and segmentation, outperforming baselines on a new benchmark.
DisciplineGen-1M is a million-scale multidisciplinary dataset for text-to-image generation and editing, paired with a discipline-informed model that improves results on discipline-specific benchmarks.
AnyGroundBench is a domain-adaptation benchmark for spatio-temporal video grounding across animal, industry, sports, surgery, and public security domains that finds 15 state-of-the-art VLMs fail in zero-shot and ICL settings.
LongEgoRefer is a new benchmark of 1,498 referring expressions in 45-minute average egocentric videos that exposes the failure of existing Video REC models on sparse long-form spatio-temporal grounding.
OpenSafeIntent benchmark shows models fail to calibrate safety across intent shifts in matched dual-use prompts, indicating current evaluations are insufficient.
LongVQUBench introduces a hierarchical benchmark with local, cross-event, and global quality understanding tasks plus needle distortion QA to measure LVLMs' long-term video quality reasoning.
An asynchronous architecture decouples incremental voxel-based mapping from VLM-based semantic enrichment to produce queryable open-vocabulary 3D scene graphs that match or exceed prior methods on segmentation and grounding benchmarks.
OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.
citing papers explorer
-
Vision-language models for chest radiography do not always need the image
A causal audit with image interventions shows text-only models reach within 5.7 accuracy points of top multimodal VLMs on chest radiography, with some large multimodal models statistically indistinguishable from small text-only baselines.
-
Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?
Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.
-
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
-
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
-
When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models
VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.
-
MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark
MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.
-
Seek to Segment: Active Perception for Panoramic Referring Segmentation
Introduces APRS task and PanoSeeker agent using VLM plus EgoSphere memory for active 360° search and segmentation, outperforming baselines on a new benchmark.
-
DisciplineGen-1M: A Large-Scale Dataset for Multidisciplinary Visual Generation and Editing
DisciplineGen-1M is a million-scale multidisciplinary dataset for text-to-image generation and editing, paired with a discipline-informed model that improves results on discipline-specific benchmarks.
-
AnyGroundBench: A Specialized-Domain Benchmark for Video Grounding in Vision-Language Models
AnyGroundBench is a domain-adaptation benchmark for spatio-temporal video grounding across animal, industry, sports, surgery, and public security domains that finds 15 state-of-the-art VLMs fail in zero-shot and ICL settings.
-
LongEgoRefer: A Benchmark for Long-Form Egocentric Video Referring Expression Comprehension
LongEgoRefer is a new benchmark of 1,498 referring expressions in 45-minute average egocentric videos that exposes the failure of existing Video REC models on sparse long-form spatio-temporal grounding.
-
LongVQUBench: Benchmarking Long-Term Video Quality Understanding of Vision-Language Models
LongVQUBench introduces a hierarchical benchmark with local, cross-event, and global quality understanding tasks plus needle distortion QA to measure LVLMs' long-term video quality reasoning.
-
Think While You Map: Asynchronous Vision-Language Agents for Incremental 3D Scene Graphs
An asynchronous architecture decouples incremental voxel-based mapping from VLM-based semantic enrichment to produce queryable open-vocabulary 3D scene graphs that match or exceed prior methods on segmentation and grounding benchmarks.
-
OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning
OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.
-
MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs
MuseBench shows state-of-the-art MLLMs achieve only 48.29% accuracy on intent-level audiovisual arts understanding versus 87.18% for human experts.
-
OP3DSG: Open-Vocabulary Part-Aware 3D Scene Graph Generation for Real-World Environments
OP3DSG generates unified part-aware open-vocabulary 3D scene graphs via knowledge-guided detection, 3D fusion, and LLM-refined prior graphs, with a new UniGraph3D benchmark showing SOTA results for robotics tasks.
-
LADBench: A Benchmark for Logical Fault Detection in Images
LADBench is a new benchmark showing leading VLMs reach at most 70.11% accuracy on logical fault detection even after explicit hints.
-
Enhancing Pathological VLMs with Cross-scale Reasoning
Presents Scale-VQA benchmark for cross-scale pathology VQA and RL-trained ScaleReasoner-R1 model that reaches SOTA on the new benchmark plus existing single-scale tasks.
-
VLADriveBench: Evaluating CoT-Action Relationship in VLA for Autonomous Driving
VLADriveBench combines observational metrics and CoT intervention protocols to evaluate the relevance and causality of reasoning in vision-language-action models for autonomous driving, revealing divergent model behaviors.
-
Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning
Rea2Seg turns image segmentation into candidate mask discovery from MLLM attention followed by MLLM-based comparative scoring and selection, plus a new multi-dimensional reasoning benchmark ReasonSeg-SGDR.
-
NutriMLLM: Multimodal Large Language Models for Dietary Micronutrient Analysis
NutriMLLM models fine-tuned on 1.1 million synthetic food image-nutrient triplets from population dietary recalls achieve near-complete coverage and competitive accuracy on real food images for comprehensive micronutrient estimation compared to proprietary MLLMs.
-
Are Reasoning Vision-Language Models Robust to Semantic Visual Distractions?
Reasoning VLMs show lower robustness to semantic visual distractions than to perceptual corruptions, with distractions entering their reasoning chains and causing errors.
-
TVI-CoT: Text-Visual Interleaved Chain-of-Thought Reasoning for Multimodal Understanding
TVI-CoT introduces learnable control tokens <THINK>, <LOOK>, <ANSWER> that let multimodal LLMs interleave textual reasoning with dynamic visual feature access, reporting gains of 3.4-6.1% on eight benchmarks over prior CoT baselines.
-
Sci-Rho: A Multilingual Visually-Grounded Symbolic Benchmark for STEM Problems
Sci-Rho is a dynamic multilingual visually-grounded symbolic benchmark for STEM problems that reveals robustness gaps in current VLMs between average and worst-case performance.
-
Beyond Absolute Scores: Relative Edit-induced Difference for Generalizable Image Aesthetic Assessment
RED-Aes learns aesthetic changes from edit-induced image pairs and a new RED-20k dataset via three-stage relative ranking training, claiming SOTA generalization over absolute MOS regression.
-
Benchmarking Visual State Tracking in Multimodal Video Understanding
VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.
-
Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events
Moment-Video benchmark shows top video MLLM achieves only 39.6% accuracy on momentary visual event tasks, with most open-source models below 25%.
-
X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding
X-Stream benchmark shows SOTA MLLMs score ~50% on concurrent multi-stream tasks and lack proactive ability, using a dual-verification pipeline to avoid single-stream bias.
-
ChartArena: Benchmarking Chart Parsing across Languages, Scenarios, and Formats
ChartArena is a new benchmark dataset and evaluation protocol for chart parsing by MLLMs that covers numeric and diagrammatic charts in multiple languages and real-world visual conditions.
-
MM-Snowball: Evaluating and Mitigating Hallucination Snowballing in Multimodal Multi-Turn Dialogue
MM-Snowball benchmark diagnoses hallucination snowballing in multi-turn MLLM dialogues; CAVR mitigates it via dual visual rectification at representation and logit levels.
-
Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?
Frontier VLMs overconfidently answer spatial questions under occlusion (~30% accuracy) and perspective ambiguity (<10% accuracy) instead of abstaining, and often fail to select helpful additional views.
-
Why Far Looks Up: Probing Spatial Representation in Vision-Language Models
VLMs exhibit consistent vertical-distance entanglement in embeddings from perspective bias in natural images, producing accuracy gaps that a new synthetic benchmark SpatialTunnel exposes as model-intrinsic.
-
CardioLens: Revealing the Clinical Reality Gap of MLLMs via Multi-Sequence Cardiac MRI Evaluations
CardioLens is a leakage-resistant CMR testbed of 473k slices and 13k QA pairs showing current MLLMs exhibit a large clinical reality gap with category-collapse failures on real workflows.
-
Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models
SD-MIA is a black-box membership inference attack that detects pre-training data in diffusion models via cross-modal perturbations on images and textual instructions.
-
DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving
DriveSpatial benchmark shows the strongest of 15 VLMs trails humans by 28.4 points on spatiotemporal tasks, with cognitive scene construction as the primary weakness.
-
VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding
VideoOdyssey is a new benchmark featuring ultra-long videos (avg. 109 min) across 11 domains with multi-level continuous certificates (avg. 16 min for visual, 12.8 min for audio-visual) to diagnose MLLM limitations in continuous reasoning and omni-modal perception.
-
Towards Clinically Interpretable Ophthalmic VQA via Spatially-Grounded Lesion Evidence
FundusGround is a new benchmark with 10,719 fundus images, 15,595 ETDRS-grid localized lesions, and 72,706 VQA questions to support clinically interpretable ophthalmic visual question answering.
-
JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation
JMed48k is a new benchmark of Japanese healthcare licensing exams used to evaluate 21 VLMs, with a paired image-removal audit revealing large differences in how models and professions benefit from visual content.
-
ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models
ArchSIBench is a new benchmark dataset and evaluation suite that measures vision-language models on architectural spatial intelligence across 17 subtasks, showing most models lag human baselines especially in transformation and configuration.
-
SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction
SetCon achieves state-of-the-art open-ended referring segmentation by using LVLM-generated set-level concepts for joint mask decoding, with gains increasing for multi-target cases on image and video benchmarks.
-
Seeing Together: Multi-Robot Cooperative Egocentric Spatial Reasoning with Multimodal Large Language Models
SP-CoR is a multimodal LLM framework using dynamics-aware sampling, spectral-physics view fusion, and prompt distillation that outperforms baselines on the new CoopSR benchmark and EgoTeam dataset for multi-robot cooperative spatial reasoning.
-
UniPPTBench: A Unified Benchmark for Presentation Generation Across Diverse Input Settings
The paper presents UniPPTBench and UniPPTEval, a unified benchmark and scenario-aware evaluation framework for presentation generation from vague prompts, long documents, multimodal documents, and multi-source inputs.
-
Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination
VLMs fail to detect image swaps during self-reflective reasoning with accuracy drops up to 60%, revealing that self-generated reflections do not trigger genuine visual re-examination.
-
MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models
MemLens benchmark shows long-context LVLMs lose accuracy with length while memory agents lose visual fidelity, with multi-session reasoning below 30% for most systems and neither approach solving the task alone.
-
MultiEmo-Bench: Multi-label Visual Emotion Analysis for Multi-modal Large Language Models
MultiEmo-Bench supplies 10,344 images with aggregated multi-label emotion votes from 20 annotators each to evaluate MLLMs on dominant emotion and full distribution prediction.
-
Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning
SPA unlocks patch-level features in CLIP for class-incremental learning via semantic-guided selection and optimal transport alignment with class descriptions, plus projectors and pseudo-feature replay to reduce forgetting.
-
ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding
ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.
-
From Web to Pixels: Bringing Agentic Search into Visual Perception
WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.
-
UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs
VLMs show a resolution illusion on UHR Earth observation imagery where higher resolution does not improve micro-target perception; UHR-Micro benchmark and MAP-Agent address this via evidence-centered active inspection.
-
Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters
Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.
-
Count Anything at Any Granularity
Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for improved accuracy.