UniQL is a human-verified benchmark providing aligned natural language questions and dialect-specific SQL queries for 16 SQL systems to evaluate cross-dialect generalization.
super hub Mixed citations
GPT-4o System Card
Mixed citation behavior. Most common role is background (54%).
abstract
GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while
authors
co-cited works
representative citing papers
Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.
VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.
M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.
MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusing blank images at >92%.
CHASM is a new benchmark dataset showing that existing multimodal large language models fail to reliably detect covert advertisements on Chinese social media even after fine-tuning.
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
ReConText3D is the first replay-memory framework for continual text-to-3D generation that prevents catastrophic forgetting on new textual categories while preserving quality on previously seen classes.
MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.
DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.
VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.
SwissGov-RSD is the first naturalistic cross-lingual document-level benchmark with human token-level semantic difference annotations, on which both LLMs and encoders show a large performance gap relative to simpler settings.
CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
LongEgoRefer is a new benchmark of 1,498 referring expressions in 45-minute average egocentric videos that exposes the failure of existing Video REC models on sparse long-form spatio-temporal grounding.
Introduces a cost-aware paired protocol with six outcome groups and applies it to Dynamic-SAGE versus SAGE, reporting 7.5-point accuracy gain, 28% fewer tool calls, but 34% higher token use.
P2R decouples perception from reasoning in VLMs via a two-stage process and PRA-GRPO alternating RL training, reporting gains such as 93.2% on V-Star for the 4B model over its Qwen3-VL backbone.
EgoGapBench shows humans reliably select egocentric actions in multi-agent scenes while MLLMs systematically choose other agents' actions, and standard egocentric training data fails to close the gap.
Identifies Screen Perception and Misused Channel attack surfaces in VLM-powered mobile agents and demonstrates seven attacks enabling arbitrary command execution on five frameworks without privileges.
citing papers explorer
-
UniQL: Towards Dialect-Universal Benchmarking for Text-to-SQL
UniQL is a human-verified benchmark providing aligned natural language questions and dialect-specific SQL queries for 16 SQL systems to evaluate cross-dialect generalization.
-
VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents
VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.
-
Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation
M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.
-
From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation
MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusing blank images at >92%.
-
CHASM: Unveiling Covert Advertisements on Chinese Social Media
CHASM is a new benchmark dataset showing that existing multimodal large language models fail to reliably detect covert advertisements on Chinese social media even after fine-tuning.
-
HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.
-
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
-
HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?
Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
-
MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark
MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.
-
DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues
DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
-
EgoSound: Benchmarking Sound Understanding in Egocentric Videos
EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.
-
VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing
VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.
-
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark
CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.
-
Flow-GRPO: Training Flow Matching Models via Online RL
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
-
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
-
LongEgoRefer: A Benchmark for Long-Form Egocentric Video Referring Expression Comprehension
LongEgoRefer is a new benchmark of 1,498 referring expressions in 45-minute average egocentric videos that exposes the failure of existing Video REC models on sparse long-form spatio-temporal grounding.
-
A Cost-Aware, Paired Protocol for Auditing Dynamic Tool Synthesis in Agentic Video Question Answering
Introduces a cost-aware paired protocol with six outcome groups and applies it to Dynamic-SAGE versus SAGE, reporting 7.5-point accuracy gain, 28% fewer tool calls, but 34% higher token use.
-
Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning
P2R decouples perception from reasoning in VLMs via a two-stage process and PRA-GRPO alternating RL training, reporting gains such as 93.2% on V-Star for the 4B model over its Qwen3-VL backbone.
-
EgoGapBench: Benchmarking Egocentric Action Selection in Multi-Agent Scenes
EgoGapBench shows humans reliably select egocentric actions in multi-agent scenes while MLLMs systematically choose other agents' actions, and standard egocentric training data fails to close the gap.
-
(A)I Sees What You Don't: Exploiting New Attack Surfaces in Third-Party Mobile Agents
Identifies Screen Perception and Misused Channel attack surfaces in VLM-powered mobile agents and demonstrates seven attacks enabling arbitrary command execution on five frameworks without privileges.
-
SpheRoPE: Zero-Shot Optimization-Free 360 Panorama Generation with Spherical RoPE
SpheRoPE modifies rotary position embeddings in diffusion transformers to enforce spherical topology for zero-shot 360 panorama generation across multiple backbones.
-
No Place to Hide: Benchmarking Video Hallucination with Background-Controlled Pairs
Introduces VidPair-Halluc benchmark of 1K background-controlled adversarial video pairs and 11K QA pairs generated via PairFlow pipeline to evaluate hallucination in LVMs.
-
Measuring Judgment Quality in Natural-Language Explanations: Evidence from Forecasting Tournaments
EQMs, sixty LLM-scored reasoning patterns, predict forecast accuracy at both item and person levels and outperform prior text-analysis methods in a large pre-registered tournament dataset.
-
OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning
OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.
-
MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs
MuseBench shows state-of-the-art MLLMs achieve only 48.29% accuracy on intent-level audiovisual arts understanding versus 87.18% for human experts.
-
A Diagnostic Framework and Multi-Evaluator Audit of Evaluator-Driven Preference Dynamics in Self-Adapting LLM Agents
A diagnostic framework called EPC reveals that proprietary LLM evaluators can exhibit large preference shifts between versions, as evidenced by a GPT-4o May-to-June drift that inverted study conclusions, rendering single-snapshot evaluations unreliable.
-
GigaSpeechBench: A Real-World Multilingual Speech-to-Text Benchmark
GigaSpeechBench is a new 680-hour in-the-wild multilingual ASR/AST benchmark with five modules for low-resource languages, Chinese dialects, English accents, domain terminology, and age-varied speech, showing model performance drops.
-
HumanMoveVQA: Can Video MLLMs reason about human movement in videos?
HumanMoveVQA is a new benchmark that generates 10K+ QA pairs from 3D-lifted video tracks to evaluate video MLLMs on global human trajectory and orientation reasoning.
-
Reflect-R1: Evidence-Driven Reflection for Self-Correction in Long Video Understanding
Reflect-R1 introduces the first evidence-driven self-correction framework for long video understanding using a three-stage pipeline, stage-decoupled RL via SD-GRPO, and a 120K dataset to achieve SOTA on VideoMME and LongVideoBench.
-
PhyEditBench: A Real-World Multi-Stage Benchmark for Physics-Aware Image Editing
PhyEditBench is a new benchmark for physics-aware image editing with real and synthetic instances plus a training-free PhyWorld baseline that uses test-time scaling to outperform SOTA models.
-
CrypFormBench: Benchmarking Formal Analysis Capability of Large Language Models for Cryptographic Schemes
CrypFormBench is a new benchmark jointly covering symbolic and computational security to evaluate LLMs on five formal analysis capabilities, with results showing top model Claude-3.5 scores 48.7/100 and most models struggling on generation, transformation, and correction.
-
Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients
ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.
-
Agentic AI Enhances Physician Trust in Clinical Decision Making
Empirical comparison shows physicians have higher cognitive and behavioral trust in agentic AI versus baselines on clinical cases, with noted over-reliance risk.
-
FARM: Find Anything using Relational Spatial Memory
FARM creates an open-vocabulary relational spatial memory that improves object retrieval recall by 164-224% over prior methods on 44k language queries across 67 scenes while running at 5-10 Hz.
-
FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation
FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.
-
AfriSUD: A Dependency Treebank Collection for Evaluating Models on African Languages
AfriSUD supplies new SUD-annotated dependency treebanks for nine Sub-Saharan African languages and demonstrates that existing models exhibit clear limitations on their syntax.
-
Tail-Aware Adaptive-k: Query-Adaptive Context Selection for Retrieval-Augmented Generation
TAA-k finds query-adaptive retrieval cutoffs by first using knee detection to isolate a candidate window around the relevance-to-noise transition, then applying EVT goodness-of-fit tests inside that window.
-
Alignment Defends LLMs from Property Inference Attacks
Alignment defenses adapted from DPO and GRPO mitigate property inference attacks on LLMs while preserving utility.
-
From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs
AVLLMs route audio-visual information sequentially in video tasks and via parallel streams for interleaved items, allowing early token discard with little performance loss across models and scales.
-
H2HMem: A Multimodal Memory Benchmark for Agents in Human-Human Interactions
H2HMem is a multimodal memory benchmark evaluating LLM agents on recall, reasoning, and application in dyadic and multi-party human-human conversations with phenomena such as anaphora and deixis.
-
CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning
CapRL++ applies reinforcement learning with verifiable rewards to dense image and video captioning by scoring captions via the accuracy of a vision-free LLM answering MCQs from the caption alone.
-
Are Reasoning Vision-Language Models Robust to Semantic Visual Distractions?
Reasoning VLMs show lower robustness to semantic visual distractions than to perceptual corruptions, with distractions entering their reasoning chains and causing errors.
-
Support Vector Rubrics: Closing the Gap Between Self-Generated and Human Rubrics
SVR learns a bank of contrastive rubrics from preference data via max-margin boundaries and prompt-conditioned selection, narrowing the gap to human rubrics on RubricBench from 24.1 to 0.3 points.
-
Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning
Skill-3D improves MLLM agent tool use in 3D spatial reasoning from 39% to 78% on VSI-Bench by evolving reusable scene-aware skills from aggregated trajectories stored in a scene memory.
-
SkelDPO: A Skeleton-Guided Direct Preference Optimization Framework for Efficient Code Generation
SkelDPO improves code generation efficiency by 2-7% over prior DPO methods via joint preference losses on full code and efficiency-critical skeletons.
-
Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception
VLMs across families and scales show anchoring to discrete slant angles in zero-shot and prompted settings rather than human-like graded texture-based slant perception.
-
DisasterBench: A Multimodal Benchmark for UAV-Based Disaster Response in Complex Environments
DisasterBench is a new multi-stage multimodal reasoning benchmark for UAV disaster response with 14 scenes and 9 tasks; the accompanying 2B DisasterVL model outperforms open-source MLLMs and approaches GPT-4o efficiency.
-
Beyond Absolute Scores: Relative Edit-induced Difference for Generalizable Image Aesthetic Assessment
RED-Aes learns aesthetic changes from edit-induced image pairs and a new RED-20k dataset via three-stage relative ranking training, claiming SOTA generalization over absolute MOS regression.
-
PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models
PlanBench-V is a new benchmark and dataset for evaluating VLMs on spatial planning map interpretation via a four-stage framework of Perception, Reasoning, Association, and Implementation.
-
Multilingual Coreference Resolution via Cycle-Consistent Machine Translation
A cycle-consistent MT pipeline generates and similarity-weights training data for coreference resolution, producing gains on four low-resource languages and enabling the task where no corpora existed.