Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.
super hub Mixed citations
GPT-4o System Card
Mixed citation behavior. Most common role is background (53%).
abstract
GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while
authors
co-cited works
representative citing papers
VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.
M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.
MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusing blank images at >92%.
CHASM is a new benchmark dataset showing that existing multimodal large language models fail to reliably detect covert advertisements on Chinese social media even after fine-tuning.
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
ReConText3D is the first replay-memory framework for continual text-to-3D generation that prevents catastrophic forgetting on new textual categories while preserving quality on previously seen classes.
MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.
DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.
VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.
SwissGov-RSD is the first naturalistic cross-lingual document-level benchmark with human token-level semantic difference annotations, on which both LLMs and encoders show a large performance gap relative to simpler settings.
CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
CrypFormBench is a new benchmark jointly covering symbolic and computational security to evaluate LLMs on five formal analysis capabilities, with results showing top model Claude-3.5 scores 48.7/100 and most models struggling on generation, transformation, and correction.
SafeGen-Bench is a benchmark with 10 malicious categories that evaluates conditional T2V models on paired start frames and text prompts, finding unsafety scores up to 44.5 and 80% guardrail failure rate.
PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.
DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.
SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.
The paper delivers the first theoretical analysis and practical zeroth-order framework for algorithmic recourse under in-context learning for tabular prediction.
citing papers explorer
-
PaperMind: Benchmarking Agentic Reasoning and Critique over Scientific Papers in Multimodal LLMs
PaperMind is a new benchmark that evaluates integrated multimodal reasoning and critique over scientific papers through four complementary task families across seven domains.
-
Retrieval Augmented Conversational Recommendation with Reinforcement Learning
RAR retrieves candidate items from a 300k-movie corpus then uses LLM generation with RL feedback to produce context-aware recommendations that outperform baselines on benchmarks.
-
MisEdu-RAG: A Misconception-Aware Dual-Hypergraph RAG for Novice Math Teachers
MisEdu-RAG builds concept and instance hypergraphs for two-stage retrieval of pedagogical knowledge and student errors, improving feedback quality on the MisstepMath benchmark by 10.95% token-F1 and up to 15.3% on response dimensions.
-
Mixture-of-Experts Knowledge Graph Retrieval-Augmented Generation for Multi-Agent LLM-based Recommendation
MixRAGRec is a multi-agent KG-RAG framework with an MoE retrieval agent for query-specific granularity, a knowledge alignment agent, and a contrastive recommendation agent trained jointly via MMAPO.
-
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
MemSearch-o1 mitigates memory dilution in agentic LLM search through reasoning-aligned token-level memory growth, retracing with a contribution function, and path reorganization, improving reasoning activation on benchmarks.
-
Filling the Gaps: Selective Knowledge Augmentation for LLM Recommenders
KnowSA_CKP uses comparative knowledge probing to selectively augment LLM prompts for items with knowledge gaps, improving recommendation accuracy and context efficiency.
-
Toward Robust GraphRAG: Mitigating Retrieval Drift and Hallucination from Imperfect Knowledge Graphs
CS-RAG is a GraphRAG framework that plans queries as ordered atomic constraints, uses anchor-relation aware retrieval, applies sufficiency checks, and falls back to text recovery to reduce drift and hallucination from imperfect KGs.
-
Traditional statistical representations outperform generative AI in identifying expert peer reviewers
TF-IDF identifies labeled experts in the top 25 recommendations 79.5% of the time versus 51.5% for GPT-4o mini on an astronomy observatory dataset.
-
SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition
SAKE is an agentic framework for GMNER that uses uncertainty-based self-awareness and reinforcement learning to balance internal knowledge exploitation with adaptive external exploration.
-
SUMMIR: A Hallucination-Aware Framework for Ranking Sports Insights from LLMs
SUMMIR is a multimetric ranking model that orders LLM-generated sports insights by importance while incorporating hallucination detection to improve factual reliability across cricket, soccer, basketball, and baseball articles.
-
Multimodal Large Language Models with Adaptive Preference Optimization for Sequential Recommendation
HaNoRec dynamically weights harder preference samples and applies Gaussian perturbations to output distributions to improve multimodal LLM performance on sequential recommendation tasks.