super hub Mixed citations

GPT-4o System Card

author=, Gpt-4o system card · 2024 · cs.CL · arXiv 2410.21276

Mixed citation behavior. Most common role is background (53%).

811 Pith papers citing it

Background 53% of classified citations

open full Pith review browse 811 citing papers more from author= arXiv PDF

abstract

GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 97 baseline 51 method 23 dataset 3

citation-polarity summary

background 93 baseline 51 use method 22 unclear 4 use dataset 3 support 1

claims ledger

abstract GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while

authors

author= Gpt-4o system card

co-cited works

representative citing papers

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

cs.CV · 2026-05-31 · accept · novelty 8.0

Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

cs.CV · 2026-05-28 · unverdicted · novelty 8.0

VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation

cs.SE · 2026-04-30 · unverdicted · novelty 8.0

MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusing blank images at >92%.

CHASM: Unveiling Covert Advertisements on Chinese Social Media

cs.LG · 2026-04-22 · unverdicted · novelty 8.0

CHASM is a new benchmark dataset showing that existing multimodal large language models fail to reliably detect covert advertisements on Chinese social media even after fine-tuning.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

cs.CR · 2026-04-16 · unverdicted · novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

ReConText3D: Replay-based Continual Text-to-3D Generation

cs.CV · 2026-04-15 · conditional · novelty 8.0

ReConText3D is the first replay-memory framework for continual text-to-3D generation that prevents catastrophic forgetting on new textual categories while preserving quality on previously seen classes.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

cs.AI · 2026-04-09 · unverdicted · novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

cs.CV · 2026-02-15 · unverdicted · novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

cs.CV · 2026-02-04 · unverdicted · novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.

SwissGov-RSD: A Human-annotated, Cross-lingual Benchmark for Token-level Recognition of Semantic Differences Between Related Documents

cs.CL · 2025-12-08 · accept · novelty 8.0

SwissGov-RSD is the first naturalistic cross-lingual document-level benchmark with human token-level semantic difference annotations, on which both LLMs and encoders show a large performance gap relative to simpler settings.

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

cs.AI · 2025-09-30 · unverdicted · novelty 8.0

CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.

MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

MuseBench shows state-of-the-art MLLMs achieve only 48.29% accuracy on intent-level audiovisual arts understanding versus 87.18% for human experts.

A Diagnostic Framework and Multi-Evaluator Audit of Evaluator-Driven Preference Dynamics in Self-Adapting LLM Agents

cs.LG · 2026-06-29 · unverdicted · novelty 7.0

A diagnostic framework called EPC reveals that proprietary LLM evaluators can exhibit large preference shifts between versions, as evidenced by a GPT-4o May-to-June drift that inverted study conclusions, rendering single-snapshot evaluations unreliable.

GigaSpeechBench: A Real-World Multilingual Speech-to-Text Benchmark

eess.AS · 2026-06-27 · unverdicted · novelty 7.0

GigaSpeechBench is a new 680-hour in-the-wild multilingual ASR/AST benchmark with five modules for low-resource languages, Chinese dialects, English accents, domain terminology, and age-varied speech, showing model performance drops.

HumanMoveVQA: Can Video MLLMs reason about human movement in videos?

cs.CV · 2026-06-26 · unverdicted · novelty 7.0 · 2 refs

HumanMoveVQA is a new benchmark that generates 10K+ QA pairs from 3D-lifted video tracks to evaluate video MLLMs on global human trajectory and orientation reasoning.

PhyEditBench: A Real-World Multi-Stage Benchmark for Physics-Aware Image Editing

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

PhyEditBench is a new benchmark for physics-aware image editing with real and synthetic instances plus a training-free PhyWorld baseline that uses test-time scaling to outperform SOTA models.

citing papers explorer

Showing 50 of 811 citing papers.

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration? cs.CV · 2026-05-31 · accept · none · ref 50 · internal anchor
Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.
VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents cs.CV · 2026-05-28 · unverdicted · none · ref 18 · internal anchor
VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.
Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation cs.CR · 2026-05-11 · unverdicted · none · ref 60 · internal anchor
M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.
From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation cs.SE · 2026-04-30 · unverdicted · none · ref 28 · internal anchor
MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusing blank images at >92%.
CHASM: Unveiling Covert Advertisements on Chinese Social Media cs.LG · 2026-04-22 · unverdicted · none · ref 20 · internal anchor
CHASM is a new benchmark dataset showing that existing multimodal large language models fail to reliably detect covert advertisements on Chinese social media even after fine-tuning.
HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models cs.SD · 2026-04-21 · unverdicted · none · ref 14 · internal anchor
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations cs.CV · 2026-04-20 · unverdicted · none · ref 15 · internal anchor
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents? cs.CR · 2026-04-16 · unverdicted · none · ref 49 · internal anchor
Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
ReConText3D: Replay-based Continual Text-to-3D Generation cs.CV · 2026-04-15 · conditional · none · ref 11 · internal anchor
ReConText3D is the first replay-memory framework for continual text-to-3D generation that prevents catastrophic forgetting on new textual categories while preserving quality on previously seen classes.
MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark cs.CV · 2026-04-12 · unverdicted · none · ref 11 · 2 links · internal anchor
MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.
DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues cs.AI · 2026-04-09 · unverdicted · none · ref 11 · internal anchor
DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
EgoSound: Benchmarking Sound Understanding in Egocentric Videos cs.CV · 2026-02-15 · unverdicted · none · ref 17 · internal anchor
EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.
VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing cs.CV · 2026-02-04 · unverdicted · none · ref 17 · internal anchor
VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.
SwissGov-RSD: A Human-annotated, Cross-lingual Benchmark for Token-level Recognition of Semantic Differences Between Related Documents cs.CL · 2025-12-08 · accept · none · ref 30 · internal anchor
SwissGov-RSD is the first naturalistic cross-lingual document-level benchmark with human token-level semantic difference annotations, on which both LLMs and encoders show a large performance gap relative to simpler settings.
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark cs.AI · 2025-09-30 · unverdicted · none · ref 19 · internal anchor
CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.
Flow-GRPO: Training Flow Matching Models via Online RL cs.CV · 2025-05-08 · unverdicted · none · ref 18 · internal anchor
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models cs.CV · 2024-09-25 · accept · none · ref 90 · internal anchor
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
LiveBench: A Challenging, Contamination-Limited LLM Benchmark cs.CL · 2024-06-27 · unverdicted · none · ref 21 · internal anchor
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning cs.CV · 2026-06-29 · unverdicted · none · ref 15 · internal anchor
OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.
MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs cs.CV · 2026-06-29 · unverdicted · none · ref 20 · internal anchor
MuseBench shows state-of-the-art MLLMs achieve only 48.29% accuracy on intent-level audiovisual arts understanding versus 87.18% for human experts.
A Diagnostic Framework and Multi-Evaluator Audit of Evaluator-Driven Preference Dynamics in Self-Adapting LLM Agents cs.LG · 2026-06-29 · unverdicted · none · ref 2 · internal anchor
A diagnostic framework called EPC reveals that proprietary LLM evaluators can exhibit large preference shifts between versions, as evidenced by a GPT-4o May-to-June drift that inverted study conclusions, rendering single-snapshot evaluations unreliable.
GigaSpeechBench: A Real-World Multilingual Speech-to-Text Benchmark eess.AS · 2026-06-27 · unverdicted · none · ref 1 · internal anchor
GigaSpeechBench is a new 680-hour in-the-wild multilingual ASR/AST benchmark with five modules for low-resource languages, Chinese dialects, English accents, domain terminology, and age-varied speech, showing model performance drops.
HumanMoveVQA: Can Video MLLMs reason about human movement in videos? cs.CV · 2026-06-26 · unverdicted · none · ref 1 · 2 links · internal anchor
HumanMoveVQA is a new benchmark that generates 10K+ QA pairs from 3D-lifted video tracks to evaluate video MLLMs on global human trajectory and orientation reasoning.
PhyEditBench: A Real-World Multi-Stage Benchmark for Physics-Aware Image Editing cs.CV · 2026-06-25 · unverdicted · none · ref 22 · internal anchor
PhyEditBench is a new benchmark for physics-aware image editing with real and synthetic instances plus a training-free PhyWorld baseline that uses test-time scaling to outperform SOTA models.
CrypFormBench: Benchmarking Formal Analysis Capability of Large Language Models for Cryptographic Schemes cs.CR · 2026-06-24 · unverdicted · none · ref 38 · internal anchor
CrypFormBench is a new benchmark jointly covering symbolic and computational security to evaluate LLMs on five formal analysis capabilities, with results showing top model Claude-3.5 scores 48.7/100 and most models struggling on generation, transformation, and correction.
SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation cs.CV · 2026-05-31 · unverdicted · none · ref 22 · internal anchor
SafeGen-Bench is a benchmark with 10 malicious categories that evaluates conditional T2V models on paired start frames and text prompts, finding unsafety scores up to 44.5 and 80% guardrail failure rate.
PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects cs.CL · 2026-05-31 · unverdicted · none · ref 57 · internal anchor
PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.
DeepLatent: Think with Images via Parallel Latent Visual Reasoning cs.CV · 2026-05-30 · unverdicted · none · ref 115 · internal anchor
DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.
Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning cs.CL · 2026-05-30 · unverdicted · none · ref 25 · internal anchor
SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.
Algorithmic Recourse of In-Context Learning for Tabular Data cs.LG · 2026-05-29 · unverdicted · none · ref 21 · internal anchor
The paper delivers the first theoretical analysis and practical zeroth-order framework for algorithmic recourse under in-context learning for tabular prediction.
PrivacyPeek: Auditing What LLM-Based Agents Acquire, Not Just What They Say cs.CR · 2026-05-29 · unverdicted · none · ref 32 · internal anchor
PrivacyPeek is a benchmark with 1,182 cases across 7 acquisition behaviors and 16 domains that evaluates acquisition-stage privacy leakage in LLM agents, finding it widespread with limited prompt mitigation.
StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning cs.CV · 2026-05-29 · unverdicted · none · ref 36 · internal anchor
StemBind benchmark diagnoses MLLM failures in abstract visual reasoning by separating perception, rule induction, and answer selection on shared stems, finding a persistent rule-to-instance binding gap even when perception and rule are correct.
Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs cs.CL · 2026-05-29 · unverdicted · none · ref 7 · internal anchor
Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.
YoCausal: How Far is Video Generation from World Model? A Causality Perspective cs.CV · 2026-05-28 · unverdicted · none · ref 52 · internal anchor
YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.
OmniMatBench: A Human-Calibrated Multimodal Reasoning Benchmark Across 19 Materials Science Subfields cs.AI · 2026-05-28 · unverdicted · none · ref 14 · internal anchor
OmniMatBench is a new human-calibrated benchmark for multimodal materials-science reasoning that reveals the best evaluated MLLM scores only 0.372 overall.
Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese cs.CL · 2026-05-28 · unverdicted · none · ref 4 · internal anchor
Introduces ChiSafe-PAS, a 1,897-prompt human-annotated Chinese adversarial benchmark for LLM safety with 3-class labels, 9-category obfuscation taxonomy, and domain coverage in self-harm, drugs, fraud, and satire.
K-FinHallu: A Hallucination Detection Benchmark for Multi-Turn RAG in Korean Finance cs.LG · 2026-05-28 · unverdicted · none · ref 23 · internal anchor
K-FinHallu is the first multi-turn Korean financial RAG hallucination benchmark; frontier LLMs struggle especially on justified abstention while an 8B fine-tuned model reaches competitive performance.
Honest Lying: Understanding Memory Confabulation in Reflexive Agents cs.LG · 2026-05-28 · unverdicted · none · ref 4 · internal anchor
Reflexive agents confabulate incorrect task interpretations in memory, detected via Reflection Repetition Rate metric, with a programmatic mitigation raising correct object mentions from 0% to 86% in frozen ALFWorld cases.
Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents cs.CV · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
Introduces GUI-RobustEval benchmark and RoTS synthesis framework to train GUI agents on error recovery, with RoTS-32B reaching 47.4% success on OSWorld.
Orthogonal Negative Guidance in Attention Feature Space for Text-to-Image Generation cs.CV · 2026-05-28 · unverdicted · none · ref 21 · internal anchor
Orthogonal Negative Guidance subtracts only the orthogonal component of negative-prompt attention features from positive ones in FLUX models to suppress concepts while preserving semantics and quality.
Embodied3DBench: Benchmarking Low-Level Embodied Spatial Intelligence of Vision Language Models cs.CV · 2026-05-27 · unverdicted · none · ref 26 · internal anchor
Embodied3DBench creates a new evaluation benchmark for low-level embodied spatial intelligence in VLMs, evaluates 13 models showing gaps in interaction perception, and supplies a large synthetic training set that yields measurable gains.
VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora cs.AI · 2026-05-27 · unverdicted · none · ref 25 · internal anchor
VeriTrip is a new benchmark using a Multimodal Retrieval Base and Verifiable Knowledge Base to evaluate evidence-grounded reasoning and factual reliability in travel planning agents over unstructured multimodal web data.
VoiceGiraffe: A Benchmark for Extreme Long-Context Audio-Language Understanding cs.SD · 2026-05-27 · unverdicted · none · ref 14 · internal anchor
VoiceGiraffe is a new benchmark showing that long-range memory persistence remains a key bottleneck for large audio language models on hour-scale audio.
Touch-R1: Reinforcing Touch Reasoning in MLLMs cs.CV · 2026-05-26 · unverdicted · none · ref 18 · internal anchor
Touch-R1 applies GRPO reinforcement learning on a new 1M tactile dataset and benchmark to train a Qwen2.5-VL-7B model that outperforms baselines on tactile perception and visual-tactile conflict tasks.
STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models cs.CV · 2026-05-25 · unverdicted · none · ref 19 · internal anchor
STORM teaches LVLMs to internalize spatial-temporal reasoning via bounded latent trajectories trained with generated thought videos in two stages, improving accuracy on VideoMME, MVBench and similar benchmarks while lowering inference overhead.
ETCHR: Editing To Clarify and Harness Reasoning cs.CV · 2026-05-22 · unverdicted · none · ref 12 · internal anchor
A decoupled question-conditioned image editor trained via supervised imitation then VLM-reward enhancement improves MLLM visual reasoning Pass@1 by 4.6-5.5 points across models and tasks.
Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval cs.CV · 2026-05-22 · unverdicted · none · ref 9 · internal anchor
ToolMerge decomposes queries into LLM-planned tool calls merged by boolean operators for long-video keyframe retrieval and introduces the M2M benchmark, showing competitive results with 5% gains on caption retrieval.
Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs cs.CV · 2026-05-21 · conditional · none · ref 25 · internal anchor
Video-LLMs exhibit directional motion blindness from a direction binding gap; DeltaDirect projector objective lifts synthetic accuracy to 85.4% and real accuracy by 21.9 points while preserving other video capabilities.
Forecasting Scientific Progress with Artificial Intelligence cs.AI · 2026-05-21 · unverdicted · none · ref 31 · internal anchor
Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.
AgroTools: A Benchmark for Tool-Augmented Multimodal Agents in Agriculture cs.CV · 2026-05-21 · unverdicted · none · ref 20 · internal anchor
AgroTools is a new benchmark for tool-augmented multimodal agents in agriculture featuring 539 QA pairs, 1,097 images, five task families, and 14 tools, with evaluations showing major limitations in current models' tool planning and execution.

GPT-4o System Card

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer