super hub Mixed citations

GPT-4o System Card

author=, Gpt-4o system card · 2024 · cs.CL · arXiv 2410.21276

Mixed citation behavior. Most common role is background (53%).

791 Pith papers citing it

Background 53% of classified citations

open full Pith review browse 791 citing papers more from author= arXiv PDF

abstract

GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 97 baseline 51 method 23 dataset 3

citation-polarity summary

background 93 baseline 51 use method 22 unclear 4 use dataset 3 support 1

claims ledger

abstract GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while

authors

author= Gpt-4o system card

co-cited works

representative citing papers

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

cs.CV · 2026-05-31 · accept · novelty 8.0

Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

cs.CV · 2026-05-28 · unverdicted · novelty 8.0

VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation

cs.SE · 2026-04-30 · unverdicted · novelty 8.0

MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusing blank images at >92%.

CHASM: Unveiling Covert Advertisements on Chinese Social Media

cs.LG · 2026-04-22 · unverdicted · novelty 8.0

CHASM is a new benchmark dataset showing that existing multimodal large language models fail to reliably detect covert advertisements on Chinese social media even after fine-tuning.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

cs.CR · 2026-04-16 · unverdicted · novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

ReConText3D: Replay-based Continual Text-to-3D Generation

cs.CV · 2026-04-15 · conditional · novelty 8.0

ReConText3D is the first replay-memory framework for continual text-to-3D generation that prevents catastrophic forgetting on new textual categories while preserving quality on previously seen classes.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

cs.AI · 2026-04-09 · unverdicted · novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

cs.CV · 2026-02-15 · unverdicted · novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

cs.CV · 2026-02-04 · unverdicted · novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.

SwissGov-RSD: A Human-annotated, Cross-lingual Benchmark for Token-level Recognition of Semantic Differences Between Related Documents

cs.CL · 2025-12-08 · accept · novelty 8.0

SwissGov-RSD is the first naturalistic cross-lingual document-level benchmark with human token-level semantic difference annotations, on which both LLMs and encoders show a large performance gap relative to simpler settings.

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

cs.AI · 2025-09-30 · unverdicted · novelty 8.0

CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

PhyEditBench: A Real-World Multi-Stage Benchmark for Physics-Aware Image Editing

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

PhyEditBench is a new benchmark for physics-aware image editing with real and synthetic instances plus a training-free PhyWorld baseline that uses test-time scaling to outperform SOTA models.

CrypFormBench: Benchmarking Formal Analysis Capability of Large Language Models for Cryptographic Schemes

cs.CR · 2026-06-24 · unverdicted · novelty 7.0

CrypFormBench is a new benchmark jointly covering symbolic and computational security to evaluate LLMs on five formal analysis capabilities, with results showing top model Claude-3.5 scores 48.7/100 and most models struggling on generation, transformation, and correction.

SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

SafeGen-Bench is a benchmark with 10 malicious categories that evaluates conditional T2V models on paired start frames and text prompts, finding unsafety scores up to 44.5 and 80% guardrail failure rate.

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

cs.CL · 2026-05-31 · unverdicted · novelty 7.0

PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.

DeepLatent: Think with Images via Parallel Latent Visual Reasoning

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.

Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning

cs.CL · 2026-05-30 · unverdicted · novelty 7.0

SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.

citing papers explorer

Showing 50 of 791 citing papers.

Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces cs.CL · 2026-04-09 · unverdicted · none · ref 19 · 2 links · internal anchor
Introduces OmniBehavior benchmark from real-world data and shows LLMs exhibit hyper-activity, persona homogenization, and utopian bias in behavior simulation.
InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding cs.CV · 2026-04-09 · unverdicted · none · ref 22 · internal anchor
InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models cs.CV · 2026-04-09 · unverdicted · none · ref 22 · internal anchor
MotionScape is a large-scale UAV video dataset with highly dynamic 6-DoF motions, geometric trajectories, and semantic annotations to train world models that better simulate complex 3D dynamics under large viewpoint changes.
Open-Ended Video Game Glitch Detection with Agentic Reasoning and Temporal Grounding cs.MA · 2026-04-09 · unverdicted · none · ref 14 · internal anchor
Introduces the first benchmark for open-ended video game glitch detection with temporal localization and proposes GliDe, an agentic framework that achieves stronger performance than vanilla multimodal models.
Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video cs.CV · 2026-04-09 · unverdicted · none · ref 23 · internal anchor
C-MET transfers emotions from speech to facial video by learning cross-modal semantic vectors with pretrained audio and disentangled expression encoders, yielding 14% higher emotion accuracy on MEAD and CREMA-D even for unseen emotions.
MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments cs.CV · 2026-04-09 · unverdicted · none · ref 15 · internal anchor
MARINER is a new benchmark dataset and evaluation framework for fine-grained perception and causal reasoning in open-water scenes using 16,629 images across 63 vessel categories, diverse environments, and maritime incidents.
How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles cs.AI · 2026-04-08 · unverdicted · none · ref 8 · internal anchor
A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.
VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models cs.CV · 2026-04-08 · unverdicted · none · ref 10 · internal anchor
VSAS-Bench offers temporally dense annotations and synchronous/asynchronous protocols to evaluate streaming VLMs on timeliness, consistency, accuracy, and latency trade-offs, showing that adapted conventional VLMs can outperform specialized streaming models.
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill cs.CL · 2026-04-08 · conditional · none · ref 29 · internal anchor
SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.
KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis cs.RO · 2026-04-08 · unverdicted · none · ref 19 · internal anchor
KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.
CCD-CBT: Multi-Agent Therapeutic Interaction for CBT Guided by Cognitive Conceptualization Diagram cs.CL · 2026-04-08 · unverdicted · none · ref 1 · internal anchor
CCD-CBT is a multi-agent framework for CBT simulation that dynamically reconstructs Cognitive Conceptualization Diagrams via a Control Agent and enforces information asymmetry between Therapist and Client agents, with the released CCDCHAT dataset enabling fine-tuned models to outperform baselines in
Learning to Interrupt in Language-based Multi-agent Communication cs.CL · 2026-04-07 · unverdicted · none · ref 17 · internal anchor
HANDRAISER learns optimal interruption points in multi-agent LLM communication using estimated future reward and cost, achieving 32.2% lower communication cost with comparable or better task results across games, scheduling, and debate.
ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference cs.CV · 2026-04-07 · unverdicted · none · ref 14 · internal anchor
ID-Selection combines importance scoring with iterative diversity suppression to prune 97.2% of visual tokens in LVLMs while retaining 91.8% performance and cutting FLOPs by over 97% without retraining.
SCOPE: A Dataset of Stereotyped Prompts for Counterfactual Fairness Assessment of LLMs cs.SE · 2026-04-07 · unverdicted · none · ref 15 · internal anchor
SCOPE is a new large-scale dataset of counterfactual prompt pairs for evaluating fairness and stereotype sensitivity in LLMs across 1,438 topics, nine bias dimensions, 1,536 groups, and four communicative intents.
Pressure, What Pressure? Sycophancy Disentanglement in Language Models via Reward Decomposition cs.AI · 2026-04-07 · unverdicted · none · ref 5 · internal anchor
A five-term decomposed reward in GRPO training reduces sycophancy across models and generalizes to unseen pressure types by targeting pressure resistance and evidence responsiveness separately.
What Makes Good Multilingual Reasoning? Disentangling Reasoning Traces with Measurable Features cs.CL · 2026-04-06 · unverdicted · none · ref 4 · internal anchor
Effective multilingual reasoning in large models relies on language-specific patterns in reasoning features rather than uniform English-like traces.
Retrieval Augmented Conversational Recommendation with Reinforcement Learning cs.IR · 2026-04-06 · unverdicted · none · ref 21 · internal anchor
RAR retrieves candidate items from a 300k-movie corpus then uses LLM generation with RL feedback to produce context-aware recommendations that outperform baselines on benchmarks.
BoxComm: Benchmarking Category-Aware Commentary Generation and Narration Rhythm in Boxing cs.CV · 2026-04-06 · unverdicted · none · ref 15 · internal anchor
BoxComm is the first large-scale benchmark for category-aware commentary generation and rhythm assessment in boxing, showing state-of-the-art multimodal models struggle with tactical analysis and temporal pacing.
Talk2AI: A Longitudinal Dataset of Human--AI Persuasive Conversations cs.HC · 2026-04-06 · unverdicted · none · ref 15 · internal anchor
Talk2AI is a new longitudinal dataset of 3,080 human-AI conversations with linked opinion-change and psychometric measures collected from 770 participants over four weeks.
GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models cs.CV · 2026-04-05 · unverdicted · none · ref 9 · internal anchor
GENFIG1 is a new benchmark that tests whether vision-language models can create effective Figure 1 visuals capturing the central scientific idea from paper text.
MisEdu-RAG: A Misconception-Aware Dual-Hypergraph RAG for Novice Math Teachers cs.IR · 2026-04-05 · unverdicted · none · ref 17 · internal anchor
MisEdu-RAG builds concept and instance hypergraphs for two-stage retrieval of pedagogical knowledge and student errors, improving feedback quality on the MisstepMath benchmark by 10.95% token-F1 and up to 15.3% on response dimensions.
BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence cs.CL · 2026-04-03 · unverdicted · none · ref 23 · internal anchor
BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models cs.CL · 2026-04-03 · unverdicted · none · ref 5 · internal anchor
A new Latent Imagination Module uses cross-attention to predict latent visual embeddings from text, improving accuracy and calibration of vision-language models on text-only inputs.
OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments cs.HC · 2026-04-03 · unverdicted · none · ref 11 · internal anchor
OmniGUI is the first step-level benchmark supplying interleaved image, audio, and video inputs across 709 expert episodes in 29 smartphone apps to evaluate multimodal GUI agents.
THOM: Generating Physically Plausible Hand-Object Meshes From Text cs.CV · 2026-04-03 · unverdicted · none · ref 23 · internal anchor
THOM is a training-free two-stage framework that generates physically plausible hand-object 3D meshes directly from text by combining text-guided Gaussians with contact-aware physics optimization and VLM refinement.
XrayClaw: Cooperative-Competitive Multi-Agent Alignment for Trustworthy Chest X-ray Diagnosis cs.CV · 2026-04-03 · unverdicted · none · ref 12 · internal anchor
XrayClaw deploys cooperative-competitive multi-agent alignment and Competitive Preference Optimization to raise diagnostic accuracy, reasoning fidelity, and generalization on chest X-ray benchmarks.
TOL: Textual Localization with OpenStreetMap cs.CV · 2026-04-02 · unverdicted · none · ref 14 · internal anchor
TOLoc localizes textual scene descriptions to accurate 2D positions on OpenStreetMap tiles via coarse-to-fine semantic and directional matching, outperforming prior methods on a new multi-city benchmark.
JAMMEval: A Refined Collection of Japanese Benchmarks for Reliable VLM Evaluation cs.CV · 2026-04-01 · conditional · none · ref 2 · internal anchor
JAMMEval delivers refined Japanese VQA benchmarks that produce evaluation scores more reflective of true model capability, with lower run-to-run variance and stronger separation between models of differing ability.
Internalized Reasoning for Long-Context Visual Document Understanding cs.CV · 2026-03-31 · unverdicted · none · ref 37 · internal anchor
A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators cs.CV · 2026-03-31 · unverdicted · none · ref 10 · internal anchor
V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the fine-grained perception gap on benchmarks.
PR-CAD: Progressive Refinement for Unified Controllable and Faithful Text-to-CAD Generation with Large Language Models cs.CL · 2026-03-27 · unverdicted · none · ref 6 · internal anchor
PR-CAD unifies text-to-CAD generation and editing via progressive refinement with LLMs, a new interaction dataset, and RL-enhanced reasoning to achieve better controllability and faithfulness.
PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments cs.AI · 2026-03-24 · unverdicted · none · ref 40 · internal anchor
PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.
CFMS: Towards Explainable and Fine-Grained Chinese Multimodal Sarcasm Detection Benchmark cs.CL · 2026-03-23 · conditional · none · ref 2 · internal anchor
CFMS is the first fine-grained Chinese multimodal sarcasm benchmark with detailed annotations, paired with a PGDS reinforcement learning strategy that improves model results on sarcasm tasks.
SteelDefectX: A Multi-Form Vision-Language Dataset and Benchmark for Steel Surface Defect Analysis cs.CV · 2026-03-23 · unverdicted · none · ref 11 · internal anchor
SteelDefectX is a new multi-form vision-language dataset and benchmark for analyzing steel surface defects using 7,778 images across 25 categories.
Topo-R1: Detecting Topological Anomalies via Vision-Language Models cs.CV · 2026-03-13 · unverdicted · none · ref 29 · internal anchor
Topo-R1 fine-tunes a vision-language model using a topology-aware reward and GRPO to detect anomalies such as broken or spurious connections in tubular segmentation masks, outperforming standard VLMs.
PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses cs.CL · 2026-03-11 · unverdicted · none · ref 19 · internal anchor
PEEM is a multi-criteria LLM-based evaluator for prompts and responses that aligns with standard accuracy while enabling zero-shot prompt optimization via feedback.
From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents cs.CV · 2026-03-02 · unverdicted · none · ref 12 · internal anchor
MM-Mem distills video input through a hierarchical memory of sensory buffer, episodic stream, and symbolic schema, optimized by a semantic information bottleneck and SIB-GRPO, to achieve SOTA on long-horizon video benchmarks.
BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of English Language Learners via Inter-group Data Augmentation cs.CL · 2026-02-27 · unverdicted · none · ref 14 · internal anchor
BRIDGE reduces bias against high-scoring ELL students in automated scoring by generating synthetic samples via inter-group content pasting and quality discrimination, achieving fairness gains comparable to additional real data.
ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices cs.AI · 2026-02-25 · conditional · none · ref 16 · internal anchor
ProactiveMobile is a new benchmark for proactive mobile agents that tests latent intent inference from context and executable API generation, where a fine-tuned 7B model reaches 19.15% success versus 15.71% for o1 and 7.39% for GPT-5.
LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding cs.CV · 2026-02-24 · unverdicted · none · ref 17 · internal anchor
LongVideo-R1 trains a reasoning agent on 33K trajectories to intelligently select informative video clips via iterative refinement and RL, achieving better accuracy-efficiency tradeoffs on long video QA benchmarks.
Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation cs.AI · 2026-02-19 · unverdicted · none · ref 13 · internal anchor
Conv-FinRe is a new benchmark built from real market data and human trajectories that tests LLMs on generating utility-grounded stock rankings over fixed horizons while distinguishing rational analysis from behavioral mimicry or momentum.
Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling cs.CV · 2026-02-11 · unverdicted · none · ref 10 · internal anchor
DiNa-LRM introduces a diffusion-native latent reward model using a noise-calibrated Thurstone likelihood on noisy states, matching VLM performance at lower compute in image alignment and preference optimization.
Thinking with Geometry: Active Geometry Integration for Spatial Reasoning cs.CV · 2026-02-05 · unverdicted · none · ref 13 · internal anchor
GeoThinker enables active, task-conditioned geometry integration in MLLMs via spatial-grounded fusion and importance gating, reaching 72.6 on VSI-Bench.
PerpetualWonder: Long-Horizon Action-Conditioned 4D Scene Generation cs.CV · 2026-02-04 · unverdicted · none · ref 16 · internal anchor
PerpetualWonder introduces a closed-loop generative simulator with a unified physical-visual representation for long-horizon action-conditioned 4D scene generation from one image.
CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning cs.CV · 2026-01-30 · unverdicted · none · ref 22 · internal anchor
CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.
AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors cs.CV · 2026-01-28 · conditional · none · ref 25 · internal anchor
AnomalyVFM converts vision foundation models into zero-shot anomaly detectors via three-stage synthetic dataset generation plus low-rank adapters and weighted pixel loss, reaching 94.1% average image AUROC across nine datasets.
ReWeaver: Towards Simulation-Ready and Topology-Accurate Garment Reconstruction cs.CV · 2026-01-23 · unverdicted · none · ref 15 · internal anchor
ReWeaver reconstructs topology-accurate 3D garments and sewing patterns from sparse multi-view images by predicting seams and panels in 2D UV and 3D space using a new 100k-sample synthetic dataset.
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning cs.CV · 2026-01-22 · unverdicted · none · ref 11 · internal anchor
VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
Weather-R1: Logically Consistent Reinforcement Fine-Tuning for Multimodal Reasoning in Meteorology cs.CV · 2026-01-20 · conditional · none · ref 25 · internal anchor
Weather-R1 is a multimodal reasoning model for meteorology that uses logical consistency rewards during reinforcement fine-tuning to cut self-contradictory outputs and raises benchmark accuracy by 9.8 points over baselines.
Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning cs.CV · 2026-01-11 · unverdicted · none · ref 3 · internal anchor
VideoDR is a new benchmark for open-web video deep research that tests multimodal models on cross-frame visual anchor extraction, interactive retrieval, and multi-hop reasoning over joint video-web evidence.

GPT-4o System Card

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer