super hub Mixed citations

GPT-4o System Card

author=, Gpt-4o system card · 2024 · cs.CL · arXiv 2410.21276

Mixed citation behavior. Most common role is background (53%).

797 Pith papers citing it

Background 53% of classified citations

open full Pith review browse 797 citing papers more from author= arXiv PDF

abstract

GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 97 baseline 51 method 23 dataset 3

citation-polarity summary

background 93 baseline 51 use method 22 unclear 4 use dataset 3 support 1

claims ledger

abstract GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while

authors

author= Gpt-4o system card

co-cited works

representative citing papers

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

cs.CV · 2026-05-31 · accept · novelty 8.0

Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

cs.CV · 2026-05-28 · unverdicted · novelty 8.0

VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation

cs.SE · 2026-04-30 · unverdicted · novelty 8.0

MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusing blank images at >92%.

CHASM: Unveiling Covert Advertisements on Chinese Social Media

cs.LG · 2026-04-22 · unverdicted · novelty 8.0

CHASM is a new benchmark dataset showing that existing multimodal large language models fail to reliably detect covert advertisements on Chinese social media even after fine-tuning.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

cs.CR · 2026-04-16 · unverdicted · novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

ReConText3D: Replay-based Continual Text-to-3D Generation

cs.CV · 2026-04-15 · conditional · novelty 8.0

ReConText3D is the first replay-memory framework for continual text-to-3D generation that prevents catastrophic forgetting on new textual categories while preserving quality on previously seen classes.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

cs.AI · 2026-04-09 · unverdicted · novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

cs.CV · 2026-02-15 · unverdicted · novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

cs.CV · 2026-02-04 · unverdicted · novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.

SwissGov-RSD: A Human-annotated, Cross-lingual Benchmark for Token-level Recognition of Semantic Differences Between Related Documents

cs.CL · 2025-12-08 · accept · novelty 8.0

SwissGov-RSD is the first naturalistic cross-lingual document-level benchmark with human token-level semantic difference annotations, on which both LLMs and encoders show a large performance gap relative to simpler settings.

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

cs.AI · 2025-09-30 · unverdicted · novelty 8.0

CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

PhyEditBench: A Real-World Multi-Stage Benchmark for Physics-Aware Image Editing

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

PhyEditBench is a new benchmark for physics-aware image editing with real and synthetic instances plus a training-free PhyWorld baseline that uses test-time scaling to outperform SOTA models.

CrypFormBench: Benchmarking Formal Analysis Capability of Large Language Models for Cryptographic Schemes

cs.CR · 2026-06-24 · unverdicted · novelty 7.0

CrypFormBench is a new benchmark jointly covering symbolic and computational security to evaluate LLMs on five formal analysis capabilities, with results showing top model Claude-3.5 scores 48.7/100 and most models struggling on generation, transformation, and correction.

SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

SafeGen-Bench is a benchmark with 10 malicious categories that evaluates conditional T2V models on paired start frames and text prompts, finding unsafety scores up to 44.5 and 80% guardrail failure rate.

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

cs.CL · 2026-05-31 · unverdicted · novelty 7.0

PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.

DeepLatent: Think with Images via Parallel Latent Visual Reasoning

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.

Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning

cs.CL · 2026-05-30 · unverdicted · novelty 7.0

SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.

citing papers explorer

Showing 50 of 797 citing papers.

Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning cs.CV · 2026-04-06 · unverdicted · none · ref 15 · internal anchor
RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning cs.CV · 2026-04-06 · unverdicted · none · ref 21 · internal anchor
G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.
Progressive Video Condensation with MLLM Agent for Long-form Video Understanding cs.CV · 2026-04-03 · unverdicted · none · ref 32 · internal anchor
ProVCA progressively condenses long videos via segment localization, snippet selection, and keyframe refinement to achieve SOTA zero-shot accuracies on EgoSchema, NExT-QA, and IntentQA with fewer frames.
Differential Mental Disorder Detection with Psychology-Inspired Multimodal Stimuli cs.MM · 2026-04-03 · unverdicted · none · ref 30 · internal anchor
Introduces the MMH dataset collected via psychology-inspired multimodal stimuli and a paradigm-aware framework that uses inter-disorder prior knowledge as prompts, outperforming baselines on differential detection of depression, anxiety and schizophrenia.
GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning cs.AI · 2026-04-03 · unverdicted · none · ref 22 · internal anchor
GrandCode is the first AI system to consistently beat all human participants and place first in live Codeforces competitive programming contests.
Understanding the Effects of Safety Unalignment on Large Language Models cs.CR · 2026-04-02 · unverdicted · none · ref 16 · internal anchor
Weight orthogonalization unalignment enables LLMs to assist malicious activities more effectively than jailbreak-tuning, with less hallucination and better retained performance, while supervised fine-tuning mitigates the added attack capabilities.
STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering cs.CV · 2026-04-02 · unverdicted · none · ref 8 · internal anchor
STRIVE stabilizes RL for video QA by creating spatiotemporal video variants and using importance-aware sampling, yielding consistent gains over baselines on six benchmarks.
SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning cs.CV · 2026-03-28 · unverdicted · none · ref 19 · internal anchor
SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.
PersonaVLM: Long-Term Personalized Multimodal LLMs cs.CL · 2026-03-20 · unverdicted · none · ref 12 · internal anchor
PersonaVLM adds memory extraction, multi-turn retrieval-based reasoning, and personality inference to multimodal LLMs, yielding 22.4% gains on a new long-term personalization benchmark and outperforming GPT-4o.
Universal Skeleton Understanding via Differentiable Rendering and MLLMs cs.CV · 2026-03-18 · unverdicted · none · ref 2 · 3 links · internal anchor
SkeletonLLM translates skeleton kinematics into compact image sequences via an end-to-end differentiable renderer DrAction and uses cooperative training to enable MLLMs to perform open-vocabulary action recognition, motion captioning, and QA across heterogeneous skeleton formats.
Investigating Vaccine Buyer's Remorse: Post-Vaccination Decision Regret in COVID-19 Social Media Using Politically Diverse Human Annotation cs.CY · 2026-03-18 · unverdicted · none · ref 16 · internal anchor
Vaccine buyer's remorse occurs in under 2% of COVID-19 social media posts, mostly in skeptic communities via personal stories of health issues.
Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation cs.AI · 2026-03-18 · unverdicted · none · ref 20 · internal anchor
Safety degradation in large reasoning models occurs only after chain-of-thought is enabled; adding pre-CoT safety signals from a BERT classifier on safe models improves safety while preserving reasoning ability.
Toward Robust GraphRAG: Mitigating Retrieval Drift and Hallucination from Imperfect Knowledge Graphs cs.IR · 2026-03-16 · unverdicted · none · ref 6 · internal anchor
CS-RAG is a GraphRAG framework that plans queries as ordered atomic constraints, uses anchor-relation aware retrieval, applies sufficiency checks, and falls back to text recovery to reduce drift and hallucination from imperfect KGs.
TDA-RC: Task-Driven Alignment for Knowledge-Based Reasoning Chains in Large Language Models cs.CL · 2026-03-13 · unverdicted · none · ref 57 · internal anchor
TDA-RC embeds topological patterns from multi-round reasoning into CoT via persistent homology and a repair agent, yielding better accuracy-efficiency trade-offs than ToT or GoT on tested datasets.
BrainMem: Brain-Inspired Evolving Memory for Embodied Agent Task Planning cs.RO · 2026-03-12 · unverdicted · none · ref 19 · internal anchor
BrainMem equips LLM-based embodied planners with working, episodic, and semantic memory that evolves interaction histories into retrievable knowledge graphs and guidelines, raising success rates on long-horizon 3D benchmarks.
Intrinsic Concept Extraction Based on Compositional Interpretability cs.CV · 2026-03-12 · unverdicted · none · ref 1 · internal anchor
HyperExpress extracts composable intrinsic concepts from single images via hyperbolic concept learning and concept-wise optimization in diffusion-based models.
Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations cs.CV · 2026-03-09 · unverdicted · none · ref 14 · internal anchor
GR3D turns 3D scene geometry into ID-indexed text references, enabling zero-shot MLLM spatial reasoning gains of 9% on VSI-Bench and 12% on MindCube.
Automating Crash Diagram Generation Using Vision-Language Models: A Case Study on Multi-Lane Roundabouts cs.HC · 2026-03-09 · unverdicted · none · ref 12 · internal anchor
Vision-language models can produce crash diagrams from reports with moderate quality, GPT-4o scoring highest at 6.29/10 across 79 roundabout cases using a structured prompt and 10-metric evaluation.
MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens cs.CL · 2026-03-06 · unverdicted · none · ref 31 · internal anchor
MSA is an end-to-end trainable memory model using sparse attention and document-wise RoPE that scales to 100M tokens with linear complexity and less than 9% degradation.
How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning cs.CL · 2026-03-01 · unverdicted · none · ref 54 · internal anchor
Reinforcement learning with three causal constraints enables multimodal models to internalize diagram-reasoning links in geometry, unlike SFT which only mimics surface format and harms performance.
GroupGPT: A Token-efficient and Privacy-preserving Agentic Framework for Multi-User Chat Assistant cs.CL · 2026-03-01 · unverdicted · none · ref 27 · internal anchor
GroupGPT decouples intervention timing from response generation via edge-cloud collaboration for multi-user chats, scoring 4.72/5 on the new MUIR benchmark of 2500 segments while cutting token use by up to 3x and adding privacy sanitization.
A Minimal Agent for Automated Theorem Proving cs.AI · 2026-02-27 · unverdicted · none · ref 61 · internal anchor
A minimal agentic system achieves competitive performance in automated theorem proving with a simpler design and lower cost than state-of-the-art methods.
FinReasoning: A Hierarchical Benchmark for Reliable Financial Research Reporting cs.CL · 2026-02-25 · unverdicted · none · ref 14 · internal anchor
FinReasoning is a hierarchical benchmark that decomposes LLM financial research capabilities into semantic consistency, data alignment, and deep insight, revealing model-type differences in auditing versus insight generation.
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs? cs.LG · 2026-02-20 · conditional · none · ref 36 · 2 links · internal anchor
MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.
Thinking with Drafting: Optical Decompression via Logical Reconstruction cs.CL · 2026-02-12 · unverdicted · none · ref 19 · internal anchor
Thinking with Drafting reconceptualizes visual reasoning as optical decompression by forcing models to draft mental models into executable DSL code for deterministic self-verification on the VisAlg benchmark.
Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning cs.CV · 2026-02-07 · unverdicted · none · ref 12 · internal anchor
Fine-R1 uses chain-of-thought supervised fine-tuning on a structured FGVR reasoning dataset plus triplet augmented policy optimization to outperform general MLLMs and CLIP models on seen and unseen fine-grained categories with 4-shot training.
Vision-aligned Latent Reasoning for Multi-modal Large Language Model cs.CV · 2026-02-04 · unverdicted · none · ref 13 · internal anchor
VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.
When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making cs.AI · 2026-02-03 · unverdicted · none · ref 36 · internal anchor
Adversarial explanation attacks preserve nearly all human trust in wrong AI outputs by using persuasive framing, shown in a study varying reasoning, evidence, style, and format with over 200 participants.
HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding cs.CV · 2026-01-21 · unverdicted · none · ref 33 · internal anchor
HERMES organizes the KV cache into a hierarchical memory to enable real-time streaming video understanding in MLLMs, achieving 10x faster TTFT and up to 11.4% accuracy gains on streaming benchmarks with 68% fewer tokens.
Compounding Disadvantage: Auditing Intersectional Bias in LLM-Generated Explanations Across Indian and American STEM Education cs.CY · 2026-01-20 · unverdicted · none · ref 35 · internal anchor
LLMs generate lower-quality STEM explanations for marginalized student profiles in Indian and American contexts, with intersectional compounding producing gaps of up to 2.55 grade levels.
EGM: Efficient Visual Grounding Language Models cs.CV · 2026-01-20 · unverdicted · none · ref 14 · internal anchor
EGM enables 8B VLMs to reach 91.4 IoU on RefCOCO at 737 ms latency, outperforming a 235B model at 4320 ms, by substituting volume of mid-quality tokens for model scale.
MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness cs.AI · 2026-01-13 · unverdicted · none · ref 14 · internal anchor
MirrorBench defines a reproducible benchmark combining lexical metrics (MATTR, Yule's K, HD-D) and LLM-judge metrics with calibration controls to measure human-likeness of user-proxy agents across four datasets.
Token-Level LLM Collaboration via FusionRoute cs.AI · 2026-01-08 · unverdicted · none · ref 10 · internal anchor
FusionRoute augments token-level expert routing with a trainable complementary logit generator to expand the policy class and recover optimal decoding under mild conditions, outperforming prior collaboration and merging methods on reasoning and generation benchmarks.
LinMU: Multimodal Understanding Made Linear cs.CV · 2026-01-04 · conditional · none · ref 9 · internal anchor
LinMU achieves linear-complexity multimodal understanding by swapping self-attention for an M-MATE dual-branch block and distilling from a frozen teacher VLM, matching accuracy with up to 2.7x faster TTFT and 9x higher throughput.
Streaming Video Instruction Tuning cs.CV · 2025-12-24 · unverdicted · none · ref 53 · internal anchor
Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.
Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction cs.CL · 2025-12-21 · unverdicted · none · ref 3 · internal anchor
LLMs systematically misalign with human student difficulty perceptions, converging to a machine consensus rather than simulating human cognitive limits across medical and math domains.
Restore-R1: Efficient Image Restoration Agents via Reinforcement Learning with Multimodal LLM Perceptual Feedback cs.CV · 2025-12-21 · unverdicted · none · ref 13 · internal anchor
An RL-trained lightweight agent uses MLLM perceptual rewards to perform efficient label-free image restoration, matching SOTA on full-reference metrics and surpassing prior work on no-reference metrics.
VeruSAGE: A Study of Agent-Based Verification for Rust Systems cs.OS · 2025-12-20 · unverdicted · none · ref 14 · internal anchor
LLM agents complete over 80% of tasks on a new 849-task Rust verification benchmark and over 90% on unfinished human proofs.
Kling-Omni Technical Report cs.CV · 2025-12-18 · unverdicted · none · ref 12 · internal anchor
Kling-Omni is a unified multimodal generative system that produces cinematic videos from diverse inputs by integrating generation, editing, and intelligent reasoning in a single end-to-end model.
Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning cs.CV · 2025-12-17 · unverdicted · none · ref 24 · internal anchor
Skyra is an MLLM that detects AI-generated videos by identifying and reasoning over grounded visual artifacts, supported by a new annotated dataset and benchmark.
AgentIAD: Agentic Industrial Anomaly Detection via Adaptive Memory Augmentation cs.CV · 2025-12-15 · unverdicted · none · ref 17 · internal anchor
AgentIAD introduces an agentic VLM with Perceptive Zoomer, Web Searcher, and Comparative Retriever tools plus two-stage SFT-then-RL training, achieving 5.92% higher classification accuracy than prior SOTA on the MMAD benchmark.
LLaDA2.0: Scaling Up Diffusion Language Models to 100B cs.LG · 2025-12-10 · conditional · none · ref 15 · internal anchor
LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.
Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs cs.AI · 2025-12-09 · unverdicted · none · ref 18 · internal anchor
State-of-the-art MLLMs show substantial inconsistency when reasoning over the same information presented in image, text, or mixed modalities, even after accounting for OCR errors, with inconsistency linked to visual factors and modality gap.
Detector-Empowered Video Large Language Model for Efficient Spatio-Temporal Grounding cs.CV · 2025-12-07 · conditional · none · ref 27 · internal anchor
DEViL offloads spatial grounding to a detector via a distilled reference-semantic token and temporal consistency regularization, reaching 43.1% m_vIoU at 14.33 FPS on HC-STVG.
MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding cs.CV · 2025-12-06 · conditional · none · ref 13 · internal anchor
MedGRPO applies cross-dataset reward normalization and a clinical LLM judge within multi-task RL to improve vision-language models on heterogeneous medical video understanding tasks using the new MedVidBench dataset.
GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization cs.CV · 2025-12-02 · unverdicted · none · ref 15 · internal anchor
GeoBridge introduces a semantic-anchor mechanism using text to bridge multi-view image features for bidirectional cross-view and language-to-image geo-localization, supported by the new GeoLoc dataset of over 50,000 aligned pairs.
PhotoFramer: Multi-modal Image Composition Instruction cs.CV · 2025-11-30 · conditional · none · ref 40 · internal anchor
PhotoFramer is a multi-modal model that jointly produces textual composition instructions and illustrative corrected images from poorly framed inputs.
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling cs.CV · 2025-11-25 · unverdicted · none · ref 16 · internal anchor
LongVT adds native video-cropping tool calling to LMMs for interleaved multimodal chain-of-tool-thought reasoning on long videos and releases VideoSIAH data for training and evaluation.
Boosting Reasoning in Large Multimodal Models via Activation Replay cs.CV · 2025-11-25 · unverdicted · none · ref 16 · internal anchor
Activation Replay boosts multimodal reasoning in post-trained LMMs by replaying low-entropy activations from base models to RLVR counterparts at test time via visual token manipulation.
GraphMind: Theorem Selection and Conclusion Generation Framework with Dynamic GNN for LLM Reasoning cs.CL · 2025-11-24 · unverdicted · none · ref 13 · internal anchor
GraphMind models multi-step reasoning as an evolving heterogeneous graph, using GNN encoding and semantic matching to select theorems and generate conclusions iteratively, reporting performance gains over baselines on QA datasets.

GPT-4o System Card

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer