super hub Mixed citations

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Dong Guo, Feng Li, Hao Zhang, Renrui Zhang, Yuanhan Zhang · 2024 · cs.CV · arXiv 2408.03326

Mixed citation behavior. Most common role is background (55%).

329 Pith papers citing it

Background 55% of classified citations

open full Pith review browse 329 citing papers more from Bo Li arXiv PDF

abstract

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 55 baseline 32 dataset 7 method 5

citation-polarity summary

background 54 baseline 32 use dataset 7 use method 5 unclear 1

claims ledger

abstract We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particu

authors

Bo Li Dong Guo Feng Li Hao Zhang Renrui Zhang Yuanhan Zhang

co-cited works

representative citing papers

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents

cs.CV · 2026-05-10 · accept · novelty 8.0

DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

cs.CL · 2024-09-04 · accept · novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

An Attribute-Based Measure of Video Complexity

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

VideoABC estimates video-LLM failure probability via low-dimensional attribute projection, dual quantization (k-means plus lattice), and psychophysics-inspired synthetic data.

DeepLatent: Think with Images via Parallel Latent Visual Reasoning

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.

Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models

cs.CL · 2026-05-26 · unverdicted · novelty 7.0

Chartographer generates seed-controlled counterfactual charts from existing QA datasets to expose generalization failures in VLMs that single-chart benchmarks miss.

Touch-R1: Reinforcing Touch Reasoning in MLLMs

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

Touch-R1 applies GRPO reinforcement learning on a new 1M tactile dataset and benchmark to train a Qwen2.5-VL-7B model that outperforms baselines on tactile perception and visual-tactile conflict tasks.

VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

VideoOdyssey is a new benchmark featuring ultra-long videos (avg. 109 min) across 11 domains with multi-level continuous certificates (avg. 16 min for visual, 12.8 min for audio-visual) to diagnose MLLM limitations in continuous reasoning and omni-modal perception.

ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

ST-SimDiff is a training-free method using a spatio-temporal graph and dual similarity-difference selection to compress video tokens for MLLMs while retaining static and dynamic content.

SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

SDGBiasBench reveals intrinsic SDG biases in VLMs driven by priors rather than evidence, and CADE mitigates them with up to 25% accuracy gains and 12-point MAE reductions.

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

cs.CV · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.

WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

cs.CV · 2026-05-20 · conditional · novelty 7.0

WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.

EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial Reasoning

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

EventPrune prunes 80% of visual tokens in Video-LLMs using event camera motion cues, yielding 1.89x speedup, 52% fewer GFLOPs, and slightly higher accuracy than full-token baselines on first-person dynamic spatial reasoning.

EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric Videos

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

EgoExoMem is the first benchmark for cross-view memory reasoning on synchronized egocentric-exocentric videos, where E2-Select raises MLLM accuracy from 55.3% to 58.2% over baselines.

EgoInteract: Synthetic Egocentric Videos Generation for Interaction Understanding and Anticipation

cs.CV · 2026-05-18 · unverdicted · novelty 7.0 · 2 refs

EgoInteract is a new simulator for generating synthetic egocentric videos with precise control over camera, body, hand, and object motions, producing a dataset that improves model performance on real-world benchmarks for temporal action segmentation, next-active object detection, interaction Anticip

HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation

cs.CV · 2026-05-16 · unverdicted · novelty 7.0

HEED replaces uniform residual alignment with density-weighted alignment using patch self-dissimilarity to improve hybrid VLM distillation, gaining 8.7 points on OCRBench v2 and 5.13 on a 10-benchmark average.

GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolution remote sensing benchmarks.

CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.

WirelessSenseLLM: Zero-Shot Human Activity Understanding by Bridging Wireless Signals and Human Language

cs.NI · 2026-05-13 · unverdicted · novelty 7.0

WirelessSenseLLM bridges unsegmented Wi-Fi CSI signals to LLMs via a CSI-to-Language Adapter for zero-shot human activity understanding and reasoning.

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

AdaFocus achieves better accuracy on long-video benchmarks with roughly 33 times fewer visual tokens by combining query-aware adaptive sampling and zero-cache disk-based refinement.

citing papers explorer

Showing 50 of 272 citing papers after filters.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models cs.CV · 2026-05-11 · unverdicted · none · ref 56 · internal anchor
SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents cs.CV · 2026-05-10 · accept · none · ref 37 · internal anchor
DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.
Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning cs.CV · 2026-04-03 · conditional · none · ref 4 · internal anchor
VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models cs.CV · 2024-09-25 · accept · none · ref 59 · internal anchor
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations cs.CV · 2026-04-20 · unverdicted · none · ref 20
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
An Attribute-Based Measure of Video Complexity cs.CV · 2026-05-30 · unverdicted · none · ref 28 · internal anchor
VideoABC estimates video-LLM failure probability via low-dimensional attribute projection, dual quantization (k-means plus lattice), and psychophysics-inspired synthetic data.
DeepLatent: Think with Images via Parallel Latent Visual Reasoning cs.CV · 2026-05-30 · unverdicted · none · ref 114 · internal anchor
DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.
Touch-R1: Reinforcing Touch Reasoning in MLLMs cs.CV · 2026-05-26 · unverdicted · none · ref 20 · internal anchor
Touch-R1 applies GRPO reinforcement learning on a new 1M tactile dataset and benchmark to train a Qwen2.5-VL-7B model that outperforms baselines on tactile perception and visual-tactile conflict tasks.
VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding cs.CV · 2026-05-21 · unverdicted · none · ref 14 · internal anchor
VideoOdyssey is a new benchmark featuring ultra-long videos (avg. 109 min) across 11 domains with multi-level continuous certificates (avg. 16 min for visual, 12.8 min for audio-visual) to diagnose MLLM limitations in continuous reasoning and omni-modal perception.
SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals cs.CV · 2026-05-21 · unverdicted · none · ref 33 · internal anchor
SDGBiasBench reveals intrinsic SDG biases in VLMs driven by priors rather than evidence, and CADE mitigates them with up to 25% accuracy gains and 12-point MAE reductions.
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning cs.CV · 2026-05-20 · unverdicted · none · ref 11 · 2 links · internal anchor
Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.
WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata cs.CV · 2026-05-20 · conditional · none · ref 125 · internal anchor
WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.
EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial Reasoning cs.CV · 2026-05-19 · unverdicted · none · ref 19 · internal anchor
EventPrune prunes 80% of visual tokens in Video-LLMs using event camera motion cues, yielding 1.89x speedup, 52% fewer GFLOPs, and slightly higher accuracy than full-token baselines on first-person dynamic spatial reasoning.
EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric Videos cs.CV · 2026-05-18 · unverdicted · none · ref 45 · internal anchor
EgoExoMem is the first benchmark for cross-view memory reasoning on synchronized egocentric-exocentric videos, where E2-Select raises MLLM accuracy from 55.3% to 58.2% over baselines.
EgoInteract: Synthetic Egocentric Videos Generation for Interaction Understanding and Anticipation cs.CV · 2026-05-18 · unverdicted · none · ref 26 · 2 links · internal anchor
EgoInteract is a new simulator for generating synthetic egocentric videos with precise control over camera, body, hand, and object motions, producing a dataset that improves model performance on real-world benchmarks for temporal action segmentation, next-active object detection, interaction Anticip
HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation cs.CV · 2026-05-16 · unverdicted · none · ref 19 · internal anchor
HEED replaces uniform residual alignment with density-weighted alignment using patch self-dissimilarity to improve hybrid VLM distillation, gaining 8.7 points on OCRBench v2 and 5.13 on a 10-benchmark average.
GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding cs.CV · 2026-05-14 · unverdicted · none · ref 11 · internal anchor
GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolution remote sensing benchmarks.
CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding cs.CV · 2026-05-14 · unverdicted · none · ref 31 · internal anchor
CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding cs.CV · 2026-05-13 · unverdicted · none · ref 53 · internal anchor
A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding cs.CV · 2026-05-13 · unverdicted · none · ref 11 · internal anchor
AdaFocus achieves better accuracy on long-video benchmarks with roughly 33 times fewer visual tokens by combining query-aware adaptive sampling and zero-cache disk-based refinement.
Count Anything at Any Granularity cs.CV · 2026-05-11 · unverdicted · none · ref 47 · internal anchor
Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for improved accuracy.
V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning cs.CV · 2026-05-11 · unverdicted · none · ref 11 · internal anchor
V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models cs.CV · 2026-05-11 · unverdicted · none · ref 21 · internal anchor
ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
StreamPro: From Reactive Perception to Proactive Decision-Making in Streaming Video cs.CV · 2026-05-11 · unverdicted · none · ref 43 · internal anchor
StreamPro introduces a benchmark and training method using CB-Stream Loss and GRPO to enable proactive decision-making in streaming videos, achieving 41.5 on StreamPro-Bench compared to 10.4 previously.
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models cs.CV · 2026-05-11 · conditional · none · ref 19 · 2 links · internal anchor
TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding cs.CV · 2026-05-08 · unverdicted · none · ref 13 · internal anchor
SYNCR benchmark shows leading MLLMs reach only 52.5% average accuracy on cross-video reasoning tasks against an 89.5% human baseline, with major weaknesses in physical and spatial reasoning.
Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding cs.CV · 2026-05-08 · unverdicted · none · ref 14 · internal anchor
SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.
Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs cs.CV · 2026-05-08 · unverdicted · none · ref 7 · internal anchor
ScaleEarth conditions remote sensing VLMs on continuous GSD via CS-HLoRA and a visual GSD predictor, creating a closed training loop with GeoScale-VQA to achieve SOTA on Earth observation benchmarks.
Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance cs.CV · 2026-05-02 · unverdicted · none · ref 23 · internal anchor
Gromov-Wasserstein distance between modalities provides a stronger, inference-only predictor of final VLM performance than conventional encoder metrics, backed by theory linking it to cross-modal learnability and verified across 60+ training runs.
Don't Pause! Every prediction matters in a streaming video cs.CV · 2026-04-27 · unverdicted · none · ref 35 · internal anchor
SPOT-Bench tests real-time streaming video perception with timeliness metrics, exposing limitations in current models and introducing AsynKV as an improved baseline.
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models cs.CV · 2026-04-27 · unverdicted · none · ref 11 · internal anchor
LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-layer attention in LLMs.
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding cs.CV · 2026-04-24 · unverdicted · none · ref 25 · internal anchor
CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.
Towards Unconstrained Human-Object Interaction cs.CV · 2026-04-15 · unverdicted · none · ref 27 · internal anchor
Introduces the U-HOI task and shows MLLMs plus a language-to-graph pipeline can handle human-object interactions without any predefined vocabulary at training or inference time.
Why MLLMs Struggle to Determine Object Orientations cs.CV · 2026-04-14 · accept · none · ref 15 · internal anchor
Orientation information is recoverable from MLLM visual encoder embeddings via linear regression, contradicting the hypothesis that failures originate in the encoders.
Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding cs.CV · 2026-04-13 · unverdicted · none · ref 17 · internal anchor
DualComp uses a lightweight router to split visual token compression into a semantic stream with size-adaptive clustering and a geometric stream with path-tracing recovery, enabling low-cost high-fidelity UHR remote sensing interpretation.
SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos cs.CV · 2026-04-10 · unverdicted · none · ref 14 · internal anchor
SiMing-Bench shows current MLLMs have weak agreement with physicians on procedural correctness in clinical videos, with intermediate step judgments remaining poor even when overall scores look acceptable.
CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning cs.CV · 2026-04-09 · unverdicted · none · ref 11 · internal anchor
CrashSight is a new infrastructure-focused benchmark showing that state-of-the-art vision-language models can describe crash scenes but fail at temporal and causal reasoning.
Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding cs.CV · 2026-04-09 · unverdicted · none · ref 29 · internal anchor
Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.
MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments cs.CV · 2026-04-09 · unverdicted · none · ref 19 · internal anchor
MARINER is a new benchmark dataset and evaluation framework for fine-grained perception and causal reasoning in open-water scenes using 16,629 images across 63 vessel categories, diverse environments, and maritime incidents.
PLUME: Latent Reasoning Based Universal Multimodal Embedding cs.CV · 2026-04-02 · unverdicted · none · ref 26 · internal anchor
PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators cs.CV · 2026-03-31 · unverdicted · none · ref 13 · internal anchor
V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the fine-grained perception gap on benchmarks.
LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models cs.CV · 2026-03-16 · unverdicted · none · ref 34 · internal anchor
LLMind uses bio-inspired non-uniform sampling via a Mobius module and closed-loop semantic feedback to retain 82-97% of full-resolution VLM performance with only 1-5% of pixels on VQA benchmarks.
More than the Sum: Panorama-Language Models for Adverse Omni-Scenes cs.CV · 2026-03-10 · unverdicted · none · ref 20 · internal anchor
Panorama-Language Models with a sparse attention module and PanoVQA dataset deliver superior holistic reasoning on 360° adverse omni-scenes compared to stitched pinhole views.
SCP: Spatial Causal Prediction in Video cs.CV · 2026-03-04 · unverdicted · none · ref 29 · internal anchor
SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding cs.CV · 2026-02-24 · unverdicted · none · ref 19 · internal anchor
LongVideo-R1 trains a reasoning agent on 33K trajectories to intelligently select informative video clips via iterative refinement and RL, achieving better accuracy-efficiency tradeoffs on long video QA benchmarks.
Thinking with Geometry: Active Geometry Integration for Spatial Reasoning cs.CV · 2026-02-05 · unverdicted · none · ref 15 · internal anchor
GeoThinker enables active, task-conditioned geometry integration in MLLMs via spatial-grounded fusion and importance gating, reaching 72.6 on VSI-Bench.
VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text? cs.CV · 2026-02-04 · conditional · none · ref 7 · internal anchor
VISTA-Bench shows vision-language models degrade on visualized text in images compared to equivalent pure text, with larger gaps under increased perceptual difficulty.
Tiled Prompts: Overcoming Prompt Misguidance in Image and Video Super-Resolution cs.CV · 2026-02-03 · unverdicted · none · ref 23 · internal anchor
Tiled Prompts generates tile-specific text prompts for each latent tile in diffusion super-resolution to reduce errors from global prompts and improve perceptual quality.
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning cs.CV · 2026-01-22 · unverdicted · none · ref 13 · internal anchor
VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
DarkQA: Benchmarking Vision-Language Models on Visual-Primitive Question Answering in Low-Light Indoor Scenes cs.CV · 2025-12-31 · accept · none · ref 13 · internal anchor
DarkQA is a new benchmark that measures vision-language model performance on basic visual questions under controlled low-light degradations modeled from real camera physics.

LLaVA-OneVision: Easy Visual Task Transfer

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer