super hub Mixed citations

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Dong Guo, Feng Li, Hao Zhang, Renrui Zhang, Yuanhan Zhang · 2024 · cs.CV · arXiv 2408.03326

Mixed citation behavior. Most common role is background (55%).

352 Pith papers citing it

Background 55% of classified citations

open full Pith review browse 352 citing papers more from Bo Li arXiv PDF

abstract

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 56 baseline 32 dataset 7 method 5

citation-polarity summary

background 55 baseline 32 use dataset 7 use method 5 unclear 1

claims ledger

abstract We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particu

authors

Bo Li Dong Guo Feng Li Hao Zhang Renrui Zhang Yuanhan Zhang

co-cited works

representative citing papers

Decodable Is Not Grounded: A Vision-Ablation Arbiter for VLM Spatial Reasoning

cs.CV · 2026-06-30 · unverdicted · novelty 8.0

A blank-image ablation test reveals that high probe accuracy on VLM spatial reasoning frequently reflects priors or inverted signs rather than image grounding, with horizontal grounded, vertical prior, and depth inverted.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents

cs.CV · 2026-05-10 · accept · novelty 8.0

DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

cs.CL · 2024-09-04 · accept · novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

MuseBench shows state-of-the-art MLLMs achieve only 48.29% accuracy on intent-level audiovisual arts understanding versus 87.18% for human experts.

RoboGaze: Evaluating Robot World Models via Structured Vision-Language Analysis

cs.RO · 2026-06-22 · unverdicted · novelty 7.0

RoboGaze presents a structured multi-agent VLM pipeline and robotics-specific error taxonomy that improves video evaluation metrics by up to 43 F1 points over zero-shot baselines on a 382-clip dataset.

An Attribute-Based Measure of Video Complexity

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

VideoABC estimates video-LLM failure probability via low-dimensional attribute projection, dual quantization (k-means plus lattice), and psychophysics-inspired synthetic data.

DeepLatent: Think with Images via Parallel Latent Visual Reasoning

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.

Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models

cs.CL · 2026-05-26 · unverdicted · novelty 7.0

Chartographer generates seed-controlled counterfactual charts from existing QA datasets to expose generalization failures in VLMs that single-chart benchmarks miss.

Touch-R1: Reinforcing Touch Reasoning in MLLMs

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

Touch-R1 applies GRPO reinforcement learning on a new 1M tactile dataset and benchmark to train a Qwen2.5-VL-7B model that outperforms baselines on tactile perception and visual-tactile conflict tasks.

STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

STORM teaches LVLMs to internalize spatial-temporal reasoning via bounded latent trajectories trained with generated thought videos in two stages, improving accuracy on VideoMME, MVBench and similar benchmarks while lowering inference overhead.

MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning

cs.AI · 2026-05-25 · unverdicted · novelty 7.0

MuCRASP prunes VLMs in a CoT-aware manner, outperforming baselines by preserving reasoning quality at 30-50% compression rates on models like Qwen2.5-VL-7B.

VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

VideoOdyssey is a new benchmark featuring ultra-long videos (avg. 109 min) across 11 domains with multi-level continuous certificates (avg. 16 min for visual, 12.8 min for audio-visual) to diagnose MLLM limitations in continuous reasoning and omni-modal perception.

ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

ST-SimDiff is a training-free method using a spatio-temporal graph and dual similarity-difference selection to compress video tokens for MLLMs while retaining static and dynamic content.

SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

SDGBiasBench reveals intrinsic SDG biases in VLMs driven by priors rather than evidence, and CADE mitigates them with up to 25% accuracy gains and 12-point MAE reductions.

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

cs.CV · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.

WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

cs.CV · 2026-05-20 · conditional · novelty 7.0

WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.

EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial Reasoning

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

EventPrune prunes 80% of visual tokens in Video-LLMs using event camera motion cues, yielding 1.89x speedup, 52% fewer GFLOPs, and slightly higher accuracy than full-token baselines on first-person dynamic spatial reasoning.

EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric Videos

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

EgoExoMem is the first benchmark for cross-view memory reasoning on synchronized egocentric-exocentric videos, where E2-Select raises MLLM accuracy from 55.3% to 58.2% over baselines.

EgoInteract: Synthetic Egocentric Videos Generation for Interaction Understanding and Anticipation

cs.CV · 2026-05-18 · unverdicted · novelty 7.0 · 2 refs

EgoInteract is a new simulator for generating synthetic egocentric videos with precise control over camera, body, hand, and object motions, producing a dataset that improves model performance on real-world benchmarks for temporal action segmentation, next-active object detection, interaction Anticip

An Efficient Streaming Video Understanding Framework with Agentic Control

cs.CV · 2026-05-18 · unverdicted · novelty 7.0 · 2 refs

R3-Streaming uses cascaded control with age-aware memory forgetting and TB-GRPO reinforcement learning to reach SOTA scores of 57.92 on OVO-Bench and 76.36 on StreamingBench with 95-96% fewer visual tokens.

citing papers explorer

Showing 50 of 352 citing papers.

NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation cs.CV · 2025-10-24 · unverdicted · none · ref 18 · internal anchor
NoisyGRPO is an RL framework that perturbs visual inputs with Gaussian noise for exploration and computes trajectory advantages via Bayesian posterior fusion of noise prior and reward likelihood to improve multimodal CoT generalization.
Online In-Context Distillation for Low-Resource Vision Language Models cs.CV · 2025-10-20 · unverdicted · none · ref 13 · internal anchor
Online In-Context Distillation lets small VLMs gain up to 33% performance with as little as 4% teacher annotations by distilling knowledge through dynamic in-context demonstrations at inference.
Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning cs.AI · 2025-09-26 · unverdicted · none · ref 20 · internal anchor
MoVT unifies different visual reasoning modes in a single model and uses the AdaVaR two-stage framework with supervised cold-start and RL via AdaGRPO to enable context-adaptive mode selection, yielding consistent gains on visual reasoning tasks.
EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence cs.CV · 2025-09-18 · unverdicted · none · ref 3 · internal anchor
EchoVLM applies a Mixture-of-Experts vision-language model to ultrasound imaging across seven body regions, reporting gains of 10.15 BLEU-1 and 4.77 ROUGE-1 over Qwen2-VL on report generation.
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search cs.CV · 2025-09-09 · unverdicted · none · ref 15 · internal anchor
Mini-o3 scales visual search reasoning to tens of interaction turns via a new probe dataset, iterative trajectory collection, and over-turn masking in RL, claiming SOTA performance while training only up to six turns.
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning cs.CV · 2025-07-22 · unverdicted · none · ref 17 · internal anchor
ThinkAct introduces reinforced visual latent planning in a dual VLA system to enable better long-horizon reasoning and adaptation for embodied tasks.
WorldVLA: Towards Autoregressive Action World Model cs.RO · 2025-06-26 · unverdicted · none · ref 19 · internal anchor
WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.
OmniGen2: Towards Instruction-Aligned Multimodal Generation cs.CV · 2025-06-23 · unverdicted · none · ref 35 · internal anchor
OmniGen2 introduces a unified generative model with two distinct decoding pathways and a decoupled image tokenizer that achieves competitive results on text-to-image and editing benchmarks plus state-of-the-art consistency among open-source models on the new OmniContext benchmark.
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation cs.CV · 2025-06-03 · unverdicted · none · ref 17 · internal anchor
UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs cs.CV · 2025-03-04 · unverdicted · none · ref 22 · internal anchor
Modality-mutual attention (MMA) is introduced to replace causal attention in MLLMs, enabling mutual attention between image and text tokens and claiming SOTA results on 12 multimodal benchmarks with no extra parameters.
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs cs.CL · 2025-03-03 · unverdicted · none · ref 32 · internal anchor
Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling cs.CV · 2025-01-21 · unverdicted · none · ref 14 · internal anchor
InternVideo2.5 improves video MLLMs by incorporating dense vision task annotations via direct preference optimization and compact spatiotemporal representations via adaptive hierarchical token compression, yielding better benchmark performance, 6x longer video memory, and new capabilities likeobject
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding cs.CV · 2025-01-09 · unverdicted · none · ref 29 · internal anchor
LLaVA-Octopus introduces instruction-driven adaptive fusion of multiple visual projectors in a multimodal LLM to improve video understanding performance.
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding cs.CV · 2024-12-13 · accept · none · ref 46 · internal anchor
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B activated parameters.
NVILA: Efficient Frontier Visual Language Models cs.CV · 2024-12-05 · unverdicted · none · ref 4 · internal anchor
NVILA improves on VILA with a scale-then-compress visual token strategy and full-lifecycle efficiency optimizations, matching or exceeding leading VLMs on image and video benchmarks while reducing training cost 1.9-5.1x and latencies 1.2-2.8x.
Hallucination of Multimodal Large Language Models: A Survey cs.CV · 2024-04-29 · accept · none · ref 101 · internal anchor
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
Perceptual Flow Network for Visually Grounded Reasoning cs.CV · 2026-05-04 · unverdicted · none · ref 19
PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
Make Your LVLM KV Cache More Lightweight cs.CV · 2026-05-01 · unverdicted · none · ref 53
LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.
ZSG-IAD: A Multimodal Framework for Zero-Shot Grounded Industrial Anomaly Detection cs.CV · 2026-04-20 · unverdicted · none · ref 32
ZSG-IAD is a zero-shot multimodal system that uses language-guided two-hop grounding and rule-based reinforcement learning to produce anomaly masks and explainable reports from industrial sensor data.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges cs.LG · 2026-04-15 · unverdicted · none · ref 201
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
Consistency as Inductive Bias: Learning Cross-View Invariance for Robust Multimodal Reasoning cs.CV · 2026-06-29 · unverdicted · none · ref 19 · internal anchor
ConsistRoll enforces cross-view consistency during RLVR training for MLLMs by joint rewards on grouped original and augmented views, yielding robustness gains on math, general, and hallucination benchmarks.
SteerSeg: Attention Steering for Reasoning Video Segmentation cs.CV · 2026-05-14 · unverdicted · none · ref 23 · internal anchor
SteerSeg improves attention-based grounding for reasoning video segmentation by training soft prompts and applying CoT-derived attributes to produce concentrated attention maps that guide point-prompted segmentation and tracklet selection.
ZAYA1-VL-8B Technical Report cs.CV · 2026-05-08 · unverdicted · none · ref 66 · internal anchor
ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting benchmarks.
UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning cs.CV · 2026-05-05 · unverdicted · none · ref 40 · internal anchor
UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.
CurEvo: Curriculum-Guided Self-Evolution for Video Understanding cs.CV · 2026-04-29 · unverdicted · none · ref 35 · internal anchor
CurEvo integrates curriculum guidance into self-evolution to structure autonomous improvement of video understanding models, yielding gains on VideoQA benchmarks.
Agentic AI for Remote Sensing: Technical Challenges and Research Directions cs.CV · 2026-04-27 · unverdicted · none · ref 56 · 2 links · internal anchor
Position paper identifies structural challenges in applying generic agentic AI to Earth Observation and outlines design principles for EO-native agents focused on geospatial state and validity.
STAND: Semantic Anchoring Constraint with Dual-Granularity Disambiguation for Remote Sensing Image Change Captioning cs.CV · 2026-04-25 · unverdicted · none · ref 13 · internal anchor
STAND adds semantic anchoring and dual-granularity disambiguation modules to address viewpoint, scale, and knowledge ambiguities in remote sensing change captioning.
DFLOP: A Data-driven Framework for Multimodal LLM Training Pipeline Optimization cs.DC · 2026-03-26 · unverdicted · none · ref 36 · internal anchor
DFLOP is a data-driven framework that profiles data-induced computation variance and uses predictive scheduling to balance workloads in multimodal LLM training pipelines, claiming up to 3.6x faster training than existing frameworks.
Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge cs.CV · 2026-02-25 · unverdicted · none · ref 8 · internal anchor
An edge-deployed multimodal LLM pipeline for online episodic memory QA reaches 51.76% accuracy on an 8 GB GPU and 54.40% on a local server, within 4-5 points of a 56% cloud baseline.
TCAP: Tri-Component Attention Profiling for Unsupervised Backdoor Detection in MLLM Fine-Tuning cs.AI · 2026-01-29 · unverdicted · none · ref 8 · internal anchor
TCAP detects backdoor samples in MLLM fine-tuning via tri-component attention profiling, GMM-based head identification, and EM vote aggregation.
Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning cs.CV · 2025-07-01 · unverdicted · none · ref 16 · internal anchor
A pipeline of chain-of-thought data synthesis, LoRA-based supervised fine-tuning, rejection sampling, and rule-based reinforcement learning raises multi-image grounding accuracy by 9.04% on MIG-Bench and 4.41% on average across seven other benchmarks.
Show-o2: Improved Native Unified Multimodal Models cs.CV · 2025-06-18 · unverdicted · none · ref 56 · internal anchor
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning cs.CV · 2025-06-07 · unverdicted · none · ref 35 · internal anchor
Vision-EKIPL injects high-quality actions from external models into RL training to expand exploration and raise the reasoning ceiling of MLLMs, reporting up to 5% gains on the Reason-RFT-CoT benchmark.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding cs.CV · 2025-01-22 · unverdicted · none · ref 29 · internal anchor
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction cs.CV · 2025-01-03 · conditional · none · ref 47 · internal anchor
VITA-1.5 integrates vision and speech into a single LLM through multi-stage training, delivering competitive benchmark results on image, video, and speech tasks with near real-time response speed.
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs cs.CV · 2024-06-11 · unverdicted · none · ref 27 · internal anchor
VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments cs.CV · 2026-04-20 · unverdicted · none · ref 52
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial reasoning and embodied performance on 18 benchmarks.
Toward Native Multimodal Modeling: A Roadmap cs.CV · 2026-05-25 · unverdicted · none · ref 210 · internal anchor
A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation cs.CV · 2026-04-13 · unverdicted · none · ref 80 · internal anchor
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey cs.CV · 2025-03-16 · unverdicted · none · ref 275 · internal anchor
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.
SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments cs.CV · 2026-04-24 · unreviewed · ref 15 · internal anchor
Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings cs.CV · 2026-04-24 · unreviewed · ref 22 · internal anchor
HyLaR: Hybrid Latent Reasoning with Decoupled Policy Optimization cs.CV · 2026-04-22 · unreviewed · ref 18 · internal anchor
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference cs.LG · 2026-04-19 · unreviewed · ref 2 · 2 links · internal anchor
BareBones: Benchmarking Zero-Shot Geometric Comprehension in VLMs cs.CV · 2026-04-12 · unreviewed · ref 15 · internal anchor
FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning cs.AI · 2026-04-04 · unreviewed · ref 26 · internal anchor
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs cs.CV · 2026-04-01 · unreviewed · ref 20 · internal anchor
R3G: A Reasoning-Retrieval-Reranking Framework for Vision-Centric Answer Generation cs.CV · 2026-01-25 · unreviewed · ref 24 · internal anchor
MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs cs.CV · 2025-11-18 · unreviewed · ref 31 · internal anchor
FreeRet: MLLMs as Training-Free Retrievers cs.CV · 2025-09-29 · unreviewed · ref 13 · internal anchor

LLaVA-OneVision: Easy Visual Task Transfer

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer