super hub Mixed citations

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Dong Guo, Feng Li, Hao Zhang, Renrui Zhang, Yuanhan Zhang · 2024 · cs.CV · arXiv 2408.03326

Mixed citation behavior. Most common role is background (55%).

347 Pith papers citing it

Background 55% of classified citations

open full Pith review browse 347 citing papers more from Bo Li arXiv PDF

abstract

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 56 baseline 32 dataset 7 method 5

citation-polarity summary

background 55 baseline 32 use dataset 7 use method 5 unclear 1

claims ledger

abstract We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particu

authors

Bo Li Dong Guo Feng Li Hao Zhang Renrui Zhang Yuanhan Zhang

co-cited works

representative citing papers

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents

cs.CV · 2026-05-10 · accept · novelty 8.0

DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

cs.CL · 2024-09-04 · accept · novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

MuseBench shows state-of-the-art MLLMs achieve only 48.29% accuracy on intent-level audiovisual arts understanding versus 87.18% for human experts.

RoboGaze: Evaluating Robot World Models via Structured Vision-Language Analysis

cs.RO · 2026-06-22 · unverdicted · novelty 7.0

RoboGaze presents a structured multi-agent VLM pipeline and robotics-specific error taxonomy that improves video evaluation metrics by up to 43 F1 points over zero-shot baselines on a 382-clip dataset.

An Attribute-Based Measure of Video Complexity

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

VideoABC estimates video-LLM failure probability via low-dimensional attribute projection, dual quantization (k-means plus lattice), and psychophysics-inspired synthetic data.

DeepLatent: Think with Images via Parallel Latent Visual Reasoning

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.

Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models

cs.CL · 2026-05-26 · unverdicted · novelty 7.0

Chartographer generates seed-controlled counterfactual charts from existing QA datasets to expose generalization failures in VLMs that single-chart benchmarks miss.

Touch-R1: Reinforcing Touch Reasoning in MLLMs

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

Touch-R1 applies GRPO reinforcement learning on a new 1M tactile dataset and benchmark to train a Qwen2.5-VL-7B model that outperforms baselines on tactile perception and visual-tactile conflict tasks.

STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

STORM teaches LVLMs to internalize spatial-temporal reasoning via bounded latent trajectories trained with generated thought videos in two stages, improving accuracy on VideoMME, MVBench and similar benchmarks while lowering inference overhead.

MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning

cs.AI · 2026-05-25 · unverdicted · novelty 7.0

MuCRASP prunes VLMs in a CoT-aware manner, outperforming baselines by preserving reasoning quality at 30-50% compression rates on models like Qwen2.5-VL-7B.

VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

VideoOdyssey is a new benchmark featuring ultra-long videos (avg. 109 min) across 11 domains with multi-level continuous certificates (avg. 16 min for visual, 12.8 min for audio-visual) to diagnose MLLM limitations in continuous reasoning and omni-modal perception.

ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

ST-SimDiff is a training-free method using a spatio-temporal graph and dual similarity-difference selection to compress video tokens for MLLMs while retaining static and dynamic content.

SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

SDGBiasBench reveals intrinsic SDG biases in VLMs driven by priors rather than evidence, and CADE mitigates them with up to 25% accuracy gains and 12-point MAE reductions.

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

cs.CV · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.

WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

cs.CV · 2026-05-20 · conditional · novelty 7.0

WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.

EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial Reasoning

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

EventPrune prunes 80% of visual tokens in Video-LLMs using event camera motion cues, yielding 1.89x speedup, 52% fewer GFLOPs, and slightly higher accuracy than full-token baselines on first-person dynamic spatial reasoning.

EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric Videos

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

EgoExoMem is the first benchmark for cross-view memory reasoning on synchronized egocentric-exocentric videos, where E2-Select raises MLLM accuracy from 55.3% to 58.2% over baselines.

EgoInteract: Synthetic Egocentric Videos Generation for Interaction Understanding and Anticipation

cs.CV · 2026-05-18 · unverdicted · novelty 7.0 · 2 refs

EgoInteract is a new simulator for generating synthetic egocentric videos with precise control over camera, body, hand, and object motions, producing a dataset that improves model performance on real-world benchmarks for temporal action segmentation, next-active object detection, interaction Anticip

An Efficient Streaming Video Understanding Framework with Agentic Control

cs.CV · 2026-05-18 · unverdicted · novelty 7.0 · 2 refs

R3-Streaming uses cascaded control with age-aware memory forgetting and TB-GRPO reinforcement learning to reach SOTA scores of 57.92 on OVO-Bench and 76.36 on StreamingBench with 95-96% fewer visual tokens.

HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation

cs.CV · 2026-05-16 · unverdicted · novelty 7.0

HEED replaces uniform residual alignment with density-weighted alignment using patch self-dissimilarity to improve hybrid VLM distillation, gaining 8.7 points on OCRBench v2 and 5.13 on a 10-benchmark average.

citing papers explorer

Showing 50 of 347 citing papers.

Boosting Reasoning in Large Multimodal Models via Activation Replay cs.CV · 2025-11-25 · unverdicted · none · ref 20 · internal anchor
Activation Replay boosts multimodal reasoning in post-trained LMMs by replaying low-entropy activations from base models to RLVR counterparts at test time via visual token manipulation.
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding cs.CV · 2025-11-17 · unverdicted · none · ref 17 · internal anchor
REVISOR adds multimodal visual-text reflection and a Dual Attribution Decoupled Reward to improve long-form video reasoning in MLLMs without extra supervised fine-tuning.
DeepEyesV2: Toward Agentic Multimodal Model cs.CV · 2025-11-07 · unverdicted · none · ref 24 · internal anchor
DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
Emu3.5: Native Multimodal Models are World Learners cs.CV · 2025-10-30 · unverdicted · none · ref 51 · internal anchor
Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation for efficient interleaved generation and world exploration.
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling cs.CV · 2025-10-23 · unverdicted · none · ref 11 · internal anchor
RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy cs.RO · 2025-10-15 · unverdicted · none · ref 20 · internal anchor
InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models cs.CV · 2025-10-12 · unverdicted · none · ref 13 · internal anchor
ViSurf unifies SFT and RLVR for LVLMs in one training stage by injecting ground-truth labels into rollouts and applying novel reward controls, outperforming standalone and two-stage baselines on diverse benchmarks.
Unlocking Zero-Shot Geospatial Reasoning via Indirect Rewards cs.CV · 2025-09-29 · unverdicted · none · ref 6 · internal anchor
Geo-R1 uses indirect proxy rewards from cross-view alignment with geolocation metadata to drive reinforcement learning, enabling zero-shot geospatial reasoning that transfers across 25+ tasks and sometimes exceeds supervised specialists.
Perceive, Verify and Understand Long Video: Multi-Granular Perception and Active Verification via Interactive Agents cs.CV · 2025-09-29 · unverdicted · none · ref 17 · internal anchor
CogniGPT uses an interactive loop between a Multi-Granular Perception Agent and an Active Verification Agent to identify reliable clues in long videos with high accuracy and low frame usage.
Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices cs.DC · 2025-09-25 · unverdicted · none · ref 9 · internal anchor
Nanomind decomposes LMMs into modular bricks mapped to heterogeneous accelerators with TABM zero-copy transfers, fused low-bit kernels, and a battery-aware scheduler, cutting energy 42.3% and enabling 18.8-hour runtime on a 2000 mAh battery for LLaVA-OneVision-Qwen2-0.5B.
LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA cs.CV · 2025-09-12 · unverdicted · none · ref 27 · internal anchor
LaV-CoT introduces a multi-stage visual CoT pipeline and GRPO training with language-consistency rewards, delivering up to 9.5% accuracy gains on multilingual VQA benchmarks over similar-sized open models.
Training-Free Multimodal Large Language Model Orchestration cs.CL · 2025-08-06 · unverdicted · none · ref 25 · 2 links · internal anchor
LLM Orchestration integrates modality experts via an LLM controller, cross-modal memory, and interaction layer to enable multimodal input-output without gradient-based training.
ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs cs.CV · 2025-07-29 · unverdicted · none · ref 18 · internal anchor
ReGATE introduces a teacher-student adaptive token elision method that reduces training tokens to 38% while matching or exceeding baseline accuracy on multimodal benchmarks.
Exploring the Secondary Risks of Large Language Models cs.LG · 2025-06-14 · unverdicted · none · ref 22 · internal anchor
Introduces secondary risks as a new class of LLM failures from benign prompts, defines two primitives, proposes SecLens search framework, and releases SecRiskBench showing risks are widespread across 16 models.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning cs.AI · 2025-06-11 · unverdicted · none · ref 38 · internal anchor
V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence cs.CV · 2025-05-29 · unverdicted · none · ref 6 · 2 links · internal anchor
Spatial-MLLM adds a 3D spatial encoder initialized from a visual geometry model and space-aware frame sampling to MLLMs to improve spatial understanding and reasoning from purely 2D visual inputs.
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction cs.CV · 2025-05-26 · unverdicted · none · ref 39 · internal anchor
VLM-3R augments VLMs with implicit 3D tokens from monocular video via geometry encoding and 200K+ 3D reconstructive QA pairs, plus a new 138K-pair temporal benchmark, to support spatial and embodied reasoning.
MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models cs.CV · 2025-05-26 · unverdicted · none · ref 14 · internal anchor
MEBench is a new benchmark and data-generation pipeline that measures mutual exclusivity bias in VLMs, finding weak bias but some use of spatial context to resolve novel-object ambiguity.
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models cs.CV · 2025-05-22 · unverdicted · none · ref 36 · internal anchor
Multi-SpatialMLLM integrates depth perception, visual correspondence, and dynamic perception into MLLMs via a 27M-sample MultiSPA dataset and benchmark, yielding gains on multi-frame spatial tasks.
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning cs.LG · 2025-05-22 · conditional · none · ref 3 · internal anchor
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs cs.CV · 2025-05-21 · unverdicted · none · ref 21 · internal anchor
Chain-of-Focus enables VLMs to adaptively search and zoom on important image areas via a two-stage SFT and RL pipeline on a custom 3K-sample dataset, yielding 5% gains on the V* benchmark across resolutions from 224 to 4K.
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval cs.CV · 2025-05-21 · unverdicted · none · ref 12 · internal anchor
LiveVLM introduces VSB and PaR to compress and retrieve KV cache in streaming video LLMs, enabling LLaVA-OneVision to reach SOTA accuracy among training-free query-agnostic and training-based online models.
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation cs.CV · 2025-05-08 · unverdicted · none · ref 35 · internal anchor
Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interleaved outputs including zero-shot editing.
Visual Compositional Tuning cs.CV · 2025-04-30 · unverdicted · none · ref 10 · internal anchor
COMPACT synthesizes compositional visual instruction data to reduce VIT training data by 90% while achieving 100.2% of full performance across eight multimodal benchmarks.
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners cs.AI · 2025-04-19 · unverdicted · none · ref 24 · internal anchor
InfiGUI-R1 uses Reasoning Injection via spatial distillation followed by Deliberation Enhancement via RL to evolve GUI agents from reactive actors to deliberative reasoners, reporting strong performance on grounding and trajectory tasks.
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model cs.CV · 2025-04-10 · unverdicted · none · ref 25 · internal anchor
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
When 'YES' Meets 'BUT': Can Large Models Comprehend Contradictory Humor Through Comparative Reasoning? cs.CV · 2025-03-29 · unverdicted · none · ref 70 · internal anchor
Presents YesBut (V2) benchmark and shows state-of-the-art VLMs significantly underperform humans on tasks requiring comparative reasoning for contradictory humor in comics.
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles cs.CV · 2025-03-21 · conditional · none · ref 36 · internal anchor
Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.
MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems cs.CV · 2025-03-19 · unverdicted · none · ref 30 · internal anchor
MathFlow decouples perception and inference stages in MLLMs for visual math, with a dedicated perception model delivering gains on the FlowVerse benchmark when paired with existing reasoners.
FaVChat: Hierarchical Prompt-Query Guided Facial Video Understanding with Data-Efficient GRPO cs.CV · 2025-03-12 · unverdicted · none · ref 8 · internal anchor
FaVChat proposes hierarchical prompt-query guided visual features and Data-Efficient GRPO for efficient training, plus the FaVChat-170K dataset, claiming consistent outperformance over prior VLLMs on facial video tasks.
Visual-RFT: Visual Reinforcement Fine-Tuning cs.CV · 2025-03-03 · conditional · none · ref 13 · internal anchor
Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.
MapNav: A Novel Memory Representation via Annotated Semantic Maps for Vision-and-Language Navigation cs.RO · 2025-02-19 · unverdicted · none · ref 21 · internal anchor
MapNav uses annotated semantic maps as memory for VLN agents, claiming SOTA results in simulation and real-world tests while promising code and data release.
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos cs.CV · 2025-01-07 · conditional · none · ref 47 · internal anchor
Sa2VA unifies SAM-2 segmentation with MLLM reasoning into a single model for referring segmentation and conversation on images and videos, supported by a new 72k-expression Ref-SAV dataset.
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models cs.CV · 2025-01-06 · unverdicted · none · ref 18 · internal anchor
MotionBench is a new benchmark showing poor fine-grained motion understanding in VLMs and proposes TE Fusion to improve performance with higher frame rates.
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling cs.CV · 2024-12-31 · unverdicted · none · ref 25 · internal anchor
VideoChat-Flash applies hierarchical video token compression to achieve ~50x reduction in context length for long videos while maintaining near-original performance on long-context benchmarks.
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces cs.CV · 2024-12-18 · unverdicted · none · ref 40 · internal anchor
MLLMs achieve competitive but subhuman performance on the new VSI-Bench for visual-spatial intelligence from videos, with spatial reasoning as the main bottleneck and explicit cognitive map generation improving distance estimation.
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning cs.CV · 2024-12-18 · unverdicted · none · ref 296 · internal anchor
VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction cs.CL · 2024-12-05 · conditional · none · ref 88 · internal anchor
Aguvis presents a pure vision-based framework for autonomous GUI agents using structured reasoning via inner monologue, a new multimodal dataset, and two-stage training to reach SOTA on offline and online benchmarks.
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization cs.CL · 2024-11-15 · conditional · none · ref 45 · internal anchor
Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding cs.CV · 2024-10-22 · unverdicted · none · ref 15 · internal anchor
LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal detail loss.
Pixtral 12B cs.CV · 2024-10-09 · unverdicted · none · ref 9 · internal anchor
Pixtral-12B is a 12B multimodal LLM with a custom vision encoder that ingests images at native resolution and aspect ratio, achieving leading benchmark results among open models while preserving text capabilities.
LLaVA-Video: Video Instruction Tuning With Synthetic Data cs.CV · 2024-10-03 · unverdicted · none · ref 158 · internal anchor
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
Emu3: Next-Token Prediction is All You Need cs.CV · 2024-09-27 · unverdicted · none · ref 44 · internal anchor
Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
LongVILA: Scaling Long-Context Visual Language Models for Long Videos cs.CV · 2024-08-19 · unverdicted · none · ref 19 · internal anchor
LongVILA scales visual-language models from 8 to 2048 video frames with 99.8% needle-in-a-haystack accuracy using long-context extension, supervised fine-tuning, and multi-modal sequence parallelism on up to 256 GPUs.
Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models cs.CV · 2026-05-03 · unverdicted · none · ref 22
VAP is a training-free active-perception method that improves zero-shot long-form video QA performance and frame efficiency up to 5.6x in VLMs by selecting keyframes that differ from priors generated by a text-conditioned video model.
TS-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation cs.CV · 2026-04-21 · unverdicted · none · ref 13
TS-Attn dynamically separates and rearranges attention in existing text-to-video models to improve temporal consistency and prompt adherence for videos with multiple sequential actions.
Weakly-Supervised Referring Video Object Segmentation through Text Supervision cs.CV · 2026-04-20 · unverdicted · none · ref 19
WSRVOS enables referring video object segmentation with text-only supervision by combining MLLM-based expression augmentation, multimodal feature interaction, pseudo-mask fusion, and temporal ranking constraints.
SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology cs.AI · 2026-04-19 · unverdicted · none · ref 23
SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.
The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models cs.CL · 2026-04-15 · unverdicted · none · ref 10
Centroid erasure shows language representations overshadow vision in multimodal models, and text-centroid contrastive decoding recovers substantial accuracy on visual reasoning tasks.
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding cs.CV · 2026-04-15 · unverdicted · none · ref 32
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent after tuning on 2.5 percent of standard data.

LLaVA-OneVision: Easy Visual Task Transfer

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer