hub Mixed citations

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin · 2023 · cs.CV · arXiv 2311.10122

Mixed citation behavior. Most common role is background (64%).

82 Pith papers citing it

Background 64% of classified citations

open full Pith review browse 82 citing papers arXiv PDF

abstract

The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits. Additionally, our Video-LLaVA also outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos. We aim for this work to provide modest insights into the multi-modal inputs for the LLM. Code address: \href{https://github.com/PKU-YuanGroup/Video-LLaVA}

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 baseline 3 method 2 other 1

citation-polarity summary

background 9 baseline 3 unclear 1 use method 1

claims ledger

abstract The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the found
baseline "×" indicates the model is incapable of performing the task. Model Understanding Image Generation Image Editing MMBV MMBI MMMU MM-Vet GenEval WISE Overall Add Adjust Extract Replace Remove Hybird Image Understanding LLaV A-1.5 [25] × 36.4 67.8 36.3 × × × × × × × × × LLaV A-NeXT [57] × 79.3 51.1 57.4 × × × × × × × × × Image & Video Understanding Video-LLaV A [22] 1.05 60.9 32.8 32.0 × × × × × × × × × LLaV A-OV [17] 0.94 80.8 48.8 57.5 × × × × × × × × × Text-to-Image Generation SDXL [34] × × × × 0
method backbone [13], as illustrated in Figure 2. The edit instruction and the original image are jointly fed into VLM, while the image is processed simultaneously by the vision encoder. The hidden states of VLM and the visual feature of the vision encoder are separately projected by MLPs and then concatenated, forming the text-branch input to DiT. Training proceeds in two stages [41], first optimizing MLPs and then jointly fine-tuning FLUX and MLPs. 3.4 Dataset Statistics ImgEdit comprises 1.2 million
background SigLIP outperforms the other two vision encoders, especially in fine-grained understanding tasks involving texts. Based on this ablation study, we choose the pretrained SigLIP as our base vision encoder, and then adapt it to taking dynamic resolutions as inputs. 5 Related Work Multimodal LLMs for Native Video Understanding. Early video MLLMs primarily relied on sparsely sampled frames and simple connectors, such as MLPs [12, 13, 139], discrete visual tokenizers [140], and Q-formers [141, 142], t
background SpatialLadder: Progressive training for spatial reasoning in vision-language models.arXiv, abs/2510.08531, 2025. [31] Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning, volume 202, pages 19730-19742, 2023. [32] Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-LLaV A: Learning united visual representation by alignment
method [74] Yitong Li, Zhe Gan, Yelong Shen, Jingjing Liu, Yu Cheng, Yuexin Wu, Lawrence Carin, David Carlson, and Jianfeng Gao. Storygan: A sequential conditional gan for story visualization, 2019. 40 [75] Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei Ma, Benjamin Van Durme, and Alan Yuille. Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning, 2023. 39 [76] Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united
background language models. InECCV, 2024. 3 [37] Yucheng Li, Huiqiang Jiang, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Amir H Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, et al. Mapsparse: Accelerating pre-filling for long-context visual language models via modality-aware permutation sparse attention. InICLR 2025 Workshop on Foundation Models in the Wild. 5 [38] Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alig

co-cited works

representative citing papers

RobustSora: De-Watermarked Benchmark for Robust AI-Generated Video Detection

cs.CV · 2025-12-11 · conditional · novelty 8.0

RobustSora benchmark demonstrates that current AI video detectors rely heavily on visible watermarks, with average accuracy drops of 6.6 percentage points when watermarks are erased and increased false alarms when watermarks are spoofed onto real videos.

No Place to Hide: Benchmarking Video Hallucination with Background-Controlled Pairs

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

Introduces VidPair-Halluc benchmark of 1K background-controlled adversarial video pairs and 11K QA pairs generated via PairFlow pipeline to evaluate hallucination in LVMs.

Balancing Image Compression and Generation with Bootstrapped Tokenization

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

SelfBootTok decomposes image tokens into global and local groups via self-bootstrapped learning, enabling generators to use only global tokens for ~40% less computation and a new SOTA gFID of 1.56 with 64 tokens.

EvoCut: Multi-Layer Evolution-Aware Visual Token Compression for Efficient Large Vision-Language Models

cs.CV · 2026-06-01 · conditional · novelty 7.0

EvoCut is a training-free visual token compression technique that identifies important tokens via multi-layer evolution deviation, retaining 11.1% tokens with 94.4% average performance preserved on LLaVA-1.5-7B.

AffectVerse: Emotional World Models for Multimodal Affective Computing

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

AffectVerse improves multimodal emotion recognition by at least 2.57% on nine benchmarks through an Emotion World Module that performs short-horizon latent affective prediction via cross-modal temporal imagination and belief aggregation.

CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.

WirelessSenseLLM: Zero-Shot Human Activity Understanding by Bridging Wireless Signals and Human Language

cs.NI · 2026-05-13 · unverdicted · novelty 7.0

WirelessSenseLLM bridges unsegmented Wi-Fi CSI signals to LLMs via a CSI-to-Language Adapter for zero-shot human activity understanding and reasoning.

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

EyeCue detects driver cognitive distraction by modeling gaze-visual context interactions in egocentric videos and achieves 74.38% accuracy on the new CogDrive dataset, outperforming 11 baselines.

LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-layer attention in LLMs.

Grounding Video Reasoning in Physical Signals

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

A new benchmark converts video clips into shared grounded event records and tests models across physics, semantic, and control prompts under original, shuffled, ablated, and masked conditions, finding selective robustness and weak spatial performance.

Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic

cs.AI · 2026-04-21 · unverdicted · novelty 7.0

SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.

SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.

VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning

cs.CV · 2026-01-22 · unverdicted · novelty 7.0

VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.

LFS: Learnable Frame Selector for Event-Aware and Temporally Diverse Video Captioning

cs.CV · 2026-01-21 · conditional · novelty 7.0

LFS learns to select temporally diverse and event-aware frames for video captioning by using direct feedback from frozen video-LLMs, yielding gains up to 2% on VDC and over 4% on the new ICH-CC benchmark.

Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation

cs.CV · 2026-01-04 · unverdicted · novelty 7.0

GAR-Font is a global-aware autoregressive framework for multimodal few-shot font generation that adds global tokenization, a language-style adapter, and post-refinement to improve style coherence over patch-based methods.

StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos

cs.CV · 2025-12-01 · unverdicted · novelty 7.0

StreamGaze is a new benchmark and QA generation pipeline that measures how well MLLMs leverage gaze trajectories for temporal reasoning and proactive intention prediction in streaming egocentric videos.

Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

cs.CV · 2025-11-27 · unverdicted · novelty 7.0

Introduces the first dedicated benchmark for live multi-modal LLM task guidance with mistake detection and a streaming baseline model.

TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?

cs.CV · 2025-09-19 · unverdicted · novelty 7.0

Introduces TennisTV benchmark for evaluating 17 MLLMs on tennis video understanding from stroke-level to rally-level tasks with automated pipelines and human verification.

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

cs.CV · 2025-02-06 · unverdicted · novelty 7.0

WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

cs.CV · 2024-10-22 · accept · novelty 7.0

PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.

MLVU: Benchmarking Multi-task Long Video Understanding

cs.CV · 2024-06-06 · conditional · novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.

HPP: Hierarchical Programmatic Probing for Long Video Understanding by Decoupling Perception and Reasoning

cs.CV · 2026-06-19 · unverdicted · novelty 6.0

HPP decouples perception from reasoning in long-video VLMs by having an LLM run iterative programmatic probes on hierarchically segmented video, reporting gains on LongVideoBench, EgoSchema, VideoMME, and MLVU.

Don't Pause: Streaming Video-Language Synchrony for Online Video Understanding

cs.CV · 2026-06-05 · unverdicted · novelty 6.0

LyraV uses FDTC and SToP for per-frame incremental decoding to reach 98.29% video synchrony at 3.89 FPS while preserving general understanding.

citing papers explorer

Showing 45 of 45 citing papers after filters.

No Place to Hide: Benchmarking Video Hallucination with Background-Controlled Pairs cs.CV · 2026-06-30 · unverdicted · none · ref 44 · internal anchor
Introduces VidPair-Halluc benchmark of 1K background-controlled adversarial video pairs and 11K QA pairs generated via PairFlow pipeline to evaluate hallucination in LVMs.
Balancing Image Compression and Generation with Bootstrapped Tokenization cs.LG · 2026-06-04 · unverdicted · none · ref 41 · internal anchor
SelfBootTok decomposes image tokens into global and local groups via self-bootstrapped learning, enabling generators to use only global tokens for ~40% less computation and a new SOTA gFID of 1.56 with 64 tokens.
EvoCut: Multi-Layer Evolution-Aware Visual Token Compression for Efficient Large Vision-Language Models cs.CV · 2026-06-01 · conditional · none · ref 25 · internal anchor
EvoCut is a training-free visual token compression technique that identifies important tokens via multi-layer evolution deviation, retaining 11.1% tokens with 94.4% average performance preserved on LLaVA-1.5-7B.
AffectVerse: Emotional World Models for Multimodal Affective Computing cs.CV · 2026-05-19 · unverdicted · none · ref 25 · internal anchor
AffectVerse improves multimodal emotion recognition by at least 2.57% on nine benchmarks through an Emotion World Module that performs short-horizon latent affective prediction via cross-modal temporal imagination and belief aggregation.
CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding cs.CV · 2026-05-14 · unverdicted · none · ref 23 · internal anchor
CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.
WirelessSenseLLM: Zero-Shot Human Activity Understanding by Bridging Wireless Signals and Human Language cs.NI · 2026-05-13 · unverdicted · none · ref 16 · internal anchor
WirelessSenseLLM bridges unsegmented Wi-Fi CSI signals to LLMs via a CSI-to-Language Adapter for zero-shot human activity understanding and reasoning.
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding cs.CV · 2026-05-13 · unverdicted · none · ref 56 · internal anchor
A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding cs.CV · 2026-05-08 · unverdicted · none · ref 28 · internal anchor
EyeCue detects driver cognitive distraction by modeling gaze-visual context interactions in egocentric videos and achieves 74.38% accuracy on the new CogDrive dataset, outperforming 11 baselines.
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models cs.CV · 2026-04-27 · unverdicted · none · ref 13 · internal anchor
LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-layer attention in LLMs.
Grounding Video Reasoning in Physical Signals cs.CV · 2026-04-23 · unverdicted · none · ref 15 · internal anchor
A new benchmark converts video clips into shared grounded event records and tests models across physics, semantic, and control prompts under original, shuffled, ablated, and masked conditions, finding selective robustness and weak spatial performance.
Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic cs.AI · 2026-04-21 · unverdicted · none · ref 12 · internal anchor
SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration cs.CV · 2026-04-06 · unverdicted · none · ref 17 · internal anchor
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning cs.CV · 2026-01-22 · unverdicted · none · ref 19 · internal anchor
VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
LFS: Learnable Frame Selector for Event-Aware and Temporally Diverse Video Captioning cs.CV · 2026-01-21 · conditional · none · ref 16 · internal anchor
LFS learns to select temporally diverse and event-aware frames for video captioning by using direct feedback from frozen video-LLMs, yielding gains up to 2% on VDC and over 4% on the new ICH-CC benchmark.
Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation cs.CV · 2026-01-04 · unverdicted · none · ref 38 · internal anchor
GAR-Font is a global-aware autoregressive framework for multimodal few-shot font generation that adds global tokenization, a language-style adapter, and post-refinement to improve style coherence over patch-based methods.
HPP: Hierarchical Programmatic Probing for Long Video Understanding by Decoupling Perception and Reasoning cs.CV · 2026-06-19 · unverdicted · none · ref 49 · internal anchor
HPP decouples perception from reasoning in long-video VLMs by having an LLM run iterative programmatic probes on hierarchically segmented video, reporting gains on LongVideoBench, EgoSchema, VideoMME, and MLVU.
Don't Pause: Streaming Video-Language Synchrony for Online Video Understanding cs.CV · 2026-06-05 · unverdicted · none · ref 68 · internal anchor
LyraV uses FDTC and SToP for per-frame incremental decoding to reach 98.29% video synchrony at 3.89 FPS while preserving general understanding.
MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models cs.CV · 2026-06-05 · unverdicted · none · ref 15 · internal anchor
MotionEnhancer distills motion priors from video diffusion models into VLMs via parameter-free attention alignment modules to improve motion-level video understanding.
When Attention Collapses: Stage-Aware Visual Token Pruning from Structure to Semantics cs.CV · 2026-06-02 · unverdicted · none · ref 16 · internal anchor
STS is a two-stage pruning framework that decouples structural diversity via repulsion sampling from semantic filtering via cross-attention to reduce redundancy in visual tokens for VLMs.
An Analysis Focused on Womens Safety: Can VAD Models Be Enhanced by a Multi-modal Dataset? cs.CV · 2026-05-25 · unverdicted · none · ref 11 · internal anchor
Introduces ExtrAnom, a new multi-modal VAD dataset with 1001 videos and four textual descriptions per video, focused on women-centric anomalies like stalking and chain snatching.
Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly cs.CV · 2026-05-20 · unverdicted · none · ref 14 · internal anchor
Flat-Pack Bench is a new evaluation suite that shows state-of-the-art LVLMs perform poorly on nuanced spatio-temporal reasoning required for furniture assembly videos.
Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models cs.CV · 2026-05-20 · conditional · none · ref 26 · internal anchor
SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.
Dynamic Model Merging Made Slim cs.LG · 2026-05-17 · unverdicted · none · ref 52 · internal anchor
DiDi-Merging achieves dynamic model merging performance matching or exceeding prior methods while using only 1.24x to 1.4x the parameters of a single fine-tuned model.
OProver: A Unified Framework for Agentic Formal Theorem Proving cs.CL · 2026-05-17 · unverdicted · none · ref 7 · internal anchor
OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs cs.CV · 2026-05-10 · unverdicted · none · ref 32 · internal anchor
SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization cs.CV · 2026-05-04 · unverdicted · none · ref 23 · internal anchor
WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection cs.CV · 2026-04-27 · unverdicted · none · ref 30 · internal anchor
ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.
UniCon: Unified Framework for Efficient Contrastive Alignment via Kernels cs.LG · 2026-04-17 · unverdicted · none · ref 10 · internal anchor
UniCon unifies contrastive alignment across encoders and alignment types using kernels to enable exact closed-form updates instead of stochastic optimization.
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding cs.CV · 2026-04-15 · unverdicted · none · ref 38 · internal anchor
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent after tuning on 2.5 percent of standard data.
ViLL-E: Video LLM Embeddings for Retrieval cs.CV · 2026-04-13 · unverdicted · none · ref 28 · internal anchor
ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.
CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning cs.AI · 2026-04-13 · unverdicted · none · ref 22 · internal anchor
CFMS is a coarse-to-fine framework that uses MLLMs to create a multi-perspective knowledge tuple as a reasoning map for symbolic table operations, yielding competitive accuracy on WikiTQ and TabFact.
Spatio-Temporal Grounding of Large Language Models from Perception Streams cs.RO · 2026-04-08 · unverdicted · none · ref 17 · internal anchor
FESTS uses Spatial Regular Expressions compiled from queries to generate 27k training tuples that raise a 3B-parameter LLM's frame-level F1 on spatio-temporal video reasoning from 48.5% to 87.5%, matching GPT-4.1 while staying far smaller.
Progressive Video Condensation with MLLM Agent for Long-form Video Understanding cs.CV · 2026-04-03 · unverdicted · none · ref 6 · internal anchor
ProVCA progressively condenses long videos via segment localization, snippet selection, and keyframe refinement to achieve SOTA zero-shot accuracies on EgoSchema, NExT-QA, and IntentQA with fewer frames.
ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling cs.CV · 2026-03-24 · unverdicted · none · ref 29 · internal anchor
ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.
Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy cs.AI · 2026-03-02 · unverdicted · none · ref 35 · internal anchor
Nano-EmoX is a compact 2.2B multimodal model that unifies six core affective tasks across perception, understanding, and interaction levels via a curriculum framework, achieving competitive benchmark performance.
Structure Over Scale: Learning Visual Reasoning from Pedagogical Video cs.CV · 2026-01-30 · unverdicted · none · ref 5 · internal anchor
Fine-tuning VLMs on 10K QA pairs from pedagogical children's videos produces consistent gains on NExT-QA, Video-MME, and MotionBench, indicating that explicit structure can substitute for data scale.
Question-Aware Evidence Ledgers for Video Relational Reasoning cs.CV · 2026-06-01 · unverdicted · none · ref 4 · internal anchor
A pipeline using question-aware evidence ledgers with GPT-5.5 achieves 92.95% overall and 93.79% macro accuracy on the VRR-QA video relational reasoning challenge.
Vision-language Models for Driver Monitoring Systems: A Driver Activity Description Dataset cs.CV · 2026-06-01 · unverdicted · none · ref 16 · internal anchor
Introduces Drive&Act description dataset of fine-grained driver activity text and reports fine-tuned VLM reaching 76 ACCR on DMD dataset versus 66 for zero-shot baseline.
ADMFormer: An Adaptive-Decomposition Transformer with Time-Varying Masked Spatial Attention for Traffic Forecasting cs.AI · 2026-05-25 · unverdicted · none · ref 12 · internal anchor
ADMFormer decouples traffic into regular and fluctuating components with time-node gating, processes them in dual temporal branches, and uses time-varying masked spatial attention to reach SOTA on four datasets.
MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering cs.CV · 2026-05-21 · conditional · none · ref 22 · internal anchor
MuKV adds multi-grained KV cache compression at patch-frame-segment levels plus semi-hierarchical retrieval to raise accuracy and cut memory in long video question-answering.
Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models cs.CV · 2026-03-02 · unverdicted · none · ref 30 · internal anchor
AOT reduces visual tokens in VLLMs via intra-frame and inter-frame anchors with local-global optimal transport, delivering competitive benchmark performance and efficiency gains in a training-free way.
UNIVID: Unified Vision-Language Model for Video Moderation cs.MM · 2026-06-04 · unverdicted · none · ref 36 · internal anchor
UNIVID generates policy-aware captions for video moderation, reducing violation leakage by 42.7% and overkill rate by 37.0% while replacing over 1,000 policy-specific models with a single backbone.
CREST: Curvature-Regulated Event-Centric Sampling for Efficient Long-Video Understanding cs.CV · 2026-05-09 · unverdicted · none · ref 6 · internal anchor
CREST uses local curvature of query-frame relevance over time to select informative frames, outperforming a lightweight baseline and approaching a costly pipeline at far lower preprocessing cost on long-video benchmarks.
Toward Native Multimodal Modeling: A Roadmap cs.CV · 2026-05-25 · unverdicted · none · ref 26 · internal anchor
A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark cs.CV · 2026-03-28 · unreviewed · ref 26 · internal anchor

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer