VISE is the first benchmark for sycophancy in Video-LLMs, with two training-free mitigation strategies based on key-frame selection and internal representation steering.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1207–1216
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.
GRIP-VLM applies group-relative policy optimization via reinforcement learning to prune visual tokens in VLMs, yielding up to 15% inference speedup at matched accuracy over prior methods.
ABMamba uses Mamba-based linear-complexity processing plus a novel Aligned Hierarchical Bidirectional Scan to deliver competitive video captioning on VATEX and MSR-VTT at roughly 3x higher throughput than typical Transformer MLLMs.
Introduces a benchmarking suite for compound AI applications to support cross-stack performance, cost, and resource analysis for hardware-software co-design.
Survey summarizing video-language understanding tasks, challenges, and methods from architecture, training, and data perspectives, including performance comparisons and future directions.
citing papers explorer
-
Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs
VISE is the first benchmark for sycophancy in Video-LLMs, with two training-free mitigation strategies based on key-frame selection and internal representation steering.
-
Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models
SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.
-
GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models
GRIP-VLM applies group-relative policy optimization via reinforcement learning to prune visual tokens in VLMs, yielding up to 15% inference speedup at matched accuracy over prior methods.
-
ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning
ABMamba uses Mamba-based linear-complexity processing plus a novel Aligned Hierarchical Bidirectional Scan to deliver competitive video captioning on VATEX and MSR-VTT at roughly 3x higher throughput than typical Transformer MLLMs.
-
Benchmarking Compound AI Applications for Hardware-Software Co-Design
Introduces a benchmarking suite for compound AI applications to support cross-stack performance, cost, and resource analysis for hardware-software co-design.
-
Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives
Survey summarizing video-language understanding tasks, challenges, and methods from architecture, training, and data perspectives, including performance comparisons and future directions.
- On Efficient Variants of Segment Anything Model: A Survey