In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1207–1216

Coin: A large-scale dataset for comprehensive instructional video analysis · 2021 · arXiv 2312.17432

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs

cs.CL · 2025-06-08 · unverdicted · novelty 7.0

VISE is the first benchmark for sycophancy in Video-LLMs, with two training-free mitigation strategies based on key-frame selection and internal representation steering.

Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models

cs.CV · 2026-05-20 · conditional · novelty 6.0

SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.

GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

GRIP-VLM applies group-relative policy optimization via reinforcement learning to prune visual tokens in VLMs, yielding up to 15% inference speedup at matched accuracy over prior methods.

ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

ABMamba uses Mamba-based linear-complexity processing plus a novel Aligned Hierarchical Bidirectional Scan to deliver competitive video captioning on VATEX and MSR-VTT at roughly 3x higher throughput than typical Transformer MLLMs.

Benchmarking Compound AI Applications for Hardware-Software Co-Design

cs.DC · 2026-03-04 · unverdicted · novelty 6.0

Introduces a benchmarking suite for compound AI applications to support cross-stack performance, cost, and resource analysis for hardware-software co-design.

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives

cs.CL · 2024-06-09 · unverdicted · novelty 2.0

Survey summarizing video-language understanding tasks, challenges, and methods from architecture, training, and data perspectives, including performance comparisons and future directions.

On Efficient Variants of Segment Anything Model: A Survey

cs.CV · 2024-10-07

citing papers explorer

Showing 7 of 7 citing papers.

Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs cs.CL · 2025-06-08 · unverdicted · none · ref 38
VISE is the first benchmark for sycophancy in Video-LLMs, with two training-free mitigation strategies based on key-frame selection and internal representation steering.
Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models cs.CV · 2026-05-20 · conditional · none · ref 49
SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.
GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 15
GRIP-VLM applies group-relative policy optimization via reinforcement learning to prune visual tokens in VLMs, yielding up to 15% inference speedup at matched accuracy over prior methods.
ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning cs.CV · 2026-04-09 · unverdicted · none · ref 57
ABMamba uses Mamba-based linear-complexity processing plus a novel Aligned Hierarchical Bidirectional Scan to deliver competitive video captioning on VATEX and MSR-VTT at roughly 3x higher throughput than typical Transformer MLLMs.
Benchmarking Compound AI Applications for Hardware-Software Co-Design cs.DC · 2026-03-04 · unverdicted · none · ref 30
Introduces a benchmarking suite for compound AI applications to support cross-stack performance, cost, and resource analysis for hardware-software co-design.
Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives cs.CL · 2024-06-09 · unverdicted · none · ref 13
Survey summarizing video-language understanding tasks, challenges, and methods from architecture, training, and data perspectives, including performance comparisons and future directions.
On Efficient Variants of Segment Anything Model: A Survey cs.CV · 2024-10-07 · unreviewed · ref 18

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1207–1216

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer