In: Proceedings of the 2024 conference on empirical methods in natural language processing

Lin, B · 2024

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

representative citing papers

Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

Sink-Token-aware Pruning (SToP) suppresses semantically uninformative sink tokens during visual token pruning in Video LLMs, boosting fine-grained performance even at 90% pruning rates across hallucination, reasoning, and MCQA benchmarks.

SurgOnAir: Hierarchy-Aware Real-Time Surgical Video Commentary

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

SurgOnAir introduces a streaming vision-language model trained on a hierarchical surgical dataset to generate real-time, multi-level narrations with explicit transition tokens.

Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding

cs.CV · 2026-03-18 · unverdicted · novelty 6.0

Motion-MLLM integrates IMU egomotion data into MLLMs using cascaded filtering and asymmetric fusion to ground visual content in physical trajectories for scale-aware 3D understanding, achieving competitive accuracy at higher speed.

Why Do Vision Language Models Struggle To Recognize Human Emotions?

cs.CV · 2026-04-16 · unverdicted · novelty 5.0

VLMs fail at dynamic facial expression recognition because web-scale pretraining exacerbates long-tailed class bias and sparse frame sampling misses micro-expressions; a multi-stage context enrichment strategy using language summaries of skipped frames is proposed to mitigate this.

citing papers explorer

Showing 4 of 4 citing papers.

Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs cs.LG · 2026-04-22 · unverdicted · none · ref 25
Sink-Token-aware Pruning (SToP) suppresses semantically uninformative sink tokens during visual token pruning in Video LLMs, boosting fine-grained performance even at 90% pruning rates across hallucination, reasoning, and MCQA benchmarks.
SurgOnAir: Hierarchy-Aware Real-Time Surgical Video Commentary cs.CV · 2026-05-20 · unverdicted · none · ref 11
SurgOnAir introduces a streaming vision-language model trained on a hierarchical surgical dataset to generate real-time, multi-level narrations with explicit transition tokens.
Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding cs.CV · 2026-03-18 · unverdicted · none · ref 38
Motion-MLLM integrates IMU egomotion data into MLLMs using cascaded filtering and asymmetric fusion to ground visual content in physical trajectories for scale-aware 3D understanding, achieving competitive accuracy at higher speed.
Why Do Vision Language Models Struggle To Recognize Human Emotions? cs.CV · 2026-04-16 · unverdicted · none · ref 33
VLMs fail at dynamic facial expression recognition because web-scale pretraining exacerbates long-tailed class bias and sparse frame sampling misses micro-expressions; a multi-stage context enrichment strategy using language summaries of skipped frames is proposed to mitigate this.

In: Proceedings of the 2024 conference on empirical methods in natural language processing

fields

years

verdicts

representative citing papers

citing papers explorer