In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Fu, C · 2025

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

browse 5 citing papers

representative citing papers

Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

Sink-Token-aware Pruning (SToP) suppresses semantically uninformative sink tokens during visual token pruning in Video LLMs, boosting fine-grained performance even at 90% pruning rates across hallucination, reasoning, and MCQA benchmarks.

OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments

cs.HC · 2026-04-03 · unverdicted · novelty 7.0

OmniGUI is the first step-level benchmark supplying interleaved image, audio, and video inputs across 709 expert episodes in 29 smartphone apps to evaluate multimodal GUI agents.

Motion-o: Trajectory-Grounded Video Reasoning

cs.CV · 2026-03-19 · conditional · novelty 7.0

Motion-o extends VLMs with Motion Chain of Thought (MCoT) using <motion/> tags and perturbation rewards to make object trajectories explicit and supervised in video reasoning.

Exploring High-Order Self-Similarity for Video Understanding

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

The MOSS module learns and combines multi-order space-time self-similarity features to enhance temporal dynamics modeling in videos across action recognition, VQA, and robotic tasks.

OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

cs.CV · 2026-04-09 · unverdicted · novelty 5.0

OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoning benchmarks.

citing papers explorer

Showing 5 of 5 citing papers.

Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs cs.LG · 2026-04-22 · unverdicted · none · ref 12
Sink-Token-aware Pruning (SToP) suppresses semantically uninformative sink tokens during visual token pruning in Video LLMs, boosting fine-grained performance even at 90% pruning rates across hallucination, reasoning, and MCQA benchmarks.
OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments cs.HC · 2026-04-03 · unverdicted · none · ref 6
OmniGUI is the first step-level benchmark supplying interleaved image, audio, and video inputs across 709 expert episodes in 29 smartphone apps to evaluate multimodal GUI agents.
Motion-o: Trajectory-Grounded Video Reasoning cs.CV · 2026-03-19 · conditional · none · ref 6
Motion-o extends VLMs with Motion Chain of Thought (MCoT) using <motion/> tags and perturbation rewards to make object trajectories explicit and supervised in video reasoning.
Exploring High-Order Self-Similarity for Video Understanding cs.CV · 2026-04-22 · unverdicted · none · ref 20
The MOSS module learns and combines multi-order space-time self-similarity features to enhance temporal dynamics modeling in videos across action recognition, VQA, and robotic tasks.
OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering cs.CV · 2026-04-09 · unverdicted · none · ref 7
OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoning benchmarks.

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

fields

years

verdicts

representative citing papers

citing papers explorer