arxiv: 2605.14908 · v1 · pith:TDXIDFFTnew · submitted 2026-05-14 · 💻 cs.CV

SteerSeg: Attention Steering for Reasoning Video Segmentation

Ali Cheraghian , Hamidreza Dastmalchi , Abdelwahed Khamis , Morteza Saberi , Aijun An , Lars Petersson This is my paper

classification 💻 cs.CV

keywords attentiongroundingsegmentationsteersegmapspromptsreasoningspatial

0 comments

read the original abstract

Video reasoning segmentation requires localizing objects across video frames from natural language expressions, often involving spatial reasoning and implicit references. Recent approaches leverage frozen large vision-language models (LVLMs) by extracting attention maps and using them as spatial priors for segmentation, enabling training-free grounding. However, these attention maps are optimized for text generation rather than spatial localization, often resulting in diffuse and ambiguous grounding signals. In this work, we introduce SteerSeg, a lightweight framework that identifies attention misalignment as the key bottleneck in attention-based grounding and proposes to steer attention at its source through input-level conditioning. SteerSeg combines learnable soft prompts with reasoning-guided Chain-of-Thought (CoT) prompting. The soft prompts reshape the attention distribution to produce more spatially concentrated maps, while CoT-derived attributes resolve ambiguity among similar objects by guiding attention toward the correct instance. The resulting attention maps are converted into point prompts across keyframes to guide a segmentation model, while candidate tracklets are ranked and selected using correlation-based scoring. Our approach freezes the LVLM and segmentation model parameters and learns only a small set of soft prompts, preserving the model's pretrained reasoning capabilities while significantly improving grounding. Despite being trained only on Ref-YouTube-VOS, SteerSeg generalizes well across diverse benchmarks, significantly improving the spatial grounding capability of LVLMs. Project page: https://steerseg.github.io

This paper has not been read by Pith yet.

SteerSeg: Attention Steering for Reasoning Video Segmentation

discussion (0)