STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.
One token to seg them all: Language instructed reasoning seg- mentation in videos.Advances in Neural Information Pro- cessing Systems, 37:6833–6859
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CV 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
WSRVOS enables referring video object segmentation with text-only supervision by combining MLLM-based expression augmentation, multimodal feature interaction, pseudo-mask fusion, and temporal ranking constraints.
citing papers explorer
-
STORM: End-to-End Referring Multi-Object Tracking in Videos
STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.
-
Weakly-Supervised Referring Video Object Segmentation through Text Supervision
WSRVOS enables referring video object segmentation with text-only supervision by combining MLLM-based expression augmentation, multimodal feature interaction, pseudo-mask fusion, and temporal ranking constraints.