Event-Aware Instructed Assistant for Referring Video Segmentation

Henghui Ding; Jinyu Liu; Shuting He; Yu-Gang Jiang

arxiv: 2606.26994 · v1 · pith:M7OWQRN6new · submitted 2026-06-25 · 💻 cs.CV · cs.AI

Event-Aware Instructed Assistant for Referring Video Segmentation

Jinyu Liu , Henghui Ding , Shuting He , Yu-Gang Jiang This is my paper

Pith reviewed 2026-06-26 05:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords referring video segmentationevent queriesvideo event decompositionobject trackingvision-language modelsmultimodal segmentationhierarchical video understanding

0 comments

The pith

Text-guided event queries partition videos into simple segments to improve referring video segmentation accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that current referring video segmentation methods fail because they process an entire video as one complex event, leading to confusion when matching text references. Instead, it proposes decomposing the video into simpler events using learnable queries guided by the input text. This creates a step-by-step hierarchical understanding where each event is handled separately. The model also combines object-level and pixel-level features to maintain tracking across long sequences. If correct, this would make segmentation more reliable when language describes compound actions spread over time.

Core claim

EVIS utilizes text-guided Event Queries to partition a video into simple events, extracting event-aware visual-text features to achieve a hierarchical understanding of the video. Object-Pixel-Hybrid Learning enables the model to track targets in long-term videos by integrating fine-grained pixel features with prior object queries, resulting in stronger performance on referring video segmentation benchmarks.

What carries the argument

Text-guided Event Queries that partition the video into events for sequential processing, paired with Object-Pixel-Hybrid Learning to merge pixel and object information.

If this is right

Video content is understood event by event rather than all at once, lowering the chance of hallucinations.
Hierarchical event-aware features improve matching between text references and visual content.
Object-Pixel-Hybrid Learning supports consistent target tracking across extended video durations.
Results improve across five public referring video segmentation benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The event decomposition approach could be tested on related tasks such as video question answering where references span multiple actions.
Performance may drop on videos whose events lack strong alignment with the accompanying text descriptions.

Load-bearing premise

Natural language expressions divide a video into distinct text-related segments each representing a separate event.

What would settle it

A test set of videos where referring expressions cross event boundaries without clear divisions, on which the event-query model shows no accuracy gain over standard single-event baselines.

Figures

Figures reproduced from arXiv: 2606.26994 by Henghui Ding, Jinyu Liu, Shuting He, Yu-Gang Jiang.

**Figure 2.** Figure 2: Comparison of previous methods and ours. a) Previous methods often struggle to directly understand the complex content within a single video. b) [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed Event-Aware Video Instructed Segmentation Assistant (EVIS). EVIS employs the event query and Event-Aware Frame Merging Module (EAFM) to learn hierarchical video features in an event-by-event manner. First, we decouple the visual tokens to pixel tokens Vp and object tokens Vo, where tokens are split along the temporal dimension. Using object queries Qo generated by the detector, EAF… view at source ↗

**Figure 4.** Figure 4: Event-Aware Frame Merging Module (EAFM). The EAFM module effectively comprehends various objects in an event-by-event manner, guiding the MLLMs to capture event-intra and event-inter information in a video. Merging Module. First, the frame merging block group object queries into distinct simple events. Subsequently, an eventintra attention mechanism is applied to capture fine-grained spatial-temporal info… view at source ↗

**Figure 5.** Figure 5: Frame Merging Block. We compute the event to assign an object query to by selecting the top-k Event Queries. (BCE) loss and DICE loss [49], weighted by their respective coefficients λbce and λdice. Given the ground truth (ˆytxt, mˆ ) and model predictions (ytxt, m), Ltxt and Lm are defined as: Ltxt = CE(ˆytxt, ytxt), (9) Lm = λbceBCE( ˆm, m) + λdiceDICE( ˆm, m), (10) where yˆtxt, ytxt correspond to textual… view at source ↗

**Figure 6.** Figure 6: Model architecture of the re-implemented EVIS by removing LLM. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Ablation on the effect of top-k event query selection on MeViS [1] dataset. Metrics are region similarity J , contour accuracy F and their combined average score J &F, respectively. all evaluation metrics. On Ref-YouTube-VOS, EVIS with the InternVL-1B achieves a J &F score of 64.4%, surpassing VideoLISA-3.8B [10] by 0.7%. On the Ref-DAVIS17 dataset, EVIS achieves a 68.8% in J &F, surpassing the state-of-th… view at source ↗

**Figure 8.** Figure 8: Example success and failure cases of EVIS. The black font denotes text that is not visible to the model. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative results of EVIS on ReasonSeg [7] Dataset. [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization for event decomposition. Scores are calculated by cosine similarity between global object and event queries in EAFM. [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 12.** Figure 12: Visulization of global queries learned w/o training (left) and w/ [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

**Figure 11.** Figure 11: Qualitative comparisons with VideoLISA [10] and DsHmp [2]. [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

read the original abstract

Existing referring video segmentation methods often treat a video as a single event consisting of multiple images, overlooking the fact that a video typically contains multiple distinct events. Under such a mechanism, the model needs to directly understand all the complex content in the video and text, which can easily lead to confusion and hallucinations. To address this issue, we propose to decompose a video to a set of simple events by learnable Event Query, and understand complex video content in an event-by-event, easy-to-understand manner. This is based on the observation that natural language expressions often divide a video into distinct, text-related segments, each representing a separate event within a compound event. We introduce EVIS, an Event-Aware Video Instructed Segmentation Assistant, which utilizes text-guided Event Queries to partition a video into simple events, extracting event-aware visual-text features to achieve a hierarchical understanding of the video. Additionally, we propose Object-Pixel-Hybrid Learning, which enables the MLLMs to track targets in long-term videos by integrating fine-grained pixel features with prior object queries. Extensive experimental results on 5 public benchmarks demonstrate EVIS's strong performance in addressing the referring video segmentation task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EVIS breaks videos into text-guided events for referring segmentation but the abstract gives almost no technical detail to judge the gains.

read the letter

The paper's main move is to stop treating a whole video as one event and instead use learnable, text-guided Event Queries to split it into simpler segments, then apply Object-Pixel-Hybrid Learning so the model can keep track of targets across longer clips. That decomposition is presented as a direct response to confusion and hallucination when models try to parse complex video-text pairs at once.

The idea itself is straightforward and targets a real pain point in referring video segmentation. If the event split actually reduces the load on the model, it could be a useful incremental step for MLLM-based approaches. The hybrid learning part also looks like a practical way to combine coarse object queries with fine pixel features for long-term tracking.

The soft spot is that almost everything rests on the abstract's performance claim. There are no equations, no architecture diagram, no ablation numbers, and no dataset specifics here, so it is impossible to tell whether the event queries are meaningfully different from prior query mechanisms or whether the reported gains on the five benchmarks are large enough to matter. The central assumption—that natural language expressions reliably carve videos into distinct, text-related events—also goes untested in the provided text.

This work is aimed at people already working on referring video segmentation or video MLLMs. A reader in that narrow subfield might pick up the event-decomposition idea and try it, but only after seeing the actual experiments and comparisons.

I would send it to peer review so the method and results can be checked properly.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes EVIS, an Event-Aware Video Instructed Segmentation Assistant for referring video segmentation. It decomposes input videos into simple events via text-guided learnable Event Queries (motivated by the observation that natural language expressions divide videos into distinct text-related segments), extracts event-aware visual-text features for hierarchical understanding, and introduces Object-Pixel-Hybrid Learning to integrate pixel features with object queries for long-term target tracking. The abstract states that extensive experiments on five public benchmarks demonstrate strong performance.

Significance. If the central architectural claims hold and are validated by rigorous experiments, the event-decomposition approach could address a plausible limitation of treating entire videos as single events, potentially reducing hallucinations in complex referring segmentation scenarios. The Object-Pixel-Hybrid Learning component might offer a practical mechanism for long-video tracking. However, with only the abstract available and no equations, architecture diagrams, training details, ablation studies, or quantitative results, the significance cannot be assessed beyond the high-level motivation.

major comments (2)

Abstract: The central performance claim ('strong performance' on 5 public benchmarks) is stated without any supporting numbers, tables, baselines, or error analysis, rendering the claim unevaluable and load-bearing for the paper's contribution.
Abstract: No equations, pseudocode, or architectural details are supplied for the Event Query mechanism or Object-Pixel-Hybrid Learning, so it is impossible to determine whether these components are independent innovations or reduce to standard query-based attention with minor modifications.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the comments on our manuscript. The full paper contains the architectural details, equations, and experimental results referenced in the abstract. We address the two major comments point by point below.

read point-by-point responses

Referee: Abstract: The central performance claim ('strong performance' on 5 public benchmarks) is stated without any supporting numbers, tables, baselines, or error analysis, rendering the claim unevaluable and load-bearing for the paper's contribution.

Authors: We agree that including concrete metrics would make the abstract's claim more immediately evaluable. In the revised manuscript we will update the abstract to report key quantitative results (e.g., mIoU gains on the five benchmarks versus recent baselines) while keeping the abstract concise. revision: yes
Referee: Abstract: No equations, pseudocode, or architectural details are supplied for the Event Query mechanism or Object-Pixel-Hybrid Learning, so it is impossible to determine whether these components are independent innovations or reduce to standard query-based attention with minor modifications.

Authors: The abstract is a high-level summary; the full manuscript supplies the requested details. Section 3.2 defines the text-guided Event Queries with the decomposition objective and associated equations. Section 3.3 presents the Object-Pixel-Hybrid Learning formulation, including the integration of pixel features with object queries, architecture diagrams, and ablations that isolate the contribution beyond standard query attention. These sections demonstrate the design choices motivated by the event-decomposition observation. revision: no

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes EVIS as an architectural innovation using learnable text-guided Event Queries to decompose videos and Object-Pixel-Hybrid Learning for tracking, motivated by the observational claim that natural language divides videos into events. No equations, parameter-fitting steps, or self-citations are present in the supplied text that would reduce any claimed prediction or result to a quantity defined by the authors' own prior inputs or fits. The performance claims rest on external benchmark evaluations rather than internal self-definition, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that language naturally segments videos into events and on the introduction of two new components (Event Query and hybrid learning) whose effectiveness is asserted via benchmark results whose details are unavailable.

free parameters (1)

Event Query parameters
Learnable parameters introduced to detect and represent events; their values are determined during training on video-text data.

axioms (1)

domain assumption Natural language expressions often divide a video into distinct, text-related segments, each representing a separate event within a compound event.
Explicitly invoked in the abstract as the observational foundation for the Event Query design.

invented entities (2)

Event Query no independent evidence
purpose: To partition a video into simple events in a text-guided manner.
New component introduced by the paper; no independent evidence outside the proposed model is provided in the abstract.
Object-Pixel-Hybrid Learning no independent evidence
purpose: To integrate fine-grained pixel features with object queries for long-term tracking.
New learning strategy introduced by the paper; no independent evidence outside the proposed model is provided in the abstract.

pith-pipeline@v0.9.1-grok · 5740 in / 1465 out tokens · 40172 ms · 2026-06-26T05:11:19.636180+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 9 linked inside Pith

[1]

MeViS: A large-scale benchmark for video segmentation with motion expressions,

H. Ding, C. Liu, S. He, X. Jiang, and C. C. Loy, “MeViS: A large-scale benchmark for video segmentation with motion expressions,” inInt. Conf. Comput. Vis., 2023, pp. 2694–2703

2023
[2]

Decoupling static and hierarchical motion perception for referring video segmentation,

S. He and H. Ding, “Decoupling static and hierarchical motion perception for referring video segmentation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 13 332–13 341

2024
[3]

URVOS: unified referring video object segmentation network with a large-scale benchmark,

S. Seo, J. Lee, and B. Han, “URVOS: unified referring video object segmentation network with a large-scale benchmark,” inEur . Conf. Comput. Vis., 2020, pp. 208–223

2020
[4]

Video object segmentation with language referring expressions,

A. Khoreva, A. Rohrbach, and B. Schiele, “Video object segmentation with language referring expressions,” inACCV, 2018, pp. 123–141

2018
[5]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,

Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, B. Li, P. Luo, T. Lu, Y . Qiao, and J. Dai, “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,”arXiv preprint arXiv:2312.14238, 2023

Pith/arXiv arXiv 2023
[6]

Minigpt-4: Enhancing vision-language understanding with advanced large language models,

D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” in Int. Conf. Learn. Represent., 2024, pp. 18 378–18 394

2024
[7]

Lisa: Reasoning segmentation via large language model,

X. Lai, Z. Tian, Y . Chen, Y . Li, Y . Yuan, S. Liu, and J. Jia, “Lisa: Reasoning segmentation via large language model,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 9579–9589

2024
[8]

GSV A: generalized segmentation via multimodal large language models,

Z. Xia, D. Han, Y . Han, X. Pan, S. Song, and G. Huang, “GSV A: generalized segmentation via multimodal large language models,” in IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 3858–3869

2024
[9]

Visa: Reasoning video object segmentation via large language models,

C. Yan, H. Wang, S. Yan, X. Jiang, Y . Hu, G. Kang, W. Xie, and E. Gavves, “Visa: Reasoning video object segmentation via large language models,” inEur . Conf. Comput. Vis., 2024, pp. 98–115

2024
[10]

One token to seg them all: Language instructed reasoning segmentation in videos,

Z. Bai, T. He, H. Mei, P. Wang, Z. Gao, J. Chen, L. Liu, Z. Zhang, and M. Z. Shou, “One token to seg them all: Language instructed reasoning segmentation in videos,” inAdv. Neural Inform. Process. Syst., 2024, pp. 6833–6859

2024
[11]

T. F. Shipley and J. M. Zacks,Understanding events: From perception to action. Oxford University Press, 2008

2008
[12]

Language as queries for referring video object segmentation,

J. Wu, Y . Jiang, P. Sun, Z. Yuan, and P. Luo, “Language as queries for referring video object segmentation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 4964–4974

2022
[14]

Losh: Long-short text joint prediction network for referring video object segmentation,

L. Yuan, M. Shi, Z. Yue, and Q. Chen, “Losh: Long-short text joint prediction network for referring video object segmentation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 14 001–14 010

2024
[15]

Koffka,Principles of Gestalt psychology

K. Koffka,Principles of Gestalt psychology. routledge, 2013

2013
[16]

Human memory: A proposed system and its control processes,

R. C. Atkinson, “Human memory: A proposed system and its control processes,”The psychology of learning and motivation, vol. 2, 1968

1968
[17]

Actor and action video segmentation from a sentence,

K. Gavrilyuk, A. Ghodrati, Z. Li, and C. G. M. Snoek, “Actor and action video segmentation from a sentence,” inIEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 5958–5966

2018
[18]

Mattnet: Modular attention network for referring expression comprehension,

L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg, “Mattnet: Modular attention network for referring expression comprehension,” inIEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 1307–1315

2018
[19]

MOSE: A new dataset for video object segmentation in complex scenes,

H. Ding, C. Liu, S. He, X. Jiang, P. H. S. Torr, and S. Bai, “MOSE: A new dataset for video object segmentation in complex scenes,” inInt. Conf. Comput. Vis., 2023, pp. 20 167–20 177

2023
[20]

Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model,

H. K. Cheng and A. G. Schwing, “Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model,” inEur . Conf. Comput. Vis., 2022, pp. 640–658

2022
[21]

Glus: Global-local reasoning unified into a single large language model for video segmentation,

L. Lin, X. Yu, Z. Pang, and Y .-X. Wang, “Glus: Global-local reasoning unified into a single large language model for video segmentation,” in CVPR, 2025, pp. 8658–8667

2025
[22]

Reinforcing video reasoning segmentation to think before it segments,

S. Gong, L. Zhang, Y . Zhuge, X. Jia, P. Zhang, and H. Lu, “Reinforcing video reasoning segmentation to think before it segments,”arXiv preprint arXiv:2508.11538, 2025

arXiv 2025
[23]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models,

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wuet al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024
[24]

Vision-language transformer and query generation for referring segmentation,

H. Ding, C. Liu, S. Wang, and X. Jiang, “Vision-language transformer and query generation for referring segmentation,” inInt. Conf. Comput. Vis., 2021, pp. 16 301–16 310

2021
[25]

Segmentation from natural language expressions,

R. Hu, M. Rohrbach, and T. Darrell, “Segmentation from natural language expressions,” inEur . Conf. Comput. Vis., 2016, pp. 108–124

2016
[26]

VLT: vision-language transformer and query generation for referring segmentation,

H. Ding, C. Liu, S. Wang, and X. Jiang, “VLT: vision-language transformer and query generation for referring segmentation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 6, pp. 7900–7916, 2023

2023
[27]

Lavt: Language-aware vision transformer for referring image segmentation,

Z. Yang, J. Wang, Y . Tang, K. Chen, H. Zhao, and P. H. Torr, “Lavt: Language-aware vision transformer for referring image segmentation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 18 155–18 165

2022
[28]

Instance-specific feature propagation for referring segmentation,

C. Liu, X. Jiang, and H. Ding, “Instance-specific feature propagation for referring segmentation,”IEEE Trans. Multimedia, vol. 25, pp. 3657–3667, 2022

2022
[29]

Referring expression object segmentation with caption-aware consistency,

Y .-W. Chen, Y .-H. Tsai, T. Wang, Y .-Y . Lin, and M.-H. Yang, “Referring expression object segmentation with caption-aware consistency,” inBrit. Mach. Vis. Conf., 2019

2019
[30]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdv. Neural Inform. Process. Syst., 2017

2017
[31]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll ´ar, and R. Girshick, “Segment anything,”arXiv preprint arXiv:2304.02643, 2023

Pith/arXiv arXiv 2023
[32]

GRES: Generalized referring expression segmentation,

C. Liu, H. Ding, and X. Jiang, “GRES: Generalized referring expression segmentation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 23 592–23 601

2023
[33]

The devil is in temporal token: High quality video reasoning segmentation,

S. Gong, Y . Zhuge, L. Zhang, Z. Yang, P. Zhang, and H. Lu, “The devil is in temporal token: High quality video reasoning segmentation,” in CVPR, 2025, pp. 29 183–29 192

2025
[34]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” inAdv. Neural Inform. Process. Syst., 2023, pp. 34 892–34 916

2023
[35]

BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,” inICML, 2023, pp. 19 730–19 742. IEEE TRANSACTIONS ON IMAGE PROCESSING 13

2023
[36]

Dualfocus: Integrating macro and micro perspectives in multi-modal large language models,

Y . Cao, P. Zhang, X. Dong, D. Lin, and J. Wang, “Dualfocus: Integrating macro and micro perspectives in multi-modal large language models,” arXiv preprint arXiv:2402.14767, 2024

arXiv 2024
[37]

Qwen-vl: A frontier large vision-language model with versatile abilities,

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,”arXiv preprint arXiv:2308.12966, 2023

Pith/arXiv arXiv 2023
[38]

Improved baselines with visual instruction tuning,

H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 26 286–26 296

2024
[39]

BLIV A: A simple multimodal LLM for better handling of text-rich visual questions,

W. Hu, Y . Xu, Y . Li, W. Li, Z. Chen, and Z. Tu, “BLIV A: A simple multimodal LLM for better handling of text-rich visual questions,” in AAAI, 2024, pp. 2256–2264

2024
[40]

Chat-univi: Unified visual representation empowers large language models with image and video understanding,

P. Jin, R. Takanobu, W. Zhang, X. Cao, and L. Yuan, “Chat-univi: Unified visual representation empowers large language models with image and video understanding,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 13 700–13 710

2024
[41]

Image as set of points,

X. Ma, Y . Zhou, H. Wang, C. Qin, B. Sun, C. Liu, and Y . Fu, “Image as set of points,” inInt. Conf. Learn. Represent., 2023

2023
[42]

Dynamicvit: Efficient vision transformers with dynamic token sparsification,

Y . Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C. Hsieh, “Dynamicvit: Efficient vision transformers with dynamic token sparsification,” inAdv. Neural Inform. Process. Syst., 2021, pp. 13 937–13 949

2021
[43]

TESTA: temporal- spatial token aggregation for long-form video-language understanding,

S. Ren, S. Chen, S. Li, X. Sun, and L. Hou, “TESTA: temporal- spatial token aggregation for long-form video-language understanding,” inEMNLP, 2023, pp. 932–947

2023
[44]

Masked- attention mask transformer for universal image segmentation,

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked- attention mask transformer for universal image segmentation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 1280–1289

2022
[45]

Roberta: A robustly optimized BERT pretraining approach,

Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “Roberta: A robustly optimized BERT pretraining approach,”arXiv preprint arXiv:1907.11692, 2019

Pith/arXiv arXiv 1907
[46]

Groupvit: Semantic segmentation emerges from text supervision,

J. Xu, S. D. Mello, S. Liu, W. Byeon, T. M. Breuel, J. Kautz, and X. Wang, “Groupvit: Semantic segmentation emerges from text supervision,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 18 113–18 123

2022
[47]

Categorical reparameterization with gumbel-softmax,

E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” inInt. Conf. Learn. Represent., 2017

2017
[48]

Neural discrete representation learning,

A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” inAdv. Neural Inform. Process. Syst., 2017

2017
[49]

V-net: Fully convolutional neural networks for volumetric medical image segmentation,

F. Milletari, N. Navab, and S. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in3DV, 2016

2016
[50]

Towards understanding action recognition,

H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, “Towards understanding action recognition,” inInt. Conf. Comput. Vis., 2013, pp. 3192–3199

2013
[51]

The 2017 DA VIS challenge on video object segmentation,

J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbel´aez, A. Sorkine-Hornung, and L. V . Gool, “The 2017 DA VIS challenge on video object segmentation,” arXiv preprint arXiv:1704.00675, 2017

Pith/arXiv arXiv 2017
[52]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites,

Z. Chen, W. Wang, H. Tian, S. Ye, Z. Gao, E. Cui, W. Tong, K. Hu, and et al, “How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites,”arXiv preprint arXiv:2404.16821, 2024

Pith/arXiv arXiv 2024
[53]

Qwen2 technical report,

A. Yang, B. Yang, B. Hui, B. Zheng, and et al, “Qwen2 technical report,” arXiv preprint arXiv:2407.10671, 2024

Pith/arXiv arXiv 2024
[54]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Int. Conf. Learn. Represent., 2019

2019
[55]

Language-bridged spatial-temporal interaction for referring video object segmentation,

Z. Ding, T. Hui, J. Huang, X. Wei, J. Han, and S. Liu, “Language-bridged spatial-temporal interaction for referring video object segmentation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 4954–4963

2022
[56]

Multi-level representation learning with semantic alignment for referring video object segmentation,

D. Wu, X. Dong, L. Shao, and J. Shen, “Multi-level representation learning with semantic alignment for referring video object segmentation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 4986–4995

2022
[57]

HTML: hybrid temporal-scale multimodal learning framework for referring video object segmentation,

M. Han, Y . Wang, Z. Li, L. Yao, X. Chang, and Y . Qiao, “HTML: hybrid temporal-scale multimodal learning framework for referring video object segmentation,” inInt. Conf. Comput. Vis., 2023, pp. 13 368–13 377

2023
[58]

Robust referring video object segmentation with cyclic structural consensus,

X. Li, J. Wang, X. Xu, X. Li, B. Raj, and Y . Lu, “Robust referring video object segmentation with cyclic structural consensus,” inICCV, 2023, pp. 22 179–22 188

2023
[59]

Spectrum-guided multi- granularity referring video object segmentation,

B. Miao, M. Bennamoun, Y . Gao, and A. Mian, “Spectrum-guided multi- granularity referring video object segmentation,” inInt. Conf. Comput. Vis., 2023, pp. 920–930

2023
[60]

Onlinerefer: A simple online baseline for referring video object segmentation,

D. Wu, T. Wang, Y . Zhang, X. Zhang, and J. Shen, “Onlinerefer: A simple online baseline for referring video object segmentation,” inInt. Conf. Comput. Vis., 2023, pp. 2749–2758

2023
[61]

Temporal collection and distribution for referring video object segmentation,

J. Tang, G. Zheng, and S. Yang, “Temporal collection and distribution for referring video object segmentation,” inInt. Conf. Comput. Vis., 2023, pp. 15 420–15 430

2023
[62]

SOC: semantic-assisted object cluster for referring video object segmentation,

Z. Luo, Y . Xiao, Y . Liu, S. Li, Y . Wang, Y . Tang, X. Li, and Y . Yang, “SOC: semantic-assisted object cluster for referring video object segmentation,” inAdv. Neural Inform. Process. Syst., 2023, pp. 26 425–26 437

2023
[63]

Tracking with human-intent reasoning,

J. Zhu, Z. Cheng, J. He, C. Li, B. Luo, H. Lu, Y . Geng, and X. Xie, “Tracking with human-intent reasoning,”arXiv preprint arXiv:2312.17448, 2023

arXiv 2023
[64]

Open-vocabulary semantic segmentation with mask- adapted CLIP,

F. Liang, B. Wu, X. Dai, K. Li, Y . Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Marculescu, “Open-vocabulary semantic segmentation with mask- adapted CLIP,” inIEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 7061–7070

2023
[65]

Generalized decoding for pixel, image, and language,

X. Zou, Z. Dou, J. Yang, Z. Gan, L. Li, C. Li, X. Dai, H. Behl, J. Wang, L. Yuan, N. Peng, L. Wang, Y . J. Lee, and J. Gao, “Generalized decoding for pixel, image, and language,” inIEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 15 116–15 127

2023
[66]

Segment everything everywhere all at once,

X. Zou, J. Yang, H. Zhang, F. Li, L. Li, J. Wang, L. Wang, J. Gao, and Y . J. Lee, “Segment everything everywhere all at once,” inAdv. Neural Inform. Process. Syst., 2023, pp. 19 769–19 782

2023
[67]

Grounded SAM: assembling open-world models for diverse visual tasks,

T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y . Chen, F. Yan, Z. Zeng, H. Zhang, F. Li, J. Yang, H. Li, Q. Jiang, and L. Zhang, “Grounded SAM: assembling open-world models for diverse visual tasks,”arXiv preprint arXiv:2401.14159, 2024

Pith/arXiv arXiv 2024
[68]

Referitgame: Referring to objects in photographs of natural scenes,

S. Kazemzadeh, V . Ordonez, M. Matten, and T. L. Berg, “Referitgame: Referring to objects in photographs of natural scenes,” inEMNLP, 2014, pp. 787–798

2014
[69]

Generation and comprehension of unambiguous object descriptions,

J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy, “Generation and comprehension of unambiguous object descriptions,” in IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 11–20

2016
[70]

Multi-task collaborative network for joint referring expression comprehension and segmentation,

G. Luo, Y . Zhou, X. Sun, L. Cao, C. Wu, C. Deng, and R. Ji, “Multi-task collaborative network for joint referring expression comprehension and segmentation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 10 031–10 040

2020
[71]

CRIS: clip-driven referring image segmentation,

Z. Wang, Y . Lu, Q. Li, X. Tao, Y . Guo, M. Gong, and T. Liu, “CRIS: clip-driven referring image segmentation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 11 676–11 685. Jinyu Liureceived the M.S. degree from Fudan University, Shanghai, China, in 2023. He is currently a Ph.D. student at College of Computer Science and Artificial Intelligence, Fud...

2022

[1] [1]

MeViS: A large-scale benchmark for video segmentation with motion expressions,

H. Ding, C. Liu, S. He, X. Jiang, and C. C. Loy, “MeViS: A large-scale benchmark for video segmentation with motion expressions,” inInt. Conf. Comput. Vis., 2023, pp. 2694–2703

2023

[2] [2]

Decoupling static and hierarchical motion perception for referring video segmentation,

S. He and H. Ding, “Decoupling static and hierarchical motion perception for referring video segmentation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 13 332–13 341

2024

[3] [3]

URVOS: unified referring video object segmentation network with a large-scale benchmark,

S. Seo, J. Lee, and B. Han, “URVOS: unified referring video object segmentation network with a large-scale benchmark,” inEur . Conf. Comput. Vis., 2020, pp. 208–223

2020

[4] [4]

Video object segmentation with language referring expressions,

A. Khoreva, A. Rohrbach, and B. Schiele, “Video object segmentation with language referring expressions,” inACCV, 2018, pp. 123–141

2018

[5] [5]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,

Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, B. Li, P. Luo, T. Lu, Y . Qiao, and J. Dai, “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,”arXiv preprint arXiv:2312.14238, 2023

Pith/arXiv arXiv 2023

[6] [6]

Minigpt-4: Enhancing vision-language understanding with advanced large language models,

D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” in Int. Conf. Learn. Represent., 2024, pp. 18 378–18 394

2024

[7] [7]

Lisa: Reasoning segmentation via large language model,

X. Lai, Z. Tian, Y . Chen, Y . Li, Y . Yuan, S. Liu, and J. Jia, “Lisa: Reasoning segmentation via large language model,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 9579–9589

2024

[8] [8]

GSV A: generalized segmentation via multimodal large language models,

Z. Xia, D. Han, Y . Han, X. Pan, S. Song, and G. Huang, “GSV A: generalized segmentation via multimodal large language models,” in IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 3858–3869

2024

[9] [9]

Visa: Reasoning video object segmentation via large language models,

C. Yan, H. Wang, S. Yan, X. Jiang, Y . Hu, G. Kang, W. Xie, and E. Gavves, “Visa: Reasoning video object segmentation via large language models,” inEur . Conf. Comput. Vis., 2024, pp. 98–115

2024

[10] [10]

One token to seg them all: Language instructed reasoning segmentation in videos,

Z. Bai, T. He, H. Mei, P. Wang, Z. Gao, J. Chen, L. Liu, Z. Zhang, and M. Z. Shou, “One token to seg them all: Language instructed reasoning segmentation in videos,” inAdv. Neural Inform. Process. Syst., 2024, pp. 6833–6859

2024

[11] [11]

T. F. Shipley and J. M. Zacks,Understanding events: From perception to action. Oxford University Press, 2008

2008

[12] [12]

Language as queries for referring video object segmentation,

J. Wu, Y . Jiang, P. Sun, Z. Yuan, and P. Luo, “Language as queries for referring video object segmentation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 4964–4974

2022

[13] [14]

Losh: Long-short text joint prediction network for referring video object segmentation,

L. Yuan, M. Shi, Z. Yue, and Q. Chen, “Losh: Long-short text joint prediction network for referring video object segmentation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 14 001–14 010

2024

[14] [15]

Koffka,Principles of Gestalt psychology

K. Koffka,Principles of Gestalt psychology. routledge, 2013

2013

[15] [16]

Human memory: A proposed system and its control processes,

R. C. Atkinson, “Human memory: A proposed system and its control processes,”The psychology of learning and motivation, vol. 2, 1968

1968

[16] [17]

Actor and action video segmentation from a sentence,

K. Gavrilyuk, A. Ghodrati, Z. Li, and C. G. M. Snoek, “Actor and action video segmentation from a sentence,” inIEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 5958–5966

2018

[17] [18]

Mattnet: Modular attention network for referring expression comprehension,

L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg, “Mattnet: Modular attention network for referring expression comprehension,” inIEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 1307–1315

2018

[18] [19]

MOSE: A new dataset for video object segmentation in complex scenes,

H. Ding, C. Liu, S. He, X. Jiang, P. H. S. Torr, and S. Bai, “MOSE: A new dataset for video object segmentation in complex scenes,” inInt. Conf. Comput. Vis., 2023, pp. 20 167–20 177

2023

[19] [20]

Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model,

H. K. Cheng and A. G. Schwing, “Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model,” inEur . Conf. Comput. Vis., 2022, pp. 640–658

2022

[20] [21]

Glus: Global-local reasoning unified into a single large language model for video segmentation,

L. Lin, X. Yu, Z. Pang, and Y .-X. Wang, “Glus: Global-local reasoning unified into a single large language model for video segmentation,” in CVPR, 2025, pp. 8658–8667

2025

[21] [22]

Reinforcing video reasoning segmentation to think before it segments,

S. Gong, L. Zhang, Y . Zhuge, X. Jia, P. Zhang, and H. Lu, “Reinforcing video reasoning segmentation to think before it segments,”arXiv preprint arXiv:2508.11538, 2025

arXiv 2025

[22] [23]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models,

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wuet al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024

[23] [24]

Vision-language transformer and query generation for referring segmentation,

H. Ding, C. Liu, S. Wang, and X. Jiang, “Vision-language transformer and query generation for referring segmentation,” inInt. Conf. Comput. Vis., 2021, pp. 16 301–16 310

2021

[24] [25]

Segmentation from natural language expressions,

R. Hu, M. Rohrbach, and T. Darrell, “Segmentation from natural language expressions,” inEur . Conf. Comput. Vis., 2016, pp. 108–124

2016

[25] [26]

VLT: vision-language transformer and query generation for referring segmentation,

H. Ding, C. Liu, S. Wang, and X. Jiang, “VLT: vision-language transformer and query generation for referring segmentation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 6, pp. 7900–7916, 2023

2023

[26] [27]

Lavt: Language-aware vision transformer for referring image segmentation,

Z. Yang, J. Wang, Y . Tang, K. Chen, H. Zhao, and P. H. Torr, “Lavt: Language-aware vision transformer for referring image segmentation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 18 155–18 165

2022

[27] [28]

Instance-specific feature propagation for referring segmentation,

C. Liu, X. Jiang, and H. Ding, “Instance-specific feature propagation for referring segmentation,”IEEE Trans. Multimedia, vol. 25, pp. 3657–3667, 2022

2022

[28] [29]

Referring expression object segmentation with caption-aware consistency,

Y .-W. Chen, Y .-H. Tsai, T. Wang, Y .-Y . Lin, and M.-H. Yang, “Referring expression object segmentation with caption-aware consistency,” inBrit. Mach. Vis. Conf., 2019

2019

[29] [30]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdv. Neural Inform. Process. Syst., 2017

2017

[30] [31]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll ´ar, and R. Girshick, “Segment anything,”arXiv preprint arXiv:2304.02643, 2023

Pith/arXiv arXiv 2023

[31] [32]

GRES: Generalized referring expression segmentation,

C. Liu, H. Ding, and X. Jiang, “GRES: Generalized referring expression segmentation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 23 592–23 601

2023

[32] [33]

The devil is in temporal token: High quality video reasoning segmentation,

S. Gong, Y . Zhuge, L. Zhang, Z. Yang, P. Zhang, and H. Lu, “The devil is in temporal token: High quality video reasoning segmentation,” in CVPR, 2025, pp. 29 183–29 192

2025

[33] [34]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” inAdv. Neural Inform. Process. Syst., 2023, pp. 34 892–34 916

2023

[34] [35]

BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,” inICML, 2023, pp. 19 730–19 742. IEEE TRANSACTIONS ON IMAGE PROCESSING 13

2023

[35] [36]

Dualfocus: Integrating macro and micro perspectives in multi-modal large language models,

Y . Cao, P. Zhang, X. Dong, D. Lin, and J. Wang, “Dualfocus: Integrating macro and micro perspectives in multi-modal large language models,” arXiv preprint arXiv:2402.14767, 2024

arXiv 2024

[36] [37]

Qwen-vl: A frontier large vision-language model with versatile abilities,

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,”arXiv preprint arXiv:2308.12966, 2023

Pith/arXiv arXiv 2023

[37] [38]

Improved baselines with visual instruction tuning,

H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 26 286–26 296

2024

[38] [39]

BLIV A: A simple multimodal LLM for better handling of text-rich visual questions,

W. Hu, Y . Xu, Y . Li, W. Li, Z. Chen, and Z. Tu, “BLIV A: A simple multimodal LLM for better handling of text-rich visual questions,” in AAAI, 2024, pp. 2256–2264

2024

[39] [40]

Chat-univi: Unified visual representation empowers large language models with image and video understanding,

P. Jin, R. Takanobu, W. Zhang, X. Cao, and L. Yuan, “Chat-univi: Unified visual representation empowers large language models with image and video understanding,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 13 700–13 710

2024

[40] [41]

Image as set of points,

X. Ma, Y . Zhou, H. Wang, C. Qin, B. Sun, C. Liu, and Y . Fu, “Image as set of points,” inInt. Conf. Learn. Represent., 2023

2023

[41] [42]

Dynamicvit: Efficient vision transformers with dynamic token sparsification,

Y . Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C. Hsieh, “Dynamicvit: Efficient vision transformers with dynamic token sparsification,” inAdv. Neural Inform. Process. Syst., 2021, pp. 13 937–13 949

2021

[42] [43]

TESTA: temporal- spatial token aggregation for long-form video-language understanding,

S. Ren, S. Chen, S. Li, X. Sun, and L. Hou, “TESTA: temporal- spatial token aggregation for long-form video-language understanding,” inEMNLP, 2023, pp. 932–947

2023

[43] [44]

Masked- attention mask transformer for universal image segmentation,

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked- attention mask transformer for universal image segmentation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 1280–1289

2022

[44] [45]

Roberta: A robustly optimized BERT pretraining approach,

Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “Roberta: A robustly optimized BERT pretraining approach,”arXiv preprint arXiv:1907.11692, 2019

Pith/arXiv arXiv 1907

[45] [46]

Groupvit: Semantic segmentation emerges from text supervision,

J. Xu, S. D. Mello, S. Liu, W. Byeon, T. M. Breuel, J. Kautz, and X. Wang, “Groupvit: Semantic segmentation emerges from text supervision,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 18 113–18 123

2022

[46] [47]

Categorical reparameterization with gumbel-softmax,

E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” inInt. Conf. Learn. Represent., 2017

2017

[47] [48]

Neural discrete representation learning,

A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” inAdv. Neural Inform. Process. Syst., 2017

2017

[48] [49]

V-net: Fully convolutional neural networks for volumetric medical image segmentation,

F. Milletari, N. Navab, and S. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in3DV, 2016

2016

[49] [50]

Towards understanding action recognition,

H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, “Towards understanding action recognition,” inInt. Conf. Comput. Vis., 2013, pp. 3192–3199

2013

[50] [51]

The 2017 DA VIS challenge on video object segmentation,

J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbel´aez, A. Sorkine-Hornung, and L. V . Gool, “The 2017 DA VIS challenge on video object segmentation,” arXiv preprint arXiv:1704.00675, 2017

Pith/arXiv arXiv 2017

[51] [52]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites,

Z. Chen, W. Wang, H. Tian, S. Ye, Z. Gao, E. Cui, W. Tong, K. Hu, and et al, “How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites,”arXiv preprint arXiv:2404.16821, 2024

Pith/arXiv arXiv 2024

[52] [53]

Qwen2 technical report,

A. Yang, B. Yang, B. Hui, B. Zheng, and et al, “Qwen2 technical report,” arXiv preprint arXiv:2407.10671, 2024

Pith/arXiv arXiv 2024

[53] [54]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Int. Conf. Learn. Represent., 2019

2019

[54] [55]

Language-bridged spatial-temporal interaction for referring video object segmentation,

Z. Ding, T. Hui, J. Huang, X. Wei, J. Han, and S. Liu, “Language-bridged spatial-temporal interaction for referring video object segmentation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 4954–4963

2022

[55] [56]

Multi-level representation learning with semantic alignment for referring video object segmentation,

D. Wu, X. Dong, L. Shao, and J. Shen, “Multi-level representation learning with semantic alignment for referring video object segmentation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 4986–4995

2022

[56] [57]

HTML: hybrid temporal-scale multimodal learning framework for referring video object segmentation,

M. Han, Y . Wang, Z. Li, L. Yao, X. Chang, and Y . Qiao, “HTML: hybrid temporal-scale multimodal learning framework for referring video object segmentation,” inInt. Conf. Comput. Vis., 2023, pp. 13 368–13 377

2023

[57] [58]

Robust referring video object segmentation with cyclic structural consensus,

X. Li, J. Wang, X. Xu, X. Li, B. Raj, and Y . Lu, “Robust referring video object segmentation with cyclic structural consensus,” inICCV, 2023, pp. 22 179–22 188

2023

[58] [59]

Spectrum-guided multi- granularity referring video object segmentation,

B. Miao, M. Bennamoun, Y . Gao, and A. Mian, “Spectrum-guided multi- granularity referring video object segmentation,” inInt. Conf. Comput. Vis., 2023, pp. 920–930

2023

[59] [60]

Onlinerefer: A simple online baseline for referring video object segmentation,

D. Wu, T. Wang, Y . Zhang, X. Zhang, and J. Shen, “Onlinerefer: A simple online baseline for referring video object segmentation,” inInt. Conf. Comput. Vis., 2023, pp. 2749–2758

2023

[60] [61]

Temporal collection and distribution for referring video object segmentation,

J. Tang, G. Zheng, and S. Yang, “Temporal collection and distribution for referring video object segmentation,” inInt. Conf. Comput. Vis., 2023, pp. 15 420–15 430

2023

[61] [62]

SOC: semantic-assisted object cluster for referring video object segmentation,

Z. Luo, Y . Xiao, Y . Liu, S. Li, Y . Wang, Y . Tang, X. Li, and Y . Yang, “SOC: semantic-assisted object cluster for referring video object segmentation,” inAdv. Neural Inform. Process. Syst., 2023, pp. 26 425–26 437

2023

[62] [63]

Tracking with human-intent reasoning,

J. Zhu, Z. Cheng, J. He, C. Li, B. Luo, H. Lu, Y . Geng, and X. Xie, “Tracking with human-intent reasoning,”arXiv preprint arXiv:2312.17448, 2023

arXiv 2023

[63] [64]

Open-vocabulary semantic segmentation with mask- adapted CLIP,

F. Liang, B. Wu, X. Dai, K. Li, Y . Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Marculescu, “Open-vocabulary semantic segmentation with mask- adapted CLIP,” inIEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 7061–7070

2023

[64] [65]

Generalized decoding for pixel, image, and language,

X. Zou, Z. Dou, J. Yang, Z. Gan, L. Li, C. Li, X. Dai, H. Behl, J. Wang, L. Yuan, N. Peng, L. Wang, Y . J. Lee, and J. Gao, “Generalized decoding for pixel, image, and language,” inIEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 15 116–15 127

2023

[65] [66]

Segment everything everywhere all at once,

X. Zou, J. Yang, H. Zhang, F. Li, L. Li, J. Wang, L. Wang, J. Gao, and Y . J. Lee, “Segment everything everywhere all at once,” inAdv. Neural Inform. Process. Syst., 2023, pp. 19 769–19 782

2023

[66] [67]

Grounded SAM: assembling open-world models for diverse visual tasks,

T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y . Chen, F. Yan, Z. Zeng, H. Zhang, F. Li, J. Yang, H. Li, Q. Jiang, and L. Zhang, “Grounded SAM: assembling open-world models for diverse visual tasks,”arXiv preprint arXiv:2401.14159, 2024

Pith/arXiv arXiv 2024

[67] [68]

Referitgame: Referring to objects in photographs of natural scenes,

S. Kazemzadeh, V . Ordonez, M. Matten, and T. L. Berg, “Referitgame: Referring to objects in photographs of natural scenes,” inEMNLP, 2014, pp. 787–798

2014

[68] [69]

Generation and comprehension of unambiguous object descriptions,

J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy, “Generation and comprehension of unambiguous object descriptions,” in IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 11–20

2016

[69] [70]

Multi-task collaborative network for joint referring expression comprehension and segmentation,

G. Luo, Y . Zhou, X. Sun, L. Cao, C. Wu, C. Deng, and R. Ji, “Multi-task collaborative network for joint referring expression comprehension and segmentation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 10 031–10 040

2020

[70] [71]

CRIS: clip-driven referring image segmentation,

Z. Wang, Y . Lu, Q. Li, X. Tao, Y . Guo, M. Gong, and T. Liu, “CRIS: clip-driven referring image segmentation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 11 676–11 685. Jinyu Liureceived the M.S. degree from Fudan University, Shanghai, China, in 2023. He is currently a Ph.D. student at College of Computer Science and Artificial Intelligence, Fud...

2022