pith. machine review for the scientific record. sign in

arxiv: 2604.11411 · v1 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

Online Reasoning Video Object Segmentation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:24 UTC · model grok-4.3

classification 💻 cs.CV
keywords online video object segmentationcausal video reasoningreferent shiftsnatural language video queriestemporal token reservoirframe-by-frame segmentationvideo understanding benchmark
0
0 comments X

The pith

Reasoning video object segmentation must run causally using only past and current frames while tracking shifting referents in language queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines Online Reasoning Video Object Segmentation as the task of generating pixel-level masks from natural-language queries that may contain implicit references and change what they point to as events unfold. It shows that current methods assume the full video is available at once, allowing future frames to resolve ambiguities, which does not match real deployments that demand incremental frame-by-frame decisions. To enable study of this setting the authors release ORVOSB, a benchmark of 210 videos with frame-level causal annotations and explicit labels for referent shifts across 512 queries in five reasoning categories. They also introduce a baseline that keeps segmentation prompts updated and maintains a structured temporal token reservoir so long sequences can be handled without unbounded memory growth. If the claim holds, the field must move from offline retrospective methods to architectures that reason causally under strict computational bounds.

Core claim

The central claim is that reasoning video object segmentation requires strictly causal operation: models must interpret queries and produce masks incrementally from past and current frames alone, without revisiting earlier outputs or accessing future frames, while correctly handling referent shifts that occur as the video progresses. The authors support this by constructing ORVOSB with the necessary frame-level causal and shift annotations and by releasing a baseline whose continually updated prompts and temporal token reservoir allow bounded long-horizon reasoning.

What carries the argument

A baseline architecture that maintains continually-updated segmentation prompts together with a structured temporal token reservoir for efficient long-horizon causal reasoning.

If this is right

  • Existing offline methods lose accuracy when future frames are withheld and when queries change reference during the video.
  • Any successful model must maintain and update its interpretation of the query across time without retrospective correction.
  • Benchmarks for video segmentation now need explicit causal constraints and referent-shift labels at the frame level.
  • Long video sequences require memory mechanisms whose size remains bounded even as the number of frames grows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Real-time systems such as live monitoring or autonomous navigation would gain immediate usability from causal models that never wait for the end of a clip.
  • The temporal token reservoir idea could be combined with hierarchical memory to scale to hour-long videos while preserving causality.
  • Similar causal constraints likely appear in other sequential reasoning tasks such as online video captioning or action anticipation.

Load-bearing premise

The ORVOSB benchmark's frame-level causal annotations and referent-shift labels sufficiently represent the distribution of real-world online queries and video content.

What would settle it

A new method that achieves substantially higher accuracy on ORVOSB while still obeying strict causality would falsify the claim that existing approaches cannot be adapted to the online regime without major redesign.

Figures

Figures reproduced from arXiv: 2604.11411 by Jinyuan Liu, Ruize Han, Song Wang, Weixin Li, Yang Wang, Zeyu Zhao.

Figure 1
Figure 1. Figure 1: Overview of ORVOSB data construction and annotation pipeline. where the prediction at time t must not depend on any future frame It ′ with t ′ > t. Causality Constraints. In ORVOS, the causality constraint is a new yet impor￾tant concept, which is to say that an object should be segmented iff. It satisfies the condition of the referring expression q. For instance, under the referring expression q indicatin… view at source ↗
Figure 2
Figure 2. Figure 2: Illustrative examples of the five reasoning query types in ORVOSB [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Instruct Template Online Video Timeline Context Frames Write Into Reservoir Retrieval ······ t-K t-1 Current Frame t (a) Overall Architecture. which vehicles are moving in the current scene? Context Frames Current Frame Memory Tokens Token Reservoir t i K i iS     1 t g   Affinity-guided Adaptive Fusion St-K <SEG> t-K ······ ······ <SEG> t-1 St-1 <TGT> t t g Multimodal Large Language Model gt ~ t Pre… view at source ↗
Figure 4
Figure 4. Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
read the original abstract

Reasoning video object segmentation predicts pixel-level masks in videos from natural-language queries that may involve implicit and temporally grounded references. However, existing methods are developed and evaluated in an offline regime, where the entire video is available at inference time and future frames can be exploited for retrospective disambiguation, deviating from real-world deployments that require strictly causal, frame-by-frame decisions. We study Online Reasoning Video Object Segmentation (ORVOS), where models must incrementally interpret queries using only past and current frames without revisiting previous predictions, while handling referent shifts as events unfold. To support evaluation, we introduce ORVOSB, a benchmark with frame-level causal annotations and referent-shift labels, comprising 210 videos, 12,907 annotated frames, and 512 queries across five reasoning categories. We further propose a baseline with continually-updated segmentation prompts and a structured temporal token reservoir for long-horizon reasoning under bounded computation. Experiments show that existing methods struggle under strict causality and referent shifts, while our baseline establishes a strong foundation for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Online Reasoning Video Object Segmentation (ORVOS), a task requiring strictly causal, frame-by-frame pixel-level mask prediction from natural-language queries that may contain implicit temporal references and referent shifts. It presents the ORVOSB benchmark (210 videos, 12,907 frames, 512 queries with frame-level causal annotations and referent-shift labels across five reasoning categories) and proposes a baseline using continually-updated segmentation prompts plus a structured temporal token reservoir for bounded long-horizon reasoning. Experiments claim that existing offline methods struggle under these constraints while the baseline establishes a strong foundation for future work.

Significance. If the empirical claims hold, the work is significant for exposing the gap between offline video object segmentation methods and real-world causal deployments, while providing a new benchmark and baseline to standardize evaluation of referent-shift handling. The emphasis on bounded computation in the baseline is a practical strength. Impact depends on rigorous validation that performance gaps are not artifacts of the benchmark distribution.

major comments (2)
  1. [Benchmark section] Benchmark section (ORVOSB construction): The central claim that existing methods 'struggle under strict causality and referent shifts' rests on ORVOSB being representative, yet no quantitative comparisons are provided for query linguistic complexity, referent-shift frequency, video duration statistics, or causal annotation consistency against larger real-world online video-query corpora. This is load-bearing, as over-representation of short clips or simple shifts could artifactually inflate observed gaps.
  2. [Experiments section] Experiments section: The abstract asserts that experiments demonstrate struggles for prior methods and a 'strong foundation' for the baseline, but the manuscript provides no error bars, statistical tests, or ablations on the temporal token reservoir's contribution to long-horizon performance. Without these, the strength of the empirical support for the task definition cannot be verified.
minor comments (2)
  1. [Abstract] Abstract: The five reasoning categories are mentioned but not enumerated; listing them would improve immediate clarity for readers.
  2. [Method] Notation: The term 'structured temporal token reservoir' is introduced without a precise definition or pseudocode in the early sections; a small diagram or equation would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, indicating the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Benchmark section] Benchmark section (ORVOSB construction): The central claim that existing methods 'struggle under strict causality and referent shifts' rests on ORVOSB being representative, yet no quantitative comparisons are provided for query linguistic complexity, referent-shift frequency, video duration statistics, or causal annotation consistency against larger real-world online video-query corpora. This is load-bearing, as over-representation of short clips or simple shifts could artifactually inflate observed gaps.

    Authors: We agree that additional context on ORVOSB's characteristics is important to support the representativeness of the claims. As the first benchmark providing frame-level causal annotations and referent-shift labels for this task, direct equivalents do not exist. In the revision we will add quantitative statistics on query linguistic complexity, referent-shift frequency, and video duration distributions, along with comparisons to published figures from existing video-query benchmarks such as Refer-YouTube-VOS. We will also report inter-annotator agreement for the causal annotations. Direct comparison of causal annotation consistency is not possible with prior datasets, which we will explicitly note as a limitation. revision: partial

  2. Referee: [Experiments section] Experiments section: The abstract asserts that experiments demonstrate struggles for prior methods and a 'strong foundation' for the baseline, but the manuscript provides no error bars, statistical tests, or ablations on the temporal token reservoir's contribution to long-horizon performance. Without these, the strength of the empirical support for the task definition cannot be verified.

    Authors: We appreciate the emphasis on empirical rigor. The revised manuscript will include error bars for all reported metrics, statistical significance tests comparing method performances, and dedicated ablations isolating the temporal token reservoir's role in long-horizon reasoning. These additions will provide clearer validation of the performance gaps and the baseline's contributions. revision: yes

Circularity Check

0 steps flagged

No circularity: new task definition and benchmark with independent baseline

full rationale

The paper defines a new task (ORVOS) requiring strictly causal, frame-by-frame processing with referent shifts, introduces the ORVOSB benchmark (210 videos, 12,907 frames, 512 queries with causal annotations and shift labels), and proposes a baseline using continually-updated prompts and temporal token reservoir. No equations, fitted parameters, or derivations are present that reduce to self-definition or self-citation. Central claims rest on experimental comparisons showing existing offline methods struggle on the new benchmark, which is externally falsifiable via the released annotations and does not rely on prior author results for uniqueness or ansatz. This is a standard task/benchmark contribution with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's claims rest on standard computer-vision assumptions about video frame causality and language grounding rather than new mathematical axioms or invented physical entities.

axioms (1)
  • domain assumption Real-world video deployments require strictly causal, frame-by-frame decisions without access to future frames.
    Stated directly in the abstract as the motivation for moving from offline to online regime.

pith-pipeline@v0.9.0 · 5481 in / 1138 out tokens · 27348 ms · 2026-05-10T15:24:06.045514+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 17 canonical work pages · 8 internal anchors

  1. [1]

    NeurIPS35, 23716–23736 (2022)

    Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. NeurIPS35, 23716–23736 (2022)

  2. [2]

    Qwen Technical Report

    Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

  3. [3]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

  4. [4]

    NeurIPS37, 6833–6859 (2024)

    Bai, Z., He, T., Mei, H., Wang, P., Gao, Z., Chen, J., Zhang, Z., Shou, M.Z.: One token to seg them all: Language instructed reasoning segmentation in videos. NeurIPS37, 6833–6859 (2024)

  5. [5]

    In: CVPR

    Botach, A., Zheltonozhskii, E., Baskin, C.: End-to-end referring video object seg- mentation with multimodal transformers. In: CVPR. pp. 4985–4995 (2022)

  6. [6]

    In: SIGCOMM

    Bothra, C., Gao, J., Rao, S., Ribeiro, B.: Veritas: Answering causal queries from video streaming traces. In: SIGCOMM. pp. 738–753 (2023)

  7. [7]

    In: CVPR

    Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: Thing and stuff classes in context. In: CVPR. pp. 1209–1218 (2018)

  8. [8]

    SAM 3: Segment Anything with Concepts

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., et al.: Sam 3: Segment anything with concepts. arXiv preprint arXiv:2511.16719 (2025)

  9. [9]

    In: CVPR

    Chen, J., Lv, Z., Wu, S., Lin, K.Q., Song, C., Gao, D., Liu, J.W., Gao, Z., Mao, D., Shou, M.Z.: Videollm-online: Online video large language model for streaming video. In: CVPR. pp. 18407–18418 (2024)

  10. [10]

    In: CVPR

    Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., Yuille, A.: Detect what you can: Detecting and representing objects using holistic models and body parts. In: CVPR. pp. 1971–1978 (2014)

  11. [11]

    In: CVPR

    Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: CVPR. pp. 24185–24198 (2024)

  12. [12]

    IEEE Trans- actions on Mobile Computing23(12), 12761–12777 (2024)

    Dai, P., Chao, Y., Wu, X., Liu, K., Guo, S.: Context-aware offloading for edge- assisted on-device video analytics through online learning approach. IEEE Trans- actions on Mobile Computing23(12), 12761–12777 (2024)

  13. [13]

    In: CVPR

    Ding, H., Liu, C., He, S., Jiang, X., Loy, C.C.: Mevis: A large-scale benchmark for video segmentation with motion expressions. In: CVPR. pp. 2694–2703 (2023)

  14. [14]

    In: CVPR

    Ding, H., Liu, C., He, S., Jiang, X., Torr, P.H., Bai, S.: Mose: A new dataset for video object segmentation in complex scenes. In: CVPR. pp. 20224–20234 (2023)

  15. [15]

    In: CVPR

    Gong, S., Zhuge, Y., Zhang, L., Yang, Z., Zhang, P., Lu, H.: The devil is in temporal token: High quality video reasoning segmentation. In: CVPR. pp. 29183–29192 (2025)

  16. [16]

    arXiv preprint arXiv:1308.0850 (2013) 4, 5

    Graves, A.: Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 (2013)

  17. [17]

    ICLR1(2), 3 (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

  18. [18]

    In: CVPR

    Huang, Z., Li, X., Li, J., Wang, J., Zeng, X., Liang, C., Wu, T., Chen, X., Li, L., Wang, L.: Online video understanding: Ovbench and videochat-online. In: CVPR. pp. 3328–3338 (2025)

  19. [19]

    In: CVPR

    Hui, T., Huang, S., Liu, S., Ding, Z., Li, G., Wang, W., Han, J., Wang, F.: Collab- orative spatial-temporal modeling for language-queried video actor segmentation. In: CVPR. pp. 4187–4196 (2021) Online Reasoning Video Object Segmentation 17

  20. [20]

    In: CVPR

    Jin, P., Takanobu, R., Zhang, W., Cao, X., Yuan, L.: Chat-univi: Unified visual rep- resentation empowers large language models with image and video understanding. In: CVPR. pp. 13700–13710 (2024)

  21. [21]

    In: EMNLP

    Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: Referring to objects in photographs of natural scenes. In: EMNLP. pp. 787–798 (2014)

  22. [22]

    In: CVPR

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: CVPR. pp. 4015–4026 (2023)

  23. [23]

    In: CVPR

    Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning segmentation via large language model. In: CVPR. pp. 9579–9589 (2024)

  24. [24]

    Stream- ingbench: Assessing the gap for mllms to achieve streaming video understanding.CoRR, abs/2411.03628, 2024

    Lin, J., Fang, Z., Chen, C., Wan, Z., Luo, F., Li, P., Liu, Y., Sun, M.: Stream- ingbench: Assessing the gap for mllms to achieve streaming video understanding. arXiv preprint arXiv:2411.03628 (2024)

  25. [25]

    In: CVPR

    Lin, L., Yu, X., Pang, Z., Wang, Y.X.: Glus: Global-local reasoning unified into a single large language model for video segmentation. In: CVPR. pp. 8658–8667 (2025)

  26. [26]

    NeurIPS36, 34892– 34916 (2023)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. NeurIPS36, 34892– 34916 (2023)

  27. [27]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  28. [28]

    Maaz, M., Rasheed, H., Khan, S., Khan, F.: Video-chatgpt: Towards detailed video understanding via large vision and language models. In: ACL. pp. 12585–12602 (2024)

  29. [29]

    In: CVPR

    Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Genera- tion and comprehension of unambiguous object descriptions. In: CVPR. pp. 11–20 (2016)

  30. [30]

    Milletari, F., Navab, N., Ahmadi, S.A.: V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 3DV. pp. 565–571. Ieee (2016)

  31. [31]

    In: CVPR

    Munasinghe, S., Gani, H., Zhu, W., Cao, J., Xing, E., Khan, F.S., Khan, S.: Videoglamm: A large multimodal model for pixel-level visual grounding in videos. In: CVPR. pp. 19036–19046 (2025)

  32. [32]

    LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

    Ning, Z., Liu, G., Jin, Q., Ding, W., Guo, M., Zhao, J.: Livevlm: Efficient online video understanding via streaming-oriented kv cache and retrieval. arXiv preprint arXiv:2505.15269 (2025)

  33. [33]

    Niu, J., Li, Y., Miao, Z., Ge, C., Zhou, Y., He, Q., Dong, X., Duan, H., Ding, S., Qian, R., et al.: Ovo-bench: How far is your video-llms from real-world online video understanding? In: CVPR. pp. 18902–18913 (2025)

  34. [34]

    AI & society40(2), 677–690 (2025)

    Obrenovic, B., Gu, X., Wang, G., Godinic, D., Jakhongirov, I.: Generative ai and human–robot interaction: implications and future agenda for business, society and ethics. AI & society40(2), 677–690 (2025)

  35. [35]

    The 2017 DAVIS Challenge on Video Object Segmentation

    Pont-Tuset, J., Perazzi, F., Caelles, S., Arbel´ aez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)

  36. [36]

    IJCV130(8), 2022–2039 (2022)

    Qi, J., Gao, Y., Hu, Y., Wang, X., Liu, X., Bai, X., Belongie, S., Yuille, A., Torr, P.H., Bai, S.: Occluded video instance segmentation: A benchmark. IJCV130(8), 2022–2039 (2022)

  37. [37]

    NeurIPS37, 119336–119360 (2024) 18 Liu et al

    Qian, R., Dong, X., Zhang, P., Zang, Y., Ding, S., Lin, D., Wang, J.: Streaming long video understanding with large language models. NeurIPS37, 119336–119360 (2024) 18 Liu et al

  38. [38]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Qian, R., Yin, X., Dou, D.: Reasoning to attend: Try to understand how¡ seg¿ token works. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24722–24731 (2025)

  39. [39]

    In: CVPR

    Ramanathan, V., Kalia, A., Petrovic, V., Wen, Y., Zheng, B., Guo, B., Wang, R., Marquez, A., Kovvuri, R., Kadian, A., et al.: Paco: Parts and attributes of common objects. In: CVPR. pp. 7141–7151 (2023)

  40. [40]

    In: SIGKDD

    Rasley, J., Rajbhandari, S., Ruwase, O., He, Y.: Deepspeed: System optimiza- tions enable training deep learning models with over 100 billion parameters. In: SIGKDD. pp. 3505–3506 (2020)

  41. [41]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., R¨ adle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)

  42. [42]

    In: CVPR

    Ren, Z., Huang, Z., Wei, Y., Zhao, Y., Fu, D., Feng, J., Jin, X.: Pixellm: Pixel reasoning with large multimodal model. In: CVPR. pp. 26374–26383 (2024)

  43. [43]

    In: ECCV

    Seo, S., Lee, J.Y., Han, B.: Urvos: Unified referring video object segmentation network with a large-scale benchmark. In: ECCV. pp. 208–223. Springer (2020)

  44. [44]

    arXiv preprint arXiv:2505.05467 , year=

    Wang, H., Feng, B., Lai, Z., Xu, M., Li, S., Ge, W., Dehghan, A., Cao, M., Huang, P.: Streambridge: Turning your offline video large language model into a proactive streaming assistant. arXiv preprint arXiv:2505.05467 (2025)

  45. [45]

    In: CVPR

    Wang, W., Zhou, T., Yu, F., Dai, J., Konukoglu, E., Van Gool, L.: Exploring cross-image pixel contrast for semantic segmentation. In: CVPR. pp. 7303–7313 (2021)

  46. [46]

    In: CVPR

    Wu, J., Jiang, Y., Sun, P., Yuan, Z., Luo, P.: Language as queries for referring video object segmentation. In: CVPR. pp. 4974–4984 (2022)

  47. [47]

    arXiv preprint arXiv:2501.13468 , year=

    Xiong, H., Yang, Z., Yu, J., Zhuge, Y., Zhang, L., Zhu, J., Lu, H.: Streaming video understanding and multi-round interaction with memory-enhanced knowl- edge. arXiv preprint arXiv:2501.13468 (2025)

  48. [48]

    arXiv preprint arXiv:2510.09608 , year=

    Xu, R., Xiao, G., Chen, Y., He, L., Peng, K., Lu, Y., Han, S.: Streamingvlm: Real- time understanding for infinite video streams. arXiv preprint arXiv:2510.09608 (2025)

  49. [49]

    In: ECCV

    Yan, C., Wang, H., Yan, S., Jiang, X., Hu, Y., Kang, G., Xie, W., Gavves, E.: Visa: Reasoning video object segmentation via large language models. In: ECCV. pp. 98–115. Springer (2024)

  50. [50]

    In: AAAI

    Yan, S., Zhang, R., Guo, Z., Chen, W., Zhang, W., Li, H., Qiao, Y., Dong, H., He, Z., Gao, P.: Referred by multi-modality: A unified temporal transformer for video object segmentation. In: AAAI. vol. 38, pp. 6449–6457 (2024)

  51. [51]

    In: ICCV

    Yang, L., Fan, Y., Xu, N.: Video instance segmentation. In: ICCV. pp. 5188–5197 (2019)

  52. [52]

    arXiv preprint arXiv:2502.09560 (2025)

    Yang, R., Chen, H., Zhang, J., Zhao, M., Qian, C., Wang, K., Wang, Q., Koripella, T.V., Movahedi, M., Li, M., et al.: Embodiedbench: Comprehensive benchmark- ing multi-modal large language models for vision-driven embodied agents. arXiv preprint arXiv:2502.09560 (2025)

  53. [53]

    Lisa++: An improved baseline for reasoning segmentation with large language model,

    Yang, S., Qu, T., Lai, X., Tian, Z., Peng, B., Liu, S., Jia, J.: Lisa++: An improved baseline for reasoning segmentation with large language model. arXiv preprint arXiv:2312.17240 (2023)

  54. [54]

    arXiv preprint arXiv:2502.10810 (2025)

    Yang, Z., Hu, Y., Du, Z., Xue, D., Qian, S., Wu, J., Yang, F., Dong, W., Xu, C.: Svbench: A benchmark with temporal multi-turn dialogues for streaming video understanding. arXiv preprint arXiv:2502.10810 (2025)

  55. [55]

    In: ICCV

    Zheng, R., Qi, L., Chen, X., Wang, Y., Wang, K., Qiao, Y., Zhao, H.: Villa: Video reasoning segmentation with large language model. In: ICCV. pp. 23667–23677 (2025) Online Reasoning Video Object Segmentation 19

  56. [56]

    In: CVPR

    Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: CVPR. pp. 633–641 (2017)

  57. [57]

    arXiv preprint arXiv:2312.17448 (2023)

    Zhu, J., Cheng, Z.Q., He, J.Y., Li, C., Luo, B., Lu, H., Geng, Y., Xie, X.: Tracking with human-intent reasoning. arXiv preprint arXiv:2312.17448 (2023)

  58. [58]

    Deformable DETR: Deformable Transformers for End-to-End Object Detection

    Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)