pith. sign in

arxiv: 2607.02096 · v1 · pith:NQSQNP2Mnew · submitted 2026-07-02 · 💻 cs.CV

LongEgoRefer: A Benchmark for Long-Form Egocentric Video Referring Expression Comprehension

Pith reviewed 2026-07-03 15:38 UTC · model grok-4.3

classification 💻 cs.CV
keywords egocentric videoreferring expression comprehensionlong-form videobenchmarkspatio-temporal groundingEgo4Dvideo understandinghuman-object interaction
0
0 comments X

The pith

Current state-of-the-art models struggle significantly on long-form egocentric video referring expression comprehension.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs LongEgoRefer as a benchmark of 1,498 referring expressions drawn from 45-minute untrimmed Ego4D videos. It shows that short-clip Video REC settings fail to capture the sparse object appearances, detailed language, and activity transitions typical of real egocentric recordings. Experiments with training-free vision-language baselines plus Grounded SAM2 and with existing state-of-the-art models demonstrate consistently poor performance. The work therefore defines a spatio-temporal grounding task that requires models to locate both the moment and the spatial region of a referred object inside extended dynamic sequences. A reader would care because the results indicate that current video understanding techniques do not yet scale to the temporal length and sparsity of everyday human activity recordings.

Core claim

LongEgoRefer establishes a benchmark for Video REC on long untrimmed egocentric videos that exhibit extreme target sparsity, detailed linguistic descriptions, and complex human-object interactions. Existing Video REC approaches, including training-free baselines that combine vision-language models with Grounded SAM2 and current state-of-the-art models, perform poorly on the benchmark, thereby defining a demanding problem of identifying both when an event occurs and where the referred object appears within extended video sequences.

What carries the argument

The LongEgoRefer benchmark itself, whose construction from Ego4D enforces long duration, extreme target sparsity, and the need for joint temporal and spatial localization under complex activity transitions.

If this is right

  • Video REC systems must incorporate mechanisms that maintain object identity across dozens of minutes of sparse appearances.
  • Linguistic detail in queries must be resolved jointly with long-range temporal context rather than frame-level or short-clip features.
  • Models will need explicit handling of activity transitions that interrupt object visibility in egocentric narratives.
  • Training regimes for spatio-temporal grounding must shift from dense short clips to long, sparsely labeled sequences.
  • Progress on real-world egocentric applications will require architectures that scale beyond current temporal windows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar long-form benchmarks could be built for other video domains such as surveillance or instructional footage to expose comparable sparsity issues.
  • The performance gap suggests that simply scaling current transformer-based video models may not close the task without new mechanisms for long-term memory and sparse attention.
  • Deployment in wearable devices or robotics would likely remain limited until models demonstrate reliable localization over multi-minute horizons.
  • The benchmark could serve as a stress test for future video-language models that claim general long-context understanding.

Load-bearing premise

The referring expressions and video segments chosen from Ego4D accurately represent the sparsity, linguistic detail, and activity transitions of real-world long-form egocentric recordings without systematic annotation bias or selection effects.

What would settle it

A new model that achieves substantially higher accuracy than the reported baselines on the LongEgoRefer test set, or independent re-annotation of a subset of the expressions that reveals consistent selection bias, would directly test the central claim.

Figures

Figures reproduced from arXiv: 2607.02096 by Chenhui Chu, Mahiro Ukai, Nakamasa Inoue, Shuhei Kurita, Shunya Kato, Taiki Miyanishi.

Figure 1
Figure 1. Figure 1: Data comparison of LongEgoRefer and RefEgo. (a) Video durations in LongE￾goRefer are orders of magnitude longer than those in RefEgo. (b) The appearance rate is significantly lower and sparser in LongEgoRefer compared to RefEgo. (c) To handle the complexity of long-form videos, captions in LongEgoRefer are substantially longer and more descriptive than those in RefEgo. Dashed lines indicate the mean value … view at source ↗
Figure 2
Figure 2. Figure 2: An overview of our benchmark construction pipeline. 3 Benchmark 3.1 Task We address the task of Spatio–Temporal Grounding of Object Occurrences in long-form egocentric videos. Given an untrimmed video V of length TV and a natural language query Q, the model must localize a single, continuous appear￾ance of the referred object within the untrimmed video. The prediction is repre￾sented as a tuple (tstart, te… view at source ↗
Figure 3
Figure 3. Figure 3: Analysis of temporal resolution, reasoning effort, and spatial grounding. 2.5 Pro performed best, and 2.5 Flash followed, suggesting that both the model scale and architectural sophistication contribute directly to the performance im￾provements on complex long-form context understanding. Despite achieving the best results, the absolute performance of GPT-5 re￾mains relatively low, indicating substantial ro… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results. example (c), both InternVL and GPT predict the wrong bounding boxes, even though the temporal duration of GPT is close to the ground truth. 6 Conclusion In this work, we introduced LongEgoRefer, a new benchmark for egocentric video referring expression comprehension that addresses key limitations of exist￾ing benchmarks. It substantially expands the temporal scale (averaging 45 min￾ute… view at source ↗
Figure 5
Figure 5. Figure 5: Word cloud visualization of referring expressions in LongEgoRefer [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Object class distribution from EgoTracks. A Word Cloud [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of Human-Object Interactions. in our benchmark. The results highlight the linguistic and semantic richness of the dataset, covering a diverse range of egocentric activities. D EgoTracks Split and Annotation Cost We use the train and validation splits of EgoTracks for LongEgoRefer benchmark construction. The total annotation cost amounts to $4,700. All 1,498 curated queries are designed exclusi… view at source ↗
Figure 8
Figure 8. Figure 8: Our training-free baseline. – mvIoU: This metric assesses the spatio-temporal localization accuracy by measuring the overlap between a sequence of predicted and ground-truth bounding boxes. It is calculated by averaging the spatial IoU of the bounding boxes over the frames where both the predicted and ground-truth objects are present. – vIoU@R: Similar to tIoU@R, this metric evaluates the percentage of sam… view at source ↗
Figure 9
Figure 9. Figure 9: Referring Expression Generation Prompt [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt used in experiments [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Time consumption over the baselines [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗
read the original abstract

Egocentric videos capture rich and diverse human-object interactions and have emerged as a fundamental resource for understanding human activities related to objects. In this context, Video Referring Expression Comprehension (Video REC), the task of localizing the temporal and spatial extent of a referred object in video frames given a natural language query, plays a key role in linking textual descriptions to observed objects in untrimmed egocentric recordings. However, existing egocentric Video REC benchmarks primarily focus on short video clips, where some target object appears densely within frames. Such settings do not reflect real-world egocentric recordings, which are long-form, untrimmed, and characterized by sparse object occurrences and complex activity transitions. To address this limitation, we introduce LongEgoRefer, a novel and challenging benchmark constructed from long-form videos in the Ego4D dataset. LongEgoRefer contains 1,498 referring expressions with an average video duration of 45 minutes. The benchmark exhibits extreme target sparsity, detailed linguistic descriptions, and complex human-object interactions embedded in long, dynamic egocentric narratives. Consequently, it defines a demanding spatio-temporal grounding problem that requires models to identify both when an event occurs and where the referred object appears within extended video sequences. We evaluate existing Video REC approaches, including training-free baselines based on vision-language models combined with Grounded SAM2. Extensive experiments show that even advanced baselines and current state-of-the-art models struggle significantly on LongEgoRefer. These results highlight the intrinsic difficulty of long-form egocentric spatio-temporal grounding and emphasize the need for more robust video understanding models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces LongEgoRefer, a benchmark for long-form egocentric Video Referring Expression Comprehension derived from Ego4D. It comprises 1,498 referring expressions across videos averaging 45 minutes, featuring extreme target sparsity, detailed linguistic descriptions, and complex activity transitions. The authors evaluate training-free baselines (vision-language models with Grounded SAM2) and existing Video REC methods, claiming that even advanced and state-of-the-art models struggle significantly, thereby highlighting the intrinsic difficulty of spatio-temporal grounding in long untrimmed egocentric videos.

Significance. If the benchmark construction is shown to be representative of Ego4D without curation bias, the work would usefully expose gaps in current models' ability to handle sparse object occurrences and long temporal contexts in egocentric video, providing a concrete testbed that could drive progress in robust video understanding.

major comments (1)
  1. [Benchmark construction] Benchmark construction section: the manuscript provides no sampling protocol for selecting the 1,498 expressions and 45-minute segments, no inter-annotator agreement statistics, and no distributional comparisons (e.g., object sparsity histograms or query complexity metrics) against the full Ego4D corpus. This directly undermines the central claim that poor model performance reflects intrinsic long-form difficulty rather than selection effects favoring extreme cases.
minor comments (1)
  1. [Abstract] Abstract: the statement that models 'struggle significantly' is not accompanied by any quantitative metrics, baseline implementation details, or performance numbers, which reduces the abstract's informativeness even though the full experiments section presumably contains them.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment on benchmark construction below and will incorporate the necessary revisions to strengthen the paper.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark construction section: the manuscript provides no sampling protocol for selecting the 1,498 expressions and 45-minute segments, no inter-annotator agreement statistics, and no distributional comparisons (e.g., object sparsity histograms or query complexity metrics) against the full Ego4D corpus. This directly undermines the central claim that poor model performance reflects intrinsic long-form difficulty rather than selection effects favoring extreme cases.

    Authors: We agree that the manuscript would benefit from greater transparency in the benchmark construction process. In the revised version, we will expand the relevant section to explicitly describe the sampling protocol used to select the 1,498 referring expressions and associated 45-minute video segments from Ego4D. We will also include inter-annotator agreement statistics for the annotations and add distributional comparisons (including object sparsity histograms and query complexity metrics) relative to the full Ego4D corpus. These additions will help demonstrate that the observed model difficulties arise from the inherent challenges of long-form egocentric spatio-temporal grounding rather than from selection bias. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivation chain

full rationale

The paper introduces LongEgoRefer as a new benchmark constructed by selecting referring expressions and segments from the external Ego4D dataset, then reports direct model evaluations on it. No equations, fitted parameters, predictions, or self-citation load-bearing steps exist. The construction and results are empirical and self-contained against external data; no claim reduces to its own inputs by definition or construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Benchmark introduction paper; no free parameters, invented entities, or non-standard axioms are introduced in the abstract. The work relies on the existing Ego4D dataset as source material.

axioms (1)
  • domain assumption Ego4D provides representative long-form egocentric videos suitable for constructing a referring-expression benchmark.
    The benchmark is constructed directly from Ego4D videos.

pith-pipeline@v0.9.1-grok · 5835 in / 1214 out tokens · 29383 ms · 2026-07-03T15:38:59.833512+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 16 canonical work pages · 14 internal anchors

  1. [1]

    LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

    An, X., Xie, Y., Yang, K., Zhang, W., Zhao, X., Cheng, Z., Wang, Y., Xu, S., Chen, C., Zhu, D., et al.: Llava-onevision-1.5: Fully open framework for democratized multimodal training. arXiv preprint arXiv:2509.23661 (2025)

  2. [2]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

  3. [3]

    Bai, Z., He, T., Mei, H., Wang, P., Gao, Z., Chen, J., Liu, L., Zhang, Z., Shou, M.Z.: One token to seg them all: Language instructed reasoning segmentation in videos. In: Adv. Neural Inform. Process. Syst. pp. 6833–6859 (2024)

  4. [4]

    In: IEEE Conf

    Bärmann, L., Waibel, A.: Where did i leave my keys? — episodic-memory-based question answering on egocentric videos. In: IEEE Conf. Comput. Vis. Pattern Recog. Worksh. pp. 1560–1568 (2022)

  5. [5]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)

  6. [6]

    SAM 3: Segment Anything with Concepts

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Al- wala, K.V., Khedr, H., Huang, A., et al.: SAM 3: Segment anything with concepts. arXiv preprint arXiv:2511.16719 (2025)

  7. [7]

    Chandrasegaran, K., Gupta, A., Hadzic, L.M., Kota, T., He, J., Eyzaguirre, C., Durante, Z., Li, M., Wu, J., Li, F.F.: Hourvideo: 1-hour video-language under- standing. In: Adv. Neural Inform. Process. Syst. pp. 53168–53197 (2024)

  8. [8]

    Chen, Y., Xue, F., Li, D., Hu, Q., Zhu, L., Li, X., Fang, Y., Tang, H., Yang, S., Liu, Z., He, E., Yin, H., Molchanov, P., Kautz, J., Fan, L., Zhu, Y., Lu, Y., Han, S.: Longvila: Scaling long-context visual language models for long videos. In: Int. Conf. Learn. Represent. (2025)

  9. [9]

    Damen, D., Doughty, H., Farinella, G.M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., Wray, M.: Scaling egocentric vision: The epic-kitchens dataset. In: Eur. Conf. Comput. Vis. pp. 720–736 (2018) 16 S. Kato et al

  10. [10]

    In: IEEE Conf

    Di, S., Xie, W.: Grounded question-answering in long egocentric videos. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 12934–12943 (2024)

  11. [11]

    Ding, H., Liu, C., He, S., Jiang, X., Loy, C.C.: Mevis: A large-scale benchmark for video segmentation with motion expressions. In: Int. Conf. Comput. Vis. pp. 2694–2703 (2023)

  12. [12]

    Project Aria: A New Tool for Egocentric Multi-Modal AI Research

    Engel, J., Somasundaram, K., Goesele, M., Sun, A., Gamino, A., Turner, A., Ta- lattof, A., Yuan, A., Souti, B., Meredith, B., et al.: Project aria: A new tool for egocentric multi-modal ai research. arXiv preprint arXiv:2308.13561 (2023)

  13. [13]

    In: IEEE Conf

    Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., Chen, P., Li, Y., Lin, S., Zhao, S., Li, K., Xu, T., Zheng, X., Chen, E., Shan, C., He, R., Sun, X.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 24108–24118 (2025)

  14. [14]

    In: IEEE Conf

    Gavrilyuk, K., Ghodrati, A., Li, Z., Snoek, C.G.: Actor and action video segmenta- tion from a sentence. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 5958–5966 (2018)

  15. [15]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini Team: Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

  16. [16]

    In: IEEE Conf

    Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Ham- burger, J., Jiang, H., Liu, M., Liu, X., et al.: Ego4d: Around the world in 3,000 hours of egocentric video. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 18995– 19012 (2022)

  17. [17]

    In: IEEE Conf

    Grauman, K., Westbury, A., Torresani, L., Kitani, K., Malik, J., Afouras, T., Ashutosh, K., Baiyya, V., Bansal, S., Boote, B., et al.: Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 19383–19400 (2024)

  18. [18]

    Huang,D.A.,Liao,S.,Radhakrishnan,S.,Yin,H.,Molchanov,P.,Yu,Z.,Kautz,J.: Lita: Language instructed temporal-localization assistant. In: Eur. Conf. Comput. Vis. pp. 202–218 (2024)

  19. [19]

    GPT-4o System Card

    Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

  20. [20]

    In: IEEE Conf

    Jang, Y., Song, Y., Yu, Y., Kim, Y., Kim, G.: Tgif-qa: Toward spatio-temporal rea- soning in visual question answering. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 2758–2766 (2017)

  21. [21]

    In: Conference on Empirical Methods in Natural Language Processing

    Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferItGame: Referring to objects in photographs of natural scenes. In: Conference on Empirical Methods in Natural Language Processing. pp. 787–798 (2014)

  22. [22]

    In: Asian Conf

    Khoreva, A., Rohrbach, A., Schiele, B.: Video object segmentation with language referring expressions. In: Asian Conf. Comput. Vis. pp. 123–141 (2018)

  23. [23]

    Kurita, S., Katsura, N., Onami, E.: Refego: Referring expression comprehension dataset from first-person perception of ego4d. In: Int. Conf. Comput. Vis. pp. 15214–15224 (2023)

  24. [24]

    Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., Wang, L., Qiao, Y.: Mvbench: A comprehensive multi-modal video understandingbenchmark.In:IEEEConf.Comput.Vis.PatternRecog.pp.22195– 22206 (2024)

  25. [25]

    Li, X., Wang, Y., Yu, J., Zeng, X., Zhu, Y., Huang, H., Gao, J., Li, K., He, Y., Wang, C., Qiao, Y., Wang, Y., Wang, L.: Videochat-flash: Hierarchical compression for long-context video modeling. In: Int. Conf. Learn. Represent. (2026) LongEgoRefer 17

  26. [26]

    In: IEEE Conf

    Li, Z., Tao, R., Gavves, E., Snoek, C.G.M., Smeulders, A.W.M.: Tracking by natu- ral language specification. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 6495– 6503 (2017)

  27. [27]

    Liang, S., Zhong, Y., Hu, Z.Y., Tao, Y., Wang, L.: Fine-grained spatiotemporal grounding on egocentric videos. In: Int. Conf. Comput. Vis. pp. 9385–9395 (2025)

  28. [28]

    Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: Eur. Conf. Comput. Vis. pp. 38–55 (2024)

  29. [29]

    In: IEEE Conf

    Liu, Y., Liu, Y., Jiang, C., Lyu, K., Wan, W., Shen, H., Liang, B., Fu, Z., Wang, H., Yi,L.:HOI4D:A4degocentricdatasetforcategory-levelhuman-objectinteraction. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 21013–21022 (2022)

  30. [30]

    Mangalam, K., Akshulakov, R., Malik, J.: Egoschema: A diagnostic benchmark for very long-form video language understanding. In: Adv. Neural Inform. Process. Syst. pp. 46212–46244 (2023)

  31. [31]

    In: IEEE Conf

    Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 11–20 (2016)

  32. [32]

    Pan, X., Charron, N., Yang, Y., Peters, S., Whelan, T., Kong, C., Parkhi, O., New- combe, R., Ren, Y.C.: Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. In: Int. Conf. Comput. Vis. pp. 20133–20143 (2023)

  33. [33]

    In: IEEE Conf

    Perrett, T., Darkhalil, A., Sinha, S., Emara, O., Pollard, S., Parida, K.K., Liu, K., Gatti, P., Bansal, S., Flanagan, K., Chalk, J., Zhu, Z., Guerrier, R., Abdelazim, F., Zhu, B., Moltisanti, D., Wray, M., Doughty, H., Damen, D.: HD-EPIC: A highly- detailed egocentric video dataset. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 23901–23913 (2025)

  34. [34]

    Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazeb- nik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Int. Conf. Comput. Vis. pp. 2641–2649 (2015)

  35. [35]

    In: IEEE Conf

    Ramakrishnan, S.K., Al-Halah, Z., Grauman, K.: Naq: Leveraging narrations as queries to supervise episodic memory. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 6694–6703 (2023)

  36. [36]

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: SAM 2: Segment anything in images and videos. In: Int. Conf. Learn. Represent. (2025)

  37. [37]

    arXiv preprint arXiv:2405.08813 (2024)

    Rawal, R., Saifullah, K., Basri, R., Jacobs, D., Somepalli, G., Goldstein, T.: Cinepile: A long video question answering dataset and benchmark. arXiv preprint arXiv:2405.08813 (2024)

  38. [38]

    In: IEEE Conf

    Ren, S., Yao, L., Li, S., Sun, X., Hou, L.: Timechat: A time-sensitive multimodal large language model for long video understanding. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 14313–14323 (2024)

  39. [39]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., et al.: Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 (2024)

  40. [40]

    Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. In: Eur. Conf. Comput. Vis. pp. 817–834 (2016)

  41. [41]

    In: IEEE Conf

    Sener, F., Chatterjee, D., Shelepov, D., He, K., Singhania, D., Wang, R., Yao, A.: Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 21096–21106 (2022) 18 S. Kato et al

  42. [42]

    Seo, S., Lee, J.Y., Han, B.: Urvos: Unified referring video object segmentation network with a large-scale benchmark. In: Eur. Conf. Comput. Vis. pp. 208–223 (2020)

  43. [43]

    In: ACM Int

    Shang, X., Li, Y., Xiao, J., Ji, W., Chua, T.S.: Video visual relation detection via iterative inference. In: ACM Int. Conf. Multimedia. pp. 3654–3663 (2021)

  44. [44]

    In: ACM Int

    Shang, X., Ren, T., Guo, J., Zhang, H., Chua, T.S.: Video visual relation detection. In: ACM Int. Conf. Multimedia. pp. 1300–1308 (2017)

  45. [45]

    OpenAI GPT-5 System Card

    Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: OpenAI GPT-5 system card. arXiv preprint arXiv:2601.03267 (2025)

  46. [46]

    In: IEEE Conf

    Song, E., Chai, W., Wang, G., Zhang, Y., Zhou, H., Wu, F., Chi, H., Guo, X., Ye, T., Zhang, Y., Lu, Y., Hwang, J.N., Wang, G.: Moviechat: From dense token to sparse memory for long video understanding. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 18221–18232 (2024)

  47. [47]

    In: AAAI

    Tan,X.,Luo,Y.,Ye,Y.,Liu,F.,Cai,Z.:Allvb:All-in-onelongvideounderstanding benchmark. In: AAAI. pp. 7211–7219 (2025)

  48. [48]

    Tang, H., Liang, K.J., Grauman, K., Feiszli, M., Wang, W.: Egotracks: A long-term egocentric visual object tracking dataset. In: Adv. Neural Inform. Process. Syst. pp. 75716–75739 (2023)

  49. [49]

    IEEE Trans

    Tang, Z., Liao, Y., Liu, S., Li, G., Jin, X., Jiang, H., Yu, Q., Xu, D.: Human- centric spatio-temporal video grounding with visual transformers. IEEE Trans. Circuit Syst. Video Technol.32(12), 8238–8249 (2021)

  50. [50]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

  51. [51]

    Wang,X.,Zhao,K.,Liu,F.,Wang,J.,Zhao,G.,Bao,X.,Zhu,Z.,Zhang,Y.,Wang, X.: Egovid-5m: A large-scale video-action dataset for egocentric video generation. In: Adv. Neural Inform. Process. Syst. (2025)

  52. [52]

    Wang, X., Kwon, T., Rad, M., Pan, B., Chakraborty, I., Andrist, S., Bohus, D., Feniello, A., Tekin, B., Frujeri, F.V., Joshi, N., Pollefeys, M.: Holoassist: an ego- centric human interaction dataset for interactive ai assistants in the real world. In: Int. Conf. Comput. Vis. pp. 20270–20281 (2023)

  53. [53]

    InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

    Wang, Y., Li, X., Yan, Z., He, Y., Yu, J., Zeng, X., Wang, C., Ma, C., Huang, H., Gao, J., Dou, M., Chen, K., Wang, W., Qiao, Y., Wang, Y., Wang, L.: Intern- Video2.5: Empowering video mllms with long and rich context modeling. arXiv preprint arXiv:2501.12386 (2025)

  54. [54]

    Wu, H., Li, D., Chen, B., Li, J.: Longvideobench: A benchmark for long-context interleaved video-language understanding. In: Adv. Neural Inform. Process. Syst. pp. 28828–28857 (2024)

  55. [55]

    In: IEEE Conf

    Xiao,J.,Shang,X.,Yao,A.,Chua,T.S.:Next-qa:Nextphaseofquestion-answering to explaining temporal actions. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 9777–9786 (2021)

  56. [56]

    arXiv preprint arXiv:2506.03569 (2025)

    Xiaomi LLM-Core Team: Mimo-vl technical report. arXiv preprint arXiv:2506.03569 (2025)

  57. [57]

    In: ACM Int

    Xu, D., Zhao, Z., Xiao, J., Wu, F., Zhang, H., He, X., Zhuang, Y.: Video question answering via gradually refined attention over appearance and motion. In: ACM Int. Conf. Multimedia. pp. 1645–1653 (2017)

  58. [58]

    Xu, N., Yang, L., Fan, Y., Yang, J., Yue, D., Liang, Y., Price, B., Cohen, S., Huang, T.S.: Youtube-vos: Sequence-to-sequence video object segmentation. In: Eur. Conf. Comput. Vis. pp. 585–601 (2018) LongEgoRefer 19

  59. [59]

    Ye, J., Xu, H., Liu, H., Hu, A., Yan, M., Qian, Q., Zhang, J., Huang, F., Zhou, J.: mplug-owl3: Towards long image-sequence understanding in multi-modal large language models. In: Int. Conf. Learn. Represent. (2025)

  60. [60]

    Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Eur. Conf. Comput. Vis. pp. 69–85 (2016)

  61. [61]

    In: AAAI

    Yu, Z., Xu, D., Yu, J., Yu, T., Zhao, Z., Zhuang, Y., Tao, D.: Activitynet-qa: A dataset for understanding complex web videos via question answering. In: AAAI. pp. 9127–9134 (2019)

  62. [62]

    Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

    Yuan, H., Li, X., Zhang, T., Sun, Y., Huang, Z., Xu, S., Ji, S., Tong, Y., Qi, L., Feng, J., Yang, M.H.: Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos. arXiv preprint arXiv:2501.04001 (2025)

  63. [63]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Zhang, B., Li, K., Cheng, Z., Hu, Z., Yuan, Y., Chen, G., Leng, S., Jiang, Y., Zhang, H., Li, X., et al.: Videollama 3: Frontier multimodal foundation models for image and video understanding. arXiv preprint arXiv:2501.13106 (2025)

  64. [64]

    Long Context Transfer from Language to Vision

    Zhang, P., Zhang, K., Li, B., Zeng, G., Yang, J., Zhang, Y., Wang, Z., Tan, H., Li, C., Liu, Z.: Long context transfer from language to vision. arXiv preprint arXiv:2406.16852 (2024)

  65. [65]

    Zhang, Z., Zhao, Z., Zhao, Y., Wang, Q., Liu, H., Gao, L.: Where does it exist: Spatio-temporalvideogroundingformulti-formsentences.In:IEEEConf.Comput. Vis. Pattern Recog. pp. 10668–10677 (2020)

  66. [66]

    In: IEEE Conf

    Zhou, J., Shu, Y., Zhao, B., Wu, B., Liang, Z., Xiao, S., Qin, M., Yang, X., Xiong, Y., Zhang, B., Huang, T., Liu, Z.: Mlvu: Benchmarking multi-task long video un- derstanding. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 13691–13701 (2025)

  67. [67]

    placed,” “lifted,

    Zhou, W., Cao, K., Zheng, H., Liu, Y., Zheng, X., Liu, M., Kristensson, P.O., Mayol-Cuevas, W.W., Zhang, F., Lin, W., Shen, J.: X-LeBench: A benchmark for extremely long egocentric video understanding. In: Findings of the Association for Computational Linguistics: EMNLP 2025. pp. 15206–15222 (2025) 20 S. Kato et al. Fig.5:Word cloud visualization of refer...

  68. [68]

    ‘start_time‘: The timestamp (in MM:SS format) when the described event begins

  69. [69]

    IMPORTANT: Do not use or process any audio information from the video

    ‘end_time‘: The timestamp (in MM:SS format) when the described event concludes. IMPORTANT: Do not use or process any audio information from the video. Only analyze the visual content (video frames/images) to identify temporal segments. Ignore all audio tracks completely. Fig.10:Prompt used in experiments. Fig.11:Time consumption over the baselines