pith. sign in

arxiv: 2510.13016 · v3 · pith:TTZNEXN6new · submitted 2025-10-14 · 💻 cs.CV

SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding

Pith reviewed 2026-05-18 07:02 UTC · model grok-4.3

classification 💻 cs.CV
keywords video groundingspatio-temporal localizationaction recognitionmulti-instancebenchmarknatural languageobject tracking
0
0 comments X

The pith

SVAG-Bench introduces a task requiring models to detect, track, and temporally localize every object matching a language query in multi-actor video scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that progress in embodied AI requires benchmarks that test the combined skills of spatial detection, temporal localization, and tracking of multiple actors from natural language descriptions. Existing video benchmarks examine these capabilities separately and therefore cannot assess their integration in realistic scenes. The authors respond by defining the Spatio-temporal Video Action Grounding task and releasing SVAG-Bench, a dataset of 688 videos containing 19,590 annotations and 903 action verbs drawn from urban, wildlife, and traffic footage. With an average of 28.5 queries per video, the benchmark supports detailed measurement of how models handle actor disambiguation and action composition. The work also supplies a standardized evaluation toolkit and a modular baseline model.

Core claim

The paper presents Spatio-temporal Video Action Grounding (SVAG) as the task of simultaneously detecting, tracking, and temporally localizing all objects that satisfy a natural language query in complex, multi-actor scenes, supported by the SVAG-Bench dataset containing 688 videos, 19,590 annotations, and 903 unique action verbs.

What carries the argument

The SVAG task, which unifies spatial detection, object tracking, and temporal localization for multiple instances based on language queries in crowded scenes.

Load-bearing premise

The annotation pipeline of expert manual labeling combined with automated paraphrase augmentation and human verification produces linguistically diverse and factually correct labels that accurately capture multi-actor disambiguation without systematic bias or error.

What would settle it

An independent review of a sample of annotations that identifies consistent mismatches between the language queries and the actual actors or timings present in the videos.

Figures

Figures reproduced from arXiv: 2510.13016 by Aljo\v{s}a O\v{s}ep, Jindong Gu, Laura Leal-Taix\'e, Mark Weber, Rajat Koner, Shuaicong Wu, Suprosanna Shit, Tanveer Hannan, Thomas Seidl.

Figure 1
Figure 1. Figure 1: Comparison of existing video grounding paradigms with our proposed [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Statistics of SVAG-Bench. The majority of queries fall within the range of 6 to 10 words. discrimination, by featuring multiple visually similar instances of the same category engaged in distinct actions. The dataset sources videos from established multi-object tracking benchmarks—MOT17 [4], MOT20 [5], and OVIS [22]—selecting sequences where objects of similar appearance perform different actions. Human an… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the SVAGFormer pipeline for Spatio-temporal Video Action Grounding (SVAG). Given a natural language query (e.g., “A person is dancing in the open area”), the model first performs temporal grounding to localize the relevant video segment (Start: 2543 → End: 2782), followed by spatial grounding to identify the target person across frames. the usefulness of this dataset for training and evaluating… view at source ↗
Figure 4
Figure 4. Figure 4: Flowchart for processing evaluation. Spatial and temporal evaluations are conducted separately on [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A qualitative visualization example of a zebra performing a fine-grained action: tilting its head to the [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Statistics on OVIS dataset. High performance on categories like Dog, Rabbit, and Airplane. Dynamic actions like move, fight lead successful sequences. develop models to tackle the SVAG task. The competition is hosted on the Codabench platform3 , with SVAGEval as the official evaluation benchmark. Several teams have submitted their results, and a summary leaderboard is presented in Tab. 4. We conducted seve… view at source ↗
read the original abstract

A truly capable AI system must do more than detect objects or recognize activities in isolation. It must form unified, grounded representations of who is acting, what they are doing, and when and where these actions unfold. These representations provide the perceptual bedrock for high-level reasoning, planning, and embodied interaction in the real world. Building such agents is central to long-horizon goals in embodied AI and robotics. Current video benchmarks evaluate fragments of these capabilities in isolation. They focus on either spatial grounding, object tracking, or temporal localization. As a result, they cannot rigorously measure progress on their joint, multi-instance integration. We introduce Spatio-temporal Video Action Grounding (SVAG), a task and benchmark that explicitly targets this unified competence by requiring models to simultaneously detect, track, and temporally localize all objects that satisfy a natural language query in complex, multi-actor scenes. To support this task, we construct SVAG-Bench. It comprises 688 videos, 19,590 verified annotations, and 903 unique action verbs drawn from crowded urban environments, wildlife, and traffic surveillance. Each video has on average 28.5 action-centric queries. This yields the densest annotation among comparable video grounding benchmarks and enables fine-grained evaluation of multi-actor disambiguation, temporal overlap, and action compositionality. Annotations are produced by a pipeline that combines expert manual labeling, GPT-3.5 paraphrase augmentation, and human verification to ensure both linguistic diversity and correctness. We further release SVAGEval, a standardized multi-referent evaluation toolkit. We also introduce SVAGFormer, a strong modular baseline architecture for SVAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces the Spatio-temporal Video Action Grounding (SVAG) task, which requires models to jointly detect, track, and temporally localize all instances satisfying a natural language query in multi-actor scenes. It presents SVAG-Bench (688 videos, 19,590 verified annotations, 903 verbs from urban/wildlife/traffic domains, average 28.5 queries per video) as the densest such benchmark, produced via expert labeling + GPT-3.5 paraphrasing + human verification, plus the SVAGEval toolkit and SVAGFormer baseline.

Significance. If the annotations prove reliable, the benchmark would fill a clear gap by enabling rigorous evaluation of integrated spatio-temporal grounding rather than isolated subtasks, supporting progress toward embodied AI. The annotation density, multi-referent focus, and release of a standardized evaluation toolkit are concrete strengths that could facilitate reproducible comparisons.

major comments (1)
  1. [Abstract] Abstract: The central claim that the pipeline of expert manual labeling, GPT-3.5 paraphrase augmentation, and human verification yields linguistically diverse and factually correct labels for multi-actor disambiguation rests on an unquantified assertion. No inter-annotator agreement scores, verification error rates, or failure-mode analysis for temporal overlap or instance disambiguation are reported, directly affecting the reliability of the 'densest annotation' and evaluation claims.
minor comments (1)
  1. Provide explicit details on video selection criteria, exact train/val/test splits, and how the 903 verbs were curated to allow independent reproduction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential value of SVAG-Bench in advancing integrated spatio-temporal grounding research. We address the concern regarding annotation reliability below and commit to revisions that directly strengthen the supporting evidence.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the pipeline of expert manual labeling, GPT-3.5 paraphrase augmentation, and human verification yields linguistically diverse and factually correct labels for multi-actor disambiguation rests on an unquantified assertion. No inter-annotator agreement scores, verification error rates, or failure-mode analysis for temporal overlap or instance disambiguation are reported, directly affecting the reliability of the 'densest annotation' and evaluation claims.

    Authors: We agree that the abstract's assertion would be more robust with explicit quantitative backing. Section 3.2 of the manuscript describes the three-stage pipeline (expert manual labeling by trained annotators, GPT-3.5 paraphrase augmentation for linguistic diversity, and independent human verification for factual accuracy), yet no numerical metrics are supplied. We will revise the paper to add: (1) inter-annotator agreement scores computed via Fleiss' kappa on a 10% random subset of videos, covering verb selection, temporal interval boundaries, and instance ID assignment; (2) verification error rates, defined as the fraction of GPT-generated paraphrases rejected or corrected during the final human review; and (3) a concise failure-mode analysis in the annotation section that discusses observed challenges with concurrent actions (temporal overlap) and crowded-scene disambiguation, supported by representative examples from the urban and wildlife subsets. These additions will appear in a dedicated subsection on annotation quality and will be referenced in the abstract and introduction. revision: yes

Circularity Check

0 steps flagged

No circularity in task definition or benchmark construction

full rationale

The paper defines a new task (SVAG) requiring simultaneous detection, tracking, and temporal localization of all objects matching a natural language query, then constructs SVAG-Bench via expert manual labeling, GPT-3.5 paraphrase augmentation, and human verification. No equations, parameter fitting, or predictions appear that reduce to inputs by construction. The contribution is self-contained as an independent task and dataset release (688 videos, 19,590 annotations, 903 verbs) without load-bearing self-citations, uniqueness theorems from prior author work, or renaming of known results. Annotation pipeline details address quality but do not create circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard computer vision assumptions about the validity of human-verified annotations and the utility of the proposed task for downstream embodied AI, without introducing new free parameters, axioms, or invented entities.

axioms (1)
  • domain assumption Unified grounded representations of actor, action, time, and location are required for high-level reasoning in embodied AI.
    Stated directly in the opening of the abstract as the perceptual bedrock.

pith-pipeline@v0.9.0 · 5873 in / 1175 out tokens · 35609 ms · 2026-05-18T07:02:35.278046+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We introduce Spatio-temporal Video Action Grounding (SVAG), a task and benchmark that explicitly targets this unified competence by requiring models to simultaneously detect, track, and temporally localize all objects that satisfy a natural language query in complex, multi-actor scenes.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VISTA: Video Interaction Spatio-Temporal Analysis Benchmark

    cs.CV 2026-05 unverdicted novelty 8.0

    VISTA is the first large-scale interaction-aware benchmark that decomposes videos into entities, actions, and relations to diagnose spatio-temporal biases in vision-language models.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Flashvtg: Feature layering and adaptive score handling network for video temporal grounding

    Zhuo Cao, Bingqing Zhang, Heming Du, Xin Yu, Xue Li, and Sen Wang. Flashvtg: Feature layering and adaptive score handling network for video temporal grounding. InProceedings of the Winter Conference on Applications of Computer Vision (WACV), pages 9208–9218, February

  2. [2]

    End-to-end multi-modal video temporal grounding.Advances in Neural Information Processing Systems, 34:28442–28453, 2021

    Yi-Wen Chen, Yi-Hsuan Tsai, and Ming-Hsuan Yang. End-to-end multi-modal video temporal grounding.Advances in Neural Information Processing Systems, 34:28442–28453, 2021. 4

  3. [3]

    Tao: A large-scale benchmark for tracking any object

    Achal Dave, Tarasha Khurana, Pavel Tokmakov, Cordelia Schmid, and Deva Ramanan. Tao: A large-scale benchmark for tracking any object. InEuropean conference on computer vision, pages 436–454. Springer, 2020. 4

  4. [4]

    Motchallenge: A benchmark for single-camera multiple target tracking.International Journal of Computer Vision, 129:845–881, 2021

    Patrick Dendorfer, Aljosa Osep, Anton Milan, Konrad Schindler, Daniel Cremers, Ian Reid, Stefan Roth, and Laura Leal-Taixé. Motchallenge: A benchmark for single-camera multiple target tracking.International Journal of Computer Vision, 129:845–881, 2021. 5, 8

  5. [5]

    arXiv preprint arXiv:2003.09003

    Patrick Dendorfer, Hamid Rezatofighi, Anton Milan, Javen Shi, Daniel Cremers, Ian Reid, Stefan Roth, Konrad Schindler, and Laura Leal-Taixé. Mot20: A benchmark for multi object tracking in crowded scenes.arXiv preprint arXiv:2003.09003, 2020. 5, 8

  6. [6]

    Lasot: A high-quality benchmark for large-scale single object tracking

    Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. Lasot: A high-quality benchmark for large-scale single object tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5374–5383, 2019. 2, 4

  7. [7]

    Tall: Temporal activity localization via language query

    Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. InProceedings of the IEEE international conference on computer vision, pages 5267–5275, 2017. 2, 4

  8. [8]

    Context-guided spatio-temporal video grounding

    Xin Gu, Heng Fan, Yan Huang, Tiejian Luo, and Libo Zhang. Context-guided spatio-temporal video grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18330–18339, 2024. 1

  9. [9]

    Prior knowledge integration via llm encoding and pseudo event regulation for video moment retrieval

    Yiyang Jiang, Wengyu Zhang, Xulu Zhang, Xiao-Yong Wei, Chang Wen Chen, and Qing Li. Prior knowledge integration via llm encoding and pseudo event regulation for video moment retrieval. InProceedings of the 32nd ACM International Conference on Multimedia, pages 7249–7258, 2024. 7

  10. [10]

    Video object segmentation with language referring expressions

    Anna Khoreva, Anna Rohrbach, and Bernt Schiele. Video object segmentation with language referring expressions. InComputer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part IV 14, pages 123–141. Springer, 2019. 1

  11. [11]

    Enhancing temporal action local- ization: Advanced s6 modeling with recurrent mechanism.arXiv preprint arXiv:2407.13078,

    Sangyoun Lee, Juho Jung, Changdae Oh, and Sunghee Yun. Enhancing temporal action local- ization: Advanced s6 modeling with recurrent mechanism.arXiv preprint arXiv:2407.13078,

  12. [12]

    Detecting moments and highlights in videos via natural language queries.Advances in Neural Information Processing Systems, 34:11846–11858,

    Jie Lei, Tamara L Berg, and Mohit Bansal. Detecting moments and highlights in videos via natural language queries.Advances in Neural Information Processing Systems, 34:11846–11858,

  13. [13]

    Tvqa+: Spatio-temporal grounding for video question answering.arXiv preprint arXiv:1904.11574, 2019

    Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. Tvqa+: Spatio-temporal grounding for video question answering.arXiv preprint arXiv:1904.11574, 2019. 4

  14. [14]

    Univtg: Towards unified video-language temporal grounding

    Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jin- peng Wang, Rui Yan, and Mike Zheng Shou. Univtg: Towards unified video-language temporal grounding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2794–2804, 2023. 4

  15. [15]

    Stvgformer: Spatio-temporal video grounding with static-dynamic cross-modal understanding

    Zihang Lin, Chaolei Tan, Jian-Fang Hu, Zhi Jin, Tiancai Ye, and Wei-Shi Zheng. Stvgformer: Spatio-temporal video grounding with static-dynamic cross-modal understanding. InProceed- ings of the 4th on Person in Context Workshop, pages 1–5, 2022. 4 11

  16. [16]

    End-to-end temporal action detection with 1b parameters across 1000 frames

    Shuming Liu, Chen-Lin Zhang, Chen Zhao, and Bernard Ghanem. End-to-end temporal action detection with 1b parameters across 1000 frames. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18591–18601, 2024. 1

  17. [17]

    Hota: A higher order metric for evaluating multi-object tracking

    Jonathon Luiten, Aljosa Osep, Patrick Dendorfer, Philip Torr, Andreas Geiger, Laura Leal- Taixé, and Bastian Leibe. Hota: A higher order metric for evaluating multi-object tracking. International journal of computer vision, 129:548–578, 2021. 7

  18. [18]

    Spatial–temporal video grounding with cross-modal understanding and enhancement.Expert Systems with Applications, 271:126650, 2025

    Shu Luo, Jingyu Pan, Da Cao, Jiawei Wang, Yuquan Le, and Meng Liu. Spatial–temporal video grounding with cross-modal understanding and enhancement.Expert Systems with Applications, 271:126650, 2025. 4

  19. [19]

    Modeling context between objects for referring expression understanding

    Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. Modeling context between objects for referring expression understanding. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 792–807. Springer, 2016. 1

  20. [20]

    Type-to-track: Retrieve any object via prompt-based tracking.Advances in Neural Information Processing Systems, 36:3205–3219,

    Pha Nguyen, Kha Gia Quach, Kris Kitani, and Khoa Luu. Type-to-track: Retrieve any object via prompt-based tracking.Advances in Neural Information Processing Systems, 36:3205–3219,

  21. [21]

    OpenAI.https://chatgpt.com/, 2023. 3, 5

  22. [22]

    Occluded video instance segmentation: A benchmark

    Jiyang Qi, Yan Gao, Yao Hu, Xinggang Wang, Xiaoyu Liu, Xiang Bai, Serge Belongie, Alan Yuille, Philip HS Torr, and Song Bai. Occluded video instance segmentation: A benchmark. International Journal of Computer Vision, 130(8):2022–2039, 2022. 5, 8

  23. [23]

    Chatvtg: Video temporal grounding via chat with video dialogue large language models

    Mengxue Qu, Xiaodong Chen, Wu Liu, Alicia Li, and Yao Zhao. Chatvtg: Video temporal grounding via chat with video dialogue large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1847–1856, 2024. 4

  24. [24]

    Grounding action descriptions in videos.Transactions of the Association for Computational Linguistics, 1:25–36, 2013

    Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. Grounding action descriptions in videos.Transactions of the Association for Computational Linguistics, 1:25–36, 2013. 2, 4

  25. [25]

    Urvos: Unified referring video object segmentation network with a large-scale benchmark

    Seonguk Seo, Joon-Young Lee, and Bohyung Han. Urvos: Unified referring video object segmentation network with a large-scale benchmark. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16, pages 208–223. Springer, 2020. 2, 3, 4

  26. [26]

    Annotating objects and relations in user-generated videos

    Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-Seng Chua. Annotating objects and relations in user-generated videos. InProceedings of the 2019 on International Conference on Multimedia Retrieval, pages 279–287, 2019. 4

  27. [27]

    Temporal action localization with enhanced instant discriminability.arXiv preprint arXiv:2309.05590, 2023

    Dingfeng Shi, Qiong Cao, Yujie Zhong, Shan An, Jian Cheng, Haogang Zhu, and Dacheng Tao. Temporal action localization with enhanced instant discriminability.arXiv preprint arXiv:2309.05590, 2023. 1

  28. [28]

    Human-centric spatio-temporal video grounding with visual transformers.IEEE Transactions on Circuits and Systems for Video Technology, 32(12):8238–8249, 2021

    Zongheng Tang, Yue Liao, Si Liu, Guanbin Li, Xiaojie Jin, Hongxu Jiang, Qian Yu, and Dong Xu. Human-centric spatio-temporal video grounding with visual transformers.IEEE Transactions on Circuits and Systems for Video Technology, 32(12):8238–8249, 2021. 1, 2, 3, 4

  29. [29]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 8

  30. [30]

    Spacevllm: Endowing multimodal large language model with spatio-temporal video grounding capability.arXiv preprint arXiv:2503.13983, 2025

    Jiankang Wang, Zhihan Zhang, Zhihang Liu, Yang Li, Jiannan Ge, Hongtao Xie, and Yongdong Zhang. Spacevllm: Endowing multimodal large language model with spatio-temporal video grounding capability.arXiv preprint arXiv:2503.13983, 2025. 4

  31. [31]

    Internvideo2: Scaling foundation models for multimodal video understanding

    Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for multimodal video understanding. InEuropean Conference on Computer Vision, pages 396–416. Springer,

  32. [32]

    Referring multi-object tracking

    Dongming Wu, Wencheng Han, Tiancai Wang, Xingping Dong, Xiangyu Zhang, and Jianbing Shen. Referring multi-object tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14633–14642, 2023. 1, 4, 6, 7

  33. [33]

    Visual relation grounding in videos

    Junbin Xiao, Xindi Shang, Xun Yang, Sheng Tang, and Tat-Seng Chua. Visual relation grounding in videos. InEuropean conference on computer vision, pages 447–464. Springer,

  34. [34]

    Point-supervised video temporal grounding

    Zhe Xu, Kun Wei, Xu Yang, and Cheng Deng. Point-supervised video temporal grounding. IEEE Transactions on Multimedia, 25:6121–6131, 2022. 4

  35. [35]

    Bootstrapping referring multi-object tracking.arXiv preprint arXiv:2406.05039, 2024

    Yani Zhang, Dongming Wu, Wencheng Han, and Xingping Dong. Bootstrapping referring multi-object tracking.arXiv preprint arXiv:2406.05039, 2024. 4, 6, 8

  36. [36]

    Where does it exist: Spatio-temporal video grounding for multi-form sentences

    Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, and Lianli Gao. Where does it exist: Spatio-temporal video grounding for multi-form sentences. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10668–10677, 2020. 1, 2, 3, 4 13