arxiv: 2604.10385 · v1 · submitted 2026-04-12 · 💻 cs.CV

Recognition: unknown

GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models

Nicolae Cudlenco , Mihai Masala , Marius Leordeanu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords video datasetground truth annotationsspatiotemporal reasoningvideo generationvideo encodersspatial relation graphsevent mappingsGEST-Engine

0 comments

The pith

GTASA supplies multi-actor videos with exact per-frame 3D spatial graphs and event mappings to evaluate and train video models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GTASA, a corpus of videos generated via the GEST-Engine that carries exact per-frame spatial relation graphs and event-level temporal mappings. It demonstrates that these videos outperform both open and closed source neural generators on human ratings of physical validity and semantic alignment, and that models trained for video captioning perform better when using GTASA data. The same exact 3D ground truth enables direct testing of four frozen video encoders on eleven spatiotemporal reasoning tasks, revealing stronger spatial encoding in self-supervised encoders than in VLM visual encoders. This setup addresses the difficulty of measuring physical plausibility and semantic faithfulness in complex multi-actor video generation without reliable ground truth.

Core claim

GTASA is a corpus of multi-actor videos with per-frame spatial relation graphs and event-level temporal mappings produced by the GEST-Engine. The method produces videos that human evaluators rate higher in physical validity and semantic alignment than those from neural generators. Training video captioning models on GTASA data leads to better results than on neural-generated videos. Probing four frozen video encoders on 11 tasks enabled by the ground truth shows self-supervised encoders encode spatial structure significantly better than VLM visual encoders.

What carries the argument

GEST-Engine, a system that generates videos from graphs of events in space and time to produce exact per-frame 3D spatial relation graphs and event mappings as ground truth.

Load-bearing premise

The assumption that the GEST-Engine generates videos whose per-frame spatial graphs and event mappings accurately reflect physical plausibility and semantic faithfulness that humans can judge reliably.

What would settle it

A blind human evaluation in which raters score physical validity and semantic alignment of GTASA videos no higher than those from neural generators, or video captioning models trained on GTASA show no accuracy gain over models trained on neural-generated videos.

Figures

Figures reproduced from arXiv: 2604.10385 by Marius Leordeanu, Mihai Masala, Nicolae Cudlenco.

**Figure 1.** Figure 1: GTASA pipeline. Generated GEST graphs are converted into videos and accompanying textual descriptions. From these texts, we generate synthetic videos using neural models. The resulting videos—produced both by our approach and by other methods—are evaluated Q1 and subsequently used to train video captioning models Q2 and to probe existing video encoders for their spatiotemporal understanding Q3. models, fr… view at source ↗

read the original abstract

Generating complex multi-actor scenario videos remains difficult even for state-of-the-art neural generators, while evaluating them is hard due to the lack of ground truth for physical plausibility and semantic faithfulness. We introduce GTASA, a corpus of multi-actor videos with per-frame spatial relation graphs and event-level temporal mappings, and the system that produced it based on Graphs of Events in Space and Time (GEST): GEST-Engine. We compare our method with both open and closed source neural generators and prove both qualitatively (human evaluation of physical validity and semantic alignment) and quantitatively (via training video captioning models) the clear advantages of our method. Probing four frozen video encoders across 11 spatiotemporal reasoning tasks enabled by GTASA's exact 3D ground truth reveals that self-supervised encoders encode spatial structure significantly better than VLM visual encoders.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GTASA adds a new annotated multi-actor video corpus with per-frame graphs and event mappings, but its ground truth accuracy rests on unverified engine output and thin human evaluation details.

read the letter

The paper's main contribution is GTASA, a corpus of multi-actor videos paired with per-frame spatial relation graphs and event-level temporal mappings, produced by their GEST-Engine. They use it to compare their generated videos against open and closed neural generators on physical validity and semantic alignment, plus train captioning models, and to run 11 spatiotemporal reasoning tasks on four frozen encoders. The probing result—that self-supervised encoders capture spatial structure better than VLM ones—is a direct application of the annotations and stands out as a concrete step forward for video model analysis.

Referee Report

2 major / 1 minor

Summary. The manuscript presents GTASA, a dataset of multi-actor videos accompanied by per-frame spatial relation graphs and event-level temporal mappings generated by the GEST-Engine. It claims to demonstrate the advantages of this approach over open and closed source neural video generators both qualitatively through human evaluations of physical validity and semantic alignment and quantitatively by training video captioning models. Furthermore, by using the exact 3D ground truth to create 11 spatiotemporal reasoning tasks, it probes four frozen video encoders and finds that self-supervised encoders encode spatial structure significantly better than VLM visual encoders.

Significance. If the ground truth annotations prove to be accurate and the evaluations robust, GTASA could serve as an important benchmark for assessing physical plausibility and semantic faithfulness in video generation models, as well as for probing the capabilities of video encoders on spatiotemporal tasks. The distinction between self-supervised and VLM encoders on spatial structure is a potentially useful insight for the field.

major comments (2)

Abstract: The abstract asserts qualitative and quantitative advantages but supplies no details on the human evaluation protocol, number of raters, statistical tests, or how the 11 tasks were constructed, leaving the central claims without visible supporting evidence.
GEST-Engine and probing experiments sections: The claim that GTASA supplies 'exact 3D ground truth' for the 11 spatiotemporal tasks requires the per-frame spatial graphs and event mappings to be verifiably accurate, yet no independent check (physics simulation match, real 3D capture comparison, or automated consistency test) is described; this is load-bearing for the reported superiority of self-supervised encoders over VLM encoders.

minor comments (1)

The description of the spatial relation graphs would benefit from an explicit example or diagram to clarify the per-frame annotation format.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the positive assessment of GTASA's potential as a benchmark. We respond to each major comment below, clarifying points of the manuscript and indicating where revisions will strengthen the presentation.

read point-by-point responses

Referee: Abstract: The abstract asserts qualitative and quantitative advantages but supplies no details on the human evaluation protocol, number of raters, statistical tests, or how the 11 tasks were constructed, leaving the central claims without visible supporting evidence.

Authors: We agree that the abstract, constrained by length, does not preview these details. The full manuscript describes the human evaluation protocol (including rater instructions, rating scales for physical validity and semantic alignment, and statistical analysis) in the relevant evaluation section, and details the construction of the 11 spatiotemporal tasks from the per-frame graphs and event mappings in the probing section. To improve visibility of the supporting evidence, we will revise the abstract to include a concise reference to the human evaluation and task construction while remaining within length limits. revision: yes
Referee: GEST-Engine and probing experiments sections: The claim that GTASA supplies 'exact 3D ground truth' for the 11 spatiotemporal tasks requires the per-frame spatial graphs and event mappings to be verifiably accurate, yet no independent check (physics simulation match, real 3D capture comparison, or automated consistency test) is described; this is load-bearing for the reported superiority of self-supervised encoders over VLM encoders.

Authors: The annotations are exact by construction: the GEST-Engine renders videos from explicit 3D scene parameters, spatial relation graphs, and timed event sequences, so the per-frame graphs and event mappings are the generative inputs rather than post-hoc inferences. This generative process is described in the GEST-Engine section. We did not include external validation against real captures or separate physics engines because the dataset is intentionally synthetic to provide perfect alignment between video and annotations. We acknowledge that an explicit consistency check would strengthen the claim and will add a short discussion of this point plus a simple automated verification (e.g., graph consistency) in the revised manuscript. The encoder probing results are presented with this synthetic ground truth in mind. revision: partial

Circularity Check

0 steps flagged

No circularity detected in derivation or claims

full rationale

The paper introduces a new dataset GTASA produced by the GEST-Engine, performs external comparisons against other generators via human evaluation and downstream captioning model training, and probes frozen encoders on 11 tasks using the annotations as ground truth. No equations, parameter fitting, self-citations, or ansatzes are present in the provided text that reduce any central claim (advantages of the method or encoder performance differences) to an input by construction. The work is self-contained against external benchmarks and does not rely on self-referential definitions or renamed known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claims rest on the unverified accuracy of annotations produced by the newly introduced GEST-Engine and on the reliability of human judgments of physical and semantic validity.

axioms (1)

domain assumption GEST-Engine generates videos whose per-frame spatial relation graphs and event mappings constitute accurate ground truth for physical plausibility and semantic faithfulness.
Invoked when claiming advantages over neural generators and when using the annotations for encoder probing.

invented entities (1)

GEST-Engine no independent evidence
purpose: System that produces the GTASA videos together with their spatial and temporal annotations.
Newly introduced component whose internal correctness is not independently demonstrated in the abstract.

pith-pipeline@v0.9.0 · 5448 in / 1339 out tokens · 52204 ms · 2026-05-10T16:34:41.588256+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 14 canonical work pages · 10 internal anchors

[1]

Communications of the ACM26(11), 832–843 (1983)

Allen, J.F.: Maintaining knowledge about temporal intervals. Communications of the ACM26(11), 832–843 (1983)

1983
[2]

Advances in Neural Information Processing Systems37, 58757–58791 (2024)

Alonso,E.,Jelley,A.,Micheli,V.,Kanervisto,A.,Storkey,A.J.,Pearce,T.,Fleuret, F.: Diffusion for world modeling: Visual details matter in atari. Advances in Neural Information Processing Systems37, 58757–58791 (2024)

2024
[3]

In: European conference on computer vision

Anderson, P., Fernando, B., Johnson, M., Gould, S.: Spice: Semantic propositional image caption evaluation. In: European conference on computer vision. pp. 382–
[4]

In: ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling (2026), supplied as supple- mental material

Anonymous: [tiny paper] GEST-engine: Controllable multi-actor video synthesis with perfect spatiotemporal annotations. In: ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling (2026), supplied as supple- mental material

2026
[5]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025)

work page internal anchor Pith review arXiv 2025
[6]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Ball, P.J., Bauer, J., Belletti, F., Brownfield, B., Ephrat, A., Fruchter, S., Gupta, A., Holsheimer, K., Holynski, A., Hron, J., Kaplanis, C., Limont, M., McGill, M., Oliveira, Y., Parker-Holder, J., Perbet, F., Scully, G., Shar, J., Spencer, S., Tov, O., Villegas, R., Wang, E., Yung, J., Baetu, C., Berbel, J., Bridson, D., Bruce, J., Buttimore, G., Chak...

2025
[8]

In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization

Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with im- proved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization. pp. 65–72 (2005)

2005
[9]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Black, M.J., Patel, P., Tesch, J., Yang, J.: Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8726–8737 (2023)

2023
[10]

In: Pro- ceedings of the 28th International Conference on Computational Linguistics

Bogolin, S.V., Croitoru, I., Leordeanu, M.: A hierarchical approach to vision-based language generation: from simple sentences to complex natural language. In: Pro- ceedings of the 28th International Conference on Computational Linguistics. pp. 2436–2447 (2020)

2020
[11]

Cudlenco et al

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A.: Video gener- ation models as world simulators (2024),https://openai.com/research/video- generation-models-as-world-simulators 16 N. Cudlenco et al

2024
[12]

In: Forty-first International Conference on Machine Learning (2024)

Bruce, J., Dennis, M.D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al.: Genie: Generative interactive environments. In: Forty-first International Conference on Machine Learning (2024)

2024
[13]

Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models,

Cai, M., Tan, R., Zhang, J., Zou, B., Zhang, K., Yao, F., Zhu, F., Gu, J., Zhong, Y., Shang, Y., et al.: Temporalbench: Benchmarking fine-grained temporal under- standing for multimodal video models. arXiv preprint arXiv:2410.10818 (2024)

work page arXiv 2024
[14]

Action100m: A large-scale video action dataset.arXiv preprint arXiv:2601.10592, 2026

Chen, D., Kasarla, T., Bang, Y., Shukor, M., Chung, W., Yu, J., Bolourchi, A., Moutakanni, T., Fung, P.: Action100m: A large-scale video action dataset. arXiv preprint arXiv:2601.10592 (2026)

work page arXiv 2026
[15]

International Journal of Computer Vision130(1), 33–55 (2022)

Damen, D., Doughty, H., Farinella, G.M., Furnari, A., Kazakos, E., Ma, J., Molti- santi, D., Munro, J., Perrett, T., Price, W., et al.: Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision130(1), 33–55 (2022)

2022
[16]

Advances in neural information processing systems36, 10088–10115 (2023)

Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: Qlora: Efficient fine- tuning of quantized llms. Advances in neural information processing systems36, 10088–10115 (2023)

2023
[17]

ACM Computing Surveys58(3), 1–38 (2025)

Ding, J., Zhang, Y., Shang, Y., Zhang, Y., Zong, Z., Feng, J., Yuan, Y., Su, H., Li, N., Sukiennik, N., et al.: Understanding world or predicting future? a compre- hensive survey of world models. ACM Computing Surveys58(3), 1–38 (2025)

2025
[18]

In: Conference on robot learning

Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: Carla: An open urban driving simulator. In: Conference on robot learning. pp. 1–16. PMLR (2017)

2017
[19]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

El Banani, M., Raj, A., Maninis, K.K., Kar, A., Li, Y., Rubinstein, M., Sun, D., Guibas,L.,Johnson,J.,Jampani,V.:Probingthe3dawarenessofvisualfoundation models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21795–21806 (2024)

2024
[20]

Google: Veo 3 announcement (2025),https://blog.google/innovation- and- ai/products/generative-media-models-io-2025/#veo-3

2025
[21]

Google: Veo 3 launch (2025),https://cloud.google.com/blog/products/ai- machine-learning/veo-3-fast-available-for-everyone-on-vertex-ai

2025
[22]

Google: Veo 3 model card (2025),https://storage.googleapis.com/deepmind- media/Model-Cards/Veo-3-Model-Card.pdf, accessed: March 04, 2026

2025
[23]

something something

Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., et al.: The" something something" video database for learning and evaluating visual common sense. In: Proceedings of the IEEE international conference on computer vision. pp. 5842– 5850 (2017)

2017
[24]

World Models

Ha, D., Schmidhuber, J.: World models. arXiv preprint arXiv:1803.101222(3) (2018)

work page internal anchor Pith review arXiv 2018
[25]

LTX-2: Efficient Joint Audio-Visual Foundation Model

HaCohen, Y., Brazowski, B., Chiprut, N., Bitterman, Y., Kvochko, A., Berkowitz, A., Shalem, D., Lifschitz, D., Moshe, D., Porat, E., et al.: Ltx-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233 (2026)

work page Pith review arXiv 2026
[26]

LTX-Video: Realtime Video Latent Diffusion

HaCohen, Y., Chiprut, N., Brazowski, B., Shalem, D., Moshe, D., Richardson, E., Levin, E., Shiran, G., Zabari, N., Gordon, O., et al.: Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103 (2024)

work page internal anchor Pith review arXiv 2024
[27]

Mastering Diverse Domains through World Models

Hafner, D., Pasukonis, J., Ba, J., Lillicrap, T.: Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104 (2023)

work page internal anchor Pith review arXiv 2023
[28]

Huang, Z., Li, X., Lv, Z., Rehg, J.M.: How much 3d do video foundation models encode? arXiv preprint arXiv:2512.19949 (2025)

work page arXiv 2025
[29]

Ji, J., Krishna, R., Fei-Fei, L., Niebles, J.C.: Action genome: Actions as composi- tionsofspatio-temporalscenegraphs.In:ProceedingsoftheIEEE/CVFconference on computer vision and pattern recognition. pp. 10236–10247 (2020) GTASA: Ground Truth Annotations for Spatial Analysis 17

2020
[30]

Convergence p

Jia, X., Berry, A., Johnston, A.: The evolutionary disruption: A paradigm shift in film and animation industry driven by real-time rendering and virtual production. Convergence p. 13548565251356932 (2025)

2025
[31]

The Kinetics Human Action Video Dataset

Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

work page internal anchor Pith review arXiv 2017
[32]

Krippendorff, K.: Computing krippendorff’s alpha-reliability (2011)

2011
[33]

2, 2022- 06-27

LeCun, Y.: A path towards autonomous machine intelligence version 0.9. 2, 2022- 06-27. Open Review62(1), 1–62 (2022)

2022
[34]

In: Text sum- marization branches out

Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text sum- marization branches out. pp. 74–81 (2004)

2004
[35]

Advances in Neural Information Processing Systems36, 46212–46244 (2023)

Mangalam, K., Akshulakov, R., Malik, J.: Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems36, 46212–46244 (2023)

2023
[36]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Masala, M., Cudlenco, N., Rebedea, T., Leordeanu, M.: Explaining vision and language through graphs of events in space and time. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2826–2831 (2023)

2023
[37]

In: Proceedings of the IEEE/CVF international conference on computer vision

Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: Howto100m: Learning a text-video embedding by watching hundred million nar- rated video clips. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2630–2640 (2019)

2019
[38]

In: Proceedings of the IEEE/CVF international conference on computer vision

Paiss, R., Ephrat, A., Tov, O., Zada, S., Mosseri, I., Irani, M., Dekel, T.: Teaching clip to count to ten. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3170–3180 (2023)

2023
[39]

In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics

Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002)

2002
[40]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Puig, X., Ra, K., Boben, M., Li, J., Wang, T., Fidler, S., Torralba, A.: Virtual- home: Simulating household activities via programs. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8494–8502 (2018)

2018
[41]

Qiu, Y., Nagasaki, Y., Hara, K., Kataoka, H., Suzuki, R., Iwata, K., Satoh, Y.: Virtualhome action genome: A simulated spatio-temporal scene graph dataset with consistentrelationshiplabels.In:ProceedingsoftheIEEE/CVFWinterConference on Applications of Computer Vision. pp. 3351–3360 (2023)

2023
[42]

In: European conference on computer vision

Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: Ground truth from computer games. In: European conference on computer vision. pp. 102–118. Springer (2016)

2016
[43]

In: Proceedings of the 58th annual meeting of the association for computational linguistics

Sellam, T., Das, D., Parikh, A.: Bleurt: Learning robust metrics for text generation. In: Proceedings of the 58th annual meeting of the association for computational linguistics. pp. 7881–7892 (2020)

2020
[44]

Advances in Neural Information Processing Systems 37, 87310–87356 (2024)

Tong, P., Brown, E., Wu, P., Woo, S., Iyer, A.J.V., Akula, S.C., Yang, S., Yang, J., Middepogu, M., Wang, Z., et al.: Cambrian-1: A fully open, vision-centric ex- ploration of multimodal llms. Advances in Neural Information Processing Systems 37, 87310–87356 (2024)

2024
[45]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4566–4575 (2015)

2015
[46]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025) 18 N. Cudlenco et al

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., Wang, Y., Qiao, Y.: Videomae v2: Scaling video masked autoencoders with dual masking. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14549–14560 (2023)

2023
[48]

Video models are zero-shot learners and reasoners

Wiedemer, T., Li, Y., Vicol, P., Gu, S.S., Matarese, N., Swersky, K., Kim, B., Jaini, P., Geirhos, R.: Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328 (2025)

work page internal anchor Pith review arXiv 2025
[49]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yang, J., Peng, W., Li, X., Guo, Z., Chen, L., Li, B., Ma, Z., Zhou, K., Zhang, W., Loy, C.C., et al.: Panoptic video scene graph generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18675– 18685 (2023)

2023
[50]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Zhang, B., Li, K., Cheng, Z., Hu, Z., Yuan, Y., Chen, G., Leng, S., Jiang, Y., Zhang, H., Li, X., et al.: Videollama 3: Frontier multimodal foundation models for image and video understanding. arXiv preprint arXiv:2501.13106 (2025)

work page internal anchor Pith review arXiv 2025
[51]

BERTScore: Evaluating Text Generation with BERT

Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1904
[52]

Advances in neural information processing systems36, 46595–46623 (2023)

Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al.: Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems36, 46595–46623 (2023)

2023