pith. machine review for the scientific record. sign in

arxiv: 2605.07800 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: no theorem link

SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:18 UTC · model grok-4.3

classification 💻 cs.CV
keywords video diffusion modelsrelational alignmenttext-to-video generationsemantic saliencytoken pair supervisionprompt followingrepresentation distillationentity preservation
0
0 comments X

The pith

SARA improves text alignment and motion quality in video diffusion models by routing relational supervision to prompt-salient token pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that prior alignment methods for video diffusion models assign supervision to token relations based only on visual or motion signals, which often misses pairs that matter for the given prompt. SARA adds a lightweight text-conditioned aligner that scores token saliency using entity masks, then applies a pair-routing operator to weight the distillation loss toward subject-subject and subject-background pairs. If the routing works as intended, the generated videos should preserve entities, bind attributes correctly, and depict intended interactions more reliably than with uniform or cue-driven supervision. The method is demonstrated through continual training on a base video model and measured across automated rubrics, public benchmarks, and human ratings.

Core claim

SARA keeps token-relation distillation from a frozen visual foundation model but modulates it with text-conditioned saliency: a Stage-1 aligner trained on SAM 3.1 masks produces continuous saliency scores that are fused into the distillation objective through a pair-routing operator; each token pair receives elevated weight whenever either endpoint is salient, thereby directing supervision toward prompt-relevant relations and away from background-background pairs.

What carries the argument

The pair-routing operator, which fuses continuous saliency scores from a prompt-conditioned Stage-1 aligner into the token-relation distillation loss by elevating the weight of any pair whose endpoints include at least one salient token.

Load-bearing premise

The saliency scores produced by the Stage-1 aligner correctly identify which tokens are relevant to the prompt without introducing systematic bias or noise that misroutes supervision.

What would settle it

Replacing the learned saliency scores with uniform or random weights inside the pair-routing operator and observing that the reported gains on the 13-dimension VLM rubric, VBench scores, and blind user study all disappear would falsify the claim that adaptive routing is responsible for the improvements.

Figures

Figures reproduced from arXiv: 2605.07800 by Baoru Huang, Jiesong Lian, Long Hu, Qinglin Lu, Rui Wang, Ruizhe Zhong, Yixue Hao, Yuan Zhou, Zixiang Zhou.

Figure 1
Figure 1. Figure 1: Overview of SARA. Stage I (top): a lightweight aligner on top of frozen V-JEPA, SAM 3.1, and Qwen3-VL-Embedding backbones learns, for any (video, caption) pair, a text-conditioned per￾patch saliency Mp, supervised jointly by per-entity, combined-entity, and background SAM masks (LBCE) and calibrated by a caption-level InfoNCE. Stage II (bottom): the frozen aligner is queried with the full caption, and its … view at source ↗
Figure 2
Figure 2. Figure 2: Blind pairwise user study. Each row reports the percentage of comparisons where annotators prefer SARA, tie, or prefer the baseline. SARA is preferred over every baseline [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison on a representative caption. The key point highlighted is the colors [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on a multi-entity scene with six people and two pairs of shoes. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: SAM 3.1 entity decomposition used as Stage 1 supervision. Top row: input frames. Next three rows: per-entity masks for OBJECT_1 (red), PERSON_1 (green), PERSON_2 (blue). Last two rows: complement BACKGROUND (Mbg = 1 − Mfg, yellow) and foreground union ALL Entities (Mfg, magenta). All five masks supervise the saliency head jointly via K + 2 forwards (Sec. 3.3). A.2 Pair-budget analysis [PITH_FULL_IMAGE:fig… view at source ↗
Figure 6
Figure 6. Figure 6: Pair-budget breakdown on the training corpus (2,400 clips, 76,800 frames). (a) Distribution of per-clip foreground fraction pfg. (b) Share of the O(N2 ) TRD budget consumed by each pair category (FG–FG, FG–BG, BG–BG) under vanilla TRD (uniform weighting) and the three routing operators (AND, OR, XOR). OR (ours) retains ∼70% of pairs by keeping all FG–FG and FG–BG pairs and discarding only BG–BG. metrics ar… view at source ↗
Figure 7
Figure 7. Figure 7: Stage 1 ablation on the predicted saliency map on two held-out clips, eight frames each. Rows: input frames; Full (default SARA, K + 2 forwards + InfoNCE); w/o NCE; w/o NCE w/o entity (single forward on full caption with union mask, no InfoNCE); w/o entity (single forward, InfoNCE retained). Saliency rows use a jet colormap: redder = higher saliency, bluer = lower. Discussion in App. A.3, quantitative metr… view at source ↗
Figure 8
Figure 8. Figure 8: Stage 1 saliency on the four supervision-time query types, eight frames of one held-out clip. Rows in each panel: input frames, SAM 3.1 reference mask My, PCA of V ′ y , predicted saliency Mp (Eq. (4)). The Mp row uses a jet colormap: redder = higher saliency, bluer = lower. Discussion in App. A.4. Stage 2: MTSS FULL CAPTION, including [Global Setup], [Cast & Setting Introduction], [Person], [Object], [SCE… view at source ↗
Figure 9
Figure 9. Figure 9 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
read the original abstract

Recent video diffusion models (VDMs) synthesize visually convincing clips, yet still drop entities, mis-bind attributes, and weaken the interactions specified in the prompt. Representation-alignment objectives such as VideoREPA and MoAlign improve fine-grained text following by distilling spatio-temporal token relations from a frozen visual foundation model, but their pairwise supervision budget is allocated by visual or motion cues rather than by how relevant each pair is to the prompt. We present SARA, Semantically Adaptive Relational Alignment, which keeps token-relation distillation (TRD) on a frozen VFM target and adds a text-conditioned saliency that decides which token pairs carry supervision. A lightweight Stage 1 aligner is trained with per-entity SAM 3.1 mask supervision and an InfoNCE regulariser, and its continuous saliency is fused into TRD through a pair-routing operator that assigns each token pair a weight whenever either of its two endpoints is salient, thereby routing supervision toward subject-subject and subject-background pairs and away from background-background ones. In the Wan2.2 continual-training setting, SARA improves both text alignment and motion quality over SFT, VideoREPA, and MoAlign on a 13-dimension VLM rubric, on the public VBench benchmarks, and in a blind user study.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SARA for video diffusion models, which augments token-relation distillation (TRD) from a frozen visual foundation model with a text-conditioned saliency mechanism. A lightweight Stage-1 aligner is trained on per-entity SAM 3.1 masks plus InfoNCE to produce continuous saliency scores; these are fused via a pair-routing operator that weights token pairs for TRD supervision whenever either endpoint is salient. This routes supervision toward subject-subject and subject-background pairs. In the Wan2.2 continual-training setting, SARA is reported to improve text alignment and motion quality over SFT, VideoREPA, and MoAlign on a 13-dimension VLM rubric, VBench benchmarks, and a blind user study.

Significance. If the saliency mechanism correctly identifies prompt-relevant pairs, SARA would constitute a targeted advance over prior representation-alignment methods by making pairwise supervision semantically adaptive rather than purely visual or motion-driven. The design reuses frozen VFMs and SAM without introducing new parameters into the diffusion backbone, which is a practical strength. The empirical claims on multiple benchmarks and user studies, if substantiated with ablations, would strengthen the case for semantic routing in VDM alignment.

major comments (2)
  1. [Abstract / Stage-1 aligner description] Abstract / Stage-1 aligner description: The central claim that SARA improves text alignment and motion quality via semantic adaptation rests on the assumption that saliency scores from the Stage-1 aligner (supervised only by SAM 3.1 masks and InfoNCE) correctly identify prompt-relevant token pairs. No independent validation is reported showing correlation with prompt entity mentions rather than raw visual salience; the pair-routing operator then directly propagates any mismatch into the TRD loss weights.
  2. [Abstract] Abstract: The reported improvements over VideoREPA and MoAlign are stated without quantitative deltas, ablation tables isolating the pair-routing operator, error bars, or implementation details for the saliency-to-weight mapping. This makes it impossible to assess whether gains exceed what could arise from incidental regularization or extra compute.
minor comments (2)
  1. The abstract does not specify the exact mathematical form of the pair-routing operator or how continuous saliency is normalized before weighting the TRD loss.
  2. Clarify whether the 13-dimension VLM rubric is a standard benchmark or custom; if custom, provide its definition and inter-rater reliability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript introducing SARA. The comments help clarify how to better substantiate the semantic adaptation mechanism and improve the presentation of results. We address each major comment below and will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: [Abstract / Stage-1 aligner description] Abstract / Stage-1 aligner description: The central claim that SARA improves text alignment and motion quality via semantic adaptation rests on the assumption that saliency scores from the Stage-1 aligner (supervised only by SAM 3.1 masks and InfoNCE) correctly identify prompt-relevant token pairs. No independent validation is reported showing correlation with prompt entity mentions rather than raw visual salience; the pair-routing operator then directly propagates any mismatch into the TRD loss weights.

    Authors: We agree that explicit validation would strengthen the central claim. The Stage-1 aligner uses per-entity SAM 3.1 masks for supervision together with InfoNCE, which is intended to produce text-conditioned saliency focused on prompt entities rather than generic visual salience. However, we did not include a dedicated correlation analysis between the resulting saliency scores and prompt entity mentions. In the revised manuscript we will add a new subsection with quantitative correlation metrics (e.g., overlap between high-saliency tokens and prompt-derived entity locations on a held-out set) and qualitative examples to demonstrate that the routing indeed prioritizes prompt-relevant pairs. revision: yes

  2. Referee: [Abstract] Abstract: The reported improvements over VideoREPA and MoAlign are stated without quantitative deltas, ablation tables isolating the pair-routing operator, error bars, or implementation details for the saliency-to-weight mapping. This makes it impossible to assess whether gains exceed what could arise from incidental regularization or extra compute.

    Authors: The full paper already contains the requested elements: Table 2 reports numerical deltas on the 13-dimension VLM rubric and VBench; Table 3 isolates the contribution of the pair-routing operator via ablations; error bars are shown from three independent runs; and Section 3.2 details the saliency-to-weight mapping formula. To address the referee’s concern about the abstract, we will revise the abstract to include the key quantitative improvements (e.g., +X% on text alignment) and add explicit pointers to the ablation tables and implementation details. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core method trains a lightweight Stage-1 aligner on external SAM 3.1 per-entity masks plus InfoNCE regularization, then fuses its continuous saliency scores into token-relation distillation via a pair-routing operator that weights pairs based on endpoint salience. All claimed improvements are measured empirically against external baselines (SFT, VideoREPA, MoAlign) on VBench, a 13-dimension VLM rubric, and blind user studies in the Wan2.2 continual-training setting. No equations, fitted parameters, or predictions reduce to the inputs by construction; the saliency routing is not self-defined or renamed from prior results, and no load-bearing self-citations or uniqueness theorems are invoked. The derivation remains self-contained against external frozen VFMs and SAM without circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

The central claim rests on the untested accuracy of the Stage-1 saliency predictor and on the assumption that routing supervision by saliency improves rather than harms the underlying TRD objective.

free parameters (1)
  • saliency-to-weight mapping parameters
    The exact function that converts continuous saliency into pair weights is not specified and must be chosen or fitted.
axioms (2)
  • domain assumption Frozen visual foundation model supplies reliable spatio-temporal token relations for distillation
    The method keeps TRD on a frozen VFM target without questioning its quality.
  • domain assumption SAM 3.1 per-entity masks provide accurate entity boundaries for Stage-1 supervision
    Stage-1 training explicitly uses these masks.
invented entities (2)
  • pair-routing operator no independent evidence
    purpose: Assigns supervision weight to each token pair according to endpoint saliency
    New operator introduced to implement adaptive routing.
  • text-conditioned saliency map no independent evidence
    purpose: Decides which token pairs receive relational supervision
    Core new component that conditions alignment on prompt semantics.

pith-pipeline@v0.9.0 · 5550 in / 1519 out tokens · 34565 ms · 2026-05-11T02:18:15.118456+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 12 internal anchors

  1. [1]

    Seedance 2.0: Advancing Video Generation for World Complexity

    Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

  2. [2]

    Veo 3.1 announcement

    Google. Veo 3.1 announcement. https://blog.google/innovation-and-ai/ technology/developers-tools/veo-3-1-gemini-api/ , 2026. Accessed: April 29, 2026

  3. [3]

    wan 2.7.https://wan.video/, 2026

    Wan Team. wan 2.7.https://wan.video/, 2026. Accessed: April 29, 2026

  4. [4]

    LTX-2: Efficient Joint Audio-Visual Foundation Model

    Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026

  5. [5]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  6. [6]

    HunyuanVideo 1.5 Technical Report

    Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025

  7. [7]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

  8. [8]

    arXiv preprint arXiv:2505.23656 (2025) 2, 4, 6, 7, 8

    Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, and Yu Cheng. Videorepa: Learning physics for video generation through relational alignment with foundation models.arXiv preprint arXiv:2505.23656, 2025

  9. [9]

    Moalign: Motion-centric representation alignment for video diffusion models

    Aritra Bhowmik, Denis Korzhenkov, Cees GM Snoek, Amirhossein Habibian, and Mohsen Ghafoorian. Moalign: Motion-centric representation alignment for video diffusion models. arXiv preprint arXiv:2510.19022, 2025

  10. [10]

    Vision Transformers Need More Than Registers

    Cheng Shi, Yizhou Yu, and Sibei Yang. Vision transformers need more than registers.arXiv preprint arXiv:2602.22394, 2026

  11. [11]

    SAM 3: Segment Anything with Concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

  12. [12]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  13. [13]

    Video generation models as world simulators

    OpenAI. Video generation models as world simulators. https://openai.com/research/ video-generation-models-as-world-simulators, 2024. Accessed: April 29, 2026

  14. [14]

    kling 3.https://kling.ai/, 2026

    Kling Team. kling 3.https://kling.ai/, 2026. Accessed: April 29, 2026

  15. [15]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  16. [16]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

  17. [17]

    Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

    Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Goldenberg, Aditya Grover, and Kai-Wei Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025. 27

  18. [18]

    Refalign: Representation alignment for reference-to-video generation.arXiv preprint arXiv:2603.25743, 2026

    Lei Wang, Yuxin Song, Ge Wu, Haocheng Feng, Hang Zhou, Jingdong Wang, Yaxing Wang, and Jian Yang. Refalign: Representation alignment for reference-to-video generation.arXiv preprint arXiv:2603.25743, 2026

  19. [19]

    Tora: Trajectory-oriented diffusion transformer for video generation

    Zhenghao Zhang, Junchao Liao, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang. Tora: Trajectory-oriented diffusion transformer for video generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 2063–2073, 2025

  20. [20]

    Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

  21. [21]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  22. [22]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818, 2025

  23. [23]

    Diffusion model alignment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

  24. [24]

    Videodpo: Omni-preference alignment for video diffusion generation

    Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni-preference alignment for video diffusion generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8009–8019, 2025

  25. [25]

    Imagereward: Learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

  26. [26]

    Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers

    Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

  27. [27]

    Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding

    Tencent Hunyuan Team. Script-a-video: Deep structured audio-visual captions via factorized streams and relational grounding.arXiv preprint arXiv:2604.11244, 2026

  28. [28]

    Vl-jepa: Joint em- bedding predictive architecture for vision-language,

    Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Yejin Bang, Allen Bolourchi, Yann LeCun, and Pascale Fung. Vl-jepa: Joint embedding predictive architecture for vision-language.arXiv preprint arXiv:2512.10942, 2025

  29. [29]

    arXiv preprint arXiv:2603.14482 (2026)

    Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

  30. [30]

    Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv, 2026

    Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv, 2026

  31. [31]

    Qwen Team. Qwen3.5. https://qwen.ai/blog?id=qwen3.5, 2026. Accessed: May 6, 2026

  32. [32]

    Qwen Team. Qwen3.6. https://qwen.ai/blog?id=qwen3.6, 2026. Accessed: May 6, 2026

  33. [33]

    Welcome gemma 4: Frontier multimodal intelligence on device.https://huggingface.co/blog/gemma4, 2026

    Hugging Face and Google DeepMind. Welcome gemma 4: Frontier multimodal intelligence on device.https://huggingface.co/blog/gemma4, 2026. Accessed: May 6, 2026

  34. [34]

    VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025. 28