pith. sign in

arxiv: 2602.05202 · v2 · pith:VDL56WXQnew · submitted 2026-02-05 · 💻 cs.CV

GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling

Pith reviewed 2026-05-25 07:12 UTC · model grok-4.3

classification 💻 cs.CV
keywords video reward modelinggenerative transformersself-supervised learningenergy-based modelscontrastive traininglatent perturbationsvideo quality assessment
0
0 comments X

The pith

Video generative models can be repurposed as reward models by recasting them as energy-based judges trained on synthetic negatives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that existing video generation models can function as reward models for human preference alignment by treating them as energy-based models that score high-quality videos low and degraded ones high. This approach would matter if true because it uses models already built to handle temporal structure, avoiding the limitations of vision-language models that often miss subtle timing issues. Training happens through contrastive objectives on synthetic negative videos made by perturbing the model's own latent space with operations like temporal slicing and frame shuffling. The result is state-of-the-art scores on GenAI-Bench and MonteBench while using only 30K human annotations, which is 6x to 65x less data than prior methods.

Core claim

Generative video models can be transformed into temporally-aware reward models by viewing them as energy-based models that assign low energy to high-quality videos and high energy to degraded ones, trained via contrastive objectives on synthetic negatives created through controlled latent-space perturbations such as temporal slicing, feature swapping, and frame shuffling.

What carries the argument

The reformulation of video generative models as energy-based models, trained contrastively on latent-space perturbed synthetic negatives to force learning of meaningful spatiotemporal quality features.

If this is right

  • Reward modeling for video generation can draw directly on pretrained generative architectures without separate large vision-language model training.
  • Human annotation budgets for video preference data can shrink substantially while still reaching top performance on quality benchmarks.
  • The model is forced to attend to realistic temporal and spatial degradations rather than superficial real-versus-generated differences.
  • Video generator alignment with preferences becomes possible using signals derived internally from the generator itself.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same latent-perturbation technique could create training signals for iterative improvement loops where the judge helps refine the generator that produced it.
  • Similar energy-based reformulations might transfer to other generative settings such as audio or image sequences where temporal or structural consistency matters.
  • Lower dependence on vision-language models for judging could cut overall data and compute demands in preference-based training of generative systems.

Load-bearing premise

That the energy scores produced by the trained model will match human judgments of video quality instead of merely detecting the specific kinds of perturbations used during training.

What would settle it

Human preference ratings collected on videos containing degradations created by methods outside the three perturbation techniques used in training, then checked for alignment with the model's energy assignments.

Figures

Figures reproduced from arXiv: 2602.05202 by Mehrab Tanjim, Raghavendra Addanki, Shivanshu Shekhar, Somdeb Sarkhel, Tong Zhang, Uttaran Bhattacharya.

Figure 1
Figure 1. Figure 1 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed GT-SVJ framework. The framework consists of two stages: (top) Training a discriminative model, where the video generative model (CogVideoX) is adapted using a contrastive energy-based objective with real, generated, and perturbed videos, and (middle and bottom) Training a reward model, where the discriminative model (DM) is aligned with human ratings through aspect-wise prediction … view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of energy trajectories predicted by our energy-based model. For the real video in (a), energy trajectory across the time steps is smooth and stable, indicating consistent temporal dynamics. In contrast, for the generated videos in (b) and (c), the energy values fluctuate erratically, reflecting spatial and temporal inconsistencies such as implausible scene lighting and motions. 4.2. Reward Mod… view at source ↗
Figure 4
Figure 4. Figure 4: Effect of LoRA placement within the backbone transformer. We compare applying LoRA to the initial third, middle third, and last third of the transformer layers. The middle-layer configuration achieves the best overall performance, while the last-layer configuration provides faster training with minimal loss in accuracy. an aggregation head maps the 21 predicted attributes to a single scalar reward score, w… view at source ↗
Figure 5
Figure 5. Figure 5: Effect of the discriminative model. Initializing the reward model with the trained discriminative model leads to lower validation losses and higher validation accuracies throughout training. discriminative features provides a stronger inductive bias for learning effective reward predictors. Effect of Perturbed Videos as Negative Samples. To better understand the role of perturbed videos as negative samples… view at source ↗
read the original abstract

Aligning video generative models with human preferences remains challenging: current approaches rely on Vision-Language Models (VLMs) for reward modeling, but these models struggle to capture subtle temporal dynamics. We propose a fundamentally different approach: repurposing video generative models, which are inherently designed to model temporal structure, as reward models. We present the Generative-Transformer-based Self-Supervised Video Judge (\modelname), a novel evaluation model that transforms state-of-the-art video generation models into powerful temporally-aware reward models. Our key insight is that generative models can be reformulated as energy-based models (EBMs) that assign low energy to high-quality videos and high energy to degraded ones, enabling them to discriminate video quality with remarkable precision when trained via contrastive objectives. To prevent the model from exploiting superficial differences between real and generated videos, we design challenging synthetic negative videos through controlled latent-space perturbations: temporal slicing, feature swapping, and frame shuffling, which simulate realistic but subtle visual degradations. This forces the model to learn meaningful spatiotemporal features rather than trivial artifacts. \modelname achieves state-of-the-art performance on GenAI-Bench and MonteBench using only 30K human-annotations: $6\times$ to $65\times$ fewer than existing VLM-based approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes GT-SVJ, which repurposes video generative transformers as energy-based models (EBMs) for reward modeling. Generative models are trained via contrastive objectives on synthetic negative videos created by latent-space perturbations (temporal slicing, feature swapping, frame shuffling) to assign low energy to high-quality videos and high energy to degraded ones. This yields a temporally-aware judge that achieves SOTA on GenAI-Bench and MonteBench using only 30K human annotations (6×–65× fewer than VLM-based methods).

Significance. If the EBM reformulation and perturbation-based training produce energies that reliably rank videos according to human preference, the approach would offer a more annotation-efficient and temporally-sensitive alternative to VLM reward models for video generation alignment.

major comments (3)
  1. The central claim that the generative transformer's implicit density can be directly treated as an EBM energy function (assigning low energy to high-quality videos) lacks an explicit derivation or likelihood computation; without this, it is unclear whether the contrastive objective on perturbed latents produces energies aligned with human judgments rather than synthetic artifacts.
  2. The SOTA result on GenAI-Bench and MonteBench with 30K annotations rests on the unverified assumption that the chosen perturbations (temporal slicing, feature swapping, frame shuffling) block shortcut solutions while generalizing beyond the synthetic distribution; the manuscript must include ablations isolating each perturbation's contribution and showing that performance does not collapse when any one is removed.
  3. Table or figure reporting benchmark results: the efficiency advantage (6×–65× fewer annotations) is load-bearing for the main contribution, yet no direct head-to-head comparison is described that controls for annotation quality, model scale, or training compute between GT-SVJ and the VLM baselines.
minor comments (2)
  1. Notation for the energy function and contrastive loss should be introduced with explicit equations rather than descriptive prose only.
  2. The abstract references external benchmarks without stating the exact metrics (e.g., ranking accuracy, correlation with human scores) used to declare SOTA.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below with clarifications and commit to revisions that strengthen the technical presentation without altering the core claims.

read point-by-point responses
  1. Referee: The central claim that the generative transformer's implicit density can be directly treated as an EBM energy function (assigning low energy to high-quality videos) lacks an explicit derivation or likelihood computation; without this, it is unclear whether the contrastive objective on perturbed latents produces energies aligned with human judgments rather than synthetic artifacts.

    Authors: We agree an explicit derivation would improve rigor. The reformulation follows the standard EBM equivalence where energy equals negative log-density of the generative model. In revision we will add a formal derivation section deriving the energy function from the transformer's likelihood and showing how the contrastive objective on controlled perturbations encourages alignment with quality. Benchmark correlations with human preferences on GenAI-Bench indicate the energies capture more than synthetic artifacts alone. revision: yes

  2. Referee: The SOTA result on GenAI-Bench and MonteBench with 30K annotations rests on the unverified assumption that the chosen perturbations (temporal slicing, feature swapping, frame shuffling) block shortcut solutions while generalizing beyond the synthetic distribution; the manuscript must include ablations isolating each perturbation's contribution and showing that performance does not collapse when any one is removed.

    Authors: We concur that isolating each perturbation's role is necessary. The current text motivates the three perturbations but omits full isolating ablations. We will add these in the revision, reporting accuracy drops when each perturbation is removed individually to confirm they collectively prevent shortcuts and support generalization beyond the synthetic set. revision: yes

  3. Referee: Table or figure reporting benchmark results: the efficiency advantage (6×–65× fewer annotations) is load-bearing for the main contribution, yet no direct head-to-head comparison is described that controls for annotation quality, model scale, or training compute between GT-SVJ and the VLM baselines.

    Authors: The reported factor derives directly from annotation counts published for the VLM baselines. A fully controlled head-to-head re-training all baselines under matched conditions exceeds the scope of this work. In revision we will add a summary table collating available annotation counts, model scales, and compute details from the source papers, together with a discussion of the practical annotation-efficiency benefit demonstrated by our results. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on external benchmarks rather than self-referential definitions or fits.

full rationale

The paper's core claim is that video generative transformers can be repurposed as EBM reward models via contrastive training on latent perturbations, yielding SOTA results on the external benchmarks GenAI-Bench and MonteBench with only 30K annotations. No equations, self-citations, or internal derivations are shown that reduce this performance to quantities defined by the paper's own fitted parameters or prior author work. The approach is validated against independent external test sets, satisfying the criterion for a self-contained, non-circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that generative models encode temporal quality signals usable as energy functions and on the ad-hoc design of three specific perturbation types to create training negatives; no free parameters or invented entities are named in the abstract.

axioms (2)
  • domain assumption Generative models can be reformulated as energy-based models that assign low energy to high-quality videos.
    Stated as the key insight enabling discrimination of video quality.
  • ad hoc to paper Synthetic negatives from temporal slicing, feature swapping, and frame shuffling force learning of meaningful spatiotemporal features rather than superficial artifacts.
    Explicit design choice to prevent exploitation of real-vs-generated differences.

pith-pipeline@v0.9.0 · 5780 in / 1456 out tokens · 34896 ms · 2026-05-25T07:12:23.642929+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    generative models can be reformulated as energy-based models (EBMs) that assign low energy to high-quality videos and high energy to degraded ones, enabling them to discriminate video quality with remarkable precision when trained via contrastive objectives

  • IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    controlled perturbations ... temporal slicing, feature swapping, and frame shuffling, which simulate realistic but subtle visual degradations. This forces the model to learn meaningful spatiotemporal features rather than trivial artifacts

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 1 internal anchor

  1. [1]

    Energy matching: Unifying flow matching and energy-based models for gener- ative modeling, 2025

    Michal Balcerak, Tamaz Amiranashvili, Antonio Terpin, Suprosanna Shit, Lea Bogensperger, Sebastian Kaltenbach, Petros Koumoutsakos, and Bjoern Menze. Energy matching: Unifying flow matching and energy-based models for gener- ative modeling, 2025. 3

  2. [2]

    Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired compar- isons.Biometrika, 39(3/4):324–345, 1952. 3

  3. [3]

    Dreamina.https://dreamina.capcut

    CapCut. Dreamina.https://dreamina.capcut. com/ai-tool/home, 2024. AI-tool for image and video generation. 1

  4. [4]

    Videocrafter2: Overcoming data limita- tions for high-quality video diffusion models.arXiv preprint arXiv:2401.09047, 2024

    Haoxin Chen, Zangwei Zheng, Xiangyu Peng, Hang Chen, Jiahui Huang, et al. Videocrafter2: Overcoming data limita- tions for high-quality video diffusion models.arXiv preprint arXiv:2401.09047, 2024. 6

  5. [5]

    Brown, Miljan Martic, Shane Legg, and Dario Amodei

    Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learn- ing from human preferences, 2023. 3

  6. [6]

    Introduction to latent variable energy-based models: a path toward autonomous machine intelligence.Journal of Statistical Mechanics: Theory and Experiment, 2024(10):104011, 2024

    Anna Dawid and Yann LeCun. Introduction to latent variable energy-based models: a path toward autonomous machine intelligence.Journal of Statistical Mechanics: Theory and Experiment, 2024(10):104011, 2024. 3

  7. [7]

    Veo 3.https : / / aistudio

    Google DeepMind. Veo 3.https : / / aistudio . google.com/models/veo- 3, 2025. Text-to-video model generating 4–8 second clips with native audio. 1

  8. [8]

    Ties matter: Meta-evaluating modern metrics with pairwise accu- racy and tie calibration, 2023

    Daniel Deutsch, George Foster, and Markus Freitag. Ties matter: Meta-evaluating modern metrics with pairwise accu- racy and tie calibration, 2023. 6

  9. [9]

    Rlhf workflow: From reward mod- eling to online rlhf, 2024

    Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward mod- eling to online rlhf, 2024. 3

  10. [10]

    Implicit generation and gener- alization in energy-based models, 2020

    Yilun Du and Igor Mordatch. Implicit generation and gener- alization in energy-based models, 2020. 3

  11. [11]

    Videoscore: Building auto- matic metrics to simulate fine-grained human feedback for video generation, 2024

    Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, Kai Wang, Quy Duc Do, Yuansheng Ni, Bo- han Lyu, Yaswanth Narsupalli, Rongqi Fan, Zhiheng Lyu, Yuchen Lin, and Wenhu Chen. Videoscore: Building auto- matic metrics to simulate fine-grained human feedback for video generation, 2024. 1, 3, 6

  12. [12]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. 4

  13. [13]

    Vbench: Com- prehensive benchmark suite for video generative models,

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Com- prehensive benchmark suite for video generative models,

  14. [14]

    Firefly.https : / / www

    Adobe Inc. Firefly.https : / / www . adobe . com / products/firefly.html, 2024. Generative AI tool for image/video/audio creative workflows. 1

  15. [15]

    Genai arena: An open evaluation platform for generative models.arXiv preprint arXiv:2406.04485, 2024

    Dongfu Jiang, Max Ku, Tianle Li, Yuansheng Ni, Shizhuo Sun, Rongqi Fan, and Wenhu Chen. Genai arena: An open evaluation platform for generative models.arXiv preprint arXiv:2406.04485, 2024. 2, 6, 7

  16. [16]

    Kling ai.https://klingai.kuaishou

    Kuaishou. Kling ai.https://klingai.kuaishou. com/, 2024. Video-generation model up to 1080p at 30fps. 1

  17. [17]

    Dream machine.https://lumalabs.ai/ dream- machine, 2024

    Luma Labs. Dream machine.https://lumalabs.ai/ dream- machine, 2024. AI model for generating high- quality videos from text/images. 1, 2, 6

  18. [18]

    Llms-as-judges: A comprehensive survey on llm-based evaluation methods,

    Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as-judges: A comprehensive survey on llm-based evaluation methods,

  19. [19]

    Generative judge for evaluating alignment, 2023

    Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, and Pengfei Liu. Generative judge for evaluating alignment, 2023. 2

  20. [20]

    Aes- thetic post-training diffusion models from generic prefer- ences with step-by-step preference optimization, 2025

    Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Mingxi Cheng, Ji Li, and Liang Zheng. Aes- thetic post-training diffusion models from generic prefer- ences with step-by-step preference optimization, 2025. 3

  21. [21]

    Reward learn- ing from preference with ties, 2024

    Jinsong Liu, Dongdong Ge, and Ruihao Zhu. Reward learn- ing from preference with ties, 2024. 3

  22. [22]

    Fréchet video motion distance: A metric for evaluating motion consistency in videos, 2024

    Jiahe Liu, Youran Qu, Qi Yan, Xiaohui Zeng, Lele Wang, and Renjie Liao. Fréchet video motion distance: A metric for evaluating motion consistency in videos, 2024. 2

  23. [23]

    Improv- ing video generation with human feedback, 2025

    Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, and Wanli Ouyang. Improv- ing video generation with human feedback, 2025. 1, 2, 3, 5, 6, 7

  24. [24]

    Is your video language model a reliable judge?, 2025

    Ming Liu and Wensheng Zhang. Is your video language model a reliable judge?, 2025. 2

  25. [25]

    Gen-3.https : / / runwayml

    Runway ML. Gen-3.https : / / runwayml . com/,

  26. [26]

    Next-generation foundation model for multimodal video/image generation. 1, 2, 6

  27. [27]

    Sora 2.https://openai.com/index/ sora-2/, 2025

    OpenAI. Sora 2.https://openai.com/index/ sora-2/, 2025. Video and audio generation model with synchronized sound and advanced world-simulation. 1, 2

  28. [28]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

  29. [29]

    Controlnext: Powerful and effi- cient control for image and video generation.arXiv preprint arXiv:2408.06070, 2024

    Bohao Peng, Jian Wang, Yuechen Zhang, Wenbo Li, Ming- Chang Yang, and Jiaya Jia. Controlnext: Powerful and effi- cient control for image and video generation.arXiv preprint arXiv:2408.06070, 2024. 4 9

  30. [30]

    Pixverse.https://pixverse.ai/, 2024

    PixVerse. Pixverse.https://pixverse.ai/, 2024. AI video creation tool from text/photos. 1

  31. [31]

    Manning, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Er- mon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024. 3

  32. [32]

    Rocm: Rlhf on consis- tency models, 2025

    Shivanshu Shekhar and Tong Zhang. Rocm: Rlhf on consis- tency models, 2025. 1

  33. [33]

    See- dpo: Self entropy enhanced direct preference optimization,

    Shivanshu Shekhar, Shreyas Singh, and Tong Zhang. See- dpo: Self entropy enhanced direct preference optimization,

  34. [34]

    Yang Song and Diederik P. Kingma. How to train your energy-based models, 2021. 3

  35. [35]

    To- wards accurate generative models of video: A new metric & challenges, 2019

    Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges, 2019. 2

  36. [36]

    Diffusion model alignment using direct preference optimization, 2023

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caim- ing Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization, 2023. 1, 3

  37. [37]

    Lift: Leveraging human feedback for text-to-video model alignment, 2025

    Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin, and Hao Li. Lift: Leveraging human feedback for text-to-video model alignment, 2025. 3

  38. [38]

    Bayesian learning via stochastic gradient langevin dynamics

    Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient langevin dynamics. InProceedings of the 28th International Conference on International Conference on Machine Learning, page 681–688, Madison, WI, USA,

  39. [39]

    Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation, 2025

    Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jia- jun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, Jiayan Teng, Zhuoyi Yang, Wendi Zheng, Xiao Liu, Ming Ding, Xiaohan Zhang, Xiaotao Gu, Shiyu Huang, Minlie Huang, Jie Tang, and Yuxiao Dong. Visionreward: Fine-grained multi-dimensional human preference learning for image and video gene...

  40. [40]

    Using human feedback to fine-tune diffusion models without any reward model, 2024

    Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Qimai Li, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model, 2024. 1, 3

  41. [41]

    Cogvideox: Text-to-video diffusion models with an expert transformer, 2025

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer, 2025. 2, 4, 5, 6

  42. [42]

    Representation alignment for generation: Training diffusion transformers is easier than you think, 2025

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think, 2025. 7

  43. [43]

    Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng, Haoxin Chen, Jiahui Huang, Hang Chen, Yifan He, Xiangyu Peng, et al. Open-sora: Democra- tizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024. 5, 6 10 GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling Supplementary Material Perturbation Type Sampling Probability R...