GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling

Mehrab Tanjim; Raghavendra Addanki; Shivanshu Shekhar; Somdeb Sarkhel; Tong Zhang; Uttaran Bhattacharya

arxiv: 2602.05202 · v2 · pith:VDL56WXQnew · submitted 2026-02-05 · 💻 cs.CV

GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling

Shivanshu Shekhar , Uttaran Bhattacharya , Raghavendra Addanki , Mehrab Tanjim , Somdeb Sarkhel , Tong Zhang This is my paper

Pith reviewed 2026-05-25 07:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords video reward modelinggenerative transformersself-supervised learningenergy-based modelscontrastive traininglatent perturbationsvideo quality assessment

0 comments

The pith

Video generative models can be repurposed as reward models by recasting them as energy-based judges trained on synthetic negatives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that existing video generation models can function as reward models for human preference alignment by treating them as energy-based models that score high-quality videos low and degraded ones high. This approach would matter if true because it uses models already built to handle temporal structure, avoiding the limitations of vision-language models that often miss subtle timing issues. Training happens through contrastive objectives on synthetic negative videos made by perturbing the model's own latent space with operations like temporal slicing and frame shuffling. The result is state-of-the-art scores on GenAI-Bench and MonteBench while using only 30K human annotations, which is 6x to 65x less data than prior methods.

Core claim

Generative video models can be transformed into temporally-aware reward models by viewing them as energy-based models that assign low energy to high-quality videos and high energy to degraded ones, trained via contrastive objectives on synthetic negatives created through controlled latent-space perturbations such as temporal slicing, feature swapping, and frame shuffling.

What carries the argument

The reformulation of video generative models as energy-based models, trained contrastively on latent-space perturbed synthetic negatives to force learning of meaningful spatiotemporal quality features.

If this is right

Reward modeling for video generation can draw directly on pretrained generative architectures without separate large vision-language model training.
Human annotation budgets for video preference data can shrink substantially while still reaching top performance on quality benchmarks.
The model is forced to attend to realistic temporal and spatial degradations rather than superficial real-versus-generated differences.
Video generator alignment with preferences becomes possible using signals derived internally from the generator itself.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent-perturbation technique could create training signals for iterative improvement loops where the judge helps refine the generator that produced it.
Similar energy-based reformulations might transfer to other generative settings such as audio or image sequences where temporal or structural consistency matters.
Lower dependence on vision-language models for judging could cut overall data and compute demands in preference-based training of generative systems.

Load-bearing premise

That the energy scores produced by the trained model will match human judgments of video quality instead of merely detecting the specific kinds of perturbations used during training.

What would settle it

Human preference ratings collected on videos containing degradations created by methods outside the three perturbation techniques used in training, then checked for alignment with the model's energy assignments.

Figures

Figures reproduced from arXiv: 2602.05202 by Mehrab Tanjim, Raghavendra Addanki, Shivanshu Shekhar, Somdeb Sarkhel, Tong Zhang, Uttaran Bhattacharya.

**Figure 2.** Figure 2: Overview of the proposed GT-SVJ framework. The framework consists of two stages: (top) Training a discriminative model, where the video generative model (CogVideoX) is adapted using a contrastive energy-based objective with real, generated, and perturbed videos, and (middle and bottom) Training a reward model, where the discriminative model (DM) is aligned with human ratings through aspect-wise prediction … view at source ↗

**Figure 3.** Figure 3: Illustration of energy trajectories predicted by our energy-based model. For the real video in (a), energy trajectory across the time steps is smooth and stable, indicating consistent temporal dynamics. In contrast, for the generated videos in (b) and (c), the energy values fluctuate erratically, reflecting spatial and temporal inconsistencies such as implausible scene lighting and motions. 4.2. Reward Mod… view at source ↗

**Figure 4.** Figure 4: Effect of LoRA placement within the backbone transformer. We compare applying LoRA to the initial third, middle third, and last third of the transformer layers. The middle-layer configuration achieves the best overall performance, while the last-layer configuration provides faster training with minimal loss in accuracy. an aggregation head maps the 21 predicted attributes to a single scalar reward score, w… view at source ↗

**Figure 5.** Figure 5: Effect of the discriminative model. Initializing the reward model with the trained discriminative model leads to lower validation losses and higher validation accuracies throughout training. discriminative features provides a stronger inductive bias for learning effective reward predictors. Effect of Perturbed Videos as Negative Samples. To better understand the role of perturbed videos as negative samples… view at source ↗

read the original abstract

Aligning video generative models with human preferences remains challenging: current approaches rely on Vision-Language Models (VLMs) for reward modeling, but these models struggle to capture subtle temporal dynamics. We propose a fundamentally different approach: repurposing video generative models, which are inherently designed to model temporal structure, as reward models. We present the Generative-Transformer-based Self-Supervised Video Judge (\modelname), a novel evaluation model that transforms state-of-the-art video generation models into powerful temporally-aware reward models. Our key insight is that generative models can be reformulated as energy-based models (EBMs) that assign low energy to high-quality videos and high energy to degraded ones, enabling them to discriminate video quality with remarkable precision when trained via contrastive objectives. To prevent the model from exploiting superficial differences between real and generated videos, we design challenging synthetic negative videos through controlled latent-space perturbations: temporal slicing, feature swapping, and frame shuffling, which simulate realistic but subtle visual degradations. This forces the model to learn meaningful spatiotemporal features rather than trivial artifacts. \modelname achieves state-of-the-art performance on GenAI-Bench and MonteBench using only 30K human-annotations: $6\times$ to $65\times$ fewer than existing VLM-based approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The core idea of turning generative video models into self-supervised judges via EBM reformulation and latent perturbations is novel but the abstract gives no way to verify if it actually works.

read the letter

The main takeaway is that this paper suggests repurposing video generative transformers as reward models by reformulating them as energy-based models trained contrastively on negatives from latent perturbations such as temporal slicing, feature swapping, and frame shuffling. It claims state-of-the-art results on GenAI-Bench and MonteBench with only 30K annotations, far fewer than VLM approaches. What is new is the controlled use of those specific perturbations to create challenging synthetic negatives that force the model to focus on spatiotemporal quality rather than easy distinctions. This builds on the generative model's strength in handling temporal dynamics, which VLMs often miss. The self-supervised angle avoids heavy reliance on human labels beyond the initial 30K. The paper does a reasonable job framing the problem and proposing a different path from VLM-based reward modeling. If the method works, the data efficiency would be a clear win for aligning video generators. However, the abstract provides no equations for the EBM reformulation, no details on how energy is extracted or the contrastive loss is applied, and no results, ablations, or implementation specifics. This leaves the central claim uncheckable. The stress-test concern is valid: without evidence that the trained energies align with human judgments rather than just the synthetic perturbations, the SOTA and efficiency claims rest on unverified steps. The weakest assumption about the perturbations producing generalizable signals holds as a real gap here. This is for people working on video reward modeling and alignment who want options beyond VLMs. It deserves a serious referee because the idea is distinct and the efficiency potential is relevant, even if the current presentation is thin on evidence. A full review could clarify if the approach delivers. I would recommend sending it to peer review to evaluate the full technical details and experiments.

Referee Report

3 major / 2 minor

Summary. The paper proposes GT-SVJ, which repurposes video generative transformers as energy-based models (EBMs) for reward modeling. Generative models are trained via contrastive objectives on synthetic negative videos created by latent-space perturbations (temporal slicing, feature swapping, frame shuffling) to assign low energy to high-quality videos and high energy to degraded ones. This yields a temporally-aware judge that achieves SOTA on GenAI-Bench and MonteBench using only 30K human annotations (6×–65× fewer than VLM-based methods).

Significance. If the EBM reformulation and perturbation-based training produce energies that reliably rank videos according to human preference, the approach would offer a more annotation-efficient and temporally-sensitive alternative to VLM reward models for video generation alignment.

major comments (3)

The central claim that the generative transformer's implicit density can be directly treated as an EBM energy function (assigning low energy to high-quality videos) lacks an explicit derivation or likelihood computation; without this, it is unclear whether the contrastive objective on perturbed latents produces energies aligned with human judgments rather than synthetic artifacts.
The SOTA result on GenAI-Bench and MonteBench with 30K annotations rests on the unverified assumption that the chosen perturbations (temporal slicing, feature swapping, frame shuffling) block shortcut solutions while generalizing beyond the synthetic distribution; the manuscript must include ablations isolating each perturbation's contribution and showing that performance does not collapse when any one is removed.
Table or figure reporting benchmark results: the efficiency advantage (6×–65× fewer annotations) is load-bearing for the main contribution, yet no direct head-to-head comparison is described that controls for annotation quality, model scale, or training compute between GT-SVJ and the VLM baselines.

minor comments (2)

Notation for the energy function and contrastive loss should be introduced with explicit equations rather than descriptive prose only.
The abstract references external benchmarks without stating the exact metrics (e.g., ranking accuracy, correlation with human scores) used to declare SOTA.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below with clarifications and commit to revisions that strengthen the technical presentation without altering the core claims.

read point-by-point responses

Referee: The central claim that the generative transformer's implicit density can be directly treated as an EBM energy function (assigning low energy to high-quality videos) lacks an explicit derivation or likelihood computation; without this, it is unclear whether the contrastive objective on perturbed latents produces energies aligned with human judgments rather than synthetic artifacts.

Authors: We agree an explicit derivation would improve rigor. The reformulation follows the standard EBM equivalence where energy equals negative log-density of the generative model. In revision we will add a formal derivation section deriving the energy function from the transformer's likelihood and showing how the contrastive objective on controlled perturbations encourages alignment with quality. Benchmark correlations with human preferences on GenAI-Bench indicate the energies capture more than synthetic artifacts alone. revision: yes
Referee: The SOTA result on GenAI-Bench and MonteBench with 30K annotations rests on the unverified assumption that the chosen perturbations (temporal slicing, feature swapping, frame shuffling) block shortcut solutions while generalizing beyond the synthetic distribution; the manuscript must include ablations isolating each perturbation's contribution and showing that performance does not collapse when any one is removed.

Authors: We concur that isolating each perturbation's role is necessary. The current text motivates the three perturbations but omits full isolating ablations. We will add these in the revision, reporting accuracy drops when each perturbation is removed individually to confirm they collectively prevent shortcuts and support generalization beyond the synthetic set. revision: yes
Referee: Table or figure reporting benchmark results: the efficiency advantage (6×–65× fewer annotations) is load-bearing for the main contribution, yet no direct head-to-head comparison is described that controls for annotation quality, model scale, or training compute between GT-SVJ and the VLM baselines.

Authors: The reported factor derives directly from annotation counts published for the VLM baselines. A fully controlled head-to-head re-training all baselines under matched conditions exceeds the scope of this work. In revision we will add a summary table collating available annotation counts, model scales, and compute details from the source papers, together with a discussion of the practical annotation-efficiency benefit demonstrated by our results. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on external benchmarks rather than self-referential definitions or fits.

full rationale

The paper's core claim is that video generative transformers can be repurposed as EBM reward models via contrastive training on latent perturbations, yielding SOTA results on the external benchmarks GenAI-Bench and MonteBench with only 30K annotations. No equations, self-citations, or internal derivations are shown that reduce this performance to quantities defined by the paper's own fitted parameters or prior author work. The approach is validated against independent external test sets, satisfying the criterion for a self-contained, non-circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that generative models encode temporal quality signals usable as energy functions and on the ad-hoc design of three specific perturbation types to create training negatives; no free parameters or invented entities are named in the abstract.

axioms (2)

domain assumption Generative models can be reformulated as energy-based models that assign low energy to high-quality videos.
Stated as the key insight enabling discrimination of video quality.
ad hoc to paper Synthetic negatives from temporal slicing, feature swapping, and frame shuffling force learning of meaningful spatiotemporal features rather than superficial artifacts.
Explicit design choice to prevent exploitation of real-vs-generated differences.

pith-pipeline@v0.9.0 · 5780 in / 1456 out tokens · 34896 ms · 2026-05-25T07:12:23.642929+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

generative models can be reformulated as energy-based models (EBMs) that assign low energy to high-quality videos and high energy to degraded ones, enabling them to discriminate video quality with remarkable precision when trained via contrastive objectives
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

controlled perturbations ... temporal slicing, feature swapping, and frame shuffling, which simulate realistic but subtle visual degradations. This forces the model to learn meaningful spatiotemporal features rather than trivial artifacts

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 1 internal anchor

[1]

Energy matching: Unifying flow matching and energy-based models for gener- ative modeling, 2025

Michal Balcerak, Tamaz Amiranashvili, Antonio Terpin, Suprosanna Shit, Lea Bogensperger, Sebastian Kaltenbach, Petros Koumoutsakos, and Bjoern Menze. Energy matching: Unifying flow matching and energy-based models for gener- ative modeling, 2025. 3

work page 2025
[2]

Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired compar- isons.Biometrika, 39(3/4):324–345, 1952. 3

work page 1952
[3]

Dreamina.https://dreamina.capcut

CapCut. Dreamina.https://dreamina.capcut. com/ai-tool/home, 2024. AI-tool for image and video generation. 1

work page 2024
[4]

Videocrafter2: Overcoming data limita- tions for high-quality video diffusion models.arXiv preprint arXiv:2401.09047, 2024

Haoxin Chen, Zangwei Zheng, Xiangyu Peng, Hang Chen, Jiahui Huang, et al. Videocrafter2: Overcoming data limita- tions for high-quality video diffusion models.arXiv preprint arXiv:2401.09047, 2024. 6

work page arXiv 2024
[5]

Brown, Miljan Martic, Shane Legg, and Dario Amodei

Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learn- ing from human preferences, 2023. 3

work page 2023
[6]

Introduction to latent variable energy-based models: a path toward autonomous machine intelligence.Journal of Statistical Mechanics: Theory and Experiment, 2024(10):104011, 2024

Anna Dawid and Yann LeCun. Introduction to latent variable energy-based models: a path toward autonomous machine intelligence.Journal of Statistical Mechanics: Theory and Experiment, 2024(10):104011, 2024. 3

work page 2024
[7]

Veo 3.https : / / aistudio

Google DeepMind. Veo 3.https : / / aistudio . google.com/models/veo- 3, 2025. Text-to-video model generating 4–8 second clips with native audio. 1

work page 2025
[8]

Ties matter: Meta-evaluating modern metrics with pairwise accu- racy and tie calibration, 2023

Daniel Deutsch, George Foster, and Markus Freitag. Ties matter: Meta-evaluating modern metrics with pairwise accu- racy and tie calibration, 2023. 6

work page 2023
[9]

Rlhf workflow: From reward mod- eling to online rlhf, 2024

Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward mod- eling to online rlhf, 2024. 3

work page 2024
[10]

Implicit generation and gener- alization in energy-based models, 2020

Yilun Du and Igor Mordatch. Implicit generation and gener- alization in energy-based models, 2020. 3

work page 2020
[11]

Videoscore: Building auto- matic metrics to simulate fine-grained human feedback for video generation, 2024

Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, Kai Wang, Quy Duc Do, Yuansheng Ni, Bo- han Lyu, Yaswanth Narsupalli, Rongqi Fan, Zhiheng Lyu, Yuchen Lin, and Wenhu Chen. Videoscore: Building auto- matic metrics to simulate fine-grained human feedback for video generation, 2024. 1, 3, 6

work page 2024
[12]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. 4

work page 2021
[13]

Vbench: Com- prehensive benchmark suite for video generative models,

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Com- prehensive benchmark suite for video generative models,

work page
[14]

Firefly.https : / / www

Adobe Inc. Firefly.https : / / www . adobe . com / products/firefly.html, 2024. Generative AI tool for image/video/audio creative workflows. 1

work page 2024
[15]

Genai arena: An open evaluation platform for generative models.arXiv preprint arXiv:2406.04485, 2024

Dongfu Jiang, Max Ku, Tianle Li, Yuansheng Ni, Shizhuo Sun, Rongqi Fan, and Wenhu Chen. Genai arena: An open evaluation platform for generative models.arXiv preprint arXiv:2406.04485, 2024. 2, 6, 7

work page arXiv 2024
[16]

Kling ai.https://klingai.kuaishou

Kuaishou. Kling ai.https://klingai.kuaishou. com/, 2024. Video-generation model up to 1080p at 30fps. 1

work page 2024
[17]

Dream machine.https://lumalabs.ai/ dream- machine, 2024

Luma Labs. Dream machine.https://lumalabs.ai/ dream- machine, 2024. AI model for generating high- quality videos from text/images. 1, 2, 6

work page 2024
[18]

Llms-as-judges: A comprehensive survey on llm-based evaluation methods,

Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as-judges: A comprehensive survey on llm-based evaluation methods,

work page
[19]

Generative judge for evaluating alignment, 2023

Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, and Pengfei Liu. Generative judge for evaluating alignment, 2023. 2

work page 2023
[20]

Aes- thetic post-training diffusion models from generic prefer- ences with step-by-step preference optimization, 2025

Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Mingxi Cheng, Ji Li, and Liang Zheng. Aes- thetic post-training diffusion models from generic prefer- ences with step-by-step preference optimization, 2025. 3

work page 2025
[21]

Reward learn- ing from preference with ties, 2024

Jinsong Liu, Dongdong Ge, and Ruihao Zhu. Reward learn- ing from preference with ties, 2024. 3

work page 2024
[22]

Fréchet video motion distance: A metric for evaluating motion consistency in videos, 2024

Jiahe Liu, Youran Qu, Qi Yan, Xiaohui Zeng, Lele Wang, and Renjie Liao. Fréchet video motion distance: A metric for evaluating motion consistency in videos, 2024. 2

work page 2024
[23]

Improv- ing video generation with human feedback, 2025

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, and Wanli Ouyang. Improv- ing video generation with human feedback, 2025. 1, 2, 3, 5, 6, 7

work page 2025
[24]

Is your video language model a reliable judge?, 2025

Ming Liu and Wensheng Zhang. Is your video language model a reliable judge?, 2025. 2

work page 2025
[25]

Gen-3.https : / / runwayml

Runway ML. Gen-3.https : / / runwayml . com/,

work page
[26]

Next-generation foundation model for multimodal video/image generation. 1, 2, 6

work page
[27]

Sora 2.https://openai.com/index/ sora-2/, 2025

OpenAI. Sora 2.https://openai.com/index/ sora-2/, 2025. Video and audio generation model with synchronized sound and advanced world-simulation. 1, 2

work page 2025
[28]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

work page 2022
[29]

Controlnext: Powerful and effi- cient control for image and video generation.arXiv preprint arXiv:2408.06070, 2024

Bohao Peng, Jian Wang, Yuechen Zhang, Wenbo Li, Ming- Chang Yang, and Jiaya Jia. Controlnext: Powerful and effi- cient control for image and video generation.arXiv preprint arXiv:2408.06070, 2024. 4 9

work page arXiv 2024
[30]

Pixverse.https://pixverse.ai/, 2024

PixVerse. Pixverse.https://pixverse.ai/, 2024. AI video creation tool from text/photos. 1

work page 2024
[31]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Er- mon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024. 3

work page 2024
[32]

Rocm: Rlhf on consis- tency models, 2025

Shivanshu Shekhar and Tong Zhang. Rocm: Rlhf on consis- tency models, 2025. 1

work page 2025
[33]

See- dpo: Self entropy enhanced direct preference optimization,

Shivanshu Shekhar, Shreyas Singh, and Tong Zhang. See- dpo: Self entropy enhanced direct preference optimization,

work page
[34]

Yang Song and Diederik P. Kingma. How to train your energy-based models, 2021. 3

work page 2021
[35]

To- wards accurate generative models of video: A new metric & challenges, 2019

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges, 2019. 2

work page 2019
[36]

Diffusion model alignment using direct preference optimization, 2023

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caim- ing Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization, 2023. 1, 3

work page 2023
[37]

Lift: Leveraging human feedback for text-to-video model alignment, 2025

Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin, and Hao Li. Lift: Leveraging human feedback for text-to-video model alignment, 2025. 3

work page 2025
[38]

Bayesian learning via stochastic gradient langevin dynamics

Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient langevin dynamics. InProceedings of the 28th International Conference on International Conference on Machine Learning, page 681–688, Madison, WI, USA,

work page
[39]

Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation, 2025

Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jia- jun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, Jiayan Teng, Zhuoyi Yang, Wendi Zheng, Xiao Liu, Ming Ding, Xiaohan Zhang, Xiaotao Gu, Shiyu Huang, Minlie Huang, Jie Tang, and Yuxiao Dong. Visionreward: Fine-grained multi-dimensional human preference learning for image and video gene...

work page 2025
[40]

Using human feedback to fine-tune diffusion models without any reward model, 2024

Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Qimai Li, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model, 2024. 1, 3

work page 2024
[41]

Cogvideox: Text-to-video diffusion models with an expert transformer, 2025

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer, 2025. 2, 4, 5, 6

work page 2025
[42]

Representation alignment for generation: Training diffusion transformers is easier than you think, 2025

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think, 2025. 7

work page 2025
[43]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Haoxin Chen, Jiahui Huang, Hang Chen, Yifan He, Xiangyu Peng, et al. Open-sora: Democra- tizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024. 5, 6 10 GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling Supplementary Material Perturbation Type Sampling Probability R...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Energy matching: Unifying flow matching and energy-based models for gener- ative modeling, 2025

Michal Balcerak, Tamaz Amiranashvili, Antonio Terpin, Suprosanna Shit, Lea Bogensperger, Sebastian Kaltenbach, Petros Koumoutsakos, and Bjoern Menze. Energy matching: Unifying flow matching and energy-based models for gener- ative modeling, 2025. 3

work page 2025

[2] [2]

Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired compar- isons.Biometrika, 39(3/4):324–345, 1952. 3

work page 1952

[3] [3]

Dreamina.https://dreamina.capcut

CapCut. Dreamina.https://dreamina.capcut. com/ai-tool/home, 2024. AI-tool for image and video generation. 1

work page 2024

[4] [4]

Videocrafter2: Overcoming data limita- tions for high-quality video diffusion models.arXiv preprint arXiv:2401.09047, 2024

Haoxin Chen, Zangwei Zheng, Xiangyu Peng, Hang Chen, Jiahui Huang, et al. Videocrafter2: Overcoming data limita- tions for high-quality video diffusion models.arXiv preprint arXiv:2401.09047, 2024. 6

work page arXiv 2024

[5] [5]

Brown, Miljan Martic, Shane Legg, and Dario Amodei

Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learn- ing from human preferences, 2023. 3

work page 2023

[6] [6]

Introduction to latent variable energy-based models: a path toward autonomous machine intelligence.Journal of Statistical Mechanics: Theory and Experiment, 2024(10):104011, 2024

Anna Dawid and Yann LeCun. Introduction to latent variable energy-based models: a path toward autonomous machine intelligence.Journal of Statistical Mechanics: Theory and Experiment, 2024(10):104011, 2024. 3

work page 2024

[7] [7]

Veo 3.https : / / aistudio

Google DeepMind. Veo 3.https : / / aistudio . google.com/models/veo- 3, 2025. Text-to-video model generating 4–8 second clips with native audio. 1

work page 2025

[8] [8]

Ties matter: Meta-evaluating modern metrics with pairwise accu- racy and tie calibration, 2023

Daniel Deutsch, George Foster, and Markus Freitag. Ties matter: Meta-evaluating modern metrics with pairwise accu- racy and tie calibration, 2023. 6

work page 2023

[9] [9]

Rlhf workflow: From reward mod- eling to online rlhf, 2024

Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward mod- eling to online rlhf, 2024. 3

work page 2024

[10] [10]

Implicit generation and gener- alization in energy-based models, 2020

Yilun Du and Igor Mordatch. Implicit generation and gener- alization in energy-based models, 2020. 3

work page 2020

[11] [11]

Videoscore: Building auto- matic metrics to simulate fine-grained human feedback for video generation, 2024

Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, Kai Wang, Quy Duc Do, Yuansheng Ni, Bo- han Lyu, Yaswanth Narsupalli, Rongqi Fan, Zhiheng Lyu, Yuchen Lin, and Wenhu Chen. Videoscore: Building auto- matic metrics to simulate fine-grained human feedback for video generation, 2024. 1, 3, 6

work page 2024

[12] [12]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. 4

work page 2021

[13] [13]

Vbench: Com- prehensive benchmark suite for video generative models,

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Com- prehensive benchmark suite for video generative models,

work page

[14] [14]

Firefly.https : / / www

Adobe Inc. Firefly.https : / / www . adobe . com / products/firefly.html, 2024. Generative AI tool for image/video/audio creative workflows. 1

work page 2024

[15] [15]

Genai arena: An open evaluation platform for generative models.arXiv preprint arXiv:2406.04485, 2024

Dongfu Jiang, Max Ku, Tianle Li, Yuansheng Ni, Shizhuo Sun, Rongqi Fan, and Wenhu Chen. Genai arena: An open evaluation platform for generative models.arXiv preprint arXiv:2406.04485, 2024. 2, 6, 7

work page arXiv 2024

[16] [16]

Kling ai.https://klingai.kuaishou

Kuaishou. Kling ai.https://klingai.kuaishou. com/, 2024. Video-generation model up to 1080p at 30fps. 1

work page 2024

[17] [17]

Dream machine.https://lumalabs.ai/ dream- machine, 2024

Luma Labs. Dream machine.https://lumalabs.ai/ dream- machine, 2024. AI model for generating high- quality videos from text/images. 1, 2, 6

work page 2024

[18] [18]

Llms-as-judges: A comprehensive survey on llm-based evaluation methods,

Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as-judges: A comprehensive survey on llm-based evaluation methods,

work page

[19] [19]

Generative judge for evaluating alignment, 2023

Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, and Pengfei Liu. Generative judge for evaluating alignment, 2023. 2

work page 2023

[20] [20]

Aes- thetic post-training diffusion models from generic prefer- ences with step-by-step preference optimization, 2025

Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Mingxi Cheng, Ji Li, and Liang Zheng. Aes- thetic post-training diffusion models from generic prefer- ences with step-by-step preference optimization, 2025. 3

work page 2025

[21] [21]

Reward learn- ing from preference with ties, 2024

Jinsong Liu, Dongdong Ge, and Ruihao Zhu. Reward learn- ing from preference with ties, 2024. 3

work page 2024

[22] [22]

Fréchet video motion distance: A metric for evaluating motion consistency in videos, 2024

Jiahe Liu, Youran Qu, Qi Yan, Xiaohui Zeng, Lele Wang, and Renjie Liao. Fréchet video motion distance: A metric for evaluating motion consistency in videos, 2024. 2

work page 2024

[23] [23]

Improv- ing video generation with human feedback, 2025

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, and Wanli Ouyang. Improv- ing video generation with human feedback, 2025. 1, 2, 3, 5, 6, 7

work page 2025

[24] [24]

Is your video language model a reliable judge?, 2025

Ming Liu and Wensheng Zhang. Is your video language model a reliable judge?, 2025. 2

work page 2025

[25] [25]

Gen-3.https : / / runwayml

Runway ML. Gen-3.https : / / runwayml . com/,

work page

[26] [26]

Next-generation foundation model for multimodal video/image generation. 1, 2, 6

work page

[27] [27]

Sora 2.https://openai.com/index/ sora-2/, 2025

OpenAI. Sora 2.https://openai.com/index/ sora-2/, 2025. Video and audio generation model with synchronized sound and advanced world-simulation. 1, 2

work page 2025

[28] [28]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

work page 2022

[29] [29]

Controlnext: Powerful and effi- cient control for image and video generation.arXiv preprint arXiv:2408.06070, 2024

Bohao Peng, Jian Wang, Yuechen Zhang, Wenbo Li, Ming- Chang Yang, and Jiaya Jia. Controlnext: Powerful and effi- cient control for image and video generation.arXiv preprint arXiv:2408.06070, 2024. 4 9

work page arXiv 2024

[30] [30]

Pixverse.https://pixverse.ai/, 2024

PixVerse. Pixverse.https://pixverse.ai/, 2024. AI video creation tool from text/photos. 1

work page 2024

[31] [31]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Er- mon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024. 3

work page 2024

[32] [32]

Rocm: Rlhf on consis- tency models, 2025

Shivanshu Shekhar and Tong Zhang. Rocm: Rlhf on consis- tency models, 2025. 1

work page 2025

[33] [33]

See- dpo: Self entropy enhanced direct preference optimization,

Shivanshu Shekhar, Shreyas Singh, and Tong Zhang. See- dpo: Self entropy enhanced direct preference optimization,

work page

[34] [34]

Yang Song and Diederik P. Kingma. How to train your energy-based models, 2021. 3

work page 2021

[35] [35]

To- wards accurate generative models of video: A new metric & challenges, 2019

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges, 2019. 2

work page 2019

[36] [36]

Diffusion model alignment using direct preference optimization, 2023

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caim- ing Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization, 2023. 1, 3

work page 2023

[37] [37]

Lift: Leveraging human feedback for text-to-video model alignment, 2025

Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin, and Hao Li. Lift: Leveraging human feedback for text-to-video model alignment, 2025. 3

work page 2025

[38] [38]

Bayesian learning via stochastic gradient langevin dynamics

Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient langevin dynamics. InProceedings of the 28th International Conference on International Conference on Machine Learning, page 681–688, Madison, WI, USA,

work page

[39] [39]

Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation, 2025

Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jia- jun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, Jiayan Teng, Zhuoyi Yang, Wendi Zheng, Xiao Liu, Ming Ding, Xiaohan Zhang, Xiaotao Gu, Shiyu Huang, Minlie Huang, Jie Tang, and Yuxiao Dong. Visionreward: Fine-grained multi-dimensional human preference learning for image and video gene...

work page 2025

[40] [40]

Using human feedback to fine-tune diffusion models without any reward model, 2024

Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Qimai Li, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model, 2024. 1, 3

work page 2024

[41] [41]

Cogvideox: Text-to-video diffusion models with an expert transformer, 2025

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer, 2025. 2, 4, 5, 6

work page 2025

[42] [42]

Representation alignment for generation: Training diffusion transformers is easier than you think, 2025

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think, 2025. 7

work page 2025

[43] [43]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Haoxin Chen, Jiahui Huang, Hang Chen, Yifan He, Xiangyu Peng, et al. Open-sora: Democra- tizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024. 5, 6 10 GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling Supplementary Material Perturbation Type Sampling Probability R...

work page internal anchor Pith review Pith/arXiv arXiv 2024