RefAlign: Representation Alignment for Reference-to-Video Generation

Ge Wu; Hang Zhou; Haocheng Feng; Jian Yang; Jingdong Wang; Lei Wang; Yaxing Wang; Yuxin Song

arxiv: 2603.25743 · v2 · pith:MPHVRMY2new · submitted 2026-03-26 · 💻 cs.CV

RefAlign: Representation Alignment for Reference-to-Video Generation

Lei Wang , YuXin Song , Ge Wu , Haocheng Feng , Hang Zhou , Jingdong Wang , Yaxing Wang , Jian Yang This is my paper

classification 💻 cs.CV

keywords featuresreferencealignmentrefalignsemanticgenerationrepresentationlatent

0 comments

read the original abstract

Reference-to-video (R2V) generation is a controllable video synthesis paradigm that constrains the generation process using both text prompts and reference images, enabling applications such as personalized advertising and virtual try-on. In practice, existing R2V methods typically introduce additional high-level semantic or cross-modal features alongside the VAE latent representation of the reference image and jointly feed them into the diffusion Transformer (DiT). These auxiliary representations provide semantic guidance and act as implicit alignment signals, which can partially alleviate pixel-level information leakage in the VAE latent space. However, they may still struggle to address copy--paste artifacts and multi-subject confusion caused by modality mismatch across heterogeneous encoder features. In this paper, we propose RefAlign, a representation alignment framework that explicitly aligns DiT reference-branch features to the semantic space of a visual foundation model (VFM). The core of RefAlign is a reference alignment loss that pulls the reference features and VFM features of the same subject closer to improve identity consistency, while pushing apart the corresponding features of different subjects to enhance semantic discriminability. This simple yet effective strategy is applied only during training, incurring no inference-time overhead, and achieves a better balance between text controllability and reference fidelity. Extensive experiments on the OpenS2V-Eval benchmark demonstrate that RefAlign outperforms current state-of-the-art methods in TotalScore, validating the effectiveness of explicit reference alignment for R2V tasks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
cs.CV 2026-05 unverdicted novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

SARA improves text alignment and motion quality in video diffusion models by routing token-relation distillation supervision to semantically salient pairs using a Stage-1 aligner trained with SAM masks and InfoNCE.
Bernini: Latent Semantic Planning for Video Diffusion
cs.CV 2026-05 unverdicted novelty 5.0

Bernini is a framework that uses an MLLM planner to output semantic representations for a DiT renderer to generate or edit videos, reporting SOTA benchmark performance.