HarmoView: Harmonizing Multi-View Constraints for Identity-Consistent Video Generation

Cong Wang; Hongmei Wang; Jiarong Ou; Qinglin Lu; Rui Chen; Weicong Liang; Yuan Zhou; Zhentao Yu; Zilin Yang; Zixiang Zhou

arxiv: 2606.10839 · v2 · pith:KTLR35JWnew · submitted 2026-06-09 · 💻 cs.CV

HarmoView: Harmonizing Multi-View Constraints for Identity-Consistent Video Generation

Cong Wang , Zhentao Yu , Hongmei Wang , Weicong Liang , Zixiang Zhou , Zilin Yang , Jiarong Ou , Rui Chen

show 2 more authors

Yuan Zhou Qinglin Lu

This is my paper

Pith reviewed 2026-06-27 13:48 UTC · model grok-4.3

classification 💻 cs.CV

keywords identity-consistent video generationmulti-view referencesdiffusion transformersfeature injectionprogressive curriculumviewpoint changesproxy tokens

0 comments

The pith

HarmoView integrates multi-view references into video generators to keep subject appearance stable across large viewpoint shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current methods lose identity fidelity when camera angles change substantially during video synthesis. HarmoView tackles this by feeding multiple reference views into a diffusion transformer through three targeted changes plus a gradual training schedule. Multi-level Feature Injection supplies low-level appearance anchors from frontal references, proxy tokens standardize varying numbers of input views, and Jump-RoPE isolates features to limit crosstalk. A four-stage curriculum with view dropout transitions the model from ordinary text-to-video generation to spatial multi-view reasoning without erasing prior capabilities, supported by a newly built large-scale multi-view dataset.

Core claim

HarmoView shows that multi-view reference inputs can be harmonized for identity-consistent video generation by injecting raw ViT features at multiple levels, unifying heterogeneous layouts with learnable proxy tokens, isolating identity features via Jump-RoPE, and applying a four-stage Progressive View Curriculum with view dropout, all while preserving the base model's generative priors, yielding performance that exceeds open-source baselines and matches leading closed-source systems on a 100-case benchmark across 52 identities.

What carries the argument

Multi-level Feature Injection anchors low-level appearance by routing frontal ViT features through cross-attention alongside text tokens, paired with learnable proxy tokens that unify single- and multi-view layouts and Jump-RoPE that isolates identity-wise features.

If this is right

Models can accept any combination of one or more reference views without layout conflicts.
Identity preservation holds under large viewpoint changes that previously caused drift.
Staged training with view dropout enables stable addition of multi-view capability.
A dedicated multi-view dataset removes the main data bottleneck for this task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same injection and proxy mechanisms might transfer to other conditional generation settings that require multiple reference images.
Open models could close the gap with closed-source video engines on identity tasks if similar curricula become standard.
View dropout during fine-tuning may help other consistency problems such as temporal coherence or style transfer.

Load-bearing premise

The new modules and curriculum can be added to the base model without degrading its original text-to-video performance and the constructed multi-view dataset supplies enough variety to train effective spatial reasoning.

What would settle it

On the paper's 100-case multi-view benchmark, HarmoView produces identity-consistency scores no higher than current open-source baselines.

read the original abstract

Current identity-consistent video generation methods struggle to preserve appearance fidelity under large viewpoint changes. While introducing multi-view reference input offers a natural solution, progress remains constrained by the lack of effective frameworks for multi-view inputs and the scarcity of multi-view data. We address these challenges by proposing HarmoView, a robust framework for identity-consistent video generation that effectively integrates multi-view cues through three architectural refinements complemented by a staged training curriculum. Specifically, we first introduce Multi-level Feature Injection to anchor identity fidelity; by injecting raw ViT features from frontal references alongside text tokens via cross-attention, MFI provides persistent low-level appearance anchors that complement the high-level identity features within DiT blocks, leading to enhanced identity preservation. Then, we employ learnable proxy tokens to unify heterogeneous reference layouts across single-/multi-view settings while simultaneously resolving the reference-view mismatch problem. Jump-RoPE is further developed for identity-wise feature isolation to reduce identity crosstalk. To activate these structural capabilities while preserving the original generative priors, we propose the Progressive View Curriculum. This four-stage training strategy employs view dropout to facilitate a stable transition from vanilla T2V generation to high-fidelity, identity-persistent spatial reasoning. Furthermore, we construct a large-scale multi-view dataset to address the issue of data scarcity. Extensive evaluation on our multi-view benchmark, comprising 100 manually-curated cases spanning 52 unique identities, demonstrates that HarmoView significantly outperforms open-source baselines and matches leading closed-source engines, achieving state-of-the-art performance in identity-consistent video generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HarmoView adds some engineering tweaks to DiT video models for multi-view identity consistency, but the evaluation details are missing so the SOTA claim stays untested.

read the letter

The paper introduces four named pieces—Multi-level Feature Injection from ViT features, learnable proxy tokens to handle varying reference counts, Jump-RoPE for isolating identity features, and a four-stage Progressive View Curriculum with view dropout—plus a new multi-view dataset. These target identity drift under viewpoint change while trying to keep the base model's priors intact. The curriculum approach of starting from plain text-to-video and slowly introducing views is a reasonable way to avoid breaking what already works.

The motivation matches real problems in the area: single-view methods lose appearance on big camera moves, and multi-view data is scarce. The proxy tokens and feature injection look like direct attempts to fix layout mismatch and low-level anchoring. No obvious internal contradictions show up in the argument.

The soft spot is the evidence. The abstract says it beats open-source baselines and matches closed-source engines on a 100-case benchmark with 52 identities, yet supplies no numbers, no listed baselines, no ablation results, and no error bars. Without those, it is hard to know whether the components actually deliver or whether the test set is forgiving. The full paper may contain the tables, but the current write-up leaves the central performance claim unverified.

This is for people already working on controllable video generation who need multi-view handling. A reader deep in DiT architectures could extract the specific tricks if the numbers hold up later. It deserves a serious referee because the problem is practical and the proposed fixes are coherent, even if the results section needs expansion and verification before publication.

Referee Report

1 major / 0 minor

Summary. The manuscript presents HarmoView, a framework for identity-consistent video generation from multi-view references. It introduces Multi-level Feature Injection (MFI) to inject raw ViT features from frontal references via cross-attention, learnable proxy tokens to unify single-/multi-view layouts and resolve reference-view mismatch, Jump-RoPE for identity-wise feature isolation to reduce crosstalk, and a four-stage Progressive View Curriculum with view dropout to transition from vanilla T2V to multi-view generation while preserving generative priors. A large-scale multi-view dataset is constructed to address data scarcity. The method is evaluated on a manually-curated 100-case benchmark spanning 52 identities and is claimed to significantly outperform open-source baselines while matching leading closed-source engines.

Significance. If the performance claims hold with rigorous evidence, the work would address a key challenge in video generation—maintaining appearance fidelity under large viewpoint changes—through a combination of architectural refinements and a staged curriculum. The explicit design to avoid harming base model priors via the curriculum and the construction of a multi-view dataset are positive elements that could support broader adoption if substantiated.

major comments (1)

[Abstract / Evaluation] The central claim of SOTA performance on the 100-case benchmark (Abstract) is unsupported by any reported metrics, baseline comparisons, ablation studies, error bars, or statistical tests. This absence makes it impossible to verify the effectiveness of MFI, proxy tokens, Jump-RoPE, or the Progressive View Curriculum and is load-bearing for the paper's primary contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for identifying the critical need for quantitative substantiation of our performance claims. We address the comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract / Evaluation] The central claim of SOTA performance on the 100-case benchmark (Abstract) is unsupported by any reported metrics, baseline comparisons, ablation studies, error bars, or statistical tests. This absence makes it impossible to verify the effectiveness of MFI, proxy tokens, Jump-RoPE, or the Progressive View Curriculum and is load-bearing for the paper's primary contribution.

Authors: We agree that the current version of the manuscript does not include the requested quantitative elements. The abstract states the SOTA claim based on the 100-case benchmark, but no numerical metrics, tables, ablations, error bars, or statistical tests are reported to support it. In the revised manuscript we will add a dedicated quantitative evaluation section containing: (1) tables with identity-consistency metrics (e.g., ArcFace cosine similarity, CLIP-I), video quality metrics, and user-study results against open-source baselines and closed-source engines; (2) component-wise ablations for MFI, proxy tokens, Jump-RoPE, and the Progressive View Curriculum; (3) error bars across multiple runs where appropriate; and (4) statistical significance tests. These additions will make the claims verifiable and directly demonstrate the contribution of each proposed technique. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is descriptive and self-contained

full rationale

The paper presents a descriptive framework of architectural modules (MFI, proxy tokens, Jump-RoPE) and a staged curriculum plus a new dataset. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the text. Performance claims rest on external empirical evaluation against baselines on a manually curated benchmark rather than any internal reduction by construction. This is the normal non-circular case for an applied CV methods paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are specified in sufficient detail to populate the ledger.

pith-pipeline@v0.9.1-grok · 5837 in / 1220 out tokens · 22237 ms · 2026-06-27T13:48:11.245893+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Accessed: 2026-05-06. Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Ming-Hsuan Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2...

Pith/arXiv arXiv 2026
[2]

Chenjie Cao, Chaohui Yu, Shang Liu, Fan Wang, Xiangyang Xue, and Yanwei Fu

Accessed: 2026-05-06. Chenjie Cao, Chaohui Yu, Shang Liu, Fan Wang, Xiangyang Xue, and Yanwei Fu. Mvgenmaster: Scaling multi-view generation from any image via 3d priors enhanced diffusion model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 6045–6056. Computer Vision Foundation / IEEE,

2026
[3]

Humo: Human-centric video generation via collaborative multi-modal conditioning.CoRR, abs/2509.08519,

Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, and Zhiyong Wu. Humo: Human-centric video generation via collaborative multi-modal conditioning.CoRR, abs/2509.08519,

arXiv
[4]

Hallo4: High-fidelity dynamic portrait animation via direct preference optimization and temporal motion modulation.CoRR, abs/2505.23525, 2025a

Jiahao Cui, Yan Chen, Mingwang Xu, Hanlin Shang, Yuxuan Chen, Yun Zhan, Zilong Dong, Yao Yao, Jingdong Wang, and Siyu Zhu. Hallo4: High-fidelity dynamic portrait animation via direct preference optimization and temporal motion modulation.CoRR, abs/2505.23525, 2025a. Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jin...

arXiv 2025
[5]

Hossein Talebi Esfandarani and Peyman Milanfar

Accessed: 2026-05-06. Hossein Talebi Esfandarani and Peyman Milanfar. NIMA: neural image assessment.IEEE Trans. Image Process., 27 (8):3998–4011,

2026
[6]

Id-animator: Zero-shot identity-preserving human video generation.CoRR, abs/2404.15275,

Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, Man Zhou, and Jie Zhang. Id-animator: Zero-shot identity-preserving human video generation.CoRR, abs/2404.15275,

arXiv
[7]

Hunyuancustom: A multimodal-driven architecture for customized video generation.CoRR, abs/2505.04512,

Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, and Qinglin Lu. Hunyuancustom: A multimodal-driven architecture for customized video generation.CoRR, abs/2505.04512,

arXiv
[8]

VACE: all-in-one video creation and editing.CoRR, abs/2503.07598,

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. VACE: all-in-one video creation and editing.CoRR, abs/2503.07598,

Pith/arXiv arXiv
[9]

Causnvs: Autoregressive multi-view diffusion for flexible 3d novel view synthesis.CoRR, abs/2509.06579,

Xin Kong, Daniel Watson, Yannick Strümpler, Michael Niemeyer, and Federico Tombari. Causnvs: Autoregressive multi-view diffusion for flexible 3d novel view synthesis.CoRR, abs/2509.06579,

arXiv
[10]

Accessed: 2026-05-06. Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. FLUX.1 kontext: Flow matching...

Pith/arXiv arXiv 2026
[11]

Omnihuman-1: Rethinking the scaling-up of one-stage conditioned human animation models.CoRR, abs/2502.01061,

Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, and Chao Liang. Omnihuman-1: Rethinking the scaling-up of one-stage conditioned human animation models.CoRR, abs/2502.01061,

arXiv
[12]

Phantom: Subject- consistent video generation via cross-modal alignment.CoRR, abs/2502.11079,

Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Qian He, and Xinglong Wu. Phantom: Subject- consistent video generation via cross-modal alignment.CoRR, abs/2502.11079,

arXiv
[13]

Ziyang Song, Xinyu Gong, Bangya Liu, and Zelin Zhao

Accessed: 2026-05-06. Ziyang Song, Xinyu Gong, Bangya Liu, and Zelin Zhao. MV-S2V: multi-view subject-consistent video generation. CoRR, abs/2601.17756,

Pith/arXiv arXiv 2026
[14]

MV-S2V: Multi-View Subject-Consistent Video Generation

doi: 10.48550/ARXIV.2601.17756.https://doi.org/10.48550/arXiv.2601.17756. Zeyi Sun, Tong Wu, Pan Zhang, Yuhang Zang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. Bootstrap3d: Improving multi-view diffusion model with synthetic data. InProceedings of the IEEE/CVF International Conference on Computer Vision,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.17756.https://doi.org/10.48550/arxiv.2601.17756
[15]

Wan: Open and advanced large-scale video generative models.CoRR, abs/2503.20314, 2025a

HarmoView: Harmonizing Multi-View Constraints for Identity-Consistent Video Generation Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Xiaofeng Meng, Ningyi Zhang, Pandeng ...

Pith/arXiv arXiv 2025
[16]

Fantasytalking: Realistic talking portrait generation via coherent motion synthesis

Mengchao Wang, Qiang Wang, Fan Jiang, Yaqi Fan, Yunpeng Zhang, Yonggang Qi, Kun Zhao, and Mu Xu. Fantasytalking: Realistic talking portrait generation via coherent motion synthesis. In Cathal Gurrin, Klaus Schoeffmann, Min Zhang, Luca Rossetto, Stevan Rudinac, Duc-Tien Dang-Nguyen, Wen-Huang Cheng, Phoebe Chen, and Jenny Benois-Pineau, editors,Proceedings...

2025
[17]

Accessed: 2026-05-06

GitHub repository. Accessed: 2026-05-06. Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, and Li Yuan. Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation.CoRR, abs/2505.20292,

arXiv 2026
[18]

Kaleido: Open-sourced multi-subject reference video generation model.CoRR, abs/2510.18573, 2025

Zhenxing Zhang, Jiayan Teng, Zhuoyi Yang, Tiankun Cao, Cheng Wang, Xiaotao Gu, Jie Tang, Dan Guo, and Meng Wang. Kaleido: Open-sourced multi-subject reference video generation model.CoRR, abs/2510.18573, 2025

arXiv 2025

[1] [1]

Accessed: 2026-05-06. Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Ming-Hsuan Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2...

Pith/arXiv arXiv 2026

[2] [2]

Chenjie Cao, Chaohui Yu, Shang Liu, Fan Wang, Xiangyang Xue, and Yanwei Fu

Accessed: 2026-05-06. Chenjie Cao, Chaohui Yu, Shang Liu, Fan Wang, Xiangyang Xue, and Yanwei Fu. Mvgenmaster: Scaling multi-view generation from any image via 3d priors enhanced diffusion model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 6045–6056. Computer Vision Foundation / IEEE,

2026

[3] [3]

Humo: Human-centric video generation via collaborative multi-modal conditioning.CoRR, abs/2509.08519,

Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, and Zhiyong Wu. Humo: Human-centric video generation via collaborative multi-modal conditioning.CoRR, abs/2509.08519,

arXiv

[4] [4]

Hallo4: High-fidelity dynamic portrait animation via direct preference optimization and temporal motion modulation.CoRR, abs/2505.23525, 2025a

Jiahao Cui, Yan Chen, Mingwang Xu, Hanlin Shang, Yuxuan Chen, Yun Zhan, Zilong Dong, Yao Yao, Jingdong Wang, and Siyu Zhu. Hallo4: High-fidelity dynamic portrait animation via direct preference optimization and temporal motion modulation.CoRR, abs/2505.23525, 2025a. Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jin...

arXiv 2025

[5] [5]

Hossein Talebi Esfandarani and Peyman Milanfar

Accessed: 2026-05-06. Hossein Talebi Esfandarani and Peyman Milanfar. NIMA: neural image assessment.IEEE Trans. Image Process., 27 (8):3998–4011,

2026

[6] [6]

Id-animator: Zero-shot identity-preserving human video generation.CoRR, abs/2404.15275,

Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, Man Zhou, and Jie Zhang. Id-animator: Zero-shot identity-preserving human video generation.CoRR, abs/2404.15275,

arXiv

[7] [7]

Hunyuancustom: A multimodal-driven architecture for customized video generation.CoRR, abs/2505.04512,

Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, and Qinglin Lu. Hunyuancustom: A multimodal-driven architecture for customized video generation.CoRR, abs/2505.04512,

arXiv

[8] [8]

VACE: all-in-one video creation and editing.CoRR, abs/2503.07598,

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. VACE: all-in-one video creation and editing.CoRR, abs/2503.07598,

Pith/arXiv arXiv

[9] [9]

Causnvs: Autoregressive multi-view diffusion for flexible 3d novel view synthesis.CoRR, abs/2509.06579,

Xin Kong, Daniel Watson, Yannick Strümpler, Michael Niemeyer, and Federico Tombari. Causnvs: Autoregressive multi-view diffusion for flexible 3d novel view synthesis.CoRR, abs/2509.06579,

arXiv

[10] [10]

Accessed: 2026-05-06. Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. FLUX.1 kontext: Flow matching...

Pith/arXiv arXiv 2026

[11] [11]

Omnihuman-1: Rethinking the scaling-up of one-stage conditioned human animation models.CoRR, abs/2502.01061,

Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, and Chao Liang. Omnihuman-1: Rethinking the scaling-up of one-stage conditioned human animation models.CoRR, abs/2502.01061,

arXiv

[12] [12]

Phantom: Subject- consistent video generation via cross-modal alignment.CoRR, abs/2502.11079,

Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Qian He, and Xinglong Wu. Phantom: Subject- consistent video generation via cross-modal alignment.CoRR, abs/2502.11079,

arXiv

[13] [13]

Ziyang Song, Xinyu Gong, Bangya Liu, and Zelin Zhao

Accessed: 2026-05-06. Ziyang Song, Xinyu Gong, Bangya Liu, and Zelin Zhao. MV-S2V: multi-view subject-consistent video generation. CoRR, abs/2601.17756,

Pith/arXiv arXiv 2026

[14] [14]

MV-S2V: Multi-View Subject-Consistent Video Generation

doi: 10.48550/ARXIV.2601.17756.https://doi.org/10.48550/arXiv.2601.17756. Zeyi Sun, Tong Wu, Pan Zhang, Yuhang Zang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. Bootstrap3d: Improving multi-view diffusion model with synthetic data. InProceedings of the IEEE/CVF International Conference on Computer Vision,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.17756.https://doi.org/10.48550/arxiv.2601.17756

[15] [15]

Wan: Open and advanced large-scale video generative models.CoRR, abs/2503.20314, 2025a

HarmoView: Harmonizing Multi-View Constraints for Identity-Consistent Video Generation Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Xiaofeng Meng, Ningyi Zhang, Pandeng ...

Pith/arXiv arXiv 2025

[16] [16]

Fantasytalking: Realistic talking portrait generation via coherent motion synthesis

Mengchao Wang, Qiang Wang, Fan Jiang, Yaqi Fan, Yunpeng Zhang, Yonggang Qi, Kun Zhao, and Mu Xu. Fantasytalking: Realistic talking portrait generation via coherent motion synthesis. In Cathal Gurrin, Klaus Schoeffmann, Min Zhang, Luca Rossetto, Stevan Rudinac, Duc-Tien Dang-Nguyen, Wen-Huang Cheng, Phoebe Chen, and Jenny Benois-Pineau, editors,Proceedings...

2025

[17] [17]

Accessed: 2026-05-06

GitHub repository. Accessed: 2026-05-06. Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, and Li Yuan. Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation.CoRR, abs/2505.20292,

arXiv 2026

[18] [18]

Kaleido: Open-sourced multi-subject reference video generation model.CoRR, abs/2510.18573, 2025

Zhenxing Zhang, Jiayan Teng, Zhuoyi Yang, Tiankun Cao, Cheng Wang, Xiaotao Gu, Jie Tang, Dan Guo, and Meng Wang. Kaleido: Open-sourced multi-subject reference video generation model.CoRR, abs/2510.18573, 2025

arXiv 2025