DomainShuttle: Freeform Open Domain Subject-driven Text-to-video Generation

Cheng Chen; Junwen Pan; Nan Chen; Rongchang Xie; Weinan Jia; Wenhan Luo; Wen Zhou; Yiyang Cai; Zhenbang Sun; Zhuowei Chen

arxiv: 2606.26058 · v1 · pith:KZNFBUUNnew · submitted 2026-06-24 · 💻 cs.CV

DomainShuttle: Freeform Open Domain Subject-driven Text-to-video Generation

Nan Chen , Yiyang Cai , Rongchang Xie , Junwen Pan , Cheng Chen , Weinan Jia , Zhuowei Chen , Wen Zhou

show 2 more authors

Zhenbang Sun Wenhan Luo

This is my paper

Pith reviewed 2026-06-25 18:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords subject-driven text-to-videoopen domain generationvideo personalizationdomain decouplingsubject fidelitygenerative flexibilitytext-to-video synthesis

0 comments

The pith

DomainShuttle decouples domain features from subject identity to support both exact retention and flexible cross-domain video edits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current subject-driven text-to-video systems maximize fidelity to a reference image when the output stays in the same domain, but this choice restricts how freely the video can adopt new styles, settings, or semantic combinations demanded by the text prompt. The paper proposes that an ideal system should move between domains without forcing a choice between accuracy and adaptability. DomainShuttle introduces three components that separate domain-specific modeling from core subject properties and enforce consistency on intrinsic features. If the separation works, the same model can produce both near-identical copies of the reference subject and versions placed in entirely different visual domains. Readers would care because personalized video tools are limited today by this fidelity-flexibility trade-off in open-domain use cases.

Core claim

DomainShuttle achieves high fidelity and generative flexibility for open domain video personalization by introducing Domain-MoT, which decouples videos and reference features and adds domain-aware AdaLN for domain-specific modeling of reference images, the Video-Reference DualRoPE scheme, which places reference image tokens and video tokens in separate RoPE spaces for precise subject-level spatial modeling, and Cross-Pair Consistent Loss, which extracts intrinsic subject features unaffected by irrelevant features; extensive experiments show significant improvements over existing methods across both in-domain and cross-domain scenarios.

What carries the argument

Domain-MoT decoupling with domain-aware AdaLN, Video-Reference DualRoPE placement, and Cross-Pair Consistent Loss, which together isolate intrinsic subject features from domain variation.

If this is right

The same model can handle both in-domain retention of subject features and cross-domain changes driven by text prompts.
Diverse application scenarios become feasible, including novel styles, semantic combinations, and domain attributes.
The trade-off between subject fidelity and editability is reduced in subject-driven video generation.
Performance gains appear across multiple open-domain test cases without separate models for each scenario.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of domain and identity signals could be tested in image-only or audio-conditioned generation to check whether the pattern generalizes.
If the loss successfully isolates intrinsic features, it might reduce the need for heavy reference-image conditioning in future video models.
Extending the DualRoPE scheme to longer sequences or multi-subject videos would be a direct next measurement of spatial control limits.

Load-bearing premise

The proposed decoupling, token placement, and consistency loss will isolate intrinsic subject features without introducing artifacts or lowering fidelity in practice.

What would settle it

Quantitative evaluation on a held-out set of cross-domain prompts where generated videos either lose measurable subject identity (via face or object recognition scores) or fail to match the target domain attributes compared with prior methods.

read the original abstract

Open domain subject-driven text-to-video (S2V) generation has drawn significant interest in academia and industry. Open domain S2V mainly involves two scenarios: in-domain, which requires retaining the reference subject features as much as possible, and cross-domain, which preserves the intrinsic features of the subject while allowing subject-irrelevant properties to vary flexibly according to the text prompt. Existing methods primarily focus on maximizing subject fidelity in in-domain scenarios, which limits their editability and adaptability in cross-domain scenarios, such as novel styles, semantic combinations, or domain attributes. In this study, we propose that an ideal S2V method should flexibly shuttle between different domains, achieving strong performance in both in-domain and cross-domain scenarios. To this end, we propose DomainShuttle, which could achieve high fidelity and generative flexibility for open domain video personalization. Specifically, we introduce Domain-MoT, which decouples videos and reference features and introduces the domain-aware AdaLN for domain-specific modeling of reference images. We then introduce the Video-Reference DualRoPE scheme, which places reference image tokens and video tokens in separate RoPE spaces to enable precise subject-level spatial modeling, and Cross-Pair Consistent Loss, which aims to extract intrinsic subject features unaffected by irrelevant features. Extensive experiments demonstrate that DomainShuttle achieves significant performance improvements over existing methods, exhibiting high subject fidelity and generative flexibility across diverse open domain application scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DomainShuttle names three specific components for balancing subject fidelity and cross-domain flexibility in text-to-video but supplies zero metrics or implementation details to check the claims.

read the letter

The paper's main contribution is a named method, DomainShuttle, that tries to handle both in-domain fidelity and cross-domain editability in subject-driven video generation. It does this through Domain-MoT with domain-aware AdaLN for decoupling reference and video features, Video-Reference DualRoPE to keep reference tokens in their own positional space, and Cross-Pair Consistent Loss to pull out subject-intrinsic features. The problem framing is direct and the component choices are concrete enough to be tried by others.

What the work does reasonably is identify why prior methods trade off one goal against the other and propose targeted fixes rather than generic fine-tuning. The DualRoPE idea in particular looks like a straightforward way to avoid spatial interference between reference and generated frames.

The soft spot is that every performance claim rests on unshown experiments. The abstract states significant improvements and extensive testing across scenarios, yet gives no subject similarity scores, no baseline tables, no ablation results, and no equations or pseudocode for the new modules or loss. Without those, the central assertion that the three pieces deliver both high fidelity and flexibility cannot be evaluated.

This is for groups already working on video personalization or subject-driven generation who might want to test similar decoupling tricks. A reader gets limited value until the numbers appear. It does not yet deserve peer review because the evidence link is missing; the architecture ideas alone do not justify referee time when the results are absent.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes DomainShuttle for open domain subject-driven text-to-video generation. It introduces Domain-MoT decoupling with domain-aware AdaLN for reference image modeling, a Video-Reference DualRoPE scheme to place reference and video tokens in separate RoPE spaces, and a Cross-Pair Consistent Loss to extract intrinsic subject features. The abstract asserts that these components enable high subject fidelity alongside generative flexibility across in-domain and cross-domain scenarios, with extensive experiments demonstrating significant performance improvements over existing methods.

Significance. The targeted problem of balancing subject fidelity with cross-domain editability in S2V is relevant. If the proposed decoupling mechanisms and loss were shown to work without introducing artifacts, the framework could offer a practical advance. However, the manuscript supplies no quantitative results, ablations, or implementation details, so no determination of significance is possible.

major comments (2)

[Abstract] Abstract: The claim that 'extensive experiments demonstrate that DomainShuttle achieves significant performance improvements over existing methods' is unsupported by any metrics, baseline comparisons, ablation tables, subject similarity scores, or failure-case analysis.
[Abstract] Abstract: The descriptions of Domain-MoT, domain-aware AdaLN, DualRoPE token placement, and Cross-Pair Consistent Loss remain high-level only, with no equations, pseudocode, or architectural specifications provided to allow assessment of whether they achieve the claimed decoupling and consistency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the review and the identification of issues in the abstract. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'extensive experiments demonstrate that DomainShuttle achieves significant performance improvements over existing methods' is unsupported by any metrics, baseline comparisons, ablation tables, subject similarity scores, or failure-case analysis.

Authors: We agree that the abstract claim requires explicit support from results. The manuscript contains an experiments section with quantitative comparisons, but the abstract does not reference specific metrics. We will revise the abstract to remove or qualify the claim until the results are more clearly summarized, or add a concise reference to key metrics such as subject fidelity scores if space permits. revision: yes
Referee: [Abstract] Abstract: The descriptions of Domain-MoT, domain-aware AdaLN, DualRoPE token placement, and Cross-Pair Consistent Loss remain high-level only, with no equations, pseudocode, or architectural specifications provided to allow assessment of whether they achieve the claimed decoupling and consistency.

Authors: Abstracts conventionally provide high-level summaries. The main text supplies the requested equations for the loss and AdaLN, the DualRoPE formulation, and architectural details. To address the concern directly, we will move one or two key equations into the abstract or add a pointer to the relevant sections and supplementary pseudocode in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on new architectural components and experiments

full rationale

The paper introduces Domain-MoT decoupling with domain-aware AdaLN, Video-Reference DualRoPE token placement, and Cross-Pair Consistent Loss as novel mechanisms for subject-driven video generation. Performance improvements are asserted via 'extensive experiments' rather than any derivation, equation, or fitted parameter that reduces outputs to inputs by construction. No self-citations, uniqueness theorems, or renamings of known results appear in the provided text, and the central claims do not collapse into self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no equations, training details, or parameter counts are provided, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5810 in / 1057 out tokens · 19805 ms · 2026-06-25T18:57:54.625070+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 11 linked inside Pith

[1]

Id-animator: Zero-shot identity-preserving human video generation.arXivpreprint arXiv:2404.15275, 2024

Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, and Jie Zhang. Id-animator: Zero-shot identity-preserving human video generation.arXivpreprint arXiv:2404.15275, 2024

arXiv 2024
[2]

Identity-preservingtext- to-video generation by frequency decomposition

ShenghaiYuan,JinfaHuang,XianyiHe,YunyangGe,YujunShi,LiuhanChen,JieboLuo,andLiYuan. Identity-preservingtext- to-video generation by frequency decomposition. InProceedings of the Computer Visionand Pattern RecognitionConference, pages 12978–12988, 2025

2025
[3]

Personalvideo: Highid-fidelityvideocustomizationwithoutdynamicandsemanticdegradation

HengjiaLi,HaonanQiu,ShiweiZhang,XiangWang,YujieWei,ZekunLi,YingyaZhang,BoxiWu,andDengCai. Personalvideo: Highid-fidelityvideocustomizationwithoutdynamicandsemanticdegradation. In ProceedingsoftheIEEE/CVFInternational Conferenceon ComputerVision(ICCV), pages 19406–19416, October 2025

2025
[4]

Magicid: Hybrid preference optimization for id-consistent and dynamic-preserved video customization

Hengjia Li, Lifan Jiang, Xi Xiao, Tianyang Wang, Hongwei Yi, Boxi Wu, and Deng Cai. Magicid: Hybrid preference optimization for id-consistent and dynamic-preserved video customization. InProceedings of the IEEE/CVF International Conferenceon ComputerVision(ICCV), pages 12737–12746, October 2025

2025
[5]

Phantom: Subject-consistent video generation via cross-modal alignment

Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, and Xinglong Wu. Phantom: Subject-consistent video generation via cross-modal alignment. InProceedings of the IEEE/CVF International Conferenceon Computer Vision(ICCV), pages 14951–14961, October 2025. 11

2025
[6]

Vace: All-in-one video creation and editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. In Proceedings ofthe IEEE/CVFInternational Conferenceon ComputerVision, pages 17191–17202, 2025

2025
[7]

Humo: Human-centric video generation via collaborative multi-modal conditioning.arXivpreprint arXiv:2509.08519, 2025

Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, and Zhiyong Wu. Humo: Human-centric video generation via collaborative multi-modal conditioning.arXivpreprint arXiv:2509.08519, 2025

arXiv 2025
[8]

MAGREF:Maskedguidanceforany-referencevideogenerationwithsubjectdisentanglement

Yufan Deng, Yuanyang Yin, Xun Guo, Yizhi Wang, Jacob Zhiyuan Fang, Shenghai Yuan, Yiding Yang, Angtian Wang, Bo Liu, HaibinHuang,andChongyangMa. MAGREF:Maskedguidanceforany-referencevideogenerationwithsubjectdisentanglement. InTheFourteenthInternational ConferenceonLearning Representations, 2026. URLhttps://openreview.net/forum? id=Nbl43eAVaE

2026
[9]

Firstframeistheplacetogoforvideocontentcustomization

Jingxi Chen, Zongxia Li, Zhichao Liu, Guangyao Shi, Xiyang Wu, Fuxiao Liu, Cornelia Fermüller, Brandon Y Feng, and YiannisAloimonos. Firstframeistheplacetogoforvideocontentcustomization. In ProceedingsoftheIEEE/CVFConference onComputerVisionandPatternRecognition, pages 9243–9252, 2026

2026
[10]

Bindweave: Subject-consistent video generation via cross-modal integration

Zhaoyang Li, Dongjun Qian, Kai Su, qishuai diao, Xiangyang Xia, Chang Liu, Wenfei Yang, Tianzhu Zhang, and Zehuan Yuan. Bindweave: Subject-consistent video generation via cross-modal integration. InThe Fourteenth International Conferenceon Learning Representations, 2026. URLhttps://openreview.net/forum?id=FP2XNyV9WL

2026
[11]

TengHu,ZhentaoYu,ZhengguangZhou,SenLiang,YuanZhou,QinLin,andQinglinLu.Hunyuancustom: Amultimodal-driven architecture for customized video generation.arXivpreprint arXiv:2505.04512, 2025

arXiv 2025
[12]

Cinema: Coherent multi-subject video generation via mllm-based guidance.arXiv preprint arXiv:2503.10391, 2025

Yufan Deng, Xun Guo, Yizhi Wang, Jacob Zhiyuan Fang, Angtian Wang, Shenghai Yuan, Yiding Yang, Bo Liu, Haibin Huang, and Chongyang Ma. Cinema: Coherent multi-subject video generation via mllm-based guidance.arXiv preprint arXiv:2503.10391, 2025

arXiv 2025
[13]

Video diffusion models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advancesin neuralinformationprocessing systems, 35:8633–8646, 2022

2022
[14]

Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Pith/arXiv arXiv 2023
[15]

Animatediff: Animate your personalized text-to-image diffusion models without specific tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. InTheTwelfthInternational Conferenceon Learning Representations, 2024. URLhttps://openreview.net/forum?id=Fx2SbBgcte

2024
[16]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conferenceoncomputervision, pages 4195–4205, 2023

2023
[17]

Cogvideox: Text-to-video diffusion models with an expert transformer.arXivpreprint arXiv:2408.06072, 2024

ZhuoyiYang,JiayanTeng,WendiZheng,MingDing,ShiyuHuang,JiazhengXu,YuanmingYang,WenyiHong,XiaohanZhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXivpreprint arXiv:2408.06072, 2024

Pith/arXiv arXiv 2024
[18]

Hunyuanvideo: A systematic framework for large video generative models.arXivpreprint arXiv:2412.03603, 2024

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXivpreprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024
[19]

Seedance1.5pro: Anativeaudio-visualjointgenerationfoundationmodel

TeamSeedance,SiyanChen,YanfeiChen,YingChen,ZhuoChen,FengCheng,XuyanChi,JianCong,QinpengCui,QideDong, JunliangFan,etal. Seedance1.5pro: Anativeaudio-visualjointgenerationfoundationmodel. arXivpreprintarXiv:2512.13507, 2025

Pith/arXiv arXiv 2025
[20]

Wan: Open and advanced large-scale video generative models.arXivpreprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXivpreprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[21]

Unified in-context video editing

Zixuan Ye, Xuanhua He, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di ZHANG, Kun Gai, Qifeng Chen, and Wenhan Luo. Unified in-context video editing. InTheFourteenthInternational ConferenceonLearning Representations, 2026. URLhttps://openreview.net/forum?id=Vb4nE3WWf5

2026
[22]

Visual-aware cot: Achieving high-fidelity visual consistency in unified models

Zixuan Ye, Quande Liu, Cong Wei, Yuanxing Zhang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhan Luo. Visual-aware cot: Achieving high-fidelity visual consistency in unified models. InProceedings of the IEEE/CVF Conferenceon Computer VisionandPatternRecognition, pages 9116–9126, 2026

2026
[23]

Yiyang Cai, Zhengkai Jiang, Yulong Liu, Chunyang Jiang, Wei Xue, Yike Guo, and Wenhan Luo. Foundation cures personalization: Improving personalized models’ prompt consistency via hidden foundation knowledge.AdvancesinNeural InformationProcessingSystems, 38:12776–12814, 2026. 12

2026
[24]

Stylemaster: Stylizeyourvideowithartistic generation and translation

ZixuanYe,HuijuanHuang,XintaoWang,PengfeiWan,DiZhang,andWenhanLuo. Stylemaster: Stylizeyourvideowithartistic generation and translation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2630–2640, 2025

2025
[25]

Infinitetalk: Audio-driven video generation for sparse-frame video dubbing.arXivpreprint arXiv:2508.14033, 2025

Shaoshu Yang, Zhe Kong, Feng Gao, Meng Cheng, Xiangyu Liu, Yong Zhang, Zhuoliang Kang, Wenhan Luo, Xunliang Cai, Ran He, et al. Infinitetalk: Audio-driven video generation for sparse-frame video dubbing.arXivpreprint arXiv:2508.14033, 2025

arXiv 2025
[26]

Scalinginstruction-basedvideoeditingwithahigh-qualitysyntheticdataset

Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, ZichenLiu,etal. Scalinginstruction-basedvideoeditingwithahigh-qualitysyntheticdataset. arXivpreprintarXiv:2510.15742, 2025

arXiv 2025
[27]

Insvie-1m: Effective instruction-based video editing with elaborate dataset construction

Yuhui Wu, Liyi Chen, Ruibin Li, Shihao Wang, Chenxi Xie, and Lei Zhang. Insvie-1m: Effective instruction-based video editing with elaborate dataset construction. InProceedings of the IEEE/CVF International Conferenceon Computer Vision, pages 16692–16701, 2025

2025
[29]

Skyreels-a2: Compose anything in video diffusion transformers.arXivpreprint arXiv:2504.02436, 2025

Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, et al. Skyreels-a2: Compose anything in video diffusion transformers.arXivpreprint arXiv:2504.02436, 2025

arXiv 2025
[30]

Conceptmaster: Multi-concept video customization on diffusion transformer models without test-time tuning.arXivpreprint arXiv:2501.04698, 2025

Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, and Kun Gai. Conceptmaster: Multi-concept video customization on diffusion transformer models without test-time tuning.arXivpreprint arXiv:2501.04698, 2025

arXiv 2025
[31]

Multi-subject open-set personalization in video generation

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aberman, Jun-Yan Zhu, Ming-Hsuan Yang, and Sergey Tulyakov. Multi-subject open-set personalization in video generation. InProceedingsofthe IEEE/CVFConferenceon ComputerVisionandPatternRecognition(CVPR), pages 6099–6110, June 2025

2025
[32]

Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.arXivpreprint arXiv:2304.09151, 2023

Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, and Orhan Firat. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.arXivpreprint arXiv:2304.09151, 2023

arXiv 2023
[33]

Flowmatchingforgenerativemodeling

YaronLipman,RickyT.Q.Chen,HeliBen-Hamu,MaximilianNickel,andMatthewLe. Flowmatchingforgenerativemodeling. In The EleventhInternational Conference on Learning Representations, 2023. URLhttps://openreview.net/forum? id=PqvMRDCJT9t

2023
[34]

Less-to-more generalization: Unlocking more controllability by in-context generation

Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, and Qian He. Less-to-more generalization: Unlocking more controllability by in-context generation. InProceedingsoftheIEEE/CVFInternational ConferenceonComputerVision, pages 18682–18692, 2025

2025
[35]

Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation.arXiv preprint arXiv:2508.09987, 2025

Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, et al. Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation.arXiv preprint arXiv:2508.09987, 2025

arXiv 2025
[36]

Phantom-data: Towards a general subject-consistent video generation dataset

Zhuowei Chen, Bingchuan Li, Tianxiang Ma, Lijie Liu, Mingcong Liu, Yunsheng Jiang, Gen Li, Xinghui Li, Liyang Chen, SiYu Zhou, Qian HE, and Xinglong Wu. Phantom-data: Towards a general subject-consistent video generation dataset. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=IjqKXnzUXx

2026
[37]

Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation.arXivpreprint arXiv:2505.20292, 2025

Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, and Li Yuan. Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation.arXivpreprint arXiv:2505.20292, 2025

arXiv 2025
[38]

Region-constraint in-context generation for instructional video editing.arXivpreprint arXiv:2512.17650, 2025

Zhongwei Zhang, Fuchen Long, Wei Li, Zhaofan Qiu, Wu Liu, Ting Yao, and Tao Mei. Region-constraint in-context generation for instructional video editing.arXivpreprint arXiv:2512.17650, 2025

arXiv 2025
[39]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean conferenceon computer vision, pages 38–55. Springer, 2024

2024
[40]

Sam 2: Segment anything in images and videos.arXivpreprint arXiv:2408.00714, 2024

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXivpreprint arXiv:2408.00714, 2024

Pith/arXiv arXiv 2024
[41]

Qwen2.5-vl technical report.arXivpreprint arXiv:2502.13923, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXivpreprint arXiv:2502.13923, 2025. 13

Pith/arXiv arXiv 2025
[42]

Kling api.https://klingai.com/global/, 2025

Kling. Kling api.https://klingai.com/global/, 2025. Accessed: 2026-01-25

2025
[43]

Skyreels-v3 technique report.arXivpreprint arXiv: 2601.17323, 2026

Debang Li, Zhengcong Fei, Tuanhui Li, Yikun Dou, Zheng Chen, Jiangping Yang, Mingyuan Fan, Jingtao Xu, Jiahua Wang, Baoxuan Gu, Mingshan Chang, Yuqiang Xie, Binjie Mao, Youqiang Zhang, Nuo Pang, Hao Zhang, Yuzhe Jin, Zhiheng Xu, Dixuan Lin, Guibin Chen, and Yahui Zhou. Skyreels-v3 technique report.arXivpreprint arXiv: 2601.17323, 2026

arXiv 2026
[44]

Gme: Improving universal multimodal retrieval by multimodal llms.arXivpreprint arXiv:2412.16855, 2024

Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: Improving universal multimodal retrieval by multimodal llms.arXivpreprint arXiv:2412.16855, 2024

Pith/arXiv arXiv 2024
[45]

Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

Pith/arXiv arXiv 2023
[46]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conferenceonmachinelearning, pages 8748–8763. PmLR, 2021

2021
[47]

https://gemini.google/au/overview/image-generation/?hl=en-AU,2025.Accessed: 2026-02-02

Google.Nanobananapro. https://gemini.google/au/overview/image-generation/?hl=en-AU,2025.Accessed: 2026-02-02

2025
[48]

Qwen-image technical report.arXivpreprint arXiv:2508.02324, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXivpreprint arXiv:2508.02324, 2025

Pith/arXiv arXiv 2025
[49]

Gpt-5.2.https://platform.openai.com/docs/models/gpt-5.2, 2025

OpenAI. Gpt-5.2.https://platform.openai.com/docs/models/gpt-5.2, 2025. Accessed: 2025-12-30

2025
[50]

Qwen3-vl technical report.arXivpreprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXivpreprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025
[51]

Musar: Exploring multi-subject customization from single-subject dataset via attention routing.arXivpreprint arXiv:2505.02823, 2025

Zinan Guo, Pengze Zhang, Yanze Wu, Chong Mou, Songtao Zhao, and Qian He. Musar: Exploring multi-subject customization from single-subject dataset via attention routing.arXivpreprint arXiv:2505.02823, 2025. 14 Appendix In the supplementary materials, we provide the construction of the training set in section A, and present more experimental setup and resul...

arXiv 2025
[52]

Assign a score between 1 and 5
[53]

3: Achievingcross-domaintransformationforpartkeyfeatures(suchashumanbody),butdomaintransformation is not achieved for the remaining key and minor features

Metric: 5: Achieving good cross-domain transformation while preserving the most features in reference image 1; 4: Achieving cross-domain transformation of key features in reference image 1, but some negative and non-critical feature transformations have flaws. 3: Achievingcross-domaintransformationforpartkeyfeatures(suchashumanbody),butdomaintransformatio...
[54]

Strict Output Requirement: Return only the numerical score (no explanations, text descriptions, or additional comments)

Higher scores correspond to higher cross-domain consistency. Strict Output Requirement: Return only the numerical score (no explanations, text descriptions, or additional comments). B More Experiments Results B.1 Implementation Details DomainShuttleutilizesthedefaultsettingsforinferenceforbothWan2.1andWan2.2,using50samplingstepsonWan2.1 and40stepsonWan2.2...
[55]

Next, guided by the prompt, each video personalization method generates videos of the reference images

to generate the edited reference images based on the reference images and the domain transformation prompt. Next, guided by the prompt, each video personalization method generates videos of the reference images. Finally, CLIP is used to calculate the cosine similarity between each frame of the generated videos and the reference images edited by Nano Banan...
[56]

Overall Video Quality: Comprehensively evaluate the overall quality of the generated videos from three aspects: aesthetic quality, the smoothness of subject motions (avoiding static or frozen subjects and frame discontinuities), and the naturalness of color, texture, and saturation
[57]

Evaluate text controllability based on the consistency between the generated video and the input text description (e.g., corresponding real-world or fantastic domain descriptions, stylistic attributes, and subject interaction alignment)
[58]

In in-domain scenarios, the best methods require retaining the reference subject features as much as possible

Open-Domain Subject Consistency: Evaluate subject consistency based on the similarity between the generated subject and the subject of the reference images. In in-domain scenarios, the best methods require retaining the reference subject features as much as possible. In cross-domain scenarios, the best methods should preserve the intrinsic features of the...

[1] [1]

Id-animator: Zero-shot identity-preserving human video generation.arXivpreprint arXiv:2404.15275, 2024

Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, and Jie Zhang. Id-animator: Zero-shot identity-preserving human video generation.arXivpreprint arXiv:2404.15275, 2024

arXiv 2024

[2] [2]

Identity-preservingtext- to-video generation by frequency decomposition

ShenghaiYuan,JinfaHuang,XianyiHe,YunyangGe,YujunShi,LiuhanChen,JieboLuo,andLiYuan. Identity-preservingtext- to-video generation by frequency decomposition. InProceedings of the Computer Visionand Pattern RecognitionConference, pages 12978–12988, 2025

2025

[3] [3]

Personalvideo: Highid-fidelityvideocustomizationwithoutdynamicandsemanticdegradation

HengjiaLi,HaonanQiu,ShiweiZhang,XiangWang,YujieWei,ZekunLi,YingyaZhang,BoxiWu,andDengCai. Personalvideo: Highid-fidelityvideocustomizationwithoutdynamicandsemanticdegradation. In ProceedingsoftheIEEE/CVFInternational Conferenceon ComputerVision(ICCV), pages 19406–19416, October 2025

2025

[4] [4]

Magicid: Hybrid preference optimization for id-consistent and dynamic-preserved video customization

Hengjia Li, Lifan Jiang, Xi Xiao, Tianyang Wang, Hongwei Yi, Boxi Wu, and Deng Cai. Magicid: Hybrid preference optimization for id-consistent and dynamic-preserved video customization. InProceedings of the IEEE/CVF International Conferenceon ComputerVision(ICCV), pages 12737–12746, October 2025

2025

[5] [5]

Phantom: Subject-consistent video generation via cross-modal alignment

Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, and Xinglong Wu. Phantom: Subject-consistent video generation via cross-modal alignment. InProceedings of the IEEE/CVF International Conferenceon Computer Vision(ICCV), pages 14951–14961, October 2025. 11

2025

[6] [6]

Vace: All-in-one video creation and editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. In Proceedings ofthe IEEE/CVFInternational Conferenceon ComputerVision, pages 17191–17202, 2025

2025

[7] [7]

Humo: Human-centric video generation via collaborative multi-modal conditioning.arXivpreprint arXiv:2509.08519, 2025

Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, and Zhiyong Wu. Humo: Human-centric video generation via collaborative multi-modal conditioning.arXivpreprint arXiv:2509.08519, 2025

arXiv 2025

[8] [8]

MAGREF:Maskedguidanceforany-referencevideogenerationwithsubjectdisentanglement

Yufan Deng, Yuanyang Yin, Xun Guo, Yizhi Wang, Jacob Zhiyuan Fang, Shenghai Yuan, Yiding Yang, Angtian Wang, Bo Liu, HaibinHuang,andChongyangMa. MAGREF:Maskedguidanceforany-referencevideogenerationwithsubjectdisentanglement. InTheFourteenthInternational ConferenceonLearning Representations, 2026. URLhttps://openreview.net/forum? id=Nbl43eAVaE

2026

[9] [9]

Firstframeistheplacetogoforvideocontentcustomization

Jingxi Chen, Zongxia Li, Zhichao Liu, Guangyao Shi, Xiyang Wu, Fuxiao Liu, Cornelia Fermüller, Brandon Y Feng, and YiannisAloimonos. Firstframeistheplacetogoforvideocontentcustomization. In ProceedingsoftheIEEE/CVFConference onComputerVisionandPatternRecognition, pages 9243–9252, 2026

2026

[10] [10]

Bindweave: Subject-consistent video generation via cross-modal integration

Zhaoyang Li, Dongjun Qian, Kai Su, qishuai diao, Xiangyang Xia, Chang Liu, Wenfei Yang, Tianzhu Zhang, and Zehuan Yuan. Bindweave: Subject-consistent video generation via cross-modal integration. InThe Fourteenth International Conferenceon Learning Representations, 2026. URLhttps://openreview.net/forum?id=FP2XNyV9WL

2026

[11] [11]

TengHu,ZhentaoYu,ZhengguangZhou,SenLiang,YuanZhou,QinLin,andQinglinLu.Hunyuancustom: Amultimodal-driven architecture for customized video generation.arXivpreprint arXiv:2505.04512, 2025

arXiv 2025

[12] [12]

Cinema: Coherent multi-subject video generation via mllm-based guidance.arXiv preprint arXiv:2503.10391, 2025

Yufan Deng, Xun Guo, Yizhi Wang, Jacob Zhiyuan Fang, Angtian Wang, Shenghai Yuan, Yiding Yang, Bo Liu, Haibin Huang, and Chongyang Ma. Cinema: Coherent multi-subject video generation via mllm-based guidance.arXiv preprint arXiv:2503.10391, 2025

arXiv 2025

[13] [13]

Video diffusion models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advancesin neuralinformationprocessing systems, 35:8633–8646, 2022

2022

[14] [14]

Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Pith/arXiv arXiv 2023

[15] [15]

Animatediff: Animate your personalized text-to-image diffusion models without specific tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. InTheTwelfthInternational Conferenceon Learning Representations, 2024. URLhttps://openreview.net/forum?id=Fx2SbBgcte

2024

[16] [16]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conferenceoncomputervision, pages 4195–4205, 2023

2023

[17] [17]

Cogvideox: Text-to-video diffusion models with an expert transformer.arXivpreprint arXiv:2408.06072, 2024

ZhuoyiYang,JiayanTeng,WendiZheng,MingDing,ShiyuHuang,JiazhengXu,YuanmingYang,WenyiHong,XiaohanZhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXivpreprint arXiv:2408.06072, 2024

Pith/arXiv arXiv 2024

[18] [18]

Hunyuanvideo: A systematic framework for large video generative models.arXivpreprint arXiv:2412.03603, 2024

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXivpreprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024

[19] [19]

Seedance1.5pro: Anativeaudio-visualjointgenerationfoundationmodel

TeamSeedance,SiyanChen,YanfeiChen,YingChen,ZhuoChen,FengCheng,XuyanChi,JianCong,QinpengCui,QideDong, JunliangFan,etal. Seedance1.5pro: Anativeaudio-visualjointgenerationfoundationmodel. arXivpreprintarXiv:2512.13507, 2025

Pith/arXiv arXiv 2025

[20] [20]

Wan: Open and advanced large-scale video generative models.arXivpreprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXivpreprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[21] [21]

Unified in-context video editing

Zixuan Ye, Xuanhua He, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di ZHANG, Kun Gai, Qifeng Chen, and Wenhan Luo. Unified in-context video editing. InTheFourteenthInternational ConferenceonLearning Representations, 2026. URLhttps://openreview.net/forum?id=Vb4nE3WWf5

2026

[22] [22]

Visual-aware cot: Achieving high-fidelity visual consistency in unified models

Zixuan Ye, Quande Liu, Cong Wei, Yuanxing Zhang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhan Luo. Visual-aware cot: Achieving high-fidelity visual consistency in unified models. InProceedings of the IEEE/CVF Conferenceon Computer VisionandPatternRecognition, pages 9116–9126, 2026

2026

[23] [23]

Yiyang Cai, Zhengkai Jiang, Yulong Liu, Chunyang Jiang, Wei Xue, Yike Guo, and Wenhan Luo. Foundation cures personalization: Improving personalized models’ prompt consistency via hidden foundation knowledge.AdvancesinNeural InformationProcessingSystems, 38:12776–12814, 2026. 12

2026

[24] [24]

Stylemaster: Stylizeyourvideowithartistic generation and translation

ZixuanYe,HuijuanHuang,XintaoWang,PengfeiWan,DiZhang,andWenhanLuo. Stylemaster: Stylizeyourvideowithartistic generation and translation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2630–2640, 2025

2025

[25] [25]

Infinitetalk: Audio-driven video generation for sparse-frame video dubbing.arXivpreprint arXiv:2508.14033, 2025

Shaoshu Yang, Zhe Kong, Feng Gao, Meng Cheng, Xiangyu Liu, Yong Zhang, Zhuoliang Kang, Wenhan Luo, Xunliang Cai, Ran He, et al. Infinitetalk: Audio-driven video generation for sparse-frame video dubbing.arXivpreprint arXiv:2508.14033, 2025

arXiv 2025

[26] [26]

Scalinginstruction-basedvideoeditingwithahigh-qualitysyntheticdataset

Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, ZichenLiu,etal. Scalinginstruction-basedvideoeditingwithahigh-qualitysyntheticdataset. arXivpreprintarXiv:2510.15742, 2025

arXiv 2025

[27] [27]

Insvie-1m: Effective instruction-based video editing with elaborate dataset construction

Yuhui Wu, Liyi Chen, Ruibin Li, Shihao Wang, Chenxi Xie, and Lei Zhang. Insvie-1m: Effective instruction-based video editing with elaborate dataset construction. InProceedings of the IEEE/CVF International Conferenceon Computer Vision, pages 16692–16701, 2025

2025

[28] [29]

Skyreels-a2: Compose anything in video diffusion transformers.arXivpreprint arXiv:2504.02436, 2025

Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, et al. Skyreels-a2: Compose anything in video diffusion transformers.arXivpreprint arXiv:2504.02436, 2025

arXiv 2025

[29] [30]

Conceptmaster: Multi-concept video customization on diffusion transformer models without test-time tuning.arXivpreprint arXiv:2501.04698, 2025

Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, and Kun Gai. Conceptmaster: Multi-concept video customization on diffusion transformer models without test-time tuning.arXivpreprint arXiv:2501.04698, 2025

arXiv 2025

[30] [31]

Multi-subject open-set personalization in video generation

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aberman, Jun-Yan Zhu, Ming-Hsuan Yang, and Sergey Tulyakov. Multi-subject open-set personalization in video generation. InProceedingsofthe IEEE/CVFConferenceon ComputerVisionandPatternRecognition(CVPR), pages 6099–6110, June 2025

2025

[31] [32]

Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.arXivpreprint arXiv:2304.09151, 2023

Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, and Orhan Firat. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.arXivpreprint arXiv:2304.09151, 2023

arXiv 2023

[32] [33]

Flowmatchingforgenerativemodeling

YaronLipman,RickyT.Q.Chen,HeliBen-Hamu,MaximilianNickel,andMatthewLe. Flowmatchingforgenerativemodeling. In The EleventhInternational Conference on Learning Representations, 2023. URLhttps://openreview.net/forum? id=PqvMRDCJT9t

2023

[33] [34]

Less-to-more generalization: Unlocking more controllability by in-context generation

Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, and Qian He. Less-to-more generalization: Unlocking more controllability by in-context generation. InProceedingsoftheIEEE/CVFInternational ConferenceonComputerVision, pages 18682–18692, 2025

2025

[34] [35]

Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation.arXiv preprint arXiv:2508.09987, 2025

Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, et al. Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation.arXiv preprint arXiv:2508.09987, 2025

arXiv 2025

[35] [36]

Phantom-data: Towards a general subject-consistent video generation dataset

Zhuowei Chen, Bingchuan Li, Tianxiang Ma, Lijie Liu, Mingcong Liu, Yunsheng Jiang, Gen Li, Xinghui Li, Liyang Chen, SiYu Zhou, Qian HE, and Xinglong Wu. Phantom-data: Towards a general subject-consistent video generation dataset. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=IjqKXnzUXx

2026

[36] [37]

Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation.arXivpreprint arXiv:2505.20292, 2025

Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, and Li Yuan. Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation.arXivpreprint arXiv:2505.20292, 2025

arXiv 2025

[37] [38]

Region-constraint in-context generation for instructional video editing.arXivpreprint arXiv:2512.17650, 2025

Zhongwei Zhang, Fuchen Long, Wei Li, Zhaofan Qiu, Wu Liu, Ting Yao, and Tao Mei. Region-constraint in-context generation for instructional video editing.arXivpreprint arXiv:2512.17650, 2025

arXiv 2025

[38] [39]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean conferenceon computer vision, pages 38–55. Springer, 2024

2024

[39] [40]

Sam 2: Segment anything in images and videos.arXivpreprint arXiv:2408.00714, 2024

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXivpreprint arXiv:2408.00714, 2024

Pith/arXiv arXiv 2024

[40] [41]

Qwen2.5-vl technical report.arXivpreprint arXiv:2502.13923, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXivpreprint arXiv:2502.13923, 2025. 13

Pith/arXiv arXiv 2025

[41] [42]

Kling api.https://klingai.com/global/, 2025

Kling. Kling api.https://klingai.com/global/, 2025. Accessed: 2026-01-25

2025

[42] [43]

Skyreels-v3 technique report.arXivpreprint arXiv: 2601.17323, 2026

Debang Li, Zhengcong Fei, Tuanhui Li, Yikun Dou, Zheng Chen, Jiangping Yang, Mingyuan Fan, Jingtao Xu, Jiahua Wang, Baoxuan Gu, Mingshan Chang, Yuqiang Xie, Binjie Mao, Youqiang Zhang, Nuo Pang, Hao Zhang, Yuzhe Jin, Zhiheng Xu, Dixuan Lin, Guibin Chen, and Yahui Zhou. Skyreels-v3 technique report.arXivpreprint arXiv: 2601.17323, 2026

arXiv 2026

[43] [44]

Gme: Improving universal multimodal retrieval by multimodal llms.arXivpreprint arXiv:2412.16855, 2024

Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: Improving universal multimodal retrieval by multimodal llms.arXivpreprint arXiv:2412.16855, 2024

Pith/arXiv arXiv 2024

[44] [45]

Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

Pith/arXiv arXiv 2023

[45] [46]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conferenceonmachinelearning, pages 8748–8763. PmLR, 2021

2021

[46] [47]

https://gemini.google/au/overview/image-generation/?hl=en-AU,2025.Accessed: 2026-02-02

Google.Nanobananapro. https://gemini.google/au/overview/image-generation/?hl=en-AU,2025.Accessed: 2026-02-02

2025

[47] [48]

Qwen-image technical report.arXivpreprint arXiv:2508.02324, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXivpreprint arXiv:2508.02324, 2025

Pith/arXiv arXiv 2025

[48] [49]

Gpt-5.2.https://platform.openai.com/docs/models/gpt-5.2, 2025

OpenAI. Gpt-5.2.https://platform.openai.com/docs/models/gpt-5.2, 2025. Accessed: 2025-12-30

2025

[49] [50]

Qwen3-vl technical report.arXivpreprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXivpreprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025

[50] [51]

Musar: Exploring multi-subject customization from single-subject dataset via attention routing.arXivpreprint arXiv:2505.02823, 2025

Zinan Guo, Pengze Zhang, Yanze Wu, Chong Mou, Songtao Zhao, and Qian He. Musar: Exploring multi-subject customization from single-subject dataset via attention routing.arXivpreprint arXiv:2505.02823, 2025. 14 Appendix In the supplementary materials, we provide the construction of the training set in section A, and present more experimental setup and resul...

arXiv 2025

[51] [52]

Assign a score between 1 and 5

[52] [53]

3: Achievingcross-domaintransformationforpartkeyfeatures(suchashumanbody),butdomaintransformation is not achieved for the remaining key and minor features

Metric: 5: Achieving good cross-domain transformation while preserving the most features in reference image 1; 4: Achieving cross-domain transformation of key features in reference image 1, but some negative and non-critical feature transformations have flaws. 3: Achievingcross-domaintransformationforpartkeyfeatures(suchashumanbody),butdomaintransformatio...

[53] [54]

Strict Output Requirement: Return only the numerical score (no explanations, text descriptions, or additional comments)

Higher scores correspond to higher cross-domain consistency. Strict Output Requirement: Return only the numerical score (no explanations, text descriptions, or additional comments). B More Experiments Results B.1 Implementation Details DomainShuttleutilizesthedefaultsettingsforinferenceforbothWan2.1andWan2.2,using50samplingstepsonWan2.1 and40stepsonWan2.2...

[54] [55]

Next, guided by the prompt, each video personalization method generates videos of the reference images

to generate the edited reference images based on the reference images and the domain transformation prompt. Next, guided by the prompt, each video personalization method generates videos of the reference images. Finally, CLIP is used to calculate the cosine similarity between each frame of the generated videos and the reference images edited by Nano Banan...

[55] [56]

Overall Video Quality: Comprehensively evaluate the overall quality of the generated videos from three aspects: aesthetic quality, the smoothness of subject motions (avoiding static or frozen subjects and frame discontinuities), and the naturalness of color, texture, and saturation

[56] [57]

Evaluate text controllability based on the consistency between the generated video and the input text description (e.g., corresponding real-world or fantastic domain descriptions, stylistic attributes, and subject interaction alignment)

[57] [58]

In in-domain scenarios, the best methods require retaining the reference subject features as much as possible

Open-Domain Subject Consistency: Evaluate subject consistency based on the similarity between the generated subject and the subject of the reference images. In in-domain scenarios, the best methods require retaining the reference subject features as much as possible. In cross-domain scenarios, the best methods should preserve the intrinsic features of the...