pith. sign in

arxiv: 2606.26058 · v1 · pith:KZNFBUUNnew · submitted 2026-06-24 · 💻 cs.CV

DomainShuttle: Freeform Open Domain Subject-driven Text-to-video Generation

Pith reviewed 2026-06-25 18:57 UTC · model grok-4.3

classification 💻 cs.CV
keywords subject-driven text-to-videoopen domain generationvideo personalizationdomain decouplingsubject fidelitygenerative flexibilitytext-to-video synthesis
0
0 comments X

The pith

DomainShuttle decouples domain features from subject identity to support both exact retention and flexible cross-domain video edits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current subject-driven text-to-video systems maximize fidelity to a reference image when the output stays in the same domain, but this choice restricts how freely the video can adopt new styles, settings, or semantic combinations demanded by the text prompt. The paper proposes that an ideal system should move between domains without forcing a choice between accuracy and adaptability. DomainShuttle introduces three components that separate domain-specific modeling from core subject properties and enforce consistency on intrinsic features. If the separation works, the same model can produce both near-identical copies of the reference subject and versions placed in entirely different visual domains. Readers would care because personalized video tools are limited today by this fidelity-flexibility trade-off in open-domain use cases.

Core claim

DomainShuttle achieves high fidelity and generative flexibility for open domain video personalization by introducing Domain-MoT, which decouples videos and reference features and adds domain-aware AdaLN for domain-specific modeling of reference images, the Video-Reference DualRoPE scheme, which places reference image tokens and video tokens in separate RoPE spaces for precise subject-level spatial modeling, and Cross-Pair Consistent Loss, which extracts intrinsic subject features unaffected by irrelevant features; extensive experiments show significant improvements over existing methods across both in-domain and cross-domain scenarios.

What carries the argument

Domain-MoT decoupling with domain-aware AdaLN, Video-Reference DualRoPE placement, and Cross-Pair Consistent Loss, which together isolate intrinsic subject features from domain variation.

If this is right

  • The same model can handle both in-domain retention of subject features and cross-domain changes driven by text prompts.
  • Diverse application scenarios become feasible, including novel styles, semantic combinations, and domain attributes.
  • The trade-off between subject fidelity and editability is reduced in subject-driven video generation.
  • Performance gains appear across multiple open-domain test cases without separate models for each scenario.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of domain and identity signals could be tested in image-only or audio-conditioned generation to check whether the pattern generalizes.
  • If the loss successfully isolates intrinsic features, it might reduce the need for heavy reference-image conditioning in future video models.
  • Extending the DualRoPE scheme to longer sequences or multi-subject videos would be a direct next measurement of spatial control limits.

Load-bearing premise

The proposed decoupling, token placement, and consistency loss will isolate intrinsic subject features without introducing artifacts or lowering fidelity in practice.

What would settle it

Quantitative evaluation on a held-out set of cross-domain prompts where generated videos either lose measurable subject identity (via face or object recognition scores) or fail to match the target domain attributes compared with prior methods.

read the original abstract

Open domain subject-driven text-to-video (S2V) generation has drawn significant interest in academia and industry. Open domain S2V mainly involves two scenarios: in-domain, which requires retaining the reference subject features as much as possible, and cross-domain, which preserves the intrinsic features of the subject while allowing subject-irrelevant properties to vary flexibly according to the text prompt. Existing methods primarily focus on maximizing subject fidelity in in-domain scenarios, which limits their editability and adaptability in cross-domain scenarios, such as novel styles, semantic combinations, or domain attributes. In this study, we propose that an ideal S2V method should flexibly shuttle between different domains, achieving strong performance in both in-domain and cross-domain scenarios. To this end, we propose DomainShuttle, which could achieve high fidelity and generative flexibility for open domain video personalization. Specifically, we introduce Domain-MoT, which decouples videos and reference features and introduces the domain-aware AdaLN for domain-specific modeling of reference images. We then introduce the Video-Reference DualRoPE scheme, which places reference image tokens and video tokens in separate RoPE spaces to enable precise subject-level spatial modeling, and Cross-Pair Consistent Loss, which aims to extract intrinsic subject features unaffected by irrelevant features. Extensive experiments demonstrate that DomainShuttle achieves significant performance improvements over existing methods, exhibiting high subject fidelity and generative flexibility across diverse open domain application scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes DomainShuttle for open domain subject-driven text-to-video generation. It introduces Domain-MoT decoupling with domain-aware AdaLN for reference image modeling, a Video-Reference DualRoPE scheme to place reference and video tokens in separate RoPE spaces, and a Cross-Pair Consistent Loss to extract intrinsic subject features. The abstract asserts that these components enable high subject fidelity alongside generative flexibility across in-domain and cross-domain scenarios, with extensive experiments demonstrating significant performance improvements over existing methods.

Significance. The targeted problem of balancing subject fidelity with cross-domain editability in S2V is relevant. If the proposed decoupling mechanisms and loss were shown to work without introducing artifacts, the framework could offer a practical advance. However, the manuscript supplies no quantitative results, ablations, or implementation details, so no determination of significance is possible.

major comments (2)
  1. [Abstract] Abstract: The claim that 'extensive experiments demonstrate that DomainShuttle achieves significant performance improvements over existing methods' is unsupported by any metrics, baseline comparisons, ablation tables, subject similarity scores, or failure-case analysis.
  2. [Abstract] Abstract: The descriptions of Domain-MoT, domain-aware AdaLN, DualRoPE token placement, and Cross-Pair Consistent Loss remain high-level only, with no equations, pseudocode, or architectural specifications provided to allow assessment of whether they achieve the claimed decoupling and consistency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the review and the identification of issues in the abstract. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that 'extensive experiments demonstrate that DomainShuttle achieves significant performance improvements over existing methods' is unsupported by any metrics, baseline comparisons, ablation tables, subject similarity scores, or failure-case analysis.

    Authors: We agree that the abstract claim requires explicit support from results. The manuscript contains an experiments section with quantitative comparisons, but the abstract does not reference specific metrics. We will revise the abstract to remove or qualify the claim until the results are more clearly summarized, or add a concise reference to key metrics such as subject fidelity scores if space permits. revision: yes

  2. Referee: [Abstract] Abstract: The descriptions of Domain-MoT, domain-aware AdaLN, DualRoPE token placement, and Cross-Pair Consistent Loss remain high-level only, with no equations, pseudocode, or architectural specifications provided to allow assessment of whether they achieve the claimed decoupling and consistency.

    Authors: Abstracts conventionally provide high-level summaries. The main text supplies the requested equations for the loss and AdaLN, the DualRoPE formulation, and architectural details. To address the concern directly, we will move one or two key equations into the abstract or add a pointer to the relevant sections and supplementary pseudocode in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on new architectural components and experiments

full rationale

The paper introduces Domain-MoT decoupling with domain-aware AdaLN, Video-Reference DualRoPE token placement, and Cross-Pair Consistent Loss as novel mechanisms for subject-driven video generation. Performance improvements are asserted via 'extensive experiments' rather than any derivation, equation, or fitted parameter that reduces outputs to inputs by construction. No self-citations, uniqueness theorems, or renamings of known results appear in the provided text, and the central claims do not collapse into self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no equations, training details, or parameter counts are provided, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5810 in / 1057 out tokens · 19805 ms · 2026-06-25T18:57:54.625070+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 11 linked inside Pith

  1. [1]

    Id-animator: Zero-shot identity-preserving human video generation.arXivpreprint arXiv:2404.15275, 2024

    Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, and Jie Zhang. Id-animator: Zero-shot identity-preserving human video generation.arXivpreprint arXiv:2404.15275, 2024

  2. [2]

    Identity-preservingtext- to-video generation by frequency decomposition

    ShenghaiYuan,JinfaHuang,XianyiHe,YunyangGe,YujunShi,LiuhanChen,JieboLuo,andLiYuan. Identity-preservingtext- to-video generation by frequency decomposition. InProceedings of the Computer Visionand Pattern RecognitionConference, pages 12978–12988, 2025

  3. [3]

    Personalvideo: Highid-fidelityvideocustomizationwithoutdynamicandsemanticdegradation

    HengjiaLi,HaonanQiu,ShiweiZhang,XiangWang,YujieWei,ZekunLi,YingyaZhang,BoxiWu,andDengCai. Personalvideo: Highid-fidelityvideocustomizationwithoutdynamicandsemanticdegradation. In ProceedingsoftheIEEE/CVFInternational Conferenceon ComputerVision(ICCV), pages 19406–19416, October 2025

  4. [4]

    Magicid: Hybrid preference optimization for id-consistent and dynamic-preserved video customization

    Hengjia Li, Lifan Jiang, Xi Xiao, Tianyang Wang, Hongwei Yi, Boxi Wu, and Deng Cai. Magicid: Hybrid preference optimization for id-consistent and dynamic-preserved video customization. InProceedings of the IEEE/CVF International Conferenceon ComputerVision(ICCV), pages 12737–12746, October 2025

  5. [5]

    Phantom: Subject-consistent video generation via cross-modal alignment

    Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, and Xinglong Wu. Phantom: Subject-consistent video generation via cross-modal alignment. InProceedings of the IEEE/CVF International Conferenceon Computer Vision(ICCV), pages 14951–14961, October 2025. 11

  6. [6]

    Vace: All-in-one video creation and editing

    Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. In Proceedings ofthe IEEE/CVFInternational Conferenceon ComputerVision, pages 17191–17202, 2025

  7. [7]

    Humo: Human-centric video generation via collaborative multi-modal conditioning.arXivpreprint arXiv:2509.08519, 2025

    Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, and Zhiyong Wu. Humo: Human-centric video generation via collaborative multi-modal conditioning.arXivpreprint arXiv:2509.08519, 2025

  8. [8]

    MAGREF:Maskedguidanceforany-referencevideogenerationwithsubjectdisentanglement

    Yufan Deng, Yuanyang Yin, Xun Guo, Yizhi Wang, Jacob Zhiyuan Fang, Shenghai Yuan, Yiding Yang, Angtian Wang, Bo Liu, HaibinHuang,andChongyangMa. MAGREF:Maskedguidanceforany-referencevideogenerationwithsubjectdisentanglement. InTheFourteenthInternational ConferenceonLearning Representations, 2026. URLhttps://openreview.net/forum? id=Nbl43eAVaE

  9. [9]

    Firstframeistheplacetogoforvideocontentcustomization

    Jingxi Chen, Zongxia Li, Zhichao Liu, Guangyao Shi, Xiyang Wu, Fuxiao Liu, Cornelia Fermüller, Brandon Y Feng, and YiannisAloimonos. Firstframeistheplacetogoforvideocontentcustomization. In ProceedingsoftheIEEE/CVFConference onComputerVisionandPatternRecognition, pages 9243–9252, 2026

  10. [10]

    Bindweave: Subject-consistent video generation via cross-modal integration

    Zhaoyang Li, Dongjun Qian, Kai Su, qishuai diao, Xiangyang Xia, Chang Liu, Wenfei Yang, Tianzhu Zhang, and Zehuan Yuan. Bindweave: Subject-consistent video generation via cross-modal integration. InThe Fourteenth International Conferenceon Learning Representations, 2026. URLhttps://openreview.net/forum?id=FP2XNyV9WL

  11. [11]

    TengHu,ZhentaoYu,ZhengguangZhou,SenLiang,YuanZhou,QinLin,andQinglinLu.Hunyuancustom: Amultimodal-driven architecture for customized video generation.arXivpreprint arXiv:2505.04512, 2025

  12. [12]

    Cinema: Coherent multi-subject video generation via mllm-based guidance.arXiv preprint arXiv:2503.10391, 2025

    Yufan Deng, Xun Guo, Yizhi Wang, Jacob Zhiyuan Fang, Angtian Wang, Shenghai Yuan, Yiding Yang, Bo Liu, Haibin Huang, and Chongyang Ma. Cinema: Coherent multi-subject video generation via mllm-based guidance.arXiv preprint arXiv:2503.10391, 2025

  13. [13]

    Video diffusion models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advancesin neuralinformationprocessing systems, 35:8633–8646, 2022

  14. [14]

    Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  15. [15]

    Animatediff: Animate your personalized text-to-image diffusion models without specific tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. InTheTwelfthInternational Conferenceon Learning Representations, 2024. URLhttps://openreview.net/forum?id=Fx2SbBgcte

  16. [16]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conferenceoncomputervision, pages 4195–4205, 2023

  17. [17]

    Cogvideox: Text-to-video diffusion models with an expert transformer.arXivpreprint arXiv:2408.06072, 2024

    ZhuoyiYang,JiayanTeng,WendiZheng,MingDing,ShiyuHuang,JiazhengXu,YuanmingYang,WenyiHong,XiaohanZhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXivpreprint arXiv:2408.06072, 2024

  18. [18]

    Hunyuanvideo: A systematic framework for large video generative models.arXivpreprint arXiv:2412.03603, 2024

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXivpreprint arXiv:2412.03603, 2024

  19. [19]

    Seedance1.5pro: Anativeaudio-visualjointgenerationfoundationmodel

    TeamSeedance,SiyanChen,YanfeiChen,YingChen,ZhuoChen,FengCheng,XuyanChi,JianCong,QinpengCui,QideDong, JunliangFan,etal. Seedance1.5pro: Anativeaudio-visualjointgenerationfoundationmodel. arXivpreprintarXiv:2512.13507, 2025

  20. [20]

    Wan: Open and advanced large-scale video generative models.arXivpreprint arXiv:2503.20314, 2025

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXivpreprint arXiv:2503.20314, 2025

  21. [21]

    Unified in-context video editing

    Zixuan Ye, Xuanhua He, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di ZHANG, Kun Gai, Qifeng Chen, and Wenhan Luo. Unified in-context video editing. InTheFourteenthInternational ConferenceonLearning Representations, 2026. URLhttps://openreview.net/forum?id=Vb4nE3WWf5

  22. [22]

    Visual-aware cot: Achieving high-fidelity visual consistency in unified models

    Zixuan Ye, Quande Liu, Cong Wei, Yuanxing Zhang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhan Luo. Visual-aware cot: Achieving high-fidelity visual consistency in unified models. InProceedings of the IEEE/CVF Conferenceon Computer VisionandPatternRecognition, pages 9116–9126, 2026

  23. [23]

    Yiyang Cai, Zhengkai Jiang, Yulong Liu, Chunyang Jiang, Wei Xue, Yike Guo, and Wenhan Luo. Foundation cures personalization: Improving personalized models’ prompt consistency via hidden foundation knowledge.AdvancesinNeural InformationProcessingSystems, 38:12776–12814, 2026. 12

  24. [24]

    Stylemaster: Stylizeyourvideowithartistic generation and translation

    ZixuanYe,HuijuanHuang,XintaoWang,PengfeiWan,DiZhang,andWenhanLuo. Stylemaster: Stylizeyourvideowithartistic generation and translation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2630–2640, 2025

  25. [25]

    Infinitetalk: Audio-driven video generation for sparse-frame video dubbing.arXivpreprint arXiv:2508.14033, 2025

    Shaoshu Yang, Zhe Kong, Feng Gao, Meng Cheng, Xiangyu Liu, Yong Zhang, Zhuoliang Kang, Wenhan Luo, Xunliang Cai, Ran He, et al. Infinitetalk: Audio-driven video generation for sparse-frame video dubbing.arXivpreprint arXiv:2508.14033, 2025

  26. [26]

    Scalinginstruction-basedvideoeditingwithahigh-qualitysyntheticdataset

    Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, ZichenLiu,etal. Scalinginstruction-basedvideoeditingwithahigh-qualitysyntheticdataset. arXivpreprintarXiv:2510.15742, 2025

  27. [27]

    Insvie-1m: Effective instruction-based video editing with elaborate dataset construction

    Yuhui Wu, Liyi Chen, Ruibin Li, Shihao Wang, Chenxi Xie, and Lei Zhang. Insvie-1m: Effective instruction-based video editing with elaborate dataset construction. InProceedings of the IEEE/CVF International Conferenceon Computer Vision, pages 16692–16701, 2025

  28. [29]

    Skyreels-a2: Compose anything in video diffusion transformers.arXivpreprint arXiv:2504.02436, 2025

    Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, et al. Skyreels-a2: Compose anything in video diffusion transformers.arXivpreprint arXiv:2504.02436, 2025

  29. [30]

    Conceptmaster: Multi-concept video customization on diffusion transformer models without test-time tuning.arXivpreprint arXiv:2501.04698, 2025

    Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, and Kun Gai. Conceptmaster: Multi-concept video customization on diffusion transformer models without test-time tuning.arXivpreprint arXiv:2501.04698, 2025

  30. [31]

    Multi-subject open-set personalization in video generation

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aberman, Jun-Yan Zhu, Ming-Hsuan Yang, and Sergey Tulyakov. Multi-subject open-set personalization in video generation. InProceedingsofthe IEEE/CVFConferenceon ComputerVisionandPatternRecognition(CVPR), pages 6099–6110, June 2025

  31. [32]

    Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.arXivpreprint arXiv:2304.09151, 2023

    Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, and Orhan Firat. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.arXivpreprint arXiv:2304.09151, 2023

  32. [33]

    Flowmatchingforgenerativemodeling

    YaronLipman,RickyT.Q.Chen,HeliBen-Hamu,MaximilianNickel,andMatthewLe. Flowmatchingforgenerativemodeling. In The EleventhInternational Conference on Learning Representations, 2023. URLhttps://openreview.net/forum? id=PqvMRDCJT9t

  33. [34]

    Less-to-more generalization: Unlocking more controllability by in-context generation

    Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, and Qian He. Less-to-more generalization: Unlocking more controllability by in-context generation. InProceedingsoftheIEEE/CVFInternational ConferenceonComputerVision, pages 18682–18692, 2025

  34. [35]

    Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation.arXiv preprint arXiv:2508.09987, 2025

    Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, et al. Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation.arXiv preprint arXiv:2508.09987, 2025

  35. [36]

    Phantom-data: Towards a general subject-consistent video generation dataset

    Zhuowei Chen, Bingchuan Li, Tianxiang Ma, Lijie Liu, Mingcong Liu, Yunsheng Jiang, Gen Li, Xinghui Li, Liyang Chen, SiYu Zhou, Qian HE, and Xinglong Wu. Phantom-data: Towards a general subject-consistent video generation dataset. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=IjqKXnzUXx

  36. [37]

    Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation.arXivpreprint arXiv:2505.20292, 2025

    Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, and Li Yuan. Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation.arXivpreprint arXiv:2505.20292, 2025

  37. [38]

    Region-constraint in-context generation for instructional video editing.arXivpreprint arXiv:2512.17650, 2025

    Zhongwei Zhang, Fuchen Long, Wei Li, Zhaofan Qiu, Wu Liu, Ting Yao, and Tao Mei. Region-constraint in-context generation for instructional video editing.arXivpreprint arXiv:2512.17650, 2025

  38. [39]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean conferenceon computer vision, pages 38–55. Springer, 2024

  39. [40]

    Sam 2: Segment anything in images and videos.arXivpreprint arXiv:2408.00714, 2024

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXivpreprint arXiv:2408.00714, 2024

  40. [41]

    Qwen2.5-vl technical report.arXivpreprint arXiv:2502.13923, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXivpreprint arXiv:2502.13923, 2025. 13

  41. [42]

    Kling api.https://klingai.com/global/, 2025

    Kling. Kling api.https://klingai.com/global/, 2025. Accessed: 2026-01-25

  42. [43]

    Skyreels-v3 technique report.arXivpreprint arXiv: 2601.17323, 2026

    Debang Li, Zhengcong Fei, Tuanhui Li, Yikun Dou, Zheng Chen, Jiangping Yang, Mingyuan Fan, Jingtao Xu, Jiahua Wang, Baoxuan Gu, Mingshan Chang, Yuqiang Xie, Binjie Mao, Youqiang Zhang, Nuo Pang, Hao Zhang, Yuzhe Jin, Zhiheng Xu, Dixuan Lin, Guibin Chen, and Yahui Zhou. Skyreels-v3 technique report.arXivpreprint arXiv: 2601.17323, 2026

  43. [44]

    Gme: Improving universal multimodal retrieval by multimodal llms.arXivpreprint arXiv:2412.16855, 2024

    Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: Improving universal multimodal retrieval by multimodal llms.arXivpreprint arXiv:2412.16855, 2024

  44. [45]

    Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  45. [46]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conferenceonmachinelearning, pages 8748–8763. PmLR, 2021

  46. [47]

    https://gemini.google/au/overview/image-generation/?hl=en-AU,2025.Accessed: 2026-02-02

    Google.Nanobananapro. https://gemini.google/au/overview/image-generation/?hl=en-AU,2025.Accessed: 2026-02-02

  47. [48]

    Qwen-image technical report.arXivpreprint arXiv:2508.02324, 2025

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXivpreprint arXiv:2508.02324, 2025

  48. [49]

    Gpt-5.2.https://platform.openai.com/docs/models/gpt-5.2, 2025

    OpenAI. Gpt-5.2.https://platform.openai.com/docs/models/gpt-5.2, 2025. Accessed: 2025-12-30

  49. [50]

    Qwen3-vl technical report.arXivpreprint arXiv:2511.21631, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXivpreprint arXiv:2511.21631, 2025

  50. [51]

    Musar: Exploring multi-subject customization from single-subject dataset via attention routing.arXivpreprint arXiv:2505.02823, 2025

    Zinan Guo, Pengze Zhang, Yanze Wu, Chong Mou, Songtao Zhao, and Qian He. Musar: Exploring multi-subject customization from single-subject dataset via attention routing.arXivpreprint arXiv:2505.02823, 2025. 14 Appendix In the supplementary materials, we provide the construction of the training set in section A, and present more experimental setup and resul...

  51. [52]

    Assign a score between 1 and 5

  52. [53]

    3: Achievingcross-domaintransformationforpartkeyfeatures(suchashumanbody),butdomaintransformation is not achieved for the remaining key and minor features

    Metric: 5: Achieving good cross-domain transformation while preserving the most features in reference image 1; 4: Achieving cross-domain transformation of key features in reference image 1, but some negative and non-critical feature transformations have flaws. 3: Achievingcross-domaintransformationforpartkeyfeatures(suchashumanbody),butdomaintransformatio...

  53. [54]

    Strict Output Requirement: Return only the numerical score (no explanations, text descriptions, or additional comments)

    Higher scores correspond to higher cross-domain consistency. Strict Output Requirement: Return only the numerical score (no explanations, text descriptions, or additional comments). B More Experiments Results B.1 Implementation Details DomainShuttleutilizesthedefaultsettingsforinferenceforbothWan2.1andWan2.2,using50samplingstepsonWan2.1 and40stepsonWan2.2...

  54. [55]

    Next, guided by the prompt, each video personalization method generates videos of the reference images

    to generate the edited reference images based on the reference images and the domain transformation prompt. Next, guided by the prompt, each video personalization method generates videos of the reference images. Finally, CLIP is used to calculate the cosine similarity between each frame of the generated videos and the reference images edited by Nano Banan...

  55. [56]

    Overall Video Quality: Comprehensively evaluate the overall quality of the generated videos from three aspects: aesthetic quality, the smoothness of subject motions (avoiding static or frozen subjects and frame discontinuities), and the naturalness of color, texture, and saturation

  56. [57]

    Evaluate text controllability based on the consistency between the generated video and the input text description (e.g., corresponding real-world or fantastic domain descriptions, stylistic attributes, and subject interaction alignment)

  57. [58]

    In in-domain scenarios, the best methods require retaining the reference subject features as much as possible

    Open-Domain Subject Consistency: Evaluate subject consistency based on the similarity between the generated subject and the subject of the reference images. In in-domain scenarios, the best methods require retaining the reference subject features as much as possible. In cross-domain scenarios, the best methods should preserve the intrinsic features of the...