pith. machine review for the scientific record. sign in

arxiv: 2604.19473 · v1 · submitted 2026-04-21 · 💻 cs.CV

Recognition: unknown

TS-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-video generationmulti-event videoattention mechanismtemporal consistencytraining-free methodstory video generationdiffusion models
0
0 comments X

The pith

TS-Attn separates attention across time to resolve misalignment and conflicts when generating videos from multi-event text prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Text-to-video models often fail on prompts with multiple sequential actions because video content drifts out of sync with the text and because attention mixes up different motions with their text conditions. The paper introduces a training-free Temporal-wise Separable Attention module that rearranges attention weights to keep each time step aware of its place in the sequence while preserving overall scene coherence. When dropped into existing large pre-trained models it raises scores on a multi-event story benchmark by 33.5 percent and 16.4 percent while adding only two percent to inference time. The same module also works plug-and-play for image-to-video cases.

Core claim

Temporal-wise Separable Attention dynamically rearranges attention distribution across frames so that each time step aligns correctly with its corresponding text condition and motion objects do not compete for the same attention slots, thereby restoring both prompt fidelity and temporal consistency in multi-event generation.

What carries the argument

Temporal-wise Separable Attention (TS-Attn), which factors attention into separate temporal and spatial components and re-weights them per frame to enforce temporal awareness and global coherence.

If this is right

  • Pre-trained text-to-video models can generate coherent multi-event videos from a single complex prompt without retraining or sequential prompting.
  • The same module works for image-to-video tasks that contain multiple sequential actions.
  • Inference cost rises by only about two percent while benchmark scores rise substantially on two different large models.
  • The method can be inserted into a variety of existing diffusion-based text-to-video pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The result suggests that attention coupling is a primary bottleneck for long-horizon video generation and may be easier to fix than retraining entire models.
  • Future work could test whether the same separation principle improves consistency in longer videos or in models that also generate audio.
  • If the gains hold across many base models, prompt engineering for story videos might become less necessary.
  • An ablation that disables only the temporal re-weighting while keeping other changes would isolate exactly how much the separation contributes.

Load-bearing premise

That the measured gains on StoryEval-Bench arise specifically from fixing temporal misalignment and attention coupling rather than from other side effects of the method.

What would settle it

Run the identical prompts and base models with and without the TS-Attn rearrangement and measure whether the StoryEval-Bench improvement disappears when the temporal separation is removed.

Figures

Figures reproduced from arXiv: 2604.19473 by Bo Li, Daquan Zhou, Hongyu Zhang, Peng-Tao Jiang, Qibin Hou, Yufan Deng, Zhen Dong, Zhiyang Dou, Zilin Pan.

Figure 1
Figure 1. Figure 1: We present TS-Attn, a training-free attention mechanism, which enhances multi-event video generation through alleviating attention conflicts across multi-event conditions. (a) Qualitative results across subjects and scenes. (b) Quantitative comparison on StoryEval-Bench. (c) Latency￾performance tradeoff analysis. ABSTRACT Generating high-quality videos from complex temporal descriptions that contain multip… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of attention maps along the temporal sequence between TS-Attn and valina cross-attention. TS-Attn strengthens motion-event alignment and reduces cross-event inter￾ference, ensuring accurate attention distribution among multiple events. • We conduct an in-depth analysis of the root causes underlying poor prompt-following performance in complex descriptions, and reveal that temporally separable gr… view at source ↗
Figure 3
Figure 3. Figure 3: The overall framework of TS-Attn. TS-Attn replaces the original cross-attention in early denoising stages to incorporate motion information with temporal awareness. It consists of a motion region extraction module to identify motion-related tokens and an event-aware attention modulation module to adjust their attention distribution across multiple events. et al. (2024); Bansal et al. (2024). However, the u… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison results on multi-event T2V generation. The list in the top-left corner, evaluated jointly by GPT-4o and humans, indicates the completion status of events [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison results on multi-event I2V generation. The list in the top-left corner, evaluated jointly by GPT-4o and humans, indicates the completion rates. SkyReels-V2-14B generates actions that defy the laws of physics, resulting in a completion score of zero for all events. Baseline Models. The comparison models we selected can be divided into three categories: (1) Basic video generation model… view at source ↗
Figure 6
Figure 6. Figure 6: Construction pipeline of StoryEval-Bench-I2V. Due to the absence of a dedicated multi-event I2V benchmark, we construct a new evaluation frame￾work to assess the generalization ability of TS-Attn on I2V tasks. StoryEval-Bench Wang et al. (2025b), as a representative benchmark for multi-event text-to-video generation, has undergone peer review and features a large scale of prompts with high data diversity. … view at source ↗
Figure 7
Figure 7. Figure 7: Ablation results on the effect of motion region mask. Not restricting attention modu￾lation to motion-related regions can, in some cases, lead to background flickering, which ultimately degrades the overall video quality. Additionally, it hinders the motion regions from effectively re￾sponding to individual events. C MORE COMPARISON RESULTS WITH LLAVA-OV-CHAT-72B VERIFIER As shown in Tables 5 and 6, we als… view at source ↗
Figure 8
Figure 8. Figure 8: The prompt template for temporal segmentation using the LLM API. 30%, and 30% to align with each event. In the experiments summarized in [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: More comparison of attention maps along the temporal sequence between TS-Attn and valina cross-attention. G MORE QUALITATIVE RESULTS In this section, we provide additional qualitative comparisons to further demonstrate the effective￾ness of our method on multi-event video generation tasks. Figures 10–15 present more text-to-video (T2V) cases under complex temporal prompts, where our approach consistently a… view at source ↗
Figure 10
Figure 10. Figure 10: More qualitative comparison results on multi-event generation [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: More qualitative comparison results on multi-event generation. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: More qualitative comparison results on multi-event generation [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: More qualitative comparison results on multi-event generation. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: More qualitative comparison results on multi-event generation [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: More qualitative comparison results on multi-event generation. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: More qualitative comparison results with Wan2.1-14B. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: More qualitative comparison results with Wan2.1-14B [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: More qualitative comparison results with multi-prompt methods. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: More qualitative results on multi-event generation with multiple subjects. The mask diagram on the right side of the figure briefly illustrates how attention rearrangement regulates the temporal attention intensity of each subject to different events under each prompt [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: More qualitative comparison results on scene-level multi-event generation. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: More qualitative comparison results on interactive long video generation. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_21.png] view at source ↗
read the original abstract

Generating high-quality videos from complex temporal descriptions that contain multiple sequential actions is a key unsolved problem. Existing methods are constrained by an inherent trade-off: using multiple short prompts fed sequentially into the model improves action fidelity but compromises temporal consistency, while a single complex prompt preserves consistency at the cost of prompt-following capability. We attribute this problem to two primary causes: 1) temporal misalignment between video content and the prompt, and 2) conflicting attention coupling between motion-related visual objects and their associated text conditions. To address these challenges, we propose a novel, training-free attention mechanism, Temporal-wise Separable Attention (TS-Attn), which dynamically rearranges attention distribution to ensure temporal awareness and global coherence in multi-event scenarios. TS-Attn can be seamlessly integrated into various pre-trained text-to-video models, boosting StoryEval-Bench scores by 33.5% and 16.4% on Wan2.1-T2V-14B and Wan2.2-T2V-A14B with only a 2% increase in inference time. It also supports plug-and-play usage across models for multi-event image-to-video generation. The source code and project page are available at https://github.com/Hong-yu-Zhang/TS-Attn.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Temporal-wise Separable Attention (TS-Attn), a training-free plug-and-play attention module for pre-trained text-to-video diffusion models. It targets two issues in multi-event generation—temporal misalignment between video frames and complex prompts, and conflicting attention between motion objects and text conditions—by dynamically rearranging attention distributions. The authors report that TS-Attn integrates into models such as Wan2.1-T2V-14B and Wan2.2-T2V-A14B, yielding 33.5% and 16.4% gains on StoryEval-Bench while adding only 2% inference time, and extends the approach to image-to-video settings. Code is released at the provided GitHub link.

Significance. If the reported benchmark lifts can be shown to arise specifically from TS-Attn’s temporal rearrangement rather than confounding factors, the method would offer a lightweight, training-free route to improve temporal coherence and prompt adherence in existing large video models. The open-source code release is a clear strength that supports reproducibility and adoption.

major comments (2)
  1. [§4 (Experiments), main results table] §4 (Experiments), main results table: The 33.5% and 16.4% StoryEval-Bench improvements on Wan2.1-T2V-14B and Wan2.2-T2V-A14B are presented without ablations that hold prompt phrasing, integration layer, and random attention baselines fixed. This leaves open the possibility that equivalent gains could arise from any attention perturbation or model-specific prompt sensitivity rather than the claimed temporal-wise separability and resolution of conflicting attention coupling.
  2. [§3 (Method)] §3 (Method): The description of TS-Attn as dynamically rearranging attention to ensure temporal awareness lacks a precise algorithmic specification or pseudocode that would allow readers to verify how the rearrangement differs from standard cross-attention and directly corrects the two stated failure modes.
minor comments (2)
  1. [Abstract / §1] The abstract and introduction refer to StoryEval-Bench without providing its definition, task construction, or citation; a brief description or reference should be added for readers unfamiliar with the benchmark.
  2. [Figures] Figure captions for attention visualizations (if present) should explicitly label which maps correspond to baseline vs. TS-Attn and which frames illustrate the claimed reduction in temporal misalignment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to strengthen the presentation of results and methodological details.

read point-by-point responses
  1. Referee: [§4 (Experiments), main results table] The 33.5% and 16.4% StoryEval-Bench improvements on Wan2.1-T2V-14B and Wan2.2-T2V-A14B are presented without ablations that hold prompt phrasing, integration layer, and random attention baselines fixed. This leaves open the possibility that equivalent gains could arise from any attention perturbation or model-specific prompt sensitivity rather than the claimed temporal-wise separability and resolution of conflicting attention coupling.

    Authors: We agree that controlled ablations are necessary to isolate the specific contribution of the temporal-wise separable rearrangement. The current experiments demonstrate gains relative to the base models and other approaches, but we acknowledge the absence of the suggested fixed-prompt, fixed-layer, and random-perturbation controls. In the revised manuscript we will add these ablations, keeping prompt phrasing and integration layers identical while comparing against a random attention baseline. This will provide direct evidence that the reported improvements arise from the targeted correction of temporal misalignment and conflicting attention coupling rather than generic perturbations. revision: yes

  2. Referee: [§3 (Method)] The description of TS-Attn as dynamically rearranging attention to ensure temporal awareness lacks a precise algorithmic specification or pseudocode that would allow readers to verify how the rearrangement differs from standard cross-attention and directly corrects the two stated failure modes.

    Authors: We accept that a more formal specification would improve verifiability. Section 3 currently describes the dynamic rearrangement process at a conceptual level. In the revised manuscript we will insert explicit pseudocode that details the separation of temporal attention maps, the event-aligned reordering step, and the subsequent fusion, together with a side-by-side comparison to standard cross-attention. This will make clear how the mechanism directly mitigates the two identified failure modes. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical module proposal stands independently

full rationale

The paper introduces TS-Attn as a training-free attention rearrangement module to mitigate temporal misalignment and conflicting coupling in multi-event video generation. Its core claims consist of a descriptive mechanism plus measured benchmark lifts (33.5% and 16.4% on StoryEval-Bench) when plugged into existing pre-trained models. No equations, uniqueness theorems, or first-principles derivations are presented that reduce by construction to fitted parameters, self-citations, or renamed inputs. The reported gains are external empirical outcomes rather than statistical artifacts of the method's own definition, and no load-bearing self-citation chain or ansatz smuggling is required for the central argument. The derivation chain is therefore self-contained against the provided benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard transformer attention assumptions plus the unproven claim that the identified misalignment and coupling problems are the dominant failure modes. No new physical entities or fitted constants are introduced.

axioms (1)
  • domain assumption Standard scaled dot-product attention can be rearranged along the temporal axis without breaking the model's learned weights.
    Invoked when describing the training-free integration into pre-trained models.

pith-pipeline@v0.9.0 · 5557 in / 1325 out tokens · 23588 ms · 2026-05-10T02:42:13.849490+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Delta Forcing uses latent trajectory deltas to adaptively limit unreliable teacher guidance while enforcing monotonic continuity, improving temporal consistency in interactive autoregressive video generation.

Reference graph

Works this paper leans on

30 extracted references · 20 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    Talc: Time-aligned captions for multi-scene text-to-video generation.arXiv preprint arXiv:2405.04682, 2024

    URLhttps:// arxiv.org/abs/2405.04682. Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023a. Andreas Blattmann, Robin Rombach, Huan Li...

  2. [2]

    SkyReels-V2: Infinite-length Film Generative Model

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074,

  3. [3]

    Gaussian splatting: 3d reconstruction and novel view synthesis – a review,

    Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models.arXiv preprint arXiv:2401.09047,

  4. [4]

    Cinema: Coherent multi-subject video generation via mllm-based guidance.arXiv preprint arXiv:2503.10391, 2025

    Yufan Deng, Xun Guo, Yizhi Wang, Jacob Zhiyuan Fang, Angtian Wang, Shenghai Yuan, Yiding Yang, Bo Liu, Haibin Huang, and Chongyang Ma. Cinema: Coherent multi-subject video gener- ation via mllm-based guidance.arXiv preprint arXiv:2503.10391, 2025a. Yufan Deng, Yuanyang Yin, Xun Guo, Yizhi Wang, Jacob Zhiyuan Fang, Shenghai Yuan, Yiding Yang, Angtian Wang,...

  5. [5]

    Vchitect-2.0: Parallel transformer for scaling up video diffusion models,

    Weichen Fan, Chenyang Si, Junhao Song, Zhenyu Yang, Yinan He, Long Zhuo, Ziqi Huang, Ziyue Dong, Jingwen He, Dongwei Pan, et al. Vchitect-2.0: Parallel transformer for scaling up video diffusion models.arXiv preprint arXiv:2501.08453,

  6. [6]

    I2v-adapter: A general image-to-video adapter for diffusion models

    Xun Guo, Mingwu Zheng, Liang Hou, Yuan Gao, Yufan Deng, Pengfei Wan, Di Zhang, Yufan Liu, Weiming Hu, Zhengjun Zha, et al. I2v-adapter: A general image-to-video adapter for diffusion models. InACM SIGGRAPH 2024 Conference Papers, pp. 1–12,

  7. [7]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103,

  8. [8]

    Hailuo.https://hailuoai.video/,

    10 Published as a conference paper at ICLR 2026 HailuoAI. Hailuo.https://hailuoai.video/,

  9. [9]

    doi:10.48550/arXiv.2502.04320 , urldate =

    Alec Helbling, Tuna Han Salih Meral, Ben Hoover, Pinar Yanardag, and Duen Horng Chau. Conceptattention: Diffusion transformers learn highly interpretable features.arXiv preprint arXiv:2502.04320,

  10. [10]

    Hunyuancustom: A multimodal-driven architecture for customized video generation.arXiv preprint arXiv:2505.04512, 2025

    URL https://arxiv.org/abs/2505.04512. Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954,

  11. [11]

    Tuning- free multi-event long video generation via synchronized coupled sampling.arXiv preprint arXiv:2503.08605,

    Subin Kim, Seoung Wug Oh, Jui-Hsien Wang, Joon-Young Lee, and Jinwoo Shin. Tuning- free multi-event long video generation via synchronized coupled sampling.arXiv preprint arXiv:2503.08605,

  12. [12]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,

  13. [13]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326,

  14. [14]

    Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

    Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model. arXiv preprint arXiv:2412.00131,

  15. [15]

    Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091,

    Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091,

  16. [16]

    Latte: Latent Diffusion Transformer for Video Generation

    Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048,

  17. [17]

    U-net: Convolutional networks for biomed- ical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomed- ical image segmentation. InMedical Image Computing and Computer-Assisted Intervention– MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceed- ings, Part III 18, pp. 234–241. Springer,

  18. [18]

    Longcat-video technical report,

    11 Published as a conference paper at ICLR 2026 Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, and Tong Zhang. Longcat-video technical report,

  19. [19]

    Longcat-video technical report

    URLhttps://arxiv.org/abs/2510.22200. Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211,

  20. [20]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025a. Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye, Yu Liu, and Hongsheng Li. Gen-l-video: Multi-text to long video generation via ...

  21. [21]

    Is your world simulator a good story presenter? a consecutive events-based benchmark for future long video generation

    Yiping Wang, Xuehai He, Kuan Wang, Luyao Ma, Jianwei Yang, Shuohang Wang, Simon Shaolei Du, and Yelong Shen. Is your world simulator a good story presenter? a consecutive events-based benchmark for future long video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 13629–13638, 2025b. Zun Wang, Jialu Li, Han Lin, Jae...

  22. [22]

    Mavin: Multi- action video generation with diffusion models via transition video infilling.arXiv preprint arXiv:2405.18003,

    Bowen Zhang, Xiaofei Xie, Haotian Lu, Na Ma, Tianlin Li, and Qing Guo. Mavin: Multi- action video generation with diffusion models via transition video infilling.arXiv preprint arXiv:2405.18003,

  23. [23]

    Magiccomp: Training-free dual-phase refinement for compositional video generation

    Hongyu Zhang, Yufan Deng, Shenghai Yuan, Peng Jin, Zesen Cheng, Yian Zhao, Chang Liu, and Jie Chen. Magiccomp: Training-free dual-phase refinement for compositional video generation. arXiv preprint arXiv:2503.14428, 2025a. Lvmin Zhang and Maneesh Agrawala. Packing input frame context in next-frame prediction models for video generation.arXiv preprint arXi...

  24. [24]

    Open-Sora: Democratizing Efficient Video Production for All

    Shiyi Zhang, Junhao Zhuang, Zhaoyang Zhang, Ying Shan, and Yansong Tang. Flexiact: Towards flexible action control in heterogeneous scenarios. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pp. 1–11, 2025b. 12 Published as a conference paper at ICLR 2026 Zangwei Zheng, Xiangyu Peng...

  25. [25]

    Therefore, this section provides a supplementary explanation for scenarios involving multiple subjects

    13 Published as a conference paper at ICLR 2026 TS-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation Appendix A TS-ATTN FORMULTIPLESUBJECTS For brevity of description, we introduce TS-Attn in the main text using a single subject and its cor- responding event list. Therefore, this section provides a supplementary explanation for scen...

  26. [26]

    Relying solely on attention reinforcement reduces TS-Attn to a mere attention enhancement mechanism for event tokens, lacking temporal correspondence

    It can be observed that removing attention rearrangement leads to a significant performance drop, further demonstrating that the more critical aspect of TS-Attn is the temporal redistribution of cross-attention distributions. Relying solely on attention reinforcement reduces TS-Attn to a mere attention enhancement mechanism for event tokens, lacking tempo...

  27. [27]

    As shown in Figure 9, the attention distributions of different actions in TS-Attn are clearly separated along the temporal sequence

    F MOREATTENTIONVISUALIZATIONRESULTS We present additional attention analysis to further elaborate on the insights of TS-Attn. As shown in Figure 9, the attention distributions of different actions in TS-Attn are clearly separated along the temporal sequence. Meanwhile, each event exhibits a strong response to the motion regions of its corresponding frames...

  28. [28]

    I MOREDIVERSEAPPLICATIONS OFTS-ATTN In this section, we present more potential application scenarios of TS-Attn, including multi-event generation involving multiple subjects, scene-level multi-event generation, and enhancing the po- tential for interactive long-video generation. Multi-subject multi-event generation.As shown in Figure 19, multi-event gener...

  29. [29]

    (2025), which natively supports video continuity

    To handle more events, we applied TS-Attn to the recently proposed LongCat-Video-13.6B model Team et al. (2025), which natively supports video continuity. This enables us to distribute a larger number of events across multiple clips. For example, 9 events can be divided into 3 clips for generation while maintaining temporal consistency. As illustrated in ...

  30. [30]

    These results highlight the potential of TS-Attn for both interactive and long-form video generation

    For a fixed number of clips, TS-Attn effectively manages more intri- cate temporal descriptions. These results highlight the potential of TS-Attn for both interactive and long-form video generation. J THEUSE OFLARGELANGUAGEMODELS We use large language models (LLMs) solely for polishing sentence structures and refining the lan- guage throughout the manuscr...