Recognition: 2 theorem links
· Lean TheoremCogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Pith reviewed 2026-05-11 12:17 UTC · model grok-4.3
The pith
CogVideo generates videos from text by inheriting weights from a text-to-image model and applying multi-frame-rate hierarchical training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Large-scale pretrained transformers have created milestones in text and text-to-image generation, yet video generation faces huge computation costs and scarce relevant datasets. We present the 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model CogView2. We also propose a multi-frame-rate hierarchical training strategy to better align text and video clips. As the first open-source large-scale pretrained text-to-video model, CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.
What carries the argument
Weight inheritance from the CogView2 text-to-image model plus multi-frame-rate hierarchical training, which transfers static image understanding to dynamic video while aligning text semantics across frame rates.
If this is right
- Generated videos exhibit stronger alignment between text descriptions and complex movements.
- An open-source model at this scale becomes available for further research and applications.
- Video generation can be scaled without full from-scratch training on massive video corpora.
- The approach demonstrates transfer of capabilities from image to video domains via staged alignment training.
Where Pith is reading between the lines
- The same inheritance-plus-hierarchy pattern might support longer or higher-resolution videos if compute budgets increase.
- Fine-tuning on domain-specific video sets could adapt the model for tasks such as animation or simulation.
- Combining the output with audio or 3D models could extend the system toward richer multimedia generation.
Load-bearing premise
That inheriting weights from a text-to-image model plus multi-frame-rate hierarchical training is enough to overcome scarce text-video data and the high cost of training video models from scratch.
What would settle it
Blind human preference tests or standard video quality metrics such as FVD in which CogVideo does not show a clear margin over other publicly released text-to-video systems.
read the original abstract
Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation. Its application to video generation is still facing many challenges: The potential huge computation cost makes the training from scratch unaffordable; The scarcity and weak relevance of text-video datasets hinder the model understanding complex movement semantics. In this work, we present 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model, CogView2. We also propose multi-frame-rate hierarchical training strategy to better align text and video clips. As (probably) the first open-source large-scale pretrained text-to-video model, CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CogVideo, a 9B-parameter transformer for text-to-video generation. It inherits weights from the pretrained CogView2 text-to-image model to reduce compute costs and applies a multi-frame-rate hierarchical training strategy to improve text-video alignment despite limited relevant data. The authors claim CogVideo is likely the first open-source large-scale pretrained text-to-video model and outperforms all publicly available models by a large margin in both machine and human evaluations.
Significance. If the performance claims hold under rigorous controls, this would be a meaningful early contribution to text-to-video generation by showing how weight inheritance from image models and hierarchical training can scale to 9B parameters. The open release of the model is a clear strength that could enable follow-on work, analogous to the role of early large text and image models. However, the significance is reduced because the central empirical claim depends on unshown evidence that the proposed techniques, rather than model scale or dataset choices alone, drive the gains.
major comments (3)
- [§4 (Experiments)] §4 (Experiments): No ablation is presented that isolates the effect of inheriting weights from CogView2 versus random initialization at 9B scale. This is load-bearing for the introduction's claim that inheritance overcomes text-video data scarcity; without it, observed gains could be explained by capacity or data alone.
- [§4.1 (Evaluation protocol)] §4.1 (Evaluation protocol): The multi-frame-rate hierarchical training is not compared against a single-rate baseline in controlled experiments. This weakens the assertion that the hierarchical schedule is responsible for improved text-video alignment, as required to support the 'large margin' superiority claim.
- [§4 (Experiments)] §4 (Experiments): The manuscript supplies no quantitative metrics (e.g., specific FID, CLIP-score, or human preference percentages), named baselines, or dataset statistics to substantiate the 'outperforms all publicly available models at a large margin' statement. These details are necessary to evaluate the central empirical claim.
minor comments (2)
- [Abstract] The abstract would be improved by including at least one concrete quantitative result to support the performance claims.
- Figure captions should be expanded to be self-contained, especially for any qualitative generation examples.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing honest clarifications on our design choices and empirical claims while noting where revisions are feasible.
read point-by-point responses
-
Referee: §4 (Experiments): No ablation is presented that isolates the effect of inheriting weights from CogView2 versus random initialization at 9B scale. This is load-bearing for the introduction's claim that inheritance overcomes text-video data scarcity; without it, observed gains could be explained by capacity or data alone.
Authors: We agree that a controlled ablation isolating weight inheritance at the full 9B scale would strengthen the claim regarding data scarcity. However, training a 9B-parameter model from random initialization requires prohibitive compute (estimated >10,000 GPU-hours per run), which exceeded our resources. Our approach follows established transfer-learning practices from image to video models, with performance gains shown via overall machine and human evaluations. In revision, we will expand Section 4 with additional discussion of this limitation and any supporting evidence from smaller-scale pretraining experiments. revision: partial
-
Referee: §4.1 (Evaluation protocol): The multi-frame-rate hierarchical training is not compared against a single-rate baseline in controlled experiments. This weakens the assertion that the hierarchical schedule is responsible for improved text-video alignment, as required to support the 'large margin' superiority claim.
Authors: We acknowledge that a direct single-rate baseline comparison would better isolate the hierarchical strategy's contribution. The multi-frame-rate approach was introduced to address varying motion speeds and improve alignment under data constraints, with benefits visible in qualitative results and overall metrics. Due to compute limits, this specific ablation was not performed. We will revise the manuscript to elaborate on the design rationale, add qualitative comparisons where possible, and list the missing ablation as a limitation and future direction. revision: partial
-
Referee: §4 (Experiments): The manuscript supplies no quantitative metrics (e.g., specific FID, CLIP-score, or human preference percentages), named baselines, or dataset statistics to substantiate the 'outperforms all publicly available models at a large margin' statement. These details are necessary to evaluate the central empirical claim.
Authors: We will revise the experiments section to report the specific quantitative metrics (FID, CLIP-score, human preference percentages), explicitly name all public baselines compared, and include dataset statistics. These details were available from our evaluations but omitted for brevity in the initial submission; adding them will allow direct assessment of the performance claims. revision: yes
- Full 9B-scale ablation isolating weight inheritance from random initialization
- Controlled ablation comparing multi-frame-rate hierarchical training to single-rate baseline
Circularity Check
No significant circularity; empirical performance claim is independently evaluated
full rationale
The paper's central claim is an empirical statement that CogVideo outperforms public baselines after inheriting weights from CogView2 and applying multi-frame-rate hierarchical training. This is supported by machine and human evaluations on external benchmarks rather than any derivation that reduces to its own inputs by construction. No equations, fitted parameters renamed as predictions, or load-bearing self-citations that forbid alternatives appear in the provided abstract or described methodology. The inheritance from CogView2 and the training strategy are presented as engineering choices whose effectiveness is measured externally, not assumed or defined into the result. This is a standard self-contained empirical ML paper with no circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A pretrained text-to-image transformer can be effectively adapted to video by adding temporal training
Forward citations
Cited by 42 Pith papers
-
MusicLM: Generating Music From Text
MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
-
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.
-
GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion
GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.
-
OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation
OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.
-
DCR: Counterfactual Attractor Guidance for Rare Compositional Generation
DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.
-
DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation
DeVI enables zero-shot physically plausible dexterous control by imitating synthetic videos via a hybrid 3D-human plus 2D-object tracking reward.
-
ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis
ReImagine decouples human appearance from temporal consistency via pretrained image backbones, SMPL-X motion guidance, and training-free video diffusion refinement to generate high-quality controllable videos.
-
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.
-
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.
-
UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models
UDM-GRPO is the first RL integration for uniform discrete diffusion models, using final clean samples as actions and forward-process trajectory reconstruction to raise GenEval accuracy from 69% to 96% and OCR accuracy...
-
MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production
MCSC-Bench is the first large-scale dataset for the Multimodal Context-to-Script Creation task, requiring models to select relevant shots from redundant materials, plan missing shots, and generate coherent scripts wit...
-
LottieGPT: Tokenizing Vector Animation for Autoregressive Generation
LottieGPT tokenizes Lottie animations into compact sequences and fine-tunes Qwen-VL to autoregressively generate coherent vector animations from natural language or visual prompts, outperforming prior SVG models.
-
MoRight: Motion Control Done Right
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...
-
OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control
OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.
-
Detecting AI-Generated Videos with Spiking Neural Networks
MAST with spiking neural networks achieves 93.14% mean accuracy detecting AI-generated videos from 10 unseen generators by exploiting smoother pixel residuals and compact semantic trajectories.
-
Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models
VAP is a training-free active-perception method that improves zero-shot long-form video QA performance and frame efficiency up to 5.6x in VLMs by selecting keyframes that differ from priors generated by a text-conditi...
-
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.
-
PhyCo: Learning Controllable Physical Priors for Generative Motion
PhyCo adds continuous physical control to video diffusion models via physics-supervised fine-tuning on a large simulation dataset and VLM-guided rewards, yielding measurable gains in physical realism on the Physics-IQ...
-
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
-
Structured State-Space Regularization for Compact and Generation-Friendly Image Tokenization
A new regularizer transfers frequency awareness from state-space models into image tokenizers, yielding more compact latents that improve diffusion-model generation quality with little reconstruction penalty.
-
ELT: Elastic Looped Transformers for Visual Generation
Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.
-
InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation
InsEdit adapts a video diffusion backbone for text-instruction video editing via Mutual Context Attention, achieving SOTA open-source results with O(100K) data while also supporting image editing.
-
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and t...
-
GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads
GENSERVE improves SLO attainment by up to 44% for co-serving heterogeneous T2I and T2V diffusion workloads via step-level preemption, elastic parallelism, and joint scheduling.
-
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...
-
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
-
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
-
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
-
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Latent-space hierarchical diffusion models with targeted error-correction techniques generate realistic videos exceeding 1000 frames while using less compute than prior pixel-space approaches.
-
Make-A-Video: Text-to-Video Generation without Text-Video Data
Make-A-Video achieves state-of-the-art text-to-video generation by decomposing temporal U-Net and attention structures to add space-time modeling to text-to-image models, trained without any paired text-video data.
-
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.
-
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation
ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
-
Embody4D: A Generalist 4D World Model for Embodied AI
Embody4D generates high-fidelity, view-consistent novel views from monocular videos for embodied scenarios via 3D-aware data synthesis, adaptive noise injection, and interaction-aware attention.
-
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...
-
Controllable Video Object Insertion via Multiview Priors
A multi-view prior-based framework for video object insertion that uses dual-path conditioning and an integration-aware consistency module to improve appearance stability and occlusion handling.
-
Ride the Wave: Precision-Allocated Sparse Attention for Smooth Video Generation
PASA uses curvature-aware dynamic budgeting, grouped approximations, and stochastic attention routing to accelerate video diffusion transformers while eliminating temporal flickering from sparse patterns.
-
Not all tokens contribute equally to diffusion learning
DARE mitigates neglect of important tokens in conditional diffusion models via distribution-rectified guidance and spatial attention alignment.
-
Open-Sora: Democratizing Efficient Video Production for All
Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tas...
-
Movie Gen: A Cast of Media Foundation Models
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
ModelScope Text-to-Video Technical Report
ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.
-
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.
Reference graph
Works this paper leans on
-
[1]
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[2]
J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 6299–6308, 2017
work page 2017
- [3]
-
[4]
A. Clark, J. Donahue, and K. Simonyan. Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571, 2019
-
[5]
M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang, et al. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34, 2021
work page 2021
- [6]
- [7]
-
[8]
C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. Advances in neural information processing systems , 29, 2016
work page 2016
- [9]
-
[10]
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial networks.arXiv preprint arXiv:1406.2661, 2014
work page internal anchor Pith review arXiv 2014
-
[11]
J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022
work page internal anchor Pith review arXiv 2022
-
[12]
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014
work page 2014
-
[13]
J. Lin, C. Gan, and S. Han. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 7083–7093, 2019
work page 2019
-
[14]
Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021
work page 2021
- [15]
- [16]
-
[17]
R. Rakhimov, D. V olkhonskiy, A. Artemov, D. Zorin, and E. Burnaev. Latent video transformer. arXiv preprint arXiv:2006.10704, 2020
-
[18]
Zero-Shot Text-to-Image Generation
A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. V oss, A. Radford, M. Chen, and I. Sutskever. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092, 2021
work page internal anchor Pith review arXiv 2021
- [19]
- [20]
-
[21]
T. Salimans, I. Goodfellow, W. Zaremba, V . Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 2234–2242, 2016
work page 2016
-
[22]
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[23]
I. Sutskever, J. Martens, and G. Hinton. Generating text with recurrent neural networks. In ICML’11, page 1017–1024, 2011
work page 2011
- [24]
-
[25]
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015
work page 2015
-
[26]
S. Tulyakov, M.-Y . Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535, 2018
work page 2018
-
[27]
Towards Accurate Generative Models of Video: A New Metric & Challenges
T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. To- wards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[28]
A. van den Oord, O. Vinyals, and K. Kavukcuoglu. Neural discrete representation learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems , pages 6309–6318, 2017
work page 2017
-
[29]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[30]
C. V ondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics.Advances in neural information processing systems , 29, 2016
work page 2016
-
[31]
X. Wang, J. Wu, J. Chen, L. Li, Y .-F. Wang, and W. Y . Wang. Vatex: A large-scale, high- quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4581–4591, 2019
work page 2019
-
[32]
Y . Wang, M. Long, J. Wang, Z. Gao, and P. S. Yu. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. Advances in neural information processing systems, 30, 2017
work page 2017
-
[33]
Scaling autoregressive video models
D. Weissenborn, O. Täckström, and J. Uszkoreit. Scaling autoregressive video models. arXiv preprint arXiv:1906.02634, 2019
- [34]
- [35]
-
[36]
W. Yan, Y . Zhang, P. Abbeel, and A. Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021
work page internal anchor Pith review arXiv 2021
-
[37]
S. Yu, J. Tack, S. Mo, H. Kim, J. Kim, J.-W. Ha, and J. Shin. Generating videos with dynamics- aware implicit generative adversarial networks. arXiv preprint arXiv:2202.10571, 2022. A Attention Analysis To explore the attention mechanism of dual-channel attention, we visualize (1) the attention distribu- tion in the temporal channel and (2) the mixture fa...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.