AVTok is a unified tokenizer that converts audio-video pairs into a compact 1D latent representation via dual-stream transformer and hierarchical training for improved reconstruction and cross-modal generation.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation
AVTok is a unified tokenizer that converts audio-video pairs into a compact 1D latent representation via dual-stream transformer and hierarchical training for improved reconstruction and cross-modal generation.