NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.
hub
arXiv preprint arXiv:2404.02905 , year =
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.
Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
VFMTok builds a generalist image tokenizer on frozen VFMs using adaptive quantization and semantic alignment, delivering gFID 1.36 for autoregressive and 1.25 for continuous generation on ImageNet with 3x faster convergence.
TRACE is an autoregressive EEG pre-training framework using temporally adaptive cross-channel expert routing to learn transferable representations, achieving best results on several of eight downstream benchmarks.
Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.
GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
NOVA reformulates video generation as non-quantized autoregressive frame-by-frame temporal prediction combined with set-by-set spatial prediction, outperforming prior AR video models and some diffusion models in efficiency and quality.
TaTok is a theoretically grounded adaptive tokenization method that uses global tokens and cumulative conditional entropy filtering to reduce redundancy while improving reconstruction quality over fixed-rate patch tokenization.
CG-MLLM is a multimodal LLM using a Mixture-of-Transformer architecture with separate TokenAR and BlockAR components integrated with a pre-trained vision-language backbone and 3D VAE to enable 3D captioning and high-fidelity generation.
MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.
citing papers explorer
-
Normalizing Trajectory Models
NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.
-
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.
-
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
-
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
-
Vision Foundation Models as Generalist Tokenizers for Image Generation
VFMTok builds a generalist image tokenizer on frozen VFMs using adaptive quantization and semantic alignment, delivering gFID 1.36 for autoregressive and 1.25 for continuous generation on ImageNet with 3x faster convergence.
-
TRACE: Temporal Routing with Autoregressive Cross-channel Experts for EEG Representation Learning
TRACE is an autoregressive EEG pre-training framework using temporally adaptive cross-channel expert routing to learn transferable representations, achieving best results on several of eight downstream benchmarks.
-
Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation
Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.
-
Generative Refinement Networks for Visual Synthesis
GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
-
Autoregressive Video Generation without Vector Quantization
NOVA reformulates video generation as non-quantized autoregressive frame-by-frame temporal prediction combined with set-by-set spatial prediction, outperforming prior AR video models and some diffusion models in efficiency and quality.
-
Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice
TaTok is a theoretically grounded adaptive tokenization method that uses global tokens and cumulative conditional entropy filtering to reduce redundancy while improving reconstruction quality over fixed-rate patch tokenization.
-
CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models
CG-MLLM is a multimodal LLM using a Mixture-of-Transformer architecture with separate TokenAR and BlockAR components integrated with a pre-trained vision-language backbone and 3D VAE to enable 3D captioning and high-fidelity generation.
-
MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings
MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.
- SRC-Flow: Compact Semantic Representations Enable Normalizing Flows for Image Generation