hub

arXiv preprint arXiv:2404.02905 , year =

Keyu Tian et al · 2024 · arXiv 2404.02905

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 other 1

citation-polarity summary

unclear 2 background 1

representative citing papers

Normalizing Trajectory Models

cs.CV · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

cs.CV · 2025-03-10 · unverdicted · novelty 7.0

Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

cs.CV · 2024-10-17 · unverdicted · novelty 7.0

Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

cs.CV · 2024-06-10 · conditional · novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.

Vision Foundation Models as Generalist Tokenizers for Image Generation

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

VFMTok builds a generalist image tokenizer on frozen VFMs using adaptive quantization and semantic alignment, delivering gFID 1.36 for autoregressive and 1.25 for continuous generation on ImageNet with 3x faster convergence.

TRACE: Temporal Routing with Autoregressive Cross-channel Experts for EEG Representation Learning

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

TRACE is an autoregressive EEG pre-training framework using temporally adaptive cross-channel expert routing to learn transferable representations, achieving best results on several of eight downstream benchmarks.

Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation

eess.AS · 2026-04-21 · unverdicted · novelty 6.0

Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.

Generative Refinement Networks for Visual Synthesis

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.

Autoregressive Video Generation without Vector Quantization

cs.CV · 2024-12-18 · unverdicted · novelty 6.0

NOVA reformulates video generation as non-quantized autoregressive frame-by-frame temporal prediction combined with set-by-set spatial prediction, outperforming prior AR video models and some diffusion models in efficiency and quality.

Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice

cs.CV · 2026-05-11 · unverdicted · novelty 5.0

TaTok is a theoretically grounded adaptive tokenization method that uses global tokens and cumulative conditional entropy filtering to reduce redundancy while improving reconstruction quality over fixed-rate patch tokenization.

CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models

cs.CV · 2026-01-29 · unverdicted · novelty 5.0

CG-MLLM is a multimodal LLM using a Mixture-of-Transformer architecture with separate TokenAR and BlockAR components integrated with a pre-trained vision-language backbone and 3D VAE to enable 3D captioning and high-fidelity generation.

MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

cs.CV · 2026-04-21 · unverdicted · novelty 4.0

MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.

SRC-Flow: Compact Semantic Representations Enable Normalizing Flows for Image Generation

cs.CV · 2026-05-18

citing papers explorer

Showing 13 of 13 citing papers.

Normalizing Trajectory Models cs.CV · 2026-05-08 · unverdicted · none · ref 19 · 2 links
NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation cs.CV · 2025-03-10 · unverdicted · none · ref 44
Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation cs.CV · 2024-10-17 · unverdicted · none · ref 79
Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation cs.CV · 2024-06-10 · conditional · none · ref 33
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
Vision Foundation Models as Generalist Tokenizers for Image Generation cs.CV · 2026-05-18 · unverdicted · none · ref 75
VFMTok builds a generalist image tokenizer on frozen VFMs using adaptive quantization and semantic alignment, delivering gFID 1.36 for autoregressive and 1.25 for continuous generation on ImageNet with 3x faster convergence.
TRACE: Temporal Routing with Autoregressive Cross-channel Experts for EEG Representation Learning cs.LG · 2026-05-12 · unverdicted · none · ref 13
TRACE is an autoregressive EEG pre-training framework using temporally adaptive cross-channel expert routing to learn transferable representations, achieving best results on several of eight downstream benchmarks.
Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation eess.AS · 2026-04-21 · unverdicted · none · ref 28
Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.
Generative Refinement Networks for Visual Synthesis cs.CV · 2026-04-14 · unverdicted · none · ref 52
GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
Autoregressive Video Generation without Vector Quantization cs.CV · 2024-12-18 · unverdicted · none · ref 24
NOVA reformulates video generation as non-quantized autoregressive frame-by-frame temporal prediction combined with set-by-set spatial prediction, outperforming prior AR video models and some diffusion models in efficiency and quality.
Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice cs.CV · 2026-05-11 · unverdicted · none · ref 55
TaTok is a theoretically grounded adaptive tokenization method that uses global tokens and cumulative conditional entropy filtering to reduce redundancy while improving reconstruction quality over fixed-rate patch tokenization.
CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models cs.CV · 2026-01-29 · unverdicted · none · ref 25
CG-MLLM is a multimodal LLM using a Mixture-of-Transformer architecture with separate TokenAR and BlockAR components integrated with a pre-trained vision-language backbone and 3D VAE to enable 3D captioning and high-fidelity generation.
MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings cs.CV · 2026-04-21 · unverdicted · none · ref 34
MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.
SRC-Flow: Compact Semantic Representations Enable Normalizing Flows for Image Generation cs.CV · 2026-05-18 · unreviewed · ref 22

arXiv preprint arXiv:2404.02905 , year =

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer