OccDirector uses a VLM-guided Spatio-Temporal MMDiT model with history anchoring to generate physically plausible 4D occupancy from language scripts, supported by the new OccInteract-85k dataset.
A Note on the Inception Score
12 Pith papers cite this work. Polarity classification is still indexing.
abstract
Deep generative models are powerful tools that have produced impressive results in recent years. These advances have been for the most part empirically driven, making it essential that we use high quality evaluation metrics. In this paper, we provide new insights into the Inception Score, a recently proposed and widely used evaluation metric for generative models, and demonstrate that it fails to provide useful guidance when comparing models. We discuss both suboptimalities of the metric itself and issues with its application. Finally, we call for researchers to be more systematic and careful when evaluating and comparing generative models, as the advancement of the field depends upon it.
citation-role summary
citation-polarity summary
representative citing papers
Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
BigGANs achieve state-of-the-art class-conditional synthesis on ImageNet 128x128 with Inception Score 166.5 and FID 7.4 by scaling GANs and applying orthogonal regularization plus truncation.
Distribution-wise rewards with subset-replace strategy and post-hoc merging improve FID-50K on SiT (8.30 to 5.77) and EDM2 (3.74 to 3.52) while preserving diversity.
DiT-Pruning introduces an energy-based saliency metric balancing weights and activations plus clustering-aware granularity for post-training pruning of DiTs, showing near-zero CLIP score degradation at 50% sparsity on FLUX.1-dev.
RMMD simultaneously distills diffusion models and optimizes rewards, yielding better FID-reward trade-offs on ImageNet than DI++, DRaFT and HyperNoise, and a 7.5x faster GenCast model that beats its teacher on 93% of weather variables while improving calibration.
UAT presents a diffusion-centric framework coupling continuous latent diffusion for audio with masked discrete diffusion for text in a shared dual-stream backbone to enable unified generation, editing, and captioning.
GRAM is a latent-variable generative model that performs recursive reasoning via stochastic trajectories, trained with amortized variational inference to support multi-hypothesis reasoning and unconditional generation.
Speculative Coupled Decoding stabilizes draft sampling in Speculative Jacobi Decoding via an information-theoretic coupling step, delivering up to 4.2x image and 13.6x video speedups with no quality loss or training.
TC-AE improves reconstruction and generative performance in deep compression by decomposing token-to-latent compression into two stages and using joint self-supervised training.
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
A survey that organizes diffusion image-to-video methods into a taxonomy, distills core designs in condition encoding, temporal modeling, noise prior, and upsampling, and discusses applications plus challenges.
citing papers explorer
-
Diffusion Models Beat GANs on Image Synthesis
Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.