Mixed citations

Taming Transformers for High-Resolution Image Synthesis, 2021

Patrick Esser, Robin Rombach, Bj ¨orn Ommer · 2021 · arXiv 2012.09841

Mixed citation behavior. Most common role is background (57%).

11 Pith papers citing it

Background 57% of classified citations

read on arXiv browse 11 citing papers

citation-role summary

background 4 method 3

citation-polarity summary

background 4 use method 2 extend 1

representative citing papers

Hierarchical Text-Conditional Image Generation with CLIP Latents

cs.CV · 2022-04-13 · accept · novelty 7.0

A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.

High-Resolution Image Synthesis with Latent Diffusion Models

cs.CV · 2021-12-20 · conditional · novelty 7.0

Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrained autoencoders with cross-attention conditioning, while cutting computational and

Consistency Training while Mitigating Obfuscation via Rate Matching

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.

GPIC: A Giant Permissive Image Corpus for Visual Generation

cs.CV · 2026-05-28 · unverdicted · novelty 6.0

GPIC is a new 28-trillion-pixel permissively licensed image corpus with 100M training examples for visual generative modeling.

Learning from Semantic Dictionaries: Discriminative Codebook Contrastive Learning for Unified Visual Representation and Generation

cs.CV · 2026-05-24 · unverdicted · novelty 6.0

LEASE achieves state-of-the-art unified performance on ImageNet-1K by combining masked token reconstruction and codebook contrast losses in a one-time precomputed discrete token space.

What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.

Emu3: Next-Token Prediction is All You Need

cs.CV · 2024-09-27 · unverdicted · novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

cs.CV · 2023-11-25 · conditional · novelty 6.0

Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results while enabling motion LoRA and multi-view 3D applications.

Mapping Whisper Representations to Human ECoG Responses with Interpretable Time-Resolved Neural Encoding

q-bio.NC · 2026-06-01 · unverdicted · novelty 5.0

The paper introduces a time-resolved neural encoder combining Whisper embeddings with recurrent temporal modeling and soft attention to predict ECoG responses, finding strongest alignment in intermediate layers and anatomically coherent phoneme organization in electrodes.

CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation

physics.ins-det · 2026-05-12 · unverdicted · novelty 5.0

CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditional flow matching.

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

cs.CV · 2022-05-29 · unverdicted · novelty 5.0

CogVideo is a large-scale transformer pretrained for text-to-video generation that outperforms public models in evaluations.

citing papers explorer

Showing 1 of 1 citing paper after filters.

High-Resolution Image Synthesis with Latent Diffusion Models cs.CV · 2021-12-20 · conditional · none · ref 23
Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrained autoencoders with cross-attention conditioning, while cutting computational and

Taming Transformers for High-Resolution Image Synthesis, 2021

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer