WorldJen is a new benchmark for generative video models that uses VLM-judged multi-dimensional Likert questionnaires validated against human preferences to achieve perfect tier agreement.
hub
The unreasonable effectiveness of deep features as a perceptual metric
11 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
fields
cs.CV 11roles
background 1polarities
background 1representative citing papers
A feature-space method that erases usable identity information from face images via learnable perturbations and a Face Revive Generator, rendering them ineffective for deepfake swapping while preserving visual quality.
Proposes World-Ego Modeling with WEM using CP-MoE diffusion and a new HTEWorld benchmark, claiming SOTA on hybrid navigation-manipulation tasks.
A transformer-based neural renderer that transfers arbitrary PBR lighting to single images via shared intrinsic conditioning extracted from both multi-illumination photos and path-traced coarse 3D renders.
VAGS adapts the CFG scale at each ODE step using velocity alignment signals to raise structural fidelity in editing and sample quality in generation over fixed-scale baselines.
ReactiveGWM introduces a decoupled diffusion architecture for player-NPC interactions that learns game-agnostic response logic for zero-shot strategy transfer across games.
DiffST delivers state-of-the-art real-world space-time video super-resolution with 17x faster inference than prior diffusion methods by using one-step sampling, cross-frame context aggregation, and video representation guidance.
PRISM improves text image super-resolution by rectifying global priors with flow-matching and modeling local structural uncertainty in a single diffusion pass, achieving SOTA results at millisecond inference.
Self-supervised models learn to perceive and manipulate the flow of time in videos, supporting speed detection, large-scale slow-motion data curation, and temporally controllable video synthesis.
LPM 1.0 generates infinite-length, identity-stable, real-time audio-visual conversational performances for single characters using a distilled causal diffusion transformer and a new benchmark.
Directly predicting clean data with large-patch pixel Transformers enables strong generative performance in diffusion models where noise prediction fails at high dimensions.
citing papers explorer
-
WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models
WorldJen is a new benchmark for generative video models that uses VLM-judged multi-dimensional Likert questionnaires validated against human preferences to achieve perfect tier agreement.
-
ID-Eraser: Proactive Defense Against Face Swapping via Identity Perturbation
A feature-space method that erases usable identity information from face images via learnable perturbations and a Face Revive Generator, rendering them ineffective for deepfake swapping while preserving visual quality.
-
World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks
Proposes World-Ego Modeling with WEM using CP-MoE diffusion and a new HTEWorld benchmark, claiming SOTA on hybrid navigation-manipulation tasks.
-
PIXLRelight: Controllable Relighting via Intrinsic Conditioning
A transformer-based neural renderer that transfers arbitrary PBR lighting to single images via shared intrinsic conditioning extracted from both multi-illumination photos and path-traced coarse 3D renders.
-
ReactiveGWM: Steering NPC in Reactive Game World Models
ReactiveGWM introduces a decoupled diffusion architecture for player-NPC interactions that learns game-agnostic response logic for zero-shot strategy transfer across games.
-
DiffST: Spatiotemporal-Aware Diffusion for Real-World Space-Time Video Super-Resolution
DiffST delivers state-of-the-art real-world space-time video super-resolution with 17x faster inference than prior diffusion methods by using one-step sampling, cross-frame context aggregation, and video representation guidance.
-
PRISM: Prior Rectification and Uncertainty-Aware Structure Modeling for Diffusion-Based Text Image Super-Resolution
PRISM improves text image super-resolution by rectifying global priors with flow-matching and modeling local structural uncertainty in a single diffusion pass, achieving SOTA results at millisecond inference.
-
Seeing Fast and Slow: Learning the Flow of Time in Videos
Self-supervised models learn to perceive and manipulate the flow of time in videos, supporting speed detection, large-scale slow-motion data curation, and temporally controllable video synthesis.
-
LPM 1.0: Video-based Character Performance Model
LPM 1.0 generates infinite-length, identity-stable, real-time audio-visual conversational performances for single characters using a distilled causal diffusion transformer and a new benchmark.
-
Back to Basics: Let Denoising Generative Models Denoise
Directly predicting clean data with large-patch pixel Transformers enables strong generative performance in diffusion models where noise prediction fails at high dimensions.