hub

Godiva: Generating open-domain videos from natural descriptions

Godiva: Generating open-domain videos from natural descriptions , author= · 2021 · arXiv 2104.14806

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

read on arXiv browse 17 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

baseline 2 background 1

citation-polarity summary

baseline 2 background 1

representative citing papers

TwinQuant: Learnable Subspace Decomposition for 4-Bit LLM Quantization

cs.DC · 2026-06-01 · unverdicted · novelty 7.0

TwinQuant learns quantization-friendly subspaces for 4-bit LLM weights via manifold optimization and a fused kernel, preserving near-FP16 accuracy with up to 1.8x speedup on LLaMA3 and Qwen3 models.

MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.

PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

cs.SD · 2025-12-30 · unverdicted · novelty 7.0 · 2 refs

PhyAVBench provides the first systematic benchmark and metric for audio-physics grounding in T2AV, I2AV, and V2A models using controlled prompt pairs and real video ground truth.

Beyond the Frame: Generating 360 Panoramic Videos from Perspective Videos

cs.CV · 2025-04-10 · unverdicted · novelty 7.0

A generative model produces realistic and coherent 360 panoramic videos from in-the-wild perspective videos via curated online data and geometry-motion aware operations.

Learning Interactive Real-World Simulators

cs.AI · 2023-10-09 · conditional · novelty 7.0

UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.

Phenaki: Variable Length Video Generation From Open Domain Textual Description

cs.CV · 2022-10-05 · unverdicted · novelty 7.0

Phenaki generates arbitrary-length videos from sequences of text prompts by tokenizing videos with causal temporal attention and generating tokens with a text-conditioned masked transformer, trained jointly on images and videos.

StreamEdit: Training-Free Video Editing via Few-Step Streaming Video Generation

cs.CV · 2026-05-20 · unverdicted · novelty 6.0 · 2 refs

StreamEdit enables high-quality training-free video editing by adapting streaming video generation models with dual-branch fast sampling, self-attention bridge, cross-attention grounding, source-oriented guidance, and visual prompting, outperforming prior methods in few-step regimes.

VAGS: Velocity Adaptive Guidance Scale for Image Editing and Generation

cs.CV · 2026-05-15 · accept · novelty 6.0

VAGS adapts the CFG scale at each ODE step using velocity alignment signals to raise structural fidelity in editing and sample quality in generation over fixed-scale baselines.

Rethinking Where to Edit: Task-Aware Localization for Instruction-Based Image Editing

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

Task-aware localization via attention cues and feature centroids from source/target streams in IIE models improves non-edit consistency while preserving instruction following.

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

cs.CV · 2023-11-25 · conditional · novelty 6.0

Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results while enabling motion LoRA and multi-view 3D applications.

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

cs.CV · 2023-08-16 · unverdicted · novelty 6.0

DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.

Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

cs.CV · 2026-05-14 · unverdicted · novelty 5.0

Causal Forcing++ applies causal consistency distillation to enable scalable frame-wise 1-2 step autoregressive video generation, outperforming prior 4-step chunk-wise methods on quality metrics while halving first-frame latency.

Stable and Near-Reversible Diffusion ODE Solvers for Image Editing

cs.CV · 2026-05-12 · unverdicted · novelty 5.0

Near-reversible Runge-Kutta ODE solvers combined with vector-field smoothing deliver more stable and higher-fidelity text-guided edits in diffusion models than exactly reversible schemes.

DirectEdit: Step-Level Accurate Inversion for Flow-Based Image Editing

cs.CV · 2026-05-04 · unverdicted · novelty 5.0

DirectEdit eliminates reconstruction error in flow-based image editing by aligning forward paths and applying attention feature injection with mask-guided noise blending.

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

cs.CV · 2022-05-29 · unverdicted · novelty 5.0

CogVideo is a large-scale transformer pretrained for text-to-video generation that outperforms public models in evaluations.

ModelScope Text-to-Video Technical Report

cs.CV · 2023-08-12 · unverdicted · novelty 4.0

ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

cs.CV · 2026-02-02 · 2 refs

citing papers explorer

Showing 17 of 17 citing papers.

TwinQuant: Learnable Subspace Decomposition for 4-Bit LLM Quantization cs.DC · 2026-06-01 · unverdicted · none · ref 24
TwinQuant learns quantization-friendly subspaces for 4-bit LLM weights via manifold optimization and a fused kernel, preserving near-FP16 accuracy with up to 1.8x speedup on LLaMA3 and Qwen3 models.
MBench: A Comprehensive Benchmark on Memory Capability for Video World Models cs.CV · 2026-05-30 · unverdicted · none · ref 80
MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.
PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation cs.SD · 2025-12-30 · unverdicted · none · ref 46 · 2 links
PhyAVBench provides the first systematic benchmark and metric for audio-physics grounding in T2AV, I2AV, and V2A models using controlled prompt pairs and real video ground truth.
Beyond the Frame: Generating 360 Panoramic Videos from Perspective Videos cs.CV · 2025-04-10 · unverdicted · none · ref 53
A generative model produces realistic and coherent 360 panoramic videos from in-the-wild perspective videos via curated online data and geometry-motion aware operations.
Learning Interactive Real-World Simulators cs.AI · 2023-10-09 · conditional · none · ref 174
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
Phenaki: Variable Length Video Generation From Open Domain Textual Description cs.CV · 2022-10-05 · unverdicted · none · ref 52
Phenaki generates arbitrary-length videos from sequences of text prompts by tokenizing videos with causal temporal attention and generating tokens with a text-conditioned masked transformer, trained jointly on images and videos.
StreamEdit: Training-Free Video Editing via Few-Step Streaming Video Generation cs.CV · 2026-05-20 · unverdicted · none · ref 79 · 2 links
StreamEdit enables high-quality training-free video editing by adapting streaming video generation models with dual-branch fast sampling, self-attention bridge, cross-attention grounding, source-oriented guidance, and visual prompting, outperforming prior methods in few-step regimes.
VAGS: Velocity Adaptive Guidance Scale for Image Editing and Generation cs.CV · 2026-05-15 · accept · none · ref 41
VAGS adapts the CFG scale at each ODE step using velocity alignment signals to raise structural fidelity in editing and sample quality in generation over fixed-scale baselines.
Rethinking Where to Edit: Task-Aware Localization for Instruction-Based Image Editing cs.CV · 2026-04-22 · unverdicted · none · ref 36
Task-aware localization via attention cues and feature centroids from source/target streams in IIE models improves non-edit consistency while preserving instruction following.
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets cs.CV · 2023-11-25 · conditional · none · ref 103
Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results while enabling motion LoRA and multi-view 3D applications.
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory cs.CV · 2023-08-16 · unverdicted · none · ref 27
DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.
Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation cs.CV · 2026-05-14 · unverdicted · none · ref 37
Causal Forcing++ applies causal consistency distillation to enable scalable frame-wise 1-2 step autoregressive video generation, outperforming prior 4-step chunk-wise methods on quality metrics while halving first-frame latency.
Stable and Near-Reversible Diffusion ODE Solvers for Image Editing cs.CV · 2026-05-12 · unverdicted · none · ref 28
Near-reversible Runge-Kutta ODE solvers combined with vector-field smoothing deliver more stable and higher-fidelity text-guided edits in diffusion models than exactly reversible schemes.
DirectEdit: Step-Level Accurate Inversion for Flow-Based Image Editing cs.CV · 2026-05-04 · unverdicted · none · ref 34
DirectEdit eliminates reconstruction error in flow-based image editing by aligning forward paths and applying attention feature injection with mask-guided noise blending.
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers cs.CV · 2022-05-29 · unverdicted · none · ref 34
CogVideo is a large-scale transformer pretrained for text-to-video generation that outperforms public models in evaluations.
ModelScope Text-to-Video Technical Report cs.CV · 2023-08-12 · unverdicted · none · ref 61
ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.
Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation cs.CV · 2026-02-02 · unreviewed · ref 40 · 2 links

Godiva: Generating open-domain videos from natural descriptions

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer