hub Canonical reference

Animate-a-story: Storytelling with retrieval-augmented video generation

Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, et al · 2023 · arXiv 2307.06940

Canonical reference. 100% of citing Pith papers cite this work as background.

10 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 10 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6

citation-polarity summary

background 6

representative citing papers

MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production

cs.MM · 2026-04-16 · unverdicted · novelty 7.0

MCSC-Bench is the first large-scale dataset for the Multimodal Context-to-Script Creation task, requiring models to select relevant shots from redundant materials, plan missing shots, and generate coherent scripts with voiceovers.

Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

eess.IV · 2026-03-30 · unverdicted · novelty 6.0

Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.

RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

cs.CV · 2025-10-23 · unverdicted · novelty 6.0

RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

cs.CV · 2025-03-27 · accept · novelty 6.0

VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs, and anomaly detection methods.

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

cs.CV · 2025-01-07 · conditional · novelty 6.0

Sa2VA unifies SAM-2 segmentation with MLLM reasoning into a single model for referring segmentation and conversation on images and videos, supported by a new 72k-expression Ref-SAV dataset.

Retrieval-Augmented Generation for AI-Generated Content: A Survey

cs.CV · 2024-02-29 · accept · novelty 5.0

A survey classifying RAG foundations for AIGC, summarizing enhancements, cross-modal applications, benchmarks, limitations, and future directions.

Character-Centered Dialogue Generation from Scene-Level Prompts

cs.CV · 2025-05-22 · unverdicted · novelty 4.0

A training-free framework generates expressive, character-grounded dialogue and speech from scene prompts using vision-language encoders, LLMs, and a recursive narrative memory bank for cross-scene consistency.

Scene-Action Prompt Fusion for Coherent Text-to-Video Storytelling

cs.CV · 2025-03-08 · unverdicted · novelty 3.0

A prompt fusion approach combines bidirectional time-weighted latent blending, dynamics-informed prompt weighting via CLIP, and semantic action representations to produce temporally consistent long videos from text without retraining.

Evolution of Video Generative Foundations

cs.CV · 2026-04-07 · unverdicted · novelty 2.0

This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

cs.CV · 2024-02-27 · unverdicted · novelty 2.0

The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

citing papers explorer

Showing 10 of 10 citing papers.

MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production cs.MM · 2026-04-16 · unverdicted · none · ref 9
MCSC-Bench is the first large-scale dataset for the Multimodal Context-to-Script Creation task, requiring models to select relevant shots from redundant materials, plan missing shots, and generate coherent scripts with voiceovers.
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms eess.IV · 2026-03-30 · unverdicted · none · ref 245
Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling cs.CV · 2025-10-23 · unverdicted · none · ref 63
RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness cs.CV · 2025-03-27 · accept · none · ref 17
VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs, and anomaly detection methods.
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos cs.CV · 2025-01-07 · conditional · none · ref 33
Sa2VA unifies SAM-2 segmentation with MLLM reasoning into a single model for referring segmentation and conversation on images and videos, supported by a new 72k-expression Ref-SAV dataset.
Retrieval-Augmented Generation for AI-Generated Content: A Survey cs.CV · 2024-02-29 · accept · none · ref 174
A survey classifying RAG foundations for AIGC, summarizing enhancements, cross-modal applications, benchmarks, limitations, and future directions.
Character-Centered Dialogue Generation from Scene-Level Prompts cs.CV · 2025-05-22 · unverdicted · none · ref 25
A training-free framework generates expressive, character-grounded dialogue and speech from scene prompts using vision-language encoders, LLMs, and a recursive narrative memory bank for cross-scene consistency.
Scene-Action Prompt Fusion for Coherent Text-to-Video Storytelling cs.CV · 2025-03-08 · unverdicted · none · ref 14
A prompt fusion approach combines bidirectional time-weighted latent blending, dynamics-informed prompt weighting via CLIP, and semantic action representations to produce temporally consistent long videos from text without retraining.
Evolution of Video Generative Foundations cs.CV · 2026-04-07 · unverdicted · none · ref 184
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models cs.CV · 2024-02-27 · unverdicted · none · ref 150
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Animate-a-story: Storytelling with retrieval-augmented video generation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer