RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

· 2025 · cs.CV · arXiv 2510.20206

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Prompt design plays a crucial role in text-to-video (T2V) generation, yet user-provided prompts are often short, unstructured, and misaligned with training data, limiting the generative potential of diffusion-based T2V models. We present \textbf{RAPO++}, a cross-stage prompt optimization framework that unifies training-data--aligned refinement, test-time iterative scaling, and large language model (LLM) fine-tuning to substantially improve T2V generation without modifying the underlying generative backbone. In \textbf{Stage 1}, Retrieval-Augmented Prompt Optimization (RAPO) enriches user prompts with semantically relevant modifiers retrieved from a relation graph and refactors them to match training distributions, enhancing compositionality and multi-object fidelity. \textbf{Stage 2} introduces Sample-Specific Prompt Optimization (SSPO), a closed-loop mechanism that iteratively refines prompts using multi-source feedback -- including semantic alignment, spatial fidelity, temporal coherence, and task-specific signals such as optical flow -- yielding progressively improved video generation quality. \textbf{Stage 3} leverages optimized prompt pairs from SSPO to fine-tune the rewriter LLM, internalizing task-specific optimization patterns and enabling efficient, high-quality prompt generation even before inference. Extensive experiments across five state-of-the-art T2V models and five benchmarks demonstrate that RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility, outperforming existing methods by large margins. Our results highlight RAPO++ as a model-agnostic, cost-efficient, and scalable solution that sets a new standard for prompt optimization in T2V generation. The code is available at https://github.com/Vchitect/RAPO.

representative citing papers

MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation

cs.CV · 2026-05-16 · unverdicted · novelty 6.0

MAVEN introduces a multi-agent system for refining prompts in multicultural text-to-video generation and releases a benchmark of 243 prompts and 972 videos showing improved cultural relevance via parallel agent specialization.

Test-Time Scaling in Multimodal Foundation Models: A Comprehensive Survey of Generation and Reasoning

cs.CV · 2026-06-06 · unverdicted · novelty 5.0

A survey of test-time scaling for multimodal foundation models that introduces a three-way taxonomy of sampling, feedback, and search approaches along with applications and benchmarks.

citing papers explorer

Showing 2 of 2 citing papers after filters.

MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation cs.CV · 2026-05-16 · unverdicted · none · ref 4 · internal anchor
MAVEN introduces a multi-agent system for refining prompts in multicultural text-to-video generation and releases a benchmark of 243 prompts and 972 videos showing improved cultural relevance via parallel agent specialization.
Test-Time Scaling in Multimodal Foundation Models: A Comprehensive Survey of Generation and Reasoning cs.CV · 2026-06-06 · unverdicted · none · ref 21 · internal anchor
A survey of test-time scaling for multimodal foundation models that introduces a three-way taxonomy of sampling, feedback, and search approaches along with applications and benchmarks.

RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

fields

years

verdicts

representative citing papers

citing papers explorer