LoCoT2V-Bench: Benchmarking Long-Form and Complex Text-to-Video Generation

Chengyue Wu; Kehai Chen; Min Zhang; Xiangqing Zheng

arxiv: 2510.26412 · v3 · pith:CU7MP3CZnew · submitted 2025-10-30 · 💻 cs.CV · cs.AI

LoCoT2V-Bench: Benchmarking Long-Form and Complex Text-to-Video Generation

Xiangqing Zheng , Chengyue Wu , Kehai Chen , Min Zhang This is my paper

classification 💻 cs.CV cs.AI

keywords generationqualityalignmentchallengecharacterconsistencylocot2v-benchlong-form

0 comments

read the original abstract

Recent advances in text-to-video generation have achieved impressive performance on short clips, yet evaluating long-form generation under complex textual inputs remains a significant challenge. In response to this challenge, we present LoCoT2V-Bench, a benchmark for long video generation (LVG) featuring multi-scene prompts with hierarchical metadata (e.g., character settings and camera behaviors), constructed from collected real-world videos. We further propose LoCoT2V-Eval, a multi-dimensional framework covering perceptual quality, text-video alignment, temporal quality, dynamic quality, and Human Expectation Realization Degree (HERD), with an emphasis on aspects such as fine-grained text-video alignment and temporal character consistency. Experiments on 17 representative LVG models reveal pronounced capability disparities across evaluation dimensions, with strong perceptual quality and background consistency but markedly weaker fine-grained text-video alignment and character consistency. These findings suggest that improving prompt faithfulness and identity preservation remains a key challenge for long-form video generation. Our code and data are released at https://github.com/XqZeppelinhead0702/LoCoT2V-Bench

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mitigating Multimodal Hallucination via Phase-wise Self-reward
cs.CV 2026-04 unverdicted novelty 6.0

PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.
Long-CODE: Isolating Pure Long-Context as an Orthogonal Dimension in Video Evaluation
cs.CV 2026-04 unverdicted novelty 6.0

Long-CODE isolates long-context video evaluation with a new benchmark dataset and shot-dynamics metric that correlates better with human judgments on narrative richness and global consistency than short-video metrics.
LPM 1.0: Video-based Character Performance Model
cs.CV 2026-04 unverdicted novelty 6.0

LPM 1.0 generates infinite-length, identity-stable, real-time audio-visual conversational performances for single characters using a distilled causal diffusion transformer and a new benchmark.