Five: A fine-grained video editing benchmark for evaluating emerging diffusion and rectified flow models

Minghan Li, Chenxi Xie, Yichen Wu, Lei Zhang, Mengyu Wang · 2025 · arXiv 2503.13684

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 1 dataset 1

citation-polarity summary

background 1 use dataset 1

representative citing papers

LangDriveCTRL: Natural Language Controllable Driving Scene Editing with Multi-modal Agents

cs.CV · 2025-12-19 · unverdicted · novelty 7.0

LangDriveCTRL decomposes driving videos into 3D scene graphs and uses an agentic pipeline with specialized multi-modal agents to perform language-controlled object and behavior edits, achieving nearly 2x higher instruction alignment than prior state-of-the-art methods.

VideoCoF: Unified Video Editing with Temporal Reasoner

cs.CV · 2025-12-08 · unverdicted · novelty 7.0

VideoCoF adds an explicit reasoning step using edit-region latents in video diffusion models to enable precise mask-free editing and motion alignment with only 50k training pairs.

StreamGVE: Training-Free Video Editing via Few-Step Streaming Video Generation

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

StreamGVE enables high-quality training-free video editing by converting the task to noise-to-data streaming generation with dual-branch fast sampling, self-attention bridges, cross-attention grounding, source-oriented guidance, and visual prompting.

VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

cs.CV · 2026-04-17 · unverdicted · novelty 6.0

VEFX-Bench releases a large human-labeled video editing dataset, a multi-dimensional reward model, and a standardized benchmark that better matches human judgments than generic evaluators.

Bernini: Latent Semantic Planning for Video Diffusion

cs.CV · 2026-05-21 · unverdicted · novelty 5.0

Bernini is a framework that uses an MLLM planner to output semantic representations for a DiT renderer to generate or edit videos, reporting SOTA benchmark performance.

Context Unrolling in Omni Models

cs.CV · 2026-04-23 · unverdicted · novelty 5.0

Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.

citing papers explorer

Showing 6 of 6 citing papers.

LangDriveCTRL: Natural Language Controllable Driving Scene Editing with Multi-modal Agents cs.CV · 2025-12-19 · unverdicted · none · ref 23
LangDriveCTRL decomposes driving videos into 3D scene graphs and uses an agentic pipeline with specialized multi-modal agents to perform language-controlled object and behavior edits, achieving nearly 2x higher instruction alignment than prior state-of-the-art methods.
VideoCoF: Unified Video Editing with Temporal Reasoner cs.CV · 2025-12-08 · unverdicted · none · ref 17
VideoCoF adds an explicit reasoning step using edit-region latents in video diffusion models to enable precise mask-free editing and motion alignment with only 50k training pairs.
StreamGVE: Training-Free Video Editing via Few-Step Streaming Video Generation cs.CV · 2026-05-20 · unverdicted · none · ref 42
StreamGVE enables high-quality training-free video editing by converting the task to noise-to-data streaming generation with dual-branch fast sampling, self-attention bridges, cross-attention grounding, source-oriented guidance, and visual prompting.
VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects cs.CV · 2026-04-17 · unverdicted · none · ref 14
VEFX-Bench releases a large human-labeled video editing dataset, a multi-dimensional reward model, and a standardized benchmark that better matches human judgments than generic evaluators.
Bernini: Latent Semantic Planning for Video Diffusion cs.CV · 2026-05-21 · unverdicted · none · ref 37
Bernini is a framework that uses an MLLM planner to output semantic representations for a DiT renderer to generate or edit videos, reporting SOTA benchmark performance.
Context Unrolling in Omni Models cs.CV · 2026-04-23 · unverdicted · none · ref 25
Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.

Five: A fine-grained video editing benchmark for evaluating emerging diffusion and rectified flow models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer