VideoCoF: Unified Video Editing with Temporal Reasoner

· 2025 · cs.CV · arXiv 2512.07469

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

open full Pith review browse 10 citing papers arXiv PDF

abstract

Existing video editing methods face a critical trade-off: expert models offer precision but rely on task-specific priors like masks, hindering unification; conversely, unified temporal in-context learning models are mask-free but lack explicit spatial cues, leading to weak instruction-to-region mapping and imprecise localization. To resolve this conflict, we propose VideoCoF, a novel Chain-of-Frames approach inspired by Chain-of-Thought reasoning. VideoCoF enforces a ``see, reason, then edit" procedure by compelling the video diffusion model to first predict reasoning tokens (edit-region latents) before generating the target video tokens. This explicit reasoning step removes the need for user-provided masks while achieving precise instruction-to-region alignment and fine-grained video editing. Furthermore, we introduce a RoPE alignment strategy that leverages these reasoning tokens to ensure motion alignment and enable length extrapolation beyond the training duration. We demonstrate that with a minimal data cost of only 50k video pairs, VideoCoF achieves state-of-the-art performance on VideoCoF-Bench, validating the efficiency and effectiveness of our approach. Our code, weight, data are available at https://github.com/knightyxp/VideoCoF.

citation-role summary

dataset 1

citation-polarity summary

background 1

representative citing papers

OmniTryOn: Video Try-On Anything at Once!

cs.CV · 2026-06-07 · unverdicted · novelty 7.0

OmniTryOn performs multi-object video virtual try-on in one pass using first-frame wearable caching and spatiotemporal RoPE, outperforming single-garment baselines on a new TryAny-Bench dataset.

Aurora: Unified Video Editing with a Tool-Using Agent

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

Aurora introduces a VLM-based agent that converts raw user video edit requests into structured conditioning inputs for a unified diffusion transformer, improving performance on underspecified tasks via a new benchmark.

GeoEdit: Geometry-Aware Object Editing via Dual-Branch Denoising

cs.CV · 2026-06-29 · unverdicted · novelty 6.0

GeoEdit introduces a Lift-Manipulate-Render-Denoise pipeline with dual-branch denoising and variance-homogeneous injection for 3D-consistent object editing in single photos.

LiveEdit: Towards Real-Time Diffusion-Based Streaming Video Editing

cs.CV · 2026-06-25 · unverdicted · novelty 6.0

LiveEdit distills a bidirectional video foundation model into a unidirectional streaming editor via three-stage training plus mask caching to reach 12.66 FPS with stable edits.

SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing

cs.CV · 2026-05-24 · unverdicted · novelty 6.0

SpongeBob introduces the first end-to-end audio-visual joint editing framework using sync-aware bidirectional attention and context-aware modules, plus a new dataset and benchmark, claiming 30% Sync-C and 12.5% Ctx-F1 gains over baselines.

MiVE: Multiscale Vision-language features for reference-guided video Editing

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

MiVE repurposes VLMs as multiscale feature extractors integrated into a unified self-attention Diffusion Transformer for reference-guided video editing, claiming top human preference scores over prior methods.

LIVE: Leveraging Image Manipulation Priors for Instruction-based Video Editing

cs.CV · 2026-04-18 · unverdicted · novelty 6.0

LIVE achieves state-of-the-art instruction-based video editing by jointly training on image and video data with a frame-wise token noise strategy to bridge domain gaps and a new benchmark of over 60 tasks.

Occlusion-Aware Physics-Semantic Keyframe Selection for Robust Video Editing

cs.CV · 2026-05-22 · unverdicted · novelty 5.0

A new keyframe selection framework combines structural, tracking, and semantic criteria to select reliable anchor frames for diffusion-based video editing under occlusion.

EasyVFX: Frequency-Driven Decoupling for Resource-Efficient VFX Generation

cs.CV · 2026-05-21 · unverdicted · novelty 5.0

EasyVFX decouples VFX generation via frequency-aware Mixture-of-Experts and test-time training to achieve realistic effects with limited resources.

Measuring AI Reasoning: A Guide for Researchers

cs.AI · 2026-05-04 · unverdicted · novelty 4.0

Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.

citing papers explorer

Showing 10 of 10 citing papers after filters.

OmniTryOn: Video Try-On Anything at Once! cs.CV · 2026-06-07 · unverdicted · none · ref 56 · internal anchor
OmniTryOn performs multi-object video virtual try-on in one pass using first-frame wearable caching and spatiotemporal RoPE, outperforming single-garment baselines on a new TryAny-Bench dataset.
Aurora: Unified Video Editing with a Tool-Using Agent cs.CV · 2026-05-18 · unverdicted · none · ref 39 · internal anchor
Aurora introduces a VLM-based agent that converts raw user video edit requests into structured conditioning inputs for a unified diffusion transformer, improving performance on underspecified tasks via a new benchmark.
GeoEdit: Geometry-Aware Object Editing via Dual-Branch Denoising cs.CV · 2026-06-29 · unverdicted · none · ref 75 · internal anchor
GeoEdit introduces a Lift-Manipulate-Render-Denoise pipeline with dual-branch denoising and variance-homogeneous injection for 3D-consistent object editing in single photos.
LiveEdit: Towards Real-Time Diffusion-Based Streaming Video Editing cs.CV · 2026-06-25 · unverdicted · none · ref 67 · internal anchor
LiveEdit distills a bidirectional video foundation model into a unidirectional streaming editor via three-stage training plus mask caching to reach 12.66 FPS with stable edits.
SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing cs.CV · 2026-05-24 · unverdicted · none · ref 13 · internal anchor
SpongeBob introduces the first end-to-end audio-visual joint editing framework using sync-aware bidirectional attention and context-aware modules, plus a new dataset and benchmark, claiming 30% Sync-C and 12.5% Ctx-F1 gains over baselines.
MiVE: Multiscale Vision-language features for reference-guided video Editing cs.CV · 2026-05-14 · unverdicted · none · ref 5 · internal anchor
MiVE repurposes VLMs as multiscale feature extractors integrated into a unified self-attention Diffusion Transformer for reference-guided video editing, claiming top human preference scores over prior methods.
LIVE: Leveraging Image Manipulation Priors for Instruction-based Video Editing cs.CV · 2026-04-18 · unverdicted · none · ref 51 · internal anchor
LIVE achieves state-of-the-art instruction-based video editing by jointly training on image and video data with a frame-wise token noise strategy to bridge domain gaps and a new benchmark of over 60 tasks.
Occlusion-Aware Physics-Semantic Keyframe Selection for Robust Video Editing cs.CV · 2026-05-22 · unverdicted · none · ref 121 · internal anchor
A new keyframe selection framework combines structural, tracking, and semantic criteria to select reliable anchor frames for diffusion-based video editing under occlusion.
EasyVFX: Frequency-Driven Decoupling for Resource-Efficient VFX Generation cs.CV · 2026-05-21 · unverdicted · none · ref 65 · internal anchor
EasyVFX decouples VFX generation via frequency-aware Mixture-of-Experts and test-time training to achieve realistic effects with limited resources.
Measuring AI Reasoning: A Guide for Researchers cs.AI · 2026-05-04 · unverdicted · none · ref 59 · internal anchor
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.

VideoCoF: Unified Video Editing with Temporal Reasoner

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer