Beyond Rigid: Benchmarking Non-Rigid Video Editing

Bingzheng QU; Kehai Chen; Min Zhang; Xuefeng Bai

arxiv: 2601.18340 · v2 · pith:3CIVISXYnew · submitted 2026-01-26 · 💻 cs.CV

Beyond Rigid: Benchmarking Non-Rigid Video Editing

Bingzheng Qu , Xuefeng Bai , Kehai Chen , Min Zhang This is my paper

classification 💻 cs.CV

keywords editingvideonon-rigidalignmentappearancebeyonddistinctdynamics

0 comments

read the original abstract

As video generation models are increasingly expected to manipulate physical dynamics, there is a growing need to move evaluation beyond appearance fidelity and semantic alignment. Non-rigid video editing offers a uniquely revealing testbed, where distinct materials impose distinct physical constraints. In this paper, we introduce NRVBench, a diagnostic benchmark for non-rigid video editing, where the task is to modify deformable motion while preserving irrelevant regions and maintaining material-specific plausibility. NRVBench contains 180 curated videos across six physics-grounded categories, 2,340 fine-grained editing instructions, 360 multiple-choice questions, and pixel-accurate masks. We further propose NRVE-Acc, a structured VLM-based protocol that decomposes editing success into instruction following, material-aware deformation plausibility, and temporal coherence with motion cues. Experiments on representative inference-time video editing methods reveal a clear mismatch between conventional metrics and physics-aware perceptual editing success: methods that preserve appearance or achieve strong global alignment may still fail under non-rigid dynamics. We additionally introduce VM-Edit, a simple region-conditioned editing baseline that frees the foreground while locking the background, exposing the stability--plasticity trade-off.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey
cs.CV 2026-04 unverdicted novelty 4.0

The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.