{"paper":{"title":"MiVE: Multiscale Vision-language features for reference-guided video Editing","license":"http://creativecommons.org/licenses/by/4.0/","headline":"MiVE pulls multiscale features from a single vision-language model to guide accurate reference-based video edits.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Chengjing Wu, Luoqi Liu, Meng Zou, Ting Liu, Tong Wang, Xiaochao Qu, Xiaolin Hu","submitted_at":"2026-05-14T10:19:19Z","abstract_excerpt":"Reference-guided video editing takes a source video, a text instruction, and a reference image as inputs, requiring the model to faithfully apply the instructed edits while preserving original motion and unedited content. Existing methods fall into two paradigms, each with inherent limitations: decoupled encoders suffer from modality gaps when processing instructions and visual content independently, while unified vision-language encoders lose fine-grained spatial details by relying solely on final-layer representations. We observe that VLM layers encode complementary information hierarchicall"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Experiments demonstrate that MiVE achieves state-of-the-art performance by ranking highest in human preference, outperforming both academic methods and commercial systems.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"VLM layers encode complementary information hierarchically -- early layers capture localized spatial details essential for precise editing, while deeper layers encode global semantics for instruction comprehension.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"MiVE repurposes VLMs as multiscale feature extractors integrated into a unified self-attention Diffusion Transformer, achieving top human preference in reference-guided video editing.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"MiVE pulls multiscale features from a single vision-language model to guide accurate reference-based video edits.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"92d605ebcdfa646fa449facea177c9b6fa83ce9f227dcef248e072f872b203c2"},"source":{"id":"2605.14664","kind":"arxiv","version":1},"verdict":{"id":"6cc176bd-ca3d-4252-a9f8-09686389c125","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T05:19:37.778019Z","strongest_claim":"Experiments demonstrate that MiVE achieves state-of-the-art performance by ranking highest in human preference, outperforming both academic methods and commercial systems.","one_line_summary":"MiVE repurposes VLMs as multiscale feature extractors integrated into a unified self-attention Diffusion Transformer, achieving top human preference in reference-guided video editing.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"VLM layers encode complementary information hierarchically -- early layers capture localized spatial details essential for precise editing, while deeper layers encode global semantics for instruction comprehension.","pith_extraction_headline":"MiVE pulls multiscale features from a single vision-language model to guide accurate reference-based video edits."},"references":{"count":34,"sample":[{"doi":"10.48550/arxiv.2503.07598","year":2025,"title":"VACE: All-in-One Video Creation and Editing","work_id":"c68efbde-3431-4655-a337-87e2871ad6a3","ref_index":1,"cited_arxiv_id":"2503.07598","is_internal_anchor":true},{"doi":"10.1145/3721238.3730673","year":2025,"title":"VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control , booktitle =","work_id":"78615cd4-c278-4978-a0f6-d3afbc8233f2","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"2025 , url =","work_id":"2fc4524e-07d0-4e9e-80a7-763c7eafe6b5","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.48550/arxiv.2512.02933","year":2025,"title":"CoRR , volume =","work_id":"a99c19f6-9776-4fd6-ad91-9f6268407f2e","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.48550/arxiv.2512.07469","year":2025,"title":"VideoCoF: Unified Video Editing with Temporal Reasoner","work_id":"41eadf73-66bd-4058-9dc2-d9d37bc8f31b","ref_index":5,"cited_arxiv_id":"2512.07469","is_internal_anchor":true}],"resolved_work":34,"snapshot_sha256":"de5cf2273ec99fe5e306df8f9d17b2ea6c9955f2bf3593a0ca4b0b4a826763c7","internal_anchors":9},"formal_canon":{"evidence_count":2,"snapshot_sha256":"593fd33e42c26674ba2507b358353c6eb1e3c1fbad57963139b2ce21b8c80320"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}