pith. sign in

arxiv: 2606.11670 · v1 · pith:ZGXYBSQXnew · submitted 2026-06-10 · 💻 cs.CV · cs.AI

ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation

classification 💻 cs.CV cs.AI
keywords identityargusmosaicvideogenerationsmiistackedsubject-preserving
0
0 comments X
read the original abstract

Subject-preserving video generation is not solved by frontal-face similarity alone: a generated person must remain recognizable across motion, large viewpoint changes, expression shifts, occlusion, scale variation, and conflicts among text, first-frame, and identity references. We argue that the central bottleneck is the point-reference paradigm, which collapses identity into a single static observation entangled with pose, accessories, lighting, background, and camera statistics. We introduce Argus, a Wan-based framework centered on Stacked Multi-View Identity Mosaic Injection (SMII). SMII converts MLLM-selected image/video identity evidence into a 3*3 stacked mosaic, synchronizes the mosaic with the current diffusion time, and injects it as negative-time read-only memory in Wan's native token space. This turns identity from an external clean adapter or a single reference image into a compact dynamic distribution. Around SMII, an MLLM Identity Director selects informative identity moments and resolves condition conflicts, while no-cross-pair counterfactual training, Temporal Identity Annealing, and Adaptive Self-Likeness Guidance improve robustness without paired subject-video supervision. We further release HardID-Celeb, a public-figure identity-stress benchmark, and introduce YawScore and OccScore to probe large-yaw and first-frame-occlusion robustness. Argus achieves state-of-the-art results on OpenS2V-Eval Human-Domain, reaching 64.38 Total Score, 71.86 FaceSim, 51.62 NexusScore, and 79.14 NaturalScore. On HardID-Celeb, Argus obtains 76.80 FaceSim and improves YawScore and OccScore by 12.60 and 15.10 points over the strongest baselines, demonstrating that dynamic identity memory and large-scale counterfactual self-supervision are highly effective for subject-preserving video generation.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OmniDrive: An LLM-Choreographed Multi-Agent World Model with Unified Latent Co-Compression for Multi-View Driving Video Generation

    cs.CV 2026-06 unverdicted novelty 7.0

    DRIVE-CHOREO uses three LLM agents to create a unified position-aware token sequence co-compressed with multi-view video, achieving SOTA BEV mAP of 21.6 and +2.4 NDS improvement on nuScenes.

  2. OrthoMotion:Disentangling Camera and Subject Motion via Geometry Semantics Orthogonal Attention

    cs.CV 2026-06 unverdicted novelty 6.0

    OrthoMotion disentangles camera and subject motion in video generation by splitting attention into algebraically complementary geometric (RoPE rotation) and semantic (gated value) channels driven to orthogonality by a...

  3. ParaScale: Scale-Calibrated Camera-Motion Transfer via a Gauge-Invariant Parallax Number

    cs.CV 2026-06 unverdicted novelty 6.0

    ParaScale extracts a gauge-invariant Parallax Number from a reference video and re-realizes the same parallax against the target scene's depth map to achieve scale-calibrated camera motion transfer.

  4. TRIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning

    cs.LG 2026-06 unverdicted novelty 5.0

    TRIDENT is a MARL framework using Richardson-Romberg gradient correction, Lyapunov-constrained trust-region updates, and a physics-informed residual critic that claims O(1/sqrt(K)) convergence to constrained Nash equi...