pith. sign in

Concept-master: Multi-concept video customiza- tion on diffusion transformer models without test-time tun- ing

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

citation-role summary

background 1

citation-polarity summary

fields

cs.CV 5

years

2026 4 2025 1

verdicts

UNVERDICTED 5

roles

background 1

polarities

background 1

representative citing papers

Generate Your Talking Avatar from Video Reference

cs.CV · 2026-04-30 · unverdicted · novelty 6.0

TAVR generates high-fidelity talking avatars from cross-scene video references via token selection and three-stage training (same-scene pretraining, cross-scene fine-tuning, identity RL), outperforming baselines on a new 158-pair benchmark.

Learning Zero-Shot Subject-Driven Video Generation Using 1% Compute

cs.CV · 2025-04-23 · unverdicted · novelty 6.0

A zero-shot subject-driven video generation framework that decomposes the task into identity injection from 200K subject-image pairs and motion preservation from 4K arbitrary videos, trained in 288 A100 GPU hours on CogVideoX-5B to match prior performance at 1% compute.

Evolution of Video Generative Foundations

cs.CV · 2026-04-07 · unverdicted · novelty 2.0

This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

citing papers explorer

Showing 5 of 5 citing papers.

  • Generate Your Talking Avatar from Video Reference cs.CV · 2026-04-30 · unverdicted · none · ref 22

    TAVR generates high-fidelity talking avatars from cross-scene video references via token selection and three-stage training (same-scene pretraining, cross-scene fine-tuning, identity RL), outperforming baselines on a new 158-pair benchmark.

  • MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation cs.CV · 2026-04-21 · unverdicted · none · ref 9

    MMControl adds multi-modal controls for identity, timbre, pose, and layout to unified audio-video diffusion models via dual-stream injection and adjustable guidance scaling.

  • Learning Zero-Shot Subject-Driven Video Generation Using 1% Compute cs.CV · 2025-04-23 · unverdicted · none · ref 24

    A zero-shot subject-driven video generation framework that decomposes the task into identity injection from 200K subject-image pairs and motion preservation from 4K arbitrary videos, trained in 288 A100 GPU hours on CogVideoX-5B to match prior performance at 1% compute.

  • Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation cs.CV · 2026-05-17 · unverdicted · none · ref 28

    Omni-Customizer proposes an end-to-end framework using Omni-Context Fusion, Masked TTS Cross-Attention, Semantic-Anchored Multimodal RoPE, and specialized training curricula to achieve precise multimodal identity binding in joint audio-video generation.

  • Evolution of Video Generative Foundations cs.CV · 2026-04-07 · unverdicted · none · ref 200

    This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.