CustomX: Unified Character, Action, and Scene Customization in Video World Models

Bo Dai; Fangyun Wei; Hongyang Zhang; Yan Lu; Yitong Wang

arxiv: 2512.17796 · v2 · pith:K3GAERATnew · submitted 2025-12-18 · 💻 cs.CV · cs.AI

CustomX: Unified Character, Action, and Scene Customization in Video World Models

Yitong Wang , Fangyun Wei , Hongyang Zhang , Bo Dai , Yan Lu This is my paper

classification 💻 cs.CV cs.AI

keywords charactermodelsvideoworldactionscustomxenvironmentgeneration

0 comments

read the original abstract

Recent advances in world models have greatly enhanced interactive environment simulation. Existing methods mainly fall into two categories: (1) static world generation models, which construct 3D environments without active agents, and (2) controllable-entity models, which allow a single entity to perform limited actions in an otherwise uncontrollable environment. In this work, we introduce CustomX, leveraging the realism and structural grounding of static world generation while extending controllable-entity models to support user-specified characters capable of performing open-ended actions. Users can provide a 3DGS scene and a character, then use natural language to direct the character to perform diverse behaviors, ranging from basic locomotion to object-centric interactions, while freely exploring the environment. CustomX synthesizes temporally coherent video clips that preserve visual fidelity with the provided scene and character, formulated as a conditional autoregressive video generation problem. Built upon a pre-trained video generator, our training strategy significantly enhances motion dynamics while maintaining generalization across actions and characters. Our evaluation covers a broad range of aspects, including visual quality, character consistency, action controllability, and long-horizon coherence.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Ink3D: Sculpting 3D Assets with Extremely Complex Textures via Video Generative Models
cs.CV 2026-07 unverdicted novelty 6.0

Ink3D decouples geometry from texture by generating dense orbit videos with a conditional video model and baking them via a neural optimizer to produce complex 3D textures.