TextMesh4D: Zero-shot Text-to-4D Mesh Generation
read the original abstract
Large-scale, high-quality dynamic 3D (4D) assets are essential for learning physically grounded representations, but remain costly to capture and annotate at scale. This limits the viability of supervised 4D learning and motivates zero-shot text-to-4D generation leveraging pretrained diffusion priors. To model complex dynamics, prior methods typically adopt implicit 3D representations (e.g., NeRFs or 3DGS) for their deformation capacity. However, their implicit nature provides limited control over surface topology, which hinders high-fidelity geometry and makes temporally coherent surface reconstruction challenging. To address these limitations, we explore zero-shot text-to-4D mesh generation. However, a structural mismatch arises when combining diffusion-based guidance with topology-constrained meshes: the guidance is noisy and spatially inconsistent, while meshes impose severe topological constraints, making direct vertex-level deformation unstable. In this paper, we introduce TextMesh4D, the first zero-shot framework for text-to-4D that directly generates dynamic meshes by addressing the above challenge at two complementary levels. Geometrically, we shift deformation modeling from vertices to faces via a Jacobian Deformation Field (JDF), enabling topology-aware surface reconstruction through an integrability-enforcing integration formulation. Semantically, we propose a Local-Global Semantic Regularizer (LGSR) that preserves identity over time by jointly constraining local deformation plausibility and global shape consistency. Extensive experiments demonstrate state-of-the-art temporal consistency, structural fidelity, and visual quality, while remaining efficient on a single 24GB GPU.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
Follow Your Track: Precise Skeleton Animation Controlled by 3D Trajectories
ACT is a trajectory-conditioned framework for topology-general skeletal animation that injects 3D point trajectories from monocular video into skeletons via a Routed Trajectory Injector for improved fidelity and tempo...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.