hub Mixed citations

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, et al · 2025 · cs.CV · arXiv 2501.12202

Mixed citation behavior. Most common role is background (67%).

48 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 48 citing papers arXiv PDF

abstract

We present Hunyuan3D 2.0, an advanced large-scale 3D synthesis system for generating high-resolution textured 3D assets. This system includes two foundation components: a large-scale shape generation model -- Hunyuan3D-DiT, and a large-scale texture synthesis model -- Hunyuan3D-Paint. The shape generative model, built on a scalable flow-based diffusion transformer, aims to create geometry that properly aligns with a given condition image, laying a solid foundation for downstream applications. The texture synthesis model, benefiting from strong geometric and diffusion priors, produces high-resolution and vibrant texture maps for either generated or hand-crafted meshes. Furthermore, we build Hunyuan3D-Studio -- a versatile, user-friendly production platform that simplifies the re-creation process of 3D assets. It allows both professional and amateur users to manipulate or even animate their meshes efficiently. We systematically evaluate our models, showing that Hunyuan3D 2.0 outperforms previous state-of-the-art models, including the open-source models and closed-source models in geometry details, condition alignment, texture quality, and etc. Hunyuan3D 2.0 is publicly released in order to fill the gaps in the open-source 3D community for large-scale foundation generative models. The code and pre-trained weights of our models are available at: https://github.com/Tencent/Hunyuan3D-2

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9 baseline 1 method 1 other 1

citation-polarity summary

background 8 unclear 2 baseline 1 use method 1

representative citing papers

UnfoldArt: Zero-Shot Recovery of Full Articulated 3D Objects from Text or Image

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

UnfoldArt uses multi-agent debate grounded in vision-language and video models to infer articulation parameters and reconstruct full 3D objects including occluded parts from text or image inputs.

GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

GenRecon lifts object-level generative priors to scene-scale reconstruction by chunking scenes and using projection-based conditioning on multi-view features, claiming 16% better results than prior methods.

CelloCut: Constructive Watertight Remeshing via Tetrahedral Cell Cuts

cs.GR · 2026-05-18 · unverdicted · novelty 7.0

CelloCut formulates watertight remeshing as binary labeling on a Delaunay tetrahedral partition solved by graph-cut minimization with one-sided constraints to guarantee volumetrically consistent solids.

InverseDraping: Recovering Sewing Patterns from 3D Garment Surfaces via BoxMesh Bridging

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

A two-stage autoregressive framework centered on BoxMesh recovers parametric sewing patterns from 3D garment surfaces, claiming state-of-the-art results on benchmarks and generalization to real scans and single-view images.

PerpetualWonder: Long-Horizon Action-Conditioned 4D Scene Generation

cs.CV · 2026-02-04 · unverdicted · novelty 7.0

PerpetualWonder introduces a closed-loop generative simulator with a unified physical-visual representation for long-horizon action-conditioned 4D scene generation from one image.

ATATA: One Algorithm to Align Them All

cs.CV · 2026-01-16 · unverdicted · novelty 7.0

ATATA enables fast joint inference of structurally aligned pairs using Rectified Flow models via segment transport, improving state-of-the-art for image and video generation while matching 3D quality at much higher speed.

LangDriveCTRL: Natural Language Controllable Driving Scene Editing with Multi-modal Agents

cs.CV · 2025-12-19 · unverdicted · novelty 7.0

LangDriveCTRL decomposes driving videos into 3D scene graphs and uses an agentic pipeline with specialized multi-modal agents to perform language-controlled object and behavior edits, achieving nearly 2x higher instruction alignment than prior state-of-the-art methods.

PacTure: Efficient PBR Texture Generation on Packed Views with Visual Autoregressive Models

cs.CV · 2025-05-28 · unverdicted · novelty 7.0

PacTure uses view packing and next-scale autoregressive prediction to generate consistent multi-view PBR textures faster than prior sequential or cross-attention methods.

OmniFit: Multi-modal 3D Body Fitting via Scale-agnostic Dense Landmark Prediction

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

OmniFit uses a conditional transformer decoder to predict dense body landmarks from multi-modal inputs for scale-agnostic SMPL-X fitting, outperforming prior methods by 57-81% and reaching millimeter accuracy on CAPE and 4D-DRESS benchmarks.

Geometrically Consistent Multi-View Scene Generation from Freehand Sketches

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in realism and consistency.

Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.

Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.

GenLCA: 3D Diffusion for Full-Body Avatars from In-the-Wild Videos

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

GenLCA enables scalable training of a 3D diffusion model for photorealistic, animatable full-body avatars by tokenizing large-scale real-world videos with a pretrained reconstructor and applying visibility-aware diffusion training to handle partial observations.

HiFiVe: High-Fidelity Vehicle Generation Leveraging Auto-Regressive 2D Generative Priors

cs.CV · 2026-06-24 · unverdicted · novelty 6.0

HiFiVe is a training-free framework using an auto-regressive texture refinement pipeline with depth-based warping, multi-view fusion, and symmetry to enhance both texture and geometry fidelity in vehicle generation from 2D priors.

FoundObj: Self-supervised Foundation Models as Rewards for Label-free 3D Object Segmentation

cs.CV · 2026-05-26 · unverdicted · novelty 6.0

FoundObj uses foundation-model priors as RL rewards to discover multi-class 3D objects from point clouds without scene-level labels.

Helix4D: Complex 4D Mesh Generation

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

Helix4D generates high-quality dynamic 4D meshes from videos by extending Trellis2 with sliding-window cross-frame attention anchored on the first frame and a repurposed 4D temporal encoding.

ROAR-3D: Routing Arbitrary Views for High-Fidelity 3D Generation

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

ROAR-3D adds a token-wise view router and dual-stream attention to pretrained single-view 3D generators so they can use arbitrary unposed images for higher-fidelity output.

TOPOS: High-Fidelity and Efficient Industry-Grade 3D Head Generation

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

TOPOS creates high-fidelity 3D heads with fixed industry topology from single images via a specialized VAE with Perceiver Resampler and a rectified flow transformer.

Real2Sim in HOI: Toward Physically Plausible HOI Reconstruction from Monocular Videos

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

HA-HOI produces physically plausible 4D HOI animations from monocular videos by anchoring object reconstruction to human motion and refining the result in a physics-based humanoid-object simulator.

Pixal3D: Pixel-Aligned 3D Generation from Images

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.

GenMed: A Pairwise Generative Reformulation of Medical Diagnostic Tasks

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

GenMed uses diffusion models to capture P(X,Y) for medical tasks and performs inference via gradient-based test-time optimization, supporting arbitrary observation combinations without retraining.

MeshReGen: A Unified 3D Geometry Regeneration Framework

cs.CV · 2026-04-30 · unverdicted · novelty 6.0 · 2 refs

MeshReGen introduces a conditioned 3D geometry regenerator with VecSet that learns a regeneration prior via self-supervision and reports state-of-the-art results on controllable generation tasks.

Pair2Scene: Learning Local Object Relations for Procedural Scene Generation

cs.CV · 2026-04-13 · unverdicted · novelty 6.0 · 2 refs

Pair2Scene generates complex 3D scenes beyond training data by training a network on local object-pair placement rules and applying them recursively with collision-aware sampling.

UniRecGen: Unifying Multi-View 3D Reconstruction and Generation

cs.CV · 2026-04-01 · unverdicted · novelty 6.0

UniRecGen unifies reconstruction and generation via shared canonical space and disentangled cooperative learning to produce complete, consistent 3D models from sparse views.

citing papers explorer

Showing 48 of 48 citing papers.

UnfoldArt: Zero-Shot Recovery of Full Articulated 3D Objects from Text or Image cs.CV · 2026-06-29 · unverdicted · none · ref 22 · internal anchor
UnfoldArt uses multi-agent debate grounded in vision-language and video models to infer articulation parameters and reconstruct full 3D objects including occluded parts from text or image inputs.
GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction cs.CV · 2026-05-22 · unverdicted · none · ref 22 · internal anchor
GenRecon lifts object-level generative priors to scene-scale reconstruction by chunking scenes and using projection-based conditioning on multi-view features, claiming 16% better results than prior methods.
CelloCut: Constructive Watertight Remeshing via Tetrahedral Cell Cuts cs.GR · 2026-05-18 · unverdicted · none · ref 42 · internal anchor
CelloCut formulates watertight remeshing as binary labeling on a Delaunay tetrahedral partition solved by graph-cut minimization with one-sided constraints to guarantee volumetrically consistent solids.
InverseDraping: Recovering Sewing Patterns from 3D Garment Surfaces via BoxMesh Bridging cs.CV · 2026-04-03 · unverdicted · none · ref 14 · internal anchor
A two-stage autoregressive framework centered on BoxMesh recovers parametric sewing patterns from 3D garment surfaces, claiming state-of-the-art results on benchmarks and generalization to real scans and single-view images.
PerpetualWonder: Long-Horizon Action-Conditioned 4D Scene Generation cs.CV · 2026-02-04 · unverdicted · none · ref 62 · internal anchor
PerpetualWonder introduces a closed-loop generative simulator with a unified physical-visual representation for long-horizon action-conditioned 4D scene generation from one image.
ATATA: One Algorithm to Align Them All cs.CV · 2026-01-16 · unverdicted · none · ref 64 · internal anchor
ATATA enables fast joint inference of structurally aligned pairs using Rectified Flow models via segment transport, improving state-of-the-art for image and video generation while matching 3D quality at much higher speed.
LangDriveCTRL: Natural Language Controllable Driving Scene Editing with Multi-modal Agents cs.CV · 2025-12-19 · unverdicted · none · ref 61 · internal anchor
LangDriveCTRL decomposes driving videos into 3D scene graphs and uses an agentic pipeline with specialized multi-modal agents to perform language-controlled object and behavior edits, achieving nearly 2x higher instruction alignment than prior state-of-the-art methods.
PacTure: Efficient PBR Texture Generation on Packed Views with Visual Autoregressive Models cs.CV · 2025-05-28 · unverdicted · none · ref 6 · internal anchor
PacTure uses view packing and next-scale autoregressive prediction to generate consistent multi-view PBR textures faster than prior sequential or cross-attention methods.
OmniFit: Multi-modal 3D Body Fitting via Scale-agnostic Dense Landmark Prediction cs.CV · 2026-04-23 · unverdicted · none · ref 67
OmniFit uses a conditional transformer decoder to predict dense body landmarks from multi-modal inputs for scale-agnostic SMPL-X fitting, outperforming prior methods by 57-81% and reaching millimeter accuracy on CAPE and 4D-DRESS benchmarks.
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches cs.CV · 2026-04-15 · unverdicted · none · ref 59
A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in realism and consistency.
Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors cs.CV · 2026-04-14 · unverdicted · none · ref 36
A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale cs.CV · 2026-04-13 · unverdicted · none · ref 96
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
GenLCA: 3D Diffusion for Full-Body Avatars from In-the-Wild Videos cs.CV · 2026-04-08 · unverdicted · none · ref 11
GenLCA enables scalable training of a 3D diffusion model for photorealistic, animatable full-body avatars by tokenizing large-scale real-world videos with a pretrained reconstructor and applying visibility-aware diffusion training to handle partial observations.
HiFiVe: High-Fidelity Vehicle Generation Leveraging Auto-Regressive 2D Generative Priors cs.CV · 2026-06-24 · unverdicted · none · ref 5 · internal anchor
HiFiVe is a training-free framework using an auto-regressive texture refinement pipeline with depth-based warping, multi-view fusion, and symmetry to enhance both texture and geometry fidelity in vehicle generation from 2D priors.
FoundObj: Self-supervised Foundation Models as Rewards for Label-free 3D Object Segmentation cs.CV · 2026-05-26 · unverdicted · none · ref 89 · internal anchor
FoundObj uses foundation-model priors as RL rewards to discover multi-class 3D objects from point clouds without scene-level labels.
Helix4D: Complex 4D Mesh Generation cs.CV · 2026-05-25 · unverdicted · none · ref 44 · internal anchor
Helix4D generates high-quality dynamic 4D meshes from videos by extending Trellis2 with sliding-window cross-frame attention anchored on the first frame and a repurposed 4D temporal encoding.
ROAR-3D: Routing Arbitrary Views for High-Fidelity 3D Generation cs.CV · 2026-05-20 · unverdicted · none · ref 77 · internal anchor
ROAR-3D adds a token-wise view router and dual-stream attention to pretrained single-view 3D generators so they can use arbitrary unposed images for higher-fidelity output.
TOPOS: High-Fidelity and Efficient Industry-Grade 3D Head Generation cs.CV · 2026-05-14 · unverdicted · none · ref 61 · internal anchor
TOPOS creates high-fidelity 3D heads with fixed industry topology from single images via a specialized VAE with Perceiver Resampler and a rectified flow transformer.
Real2Sim in HOI: Toward Physically Plausible HOI Reconstruction from Monocular Videos cs.CV · 2026-05-14 · unverdicted · none · ref 13 · internal anchor
HA-HOI produces physically plausible 4D HOI animations from monocular videos by anchoring object reconstruction to human motion and refining the result in a physics-based humanoid-object simulator.
Pixal3D: Pixel-Aligned 3D Generation from Images cs.CV · 2026-05-11 · unverdicted · none · ref 6 · internal anchor
Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.
GenMed: A Pairwise Generative Reformulation of Medical Diagnostic Tasks cs.CV · 2026-05-11 · unverdicted · none · ref 45 · internal anchor
GenMed uses diffusion models to capture P(X,Y) for medical tasks and performs inference via gradient-based test-time optimization, supporting arbitrary observation combinations without retraining.
MeshReGen: A Unified 3D Geometry Regeneration Framework cs.CV · 2026-04-30 · unverdicted · none · ref 86 · 2 links · internal anchor
MeshReGen introduces a conditioned 3D geometry regenerator with VecSet that learns a regeneration prior via self-supervision and reports state-of-the-art results on controllable generation tasks.
Pair2Scene: Learning Local Object Relations for Procedural Scene Generation cs.CV · 2026-04-13 · unverdicted · none · ref 33 · 2 links · internal anchor
Pair2Scene generates complex 3D scenes beyond training data by training a network on local object-pair placement rules and applying them recursively with collision-aware sampling.
UniRecGen: Unifying Multi-View 3D Reconstruction and Generation cs.CV · 2026-04-01 · unverdicted · none · ref 109 · internal anchor
UniRecGen unifies reconstruction and generation via shared canonical space and disentangled cooperative learning to produce complete, consistent 3D models from sparse views.
MV-SAM3D: Adaptive Multi-View Fusion for Layout-Aware 3D Generation cs.CV · 2026-03-12 · unverdicted · none · ref 39 · internal anchor
MV-SAM3D adds multi-view fusion via multi-diffusion with attention-entropy and visibility weighting plus physics-aware optimization to improve fidelity and physical plausibility in layout-aware 3D generation.
Structured 3D Latents Are Surprisingly Powerful: Unleashing Generalizable Style with 2D Diffusion cs.CV · 2026-05-06 · unverdicted · none · ref 10
DiLAST optimizes 3D latents via guidance from a 2D diffusion model to enable generalizable style transfer for OOD styles in 3D asset generation.
Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers cs.CV · 2026-04-23 · unverdicted · none · ref 65
Sculpt4D generates temporally coherent 4D shapes by integrating a block sparse attention mechanism with time-decaying mask into a pretrained 3D diffusion transformer, achieving SOTA results with 56% less computation.
MetaEarth3D: Unlocking World-scale 3D Generation with Spatially Scalable Generative Modeling cs.CV · 2026-04-19 · unverdicted · none · ref 5
MetaEarth3D is the first generative foundation model for spatially consistent, unbounded 3D scene generation at planetary scale using optical Earth observation data.
PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios cs.CV · 2026-04-15 · unverdicted · none · ref 55
PostureObjectStitch generates assembly-aware anomaly images by decoupling multi-view features into high-frequency, texture and RGB components, modulating them temporally in a diffusion model, and applying conditional loss plus geometric priors to preserve correct component relationships.
Beyond Voxel 3D Editing: Learning from 3D Masks and Self-Constructed Data cs.CV · 2026-04-15 · unverdicted · none · ref 93
BVE framework enables text-guided 3D editing beyond voxel limits by combining self-constructed data, lightweight semantic injection, and annotation-free masking to preserve local invariance.
CMAG: Concept-Scaffolded Retrieval for Marketplace Avatar Generation cs.CV · 2026-05-18 · unverdicted · none · ref 15 · internal anchor
CMAG combines 3D concept scaffolding, prompt decomposition, taxonomy routing, hybrid retrieval, and agentic VLM verification to assemble topologically consistent avatars from catalog assets given free-form text prompts.
From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation cs.GR · 2026-04-26 · unverdicted · none · ref 6 · 2 links · internal anchor
The paper surveys 3D asset generation methods and organizes them around the full production pipeline to assess which outputs meet engine-level requirements for interactive applications.
CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models cs.CV · 2026-01-29 · unverdicted · none · ref 41 · internal anchor
CG-MLLM is a multimodal LLM using a Mixture-of-Transformer architecture with separate TokenAR and BlockAR components integrated with a pre-trained vision-language backbone and 3D VAE to enable 3D captioning and high-fidelity generation.
DreamLifting: A Plug-in Module Lifting MV Diffusion Models for 3D Asset Generation cs.CV · 2025-09-09 · unverdicted · none · ref 3 · internal anchor
LGAA is a modular adapter framework that lifts multi-view diffusion models to produce 2D Gaussian Splats with PBR channels for high-quality relightable 3D mesh extraction using data-efficient finetuning on 69k instances.
Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation cs.CV · 2026-04-20 · unverdicted · none · ref 30
Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples sparse-view multiview generation with 3D Gaussian lifting.
"From remembering to shaping": Narrating Shared Experiences by Co-Designing Cultural Heritage Artifacts in Collaborative VR cs.HC · 2026-04-16 · unverdicted · none · ref 90
A collaborative VR workflow with GenAI lets users merge prompts and creatively repurpose outputs to co-create 3D artifacts that narrate shared cultural heritage experiences.
Controllable Video Object Insertion via Multiview Priors cs.CV · 2026-04-16 · unverdicted · none · ref 39
A multi-view prior-based framework for video object insertion that uses dual-path conditioning and an integration-aware consistency module to improve appearance stability and occlusion handling.
AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation cs.CV · 2026-04-29 · unverdicted · none · ref 47 · internal anchor
AnimateAnyMesh++ animates arbitrary 3D meshes from text using an expanded 300K-identity DyMesh-XL dataset, a power-law topology-aware DyMeshVAE-Flex, and a variable-length rectified-flow generator to produce semantically accurate, temporally coherent animations in seconds.
Seed3D 2.0: Advancing High-Fidelity Simulation-Ready 3D Content Generation cs.GR · 2026-04-22 · unverdicted · none · ref 23 · internal anchor
Seed3D 2.0 advances 3D content generation via a coarse-to-fine geometry pipeline, unified PBR material model, and simulation-ready scene tools, reporting 69-89.9% win rates over commercial systems in human studies.
Hunyuan3D 2.5: Towards High-Fidelity 3D Assets Generation with Ultimate Details cs.CV · 2025-06-19 · unverdicted · none · ref 23 · internal anchor
Hunyuan3D 2.5's LATTICE model with 10B parameters generates detailed 3D shapes from images and uses multi-view PBR for textures, outperforming prior methods in fidelity and mesh quality.
Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material cs.CV · 2025-06-18 · unverdicted · none · ref 36 · internal anchor
Hunyuan3D 2.1 is a two-part system with DiT for shape generation and Paint for texture synthesis that produces high-fidelity 3D assets with PBR materials.
Stream3D: Sequential Multi-View 3D Generation via Evidential Memory cs.CV · 2026-05-20 · unreviewed · ref 86 · internal anchor
QuadLink: Autoregressive Quad-Dominant Mesh Generation via Point-Relation Learning cs.GR · 2026-05-16 · unreviewed · ref 124 · internal anchor
VecSet-Edit: Unleashing Pre-trained LRM for Mesh Editing from Single Image cs.CV · 2026-02-04 · unreviewed · ref 17 · internal anchor
DVD: Discrete Voxel Diffusion for 3D Generation and Editing cs.CV · 2026-05-08 · unreviewed · ref 5
CADFit: Precise Mesh-to-CAD Program Generation with Hybrid Optimization cs.CV · 2026-05-02 · unreviewed · ref 5
AssemLM: A Spatial Reasoning Multimodal Large Language Model for Robotic Assembly cs.RO · 2026-04-10 · unreviewed · ref 50
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models cs.CV · 2026-04-06 · unreviewed · ref 162

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer