By inserting per-region markers and reserved vocabulary tokens before frozen encoder patches and refining them via MSR, 3D-PLOT-LLM adds part-level addressing to 3D LLMs, outperforming baselines on PartVerse-QA and 3DCoMPaT-GrIn with minimal new parameters.
hub Mixed citations
Uni3d: Exploring unified 3d representation at scale
Mixed citation behavior. Most common role is method (60%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
VoxAfford fuses multi-scale voxel features into MLLM output tokens using cross-attention with a learned compatibility gate to achieve SOTA open-vocabulary 3D affordance detection with ~8% mIoU gain and zero-shot robot transfer.
CompassAD benchmark and CompassNet framework for intent-driven affordance prediction on the appropriate object within multi-object 3D point clouds conditioned on natural language intent.
POMA-3D learns self-supervised 3D scene representations from point maps and improves performance on geometric 3D tasks including navigation and scene retrieval.
HiFiVe is a training-free framework using an auto-regressive texture refinement pipeline with depth-based warping, multi-view fusion, and symmetry to enhance both texture and geometry fidelity in vehicle generation from 2D priors.
Helix4D generates high-quality dynamic 4D meshes from videos by extending Trellis2 with sliding-window cross-frame attention anchored on the first frame and a repurposed 4D temporal encoding.
REVIVE 3D generates voluminous 3D assets from flat 2D images via an inflated prior construction followed by latent-space refinement, plus new metrics for volume and flatness validated by user study.
Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five benchmarks using pre-trained encoders.
CLAMP pretrains 3D multi-view encoders with contrastive learning on point clouds and actions, then initializes diffusion policies for more sample-efficient fine-tuning on robotic tasks.
Introduces O-Voxel omni-voxel representation and Sparse Compression VAE for structured native 3D latents, enabling efficient training of large flow-matching models that produce higher-quality geometry and materials than prior methods.
Contrastive Fusion (ConFu) adds a fused-modality contrastive term to jointly align individual modalities and their combinations, enabling capture of higher-order dependencies like XOR relations while preserving pairwise alignments.
SAM 3D reconstructs 3D objects from single images with geometry, texture, and pose using human-model annotated data at scale and synthetic-to-real training, achieving 5:1 human preference wins.
DoReMi uses self-supervised pre-training on topological and texture variations plus domain-aware experts with spatial-guided routing and entropy-controlled allocation to reach 80.1% mIoU on ScanNet and 77.2% mIoU on S3DIS.
Hypergraph reasoning with geometric-aware prototypes for novel class discovery in point cloud segmentation.
SGSoft introduces a template-guided pipeline that fuses semantic and geometric features to learn dense correspondences across deformable 3D shapes with claimed SOTA generalization and real-time efficiency.
SynVA toolkit generates realistic vascular meshes and anatomically plausible aneurysms, releasing 50,000 labeled samples for medical vision tasks.
PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.
A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.
CG-MLLM is a multimodal LLM using a Mixture-of-Transformer architecture with separate TokenAR and BlockAR components integrated with a pre-trained vision-language backbone and 3D VAE to enable 3D captioning and high-fidelity generation.
Hunyuan3D 2.5's LATTICE model with 10B parameters generates detailed 3D shapes from images and uses multi-view PBR for textures, outperforming prior methods in fidelity and mesh quality.
Hunyuan3D 2.0 scales flow-based diffusion transformers and texture synthesis models to generate high-resolution textured 3D assets that outperform prior state-of-the-art in geometry, alignment, and texture quality.
citing papers explorer
-
3D-PLOT-LLM: Part-Level Object Tokens for 3D Large Language Models
By inserting per-region markers and reserved vocabulary tokens before frozen encoder patches and refining them via MSR, 3D-PLOT-LLM adds part-level addressing to 3D LLMs, outperforming baselines on PartVerse-QA and 3DCoMPaT-GrIn with minimal new parameters.
-
VoxAfford: Multi-Scale Voxel-Token Fusion for Open-Vocabulary 3D Affordance Detection
VoxAfford fuses multi-scale voxel features into MLLM output tokens using cross-attention with a learned compatibility gate to achieve SOTA open-vocabulary 3D affordance detection with ~8% mIoU gain and zero-shot robot transfer.
-
CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects
CompassAD benchmark and CompassNet framework for intent-driven affordance prediction on the appropriate object within multi-object 3D point clouds conditioned on natural language intent.
-
POMA-3D: The Point Map Way to 3D Scene Understanding
POMA-3D learns self-supervised 3D scene representations from point maps and improves performance on geometric 3D tasks including navigation and scene retrieval.
-
HiFiVe: High-Fidelity Vehicle Generation Leveraging Auto-Regressive 2D Generative Priors
HiFiVe is a training-free framework using an auto-regressive texture refinement pipeline with depth-based warping, multi-view fusion, and symmetry to enhance both texture and geometry fidelity in vehicle generation from 2D priors.
-
Helix4D: Complex 4D Mesh Generation
Helix4D generates high-quality dynamic 4D meshes from videos by extending Trellis2 with sliding-window cross-frame attention anchored on the first frame and a repurposed 4D temporal encoding.
-
REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement
REVIVE 3D generates voluminous 3D assets from flat 2D images via an inflated prior construction followed by latent-space refinement, plus new metrics for volume and flatness validated by user study.
-
Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM
Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five benchmarks using pre-trained encoders.
-
CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining
CLAMP pretrains 3D multi-view encoders with contrastive learning on point clouds and actions, then initializes diffusion policies for more sample-efficient fine-tuning on robotic tasks.
-
Native and Compact Structured Latents for 3D Generation
Introduces O-Voxel omni-voxel representation and Sparse Compression VAE for structured native 3D latents, enabling efficient training of large flow-matching models that produce higher-quality geometry and materials than prior methods.
-
The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment
Contrastive Fusion (ConFu) adds a fused-modality contrastive term to jointly align individual modalities and their combinations, enabling capture of higher-order dependencies like XOR relations while preserving pairwise alignments.
-
SAM 3D: 3Dfy Anything in Images
SAM 3D reconstructs 3D objects from single images with geometry, texture, and pose using human-model annotated data at scale and synthetic-to-real training, achieving 5:1 human preference wins.
-
DoReMi: Bridging 3D Domains via Topology-Aware Domain-Representation Mixture of Experts
DoReMi uses self-supervised pre-training on topological and texture variations plus domain-aware experts with spatial-guided routing and entropy-controlled allocation to reach 80.1% mIoU on ScanNet and 77.2% mIoU on S3DIS.
-
Geometric-Aware Hypergraph Reasoning for Novel Class Discovery in Point Cloud Segmentation
Hypergraph reasoning with geometric-aware prototypes for novel class discovery in point cloud segmentation.
-
SGSoft: Learning Fused Semantic-Geometric Features for 3D Shape Correspondence via Template-Guided Soft Signals
SGSoft introduces a template-guided pipeline that fuses semantic and geometric features to learn dense correspondences across deformable 3D shapes with claimed SOTA generalization and real-time efficiency.
-
SynVA: A Modular Toolkit for Vessel Generation and Aneurysm Editing
SynVA toolkit generates realistic vascular meshes and anatomically plausible aneurysms, releasing 50,000 labeled samples for medical vision tasks.
-
Pose-Aware Diffusion for 3D Generation
PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.
-
R3D: Revisiting 3D Policy Learning
A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.
-
CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models
CG-MLLM is a multimodal LLM using a Mixture-of-Transformer architecture with separate TokenAR and BlockAR components integrated with a pre-trained vision-language backbone and 3D VAE to enable 3D captioning and high-fidelity generation.
-
Hunyuan3D 2.5: Towards High-Fidelity 3D Assets Generation with Ultimate Details
Hunyuan3D 2.5's LATTICE model with 10B parameters generates detailed 3D shapes from images and uses multi-view PBR for textures, outperforming prior methods in fidelity and mesh quality.
-
Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
Hunyuan3D 2.0 scales flow-based diffusion transformers and texture synthesis models to generate high-resolution textured 3D assets that outperform prior state-of-the-art in geometry, alignment, and texture quality.
- TORA: Topological Representation Alignment for 3D Shape Assembly
- RGB-Pointmap Pretraining for Unified 3D Scene Understanding