Static adversarial camouflage exploits natural view-angle changes during relative motion to induce consistent feature drift in AV perception, leading to incorrect trajectory predictions and unnecessary braking.
hub Mixed citations
Accelerating 3D Deep Learning with PyTorch3D
Mixed citation behavior. Most common role is method (62%).
abstract
Deep learning has significantly improved 2D image recognition. Extending into 3D may advance many new applications including autonomous vehicles, virtual and augmented reality, authoring 3D content, and even improving 2D recognition. However despite growing interest, 3D deep learning remains relatively underexplored. We believe that some of this disparity is due to the engineering challenges involved in 3D deep learning, such as efficiently processing heterogeneous data and reframing graphics operations to be differentiable. We address these challenges by introducing PyTorch3D, a library of modular, efficient, and differentiable operators for 3D deep learning. It includes a fast, modular differentiable renderer for meshes and point clouds, enabling analysis-by-synthesis approaches. Compared with other differentiable renderers, PyTorch3D is more modular and efficient, allowing users to more easily extend it while also gracefully scaling to large meshes and images. We compare the PyTorch3D operators and renderer with other implementations and demonstrate significant speed and memory improvements. We also use PyTorch3D to improve the state-of-the-art for unsupervised 3D mesh and point cloud prediction from 2D images on ShapeNet. PyTorch3D is open-source and we hope it will help accelerate research in 3D deep learning.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Meschers are a new mesh representation for impossible geometric objects grounded in discrete exterior calculus that supports full discrete geometry processing including inverse rendering.
Human face perception aligns with neural networks trained on inverse-generative and naturalistic discriminative tasks, as these best predict human dissimilarity judgments on controversial and random face pairs.
Introduces the ProfileSynth dataset and a profile-specific FLAME 3DMM regression baseline with visibility-aware jawline regularization for 3D reconstruction from single lateral face images.
LEXIS-Flow uses VQ-VAE-learned interaction signatures to guide diffusion-based reconstruction of 3D human-object meshes and dense proximity fields from single RGB images, outperforming SOTA on benchmarks.
A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in realism and consistency.
A two-stage autoregressive framework centered on BoxMesh recovers parametric sewing patterns from 3D garment surfaces, claiming state-of-the-art results on benchmarks and generalization to real scans and single-view images.
AGILE generates complete object meshes via VLM-guided synthesis and tracks poses with anchor-and-track plus contact-aware optimization to achieve robust hand-object reconstruction from video.
UIKA is a feed-forward animatable Gaussian head model using UV-guided correspondence estimation and learnable UV tokens with dual-level attention, trained on large-scale synthetic data to handle pose-free inputs.
NeuralBoneReg is a self-supervised instance-specific method using neural UDF and MLP-based point cloud registration that matches supervised SOTA accuracy on CT-US and CT-RGB-D bone datasets without inter-subject training data.
Objaverse-XL supplies over 10 million diverse 3D objects that, when used to render 100 million views, improve zero-shot novel-view synthesis in models such as Zero123.
ObjView-Bench disentangles omnidirectional self-occlusion, saturation difficulty, and set-cover planning difficulty, then shows that budget regimes and reachable-view constraints change planner rankings and failure modes across classical, learned, and hybrid methods.
A delighting network trained via Dataset Latent Modulation on heterogeneous OLAT and Light Stage data enables high-quality in-the-wild facial reflectance capture from video and produces the NeRSemble-Scan dataset.
A multimodal diffusion model trained on synthetic data enhances low-resolution EBSD and corrupted polarized light data, achieving near full-resolution performance with only 25% EBSD data.
TouchAnything reconstructs accurate 3D object geometries from only a few tactile contacts by optimizing for consistency with a pretrained visual diffusion prior.
A coupled world-agent framework uses 3D Gaussian reconstruction and first-person RGB-D perception with iterative planning to enable goal-directed, collision-avoiding humanoid behavior in novel reconstructed scenes.
Stepper uses stepwise panoramic expansion with a multi-view 360-degree diffusion model and geometry reconstruction to produce high-fidelity, structurally consistent immersive 3D scenes from text.
MAGICIAN uses Imagined Gaussians from occupancy networks for efficient coverage gain computation in tree-search based long-horizon planning for active mapping, achieving SOTA results on indoor and outdoor benchmarks.
ViewCrafter tames video diffusion models with point-based 3D guidance and iterative trajectory planning to produce high-fidelity novel views from single or sparse images.
DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.
Shap-E encodes 3D assets into implicit function parameters then uses a conditional diffusion model to generate new ones from text, enabling fast multi-representation 3D asset creation.
HUG3D uses group-instance multi-view diffusion and physics-based optimization to create physically plausible 3D reconstructions of interacting people from a single image.
An unsupervised SfT approach using image observations and mesh inextensibility constraints reconstructs deforming 3D shapes 400x faster than prior unsupervised methods while handling severe occlusions better.
A systematic literature survey that categorizes deep learning architectures for point cloud classification, part segmentation, and semantic segmentation, evaluates them on benchmarks, and discusses innovations, limitations, and future directions.
citing papers explorer
-
Still Camouflage, Moving Illusion: View-Induced Trajectory Manipulation in Autonomous Driving
Static adversarial camouflage exploits natural view-angle changes during relative motion to induce consistent feature drift in AV perception, leading to incorrect trajectory predictions and unnecessary braking.
-
Meschers: Geometry Processing of Impossible Objects
Meschers are a new mesh representation for impossible geometric objects grounded in discrete exterior calculus that supports full discrete geometry processing including inverse rendering.
-
Human face perception reflects inverse-generative and naturalistic discriminative objectives
Human face perception aligns with neural networks trained on inverse-generative and naturalistic discriminative tasks, as these best predict human dissimilarity judgments on controversial and random face pairs.
-
Profile-Specific 3DMM Regression from a Single Lateral Face Image
Introduces the ProfileSynth dataset and a profile-specific FLAME 3DMM regression baseline with visibility-aware jawline regularization for 3D reconstruction from single lateral face images.
-
LEXIS: LatEnt ProXimal Interaction Signatures for 3D HOI from an Image
LEXIS-Flow uses VQ-VAE-learned interaction signatures to guide diffusion-based reconstruction of 3D human-object meshes and dense proximity fields from single RGB images, outperforming SOTA on benchmarks.
-
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches
A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in realism and consistency.
-
InverseDraping: Recovering Sewing Patterns from 3D Garment Surfaces via BoxMesh Bridging
A two-stage autoregressive framework centered on BoxMesh recovers parametric sewing patterns from 3D garment surfaces, claiming state-of-the-art results on benchmarks and generalization to real scans and single-view images.
-
AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation
AGILE generates complete object meshes via VLM-guided synthesis and tracks poses with anchor-and-track plus contact-aware optimization to achieve robust hand-object reconstruction from video.
-
UIKA: Fast Universal Head Avatar from Pose-Free Images
UIKA is a feed-forward animatable Gaussian head model using UV-guided correspondence estimation and learnable UV tokens with dual-level attention, trained on large-scale synthetic data to handle pose-free inputs.
-
NeuralBoneReg: An Instance-Specific Label-Free Point Cloud-Based Method for Multi-Modal Bone Surface Registration
NeuralBoneReg is a self-supervised instance-specific method using neural UDF and MLP-based point cloud registration that matches supervised SOTA accuracy on CT-US and CT-RGB-D bone datasets without inter-subject training data.
-
Objaverse-XL: A Universe of 10M+ 3D Objects
Objaverse-XL supplies over 10 million diverse 3D objects that, when used to render 100 million views, improve zero-shot novel-view synthesis in models such as Zero123.
-
ObjView-Bench: Rethinking Difficulty and Deployment for Object-Centric View Planning
ObjView-Bench disentangles omnidirectional self-occlusion, saturation difficulty, and set-cover planning difficulty, then shows that budget regimes and reachable-view constraints change planner rankings and failure modes across classical, learned, and hybrid methods.
-
Learning a Delighting Prior for Facial Appearance Capture in the Wild
A delighting network trained via Dataset Latent Modulation on heterogeneous OLAT and Light Stage data enables high-quality in-the-wild facial reflectance capture from video and produces the NeRSemble-Scan dataset.
-
Multimodal Diffusion to Mutually Enhance Polarized Light and Low Resolution EBSD Data
A multimodal diffusion model trained on synthetic data enhances low-resolution EBSD and corrupted polarized light data, achieving near full-resolution performance with only 25% EBSD data.
-
TouchAnything: Diffusion-Guided 3D Reconstruction from Sparse Robot Touches
TouchAnything reconstructs accurate 3D object geometries from only a few tactile contacts by optimizing for consistency with a pretrained visual diffusion prior.
-
Visually-grounded Humanoid Agents
A coupled world-agent framework uses 3D Gaussian reconstruction and first-person RGB-D perception with iterative planning to enable goal-directed, collision-avoiding humanoid behavior in novel reconstructed scenes.
-
Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas
Stepper uses stepwise panoramic expansion with a multi-view 360-degree diffusion model and geometry reconstruction to produce high-fidelity, structurally consistent immersive 3D scenes from text.
-
MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping
MAGICIAN uses Imagined Gaussians from occupancy networks for efficient coverage gain computation in tree-search based long-horizon planning for active mapping, achieving SOTA results on indoor and outdoor benchmarks.
-
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
ViewCrafter tames video diffusion models with point-based 3D guidance and iterative trajectory planning to produce high-fidelity novel views from single or sparse images.
-
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.
-
Shap-E: Generating Conditional 3D Implicit Functions
Shap-E encodes 3D assets into implicit function parameters then uses a conditional diffusion model to generate new ones from text, enabling fast multi-representation 3D asset creation.
-
Human Interaction-Aware 3D Reconstruction from a Single Image
HUG3D uses group-instance multi-view diffusion and physics-based optimization to create physically plausible 3D reconstructions of interacting people from a single image.
-
Image-Guided Shape-from-Template Using Mesh Inextensibility Constraints
An unsupervised SfT approach using image observations and mesh inextensibility constraints reconstructs deforming 3D shapes 400x faster than prior unsupervised methods while handling severe occlusions better.
-
A Systematic Survey on Deep Learning Architectures for Point Cloud Classification and Segmentation
A systematic literature survey that categorizes deep learning architectures for point cloud classification, part segmentation, and semantic segmentation, evaluates them on benchmarks, and discusses innovations, limitations, and future directions.
-
Seed3D 2.0: Advancing High-Fidelity Simulation-Ready 3D Content Generation
Seed3D 2.0 advances 3D content generation via a coarse-to-fine geometry pipeline, unified PBR material model, and simulation-ready scene tools, reporting 69-89.9% win rates over commercial systems in human studies.
-
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claiming open-source SOTA performance.