Static adversarial camouflage exploits natural view-angle changes during relative motion to induce consistent feature drift in AV perception, leading to incorrect trajectory predictions and unnecessary braking.
hub Mixed citations
Accelerating 3D Deep Learning with PyTorch3D
Mixed citation behavior. Most common role is method (62%).
abstract
Deep learning has significantly improved 2D image recognition. Extending into 3D may advance many new applications including autonomous vehicles, virtual and augmented reality, authoring 3D content, and even improving 2D recognition. However despite growing interest, 3D deep learning remains relatively underexplored. We believe that some of this disparity is due to the engineering challenges involved in 3D deep learning, such as efficiently processing heterogeneous data and reframing graphics operations to be differentiable. We address these challenges by introducing PyTorch3D, a library of modular, efficient, and differentiable operators for 3D deep learning. It includes a fast, modular differentiable renderer for meshes and point clouds, enabling analysis-by-synthesis approaches. Compared with other differentiable renderers, PyTorch3D is more modular and efficient, allowing users to more easily extend it while also gracefully scaling to large meshes and images. We compare the PyTorch3D operators and renderer with other implementations and demonstrate significant speed and memory improvements. We also use PyTorch3D to improve the state-of-the-art for unsupervised 3D mesh and point cloud prediction from 2D images on ShapeNet. PyTorch3D is open-source and we hope it will help accelerate research in 3D deep learning.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
AimTrap is an end-to-end system using Adversarial Camouflage Textures (ACT) and Adversarial Honeypot Textures (AHT) synthesized via differentiable rendering to defend against and detect visual aimbots, with reported success rates of 85.1% and 96.9% and negligible overhead.
XPR is an extensible cross-platform framework for point-based differentiable rendering that decomposes the pipeline into modular operations compilable by XLA, demonstrated with 3DGS, 3DGUT and LinPrim in a few hundred lines of Python each.
Meschers are a new mesh representation for impossible geometric objects grounded in discrete exterior calculus that supports full discrete geometry processing including inverse rendering.
Human face perception aligns with neural networks trained on inverse-generative and naturalistic discriminative tasks, as these best predict human dissimilarity judgments on controversial and random face pairs.
Introduces the ProfileSynth dataset and a profile-specific FLAME 3DMM regression baseline with visibility-aware jawline regularization for 3D reconstruction from single lateral face images.
LEXIS-Flow uses VQ-VAE-learned interaction signatures to guide diffusion-based reconstruction of 3D human-object meshes and dense proximity fields from single RGB images, outperforming SOTA on benchmarks.
A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in realism and consistency.
A two-stage autoregressive framework centered on BoxMesh recovers parametric sewing patterns from 3D garment surfaces, claiming state-of-the-art results on benchmarks and generalization to real scans and single-view images.
AGILE generates complete object meshes via VLM-guided synthesis and tracks poses with anchor-and-track plus contact-aware optimization to achieve robust hand-object reconstruction from video.
UIKA is a feed-forward animatable Gaussian head model using UV-guided correspondence estimation and learnable UV tokens with dual-level attention, trained on large-scale synthetic data to handle pose-free inputs.
Objaverse-XL supplies over 10 million diverse 3D objects that, when used to render 100 million views, improve zero-shot novel-view synthesis in models such as Zero123.
Derives an optimal control-based variational optimization framework for test-time diffusion synchronization to enhance collaborative generation across modalities.
TextHOI-3D generates text-conditioned 3D hand-object meshes using a VQ token space and CLIP-conditioned autoregressive multi-view prediction followed by joint mesh optimization, reporting large reductions in object CD and penetration volume versus single-view baselines on HO3D-derived data.
ObjView-Bench disentangles omnidirectional self-occlusion, saturation difficulty, and set-cover planning difficulty, then shows that budget regimes and reachable-view constraints change planner rankings and failure modes across classical, learned, and hybrid methods.
A delighting network trained via Dataset Latent Modulation on heterogeneous OLAT and Light Stage data enables high-quality in-the-wild facial reflectance capture from video and produces the NeRSemble-Scan dataset.
A multimodal diffusion model trained on synthetic data enhances low-resolution EBSD and corrupted polarized light data, achieving near full-resolution performance with only 25% EBSD data.
TouchAnything reconstructs accurate 3D object geometries from only a few tactile contacts by optimizing for consistency with a pretrained visual diffusion prior.
A coupled world-agent framework uses 3D Gaussian reconstruction and first-person RGB-D perception with iterative planning to enable goal-directed, collision-avoiding humanoid behavior in novel reconstructed scenes.
Stepper uses stepwise panoramic expansion with a multi-view 360-degree diffusion model and geometry reconstruction to produce high-fidelity, structurally consistent immersive 3D scenes from text.
MAGICIAN uses Imagined Gaussians from occupancy networks for efficient coverage gain computation in tree-search based long-horizon planning for active mapping, achieving SOTA results on indoor and outdoor benchmarks.
ViewCrafter tames video diffusion models with point-based 3D guidance and iterative trajectory planning to produce high-fidelity novel views from single or sparse images.
DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.
Shap-E encodes 3D assets into implicit function parameters then uses a conditional diffusion model to generate new ones from text, enabling fast multi-representation 3D asset creation.
citing papers explorer
-
Still Camouflage, Moving Illusion: View-Induced Trajectory Manipulation in Autonomous Driving
Static adversarial camouflage exploits natural view-angle changes during relative motion to induce consistent feature drift in AV perception, leading to incorrect trajectory predictions and unnecessary braking.
-
Shoot the Honey, Cloak the Player: Towards Zero-Runtime-Overhead Proactive Defense and Detection for Visual Game Cheating
AimTrap is an end-to-end system using Adversarial Camouflage Textures (ACT) and Adversarial Honeypot Textures (AHT) synthesized via differentiable rendering to defend against and detect visual aimbots, with reported success rates of 85.1% and 96.9% and negligible overhead.
-
XPR: An Extensible Cross-Platform Point-Based Differentiable Renderer
XPR is an extensible cross-platform framework for point-based differentiable rendering that decomposes the pipeline into modular operations compilable by XLA, demonstrated with 3DGS, 3DGUT and LinPrim in a few hundred lines of Python each.
-
Meschers: Geometry Processing of Impossible Objects
Meschers are a new mesh representation for impossible geometric objects grounded in discrete exterior calculus that supports full discrete geometry processing including inverse rendering.
-
Human face perception reflects inverse-generative and naturalistic discriminative objectives
Human face perception aligns with neural networks trained on inverse-generative and naturalistic discriminative tasks, as these best predict human dissimilarity judgments on controversial and random face pairs.
-
Profile-Specific 3DMM Regression from a Single Lateral Face Image
Introduces the ProfileSynth dataset and a profile-specific FLAME 3DMM regression baseline with visibility-aware jawline regularization for 3D reconstruction from single lateral face images.
-
LEXIS: LatEnt ProXimal Interaction Signatures for 3D HOI from an Image
LEXIS-Flow uses VQ-VAE-learned interaction signatures to guide diffusion-based reconstruction of 3D human-object meshes and dense proximity fields from single RGB images, outperforming SOTA on benchmarks.
-
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches
A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in realism and consistency.
-
InverseDraping: Recovering Sewing Patterns from 3D Garment Surfaces via BoxMesh Bridging
A two-stage autoregressive framework centered on BoxMesh recovers parametric sewing patterns from 3D garment surfaces, claiming state-of-the-art results on benchmarks and generalization to real scans and single-view images.
-
AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation
AGILE generates complete object meshes via VLM-guided synthesis and tracks poses with anchor-and-track plus contact-aware optimization to achieve robust hand-object reconstruction from video.
-
Variational Test-time Optimization for Diffusion Synchronization
Derives an optimal control-based variational optimization framework for test-time diffusion synchronization to enhance collaborative generation across modalities.
-
TextHOI-3D: Text-to-3D Hand-Object Interaction via Discrete Multi-View Generation and Joint Mesh Optimization
TextHOI-3D generates text-conditioned 3D hand-object meshes using a VQ token space and CLIP-conditioned autoregressive multi-view prediction followed by joint mesh optimization, reporting large reductions in object CD and penetration volume versus single-view baselines on HO3D-derived data.
-
ObjView-Bench: Rethinking Difficulty and Deployment for Object-Centric View Planning
ObjView-Bench disentangles omnidirectional self-occlusion, saturation difficulty, and set-cover planning difficulty, then shows that budget regimes and reachable-view constraints change planner rankings and failure modes across classical, learned, and hybrid methods.
-
Learning a Delighting Prior for Facial Appearance Capture in the Wild
A delighting network trained via Dataset Latent Modulation on heterogeneous OLAT and Light Stage data enables high-quality in-the-wild facial reflectance capture from video and produces the NeRSemble-Scan dataset.
-
Multimodal Diffusion to Mutually Enhance Polarized Light and Low Resolution EBSD Data
A multimodal diffusion model trained on synthetic data enhances low-resolution EBSD and corrupted polarized light data, achieving near full-resolution performance with only 25% EBSD data.
-
TouchAnything: Diffusion-Guided 3D Reconstruction from Sparse Robot Touches
TouchAnything reconstructs accurate 3D object geometries from only a few tactile contacts by optimizing for consistency with a pretrained visual diffusion prior.
-
Visually-grounded Humanoid Agents
A coupled world-agent framework uses 3D Gaussian reconstruction and first-person RGB-D perception with iterative planning to enable goal-directed, collision-avoiding humanoid behavior in novel reconstructed scenes.
-
Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas
Stepper uses stepwise panoramic expansion with a multi-view 360-degree diffusion model and geometry reconstruction to produce high-fidelity, structurally consistent immersive 3D scenes from text.
-
MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping
MAGICIAN uses Imagined Gaussians from occupancy networks for efficient coverage gain computation in tree-search based long-horizon planning for active mapping, achieving SOTA results on indoor and outdoor benchmarks.
-
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
ViewCrafter tames video diffusion models with point-based 3D guidance and iterative trajectory planning to produce high-fidelity novel views from single or sparse images.
-
Mirror Illusion Art
AutoMIA is an automated pipeline for inverse design of 3D mirror illusion objects that jointly optimizes shape and color using four stabilization mechanisms.
-
Human Interaction-Aware 3D Reconstruction from a Single Image
HUG3D uses group-instance multi-view diffusion and physics-based optimization to create physically plausible 3D reconstructions of interacting people from a single image.
-
Image-Guided Shape-from-Template Using Mesh Inextensibility Constraints
An unsupervised SfT approach using image observations and mesh inextensibility constraints reconstructs deforming 3D shapes 400x faster than prior unsupervised methods while handling severe occlusions better.
-
Accelerated Likelihood Maximization for Diffusion-based Versatile Content Generation
ALM integrates likelihood maximization and acceleration into diffusion reverse sampling to enable globally coherent generation from incomplete inputs.
-
Seed3D 2.0: Advancing High-Fidelity Simulation-Ready 3D Content Generation
Seed3D 2.0 advances 3D content generation via a coarse-to-fine geometry pipeline, unified PBR material model, and simulation-ready scene tools, reporting 69-89.9% win rates over commercial systems in human studies.
-
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claiming open-source SOTA performance.
-
A Survey on Deep Learning Architectures for Point Cloud Classification and Segmentation
A survey that categorizes deep learning models for point cloud tasks by backbone architecture, evaluates benchmark performance, and outlines challenges and future research directions.