3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.
Sam3d: Segment anything in 3d scenes
11 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 11roles
background 1polarities
background 1representative citing papers
A feed-forward framework learns instance-structured 3D token groups from unposed multi-view images via differentiable rendering, enabling native object-level segmentation, editing, and retrieval without 3D supervision.
Ilov3Splat learns view-consistent CLIP and instance feature fields on 3D Gaussians to support open-vocabulary object selection and segmentation without category labels.
PanoSAMic modifies SAM with multi-stage feature encoding, spatio-modal fusion, spherical attention, and dual-view fusion to achieve SOTA panoramic semantic segmentation on public RGB and RGB-D datasets.
ShelfGaussian achieves state-of-the-art zero-shot semantic occupancy prediction on Occ3D-nuScenes by jointly supervising Gaussian representations with vision foundation model features at 2D image and 3D scene levels.
ESAM++ introduces a 3D Sparse Feature Pyramid Network for efficient online 3D scene perception on edge devices, claiming competitive accuracy with up to 3x faster inference and 2x smaller model size than ESAM on four benchmarks.
AgentGrounder performs zero-shot 3D visual grounding on colored point clouds via an offline object lookup table and an online agent that selectively retrieves, scores geometrically, and renders images on demand, reporting gains over SeeGround on ScanRefer and Nr3D.
CAR-SAM introduces MatMul-Aware Compensation and Joint Cross-Attention Reconstruction to enable stable 4-bit post-training quantization of SAM, outperforming prior PTQ methods by 14.6% mAP on SAM-B and 6.6% on SAM-L.
DDS combines multi-granularity distillation from projected 2D features with graph diffusion on superpoints to deliver region-consistent semantic labels for 3D scenes without any dense annotations.
MV3DIS uses 3D-guided mask matching and depth consistency to produce more consistent multi-view 2D masks that refine into accurate zero-shot 3D instances.
GraspSense computes force maps from object geometry to select mechanically safe grasp regions and regulate grip forces for dexterous hands.
citing papers explorer
-
3AM: 3egment Anything with Geometric Consistency in Videos
3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.
-
Scenes as Objects, Not Primitives: Instance-Structured 3D Tokenization from Unposed Views
A feed-forward framework learns instance-structured 3D token groups from unposed multi-view images via differentiable rendering, enabling native object-level segmentation, editing, and retrieval without 3D supervision.
-
Ilov3Splat: Instance-Level Open-Vocabulary 3D Scene Understanding in Gaussian Splatting
Ilov3Splat learns view-consistent CLIP and instance feature fields on 3D Gaussians to support open-vocabulary object selection and segmentation without category labels.
-
PanoSAMic: Panoramic Image Segmentation from SAM Feature Encoding and Dual View Fusion
PanoSAMic modifies SAM with multi-stage feature encoding, spatio-modal fusion, spherical attention, and dual-view fusion to achieve SOTA panoramic semantic segmentation on public RGB and RGB-D datasets.
-
ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding
ShelfGaussian achieves state-of-the-art zero-shot semantic occupancy prediction on Occ3D-nuScenes by jointly supervising Gaussian representations with vision foundation model features at 2D image and 3D scene levels.
-
ESAM++: Efficient Online 3D Perception on the Edge
ESAM++ introduces a 3D Sparse Feature Pyramid Network for efficient online 3D scene perception on edge devices, claiming competitive accuracy with up to 3x faster inference and 2x smaller model size than ESAM on four benchmarks.
-
AgentGrounder: Zero-Shot 3D Visual Pointcloud Grounding using Multimodal Language Models
AgentGrounder performs zero-shot 3D visual grounding on colored point clouds via an offline object lookup table and an online agent that selectively retrieves, scores geometrically, and renders images on demand, reporting gains over SeeGround on ScanRefer and Nr3D.
-
CAR-SAM: Cross-Attention Reconstruction for Post-Training Quantization of the Segment Anything Model
CAR-SAM introduces MatMul-Aware Compensation and Joint Cross-Attention Reconstruction to enable stable 4-bit post-training quantization of SAM, outperforming prior PTQ methods by 14.6% mAP on SAM-B and 6.6% on SAM-L.
-
Distill, Diffuse, and Semanticize (DDS): Annotation-Free 3D Scene Understanding Based on Multi-Granularity Distillation and Graph-Diffusion-Based Segmentation
DDS combines multi-granularity distillation from projected 2D features with graph diffusion on superpoints to deliver region-consistent semantic labels for 3D scenes without any dense annotations.
-
MV3DIS: Multi-View Mask Matching via 3D Guides for Zero-Shot 3D Instance Segmentation
MV3DIS uses 3D-guided mask matching and depth consistency to produce more consistent multi-view 2D masks that refine into accurate zero-shot 3D instances.
-
GraspSense: Physically Grounded Grasp and Grip Planning for a Dexterous Robotic Hand via Language-Guided Perception and Force Maps
GraspSense computes force maps from object geometry to select mechanically safe grasp regions and regulate grip forces for dexterous hands.