OpenSGA fuses vision-language, textual, and geometric features via a distance-gated attention encoder and minimum-cost-flow allocator to outperform prior methods on both frame-to-scan and subscan-to-subscan 3D scene graph alignment, backed by a new 700k-sample ScanNet-SG dataset.
hub
Fast segment anything
20 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
LAGO achieves state-of-the-art zero-shot performance with fewer image regions by using class-agnostic object discovery followed by confidence-controlled language-guided refinement and dual-channel aggregation.
Seg2Change adapts open-vocabulary segmentation models to open-vocabulary change detection via a category-agnostic change head and new dataset CA-CDD, delivering +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND.
Boxes2Pixels distills noisy SAM pseudo-masks into a compact DINOv2-based student with auxiliary localization and one-sided self-correction, delivering +6.97 anomaly mIoU and +9.71 binary IoU gains over baselines on wind turbine data with 80% fewer parameters.
OmniOVCD uses SAM 3's decoupled outputs and an SFID strategy to achieve state-of-the-art IoU scores of 67.2, 66.5, 24.5, and 27.1 on four OVCD benchmarks, surpassing prior methods.
P2DNav proposes a three-part hierarchical framework (panorama-to-downview reasoning, sliding-window dialogue memory, and reflective reorientation) that reports large success-rate gains on the R2R-CE zero-shot VLN benchmark.
SparseSAM achieves 2x faster inference and 2.8x memory reduction in SAM with only 0.004 mIoU loss at 0.4 density via Stripe-Sort Attention and Residual-Consistency MLP.
StateScribe uses a dual-layer memory architecture for episodic scenes and object-centric changes to deliver live and historical descriptions, achieving 83.1% F1 accuracy across revisits in evaluations and user studies with BLV participants.
GRAIL autonomously grounds relational concepts in NeSy-RL by using LLM weak supervision followed by interaction-based refinement, matching or exceeding manually defined concepts on Atari games.
H-SPAM produces accurate, regular, and perfectly nested hierarchical superpixels that outperform prior hierarchical methods and match recent non-hierarchical state-of-the-art.
A deformable soft conical hand is modeled in physics simulation and its scooping trajectories are optimized via evolutionary search, enabling effective contact-rich granular tasks validated in both simulation and physical robot experiments.
AIM-CoT enhances interleaved multimodal chain-of-thought reasoning by adding context-enhanced attention generation, active visual probing via information foraging, and dynamic attention-shift triggering.
Terra produces a lightweight task-agnostic metric-semantic 3D scene graph for outdoor environments using terrain-aware place nodes and hierarchically organized regions.
TinySAM 2 reaches 90% of SAM 2.1 performance on DAVIS and SA-V using 7% of the memory tokens and 3% of the training data via frame selection, spatial average pooling, temporal similarity-based token pruning, and a RepViT image encoder.
FUS3DMaps fuses voxel- and instance-level open-vocabulary layers inside a shared 3D voxel map to improve both layers and enable scalable accurate semantic mapping.
A scale-robust lightweight CNN for glottis segmentation achieves 92.9% mDice at over 170 FPS with a 19 MB model size on three datasets.
Permutation-COMQ is a new post-training quantization algorithm that reorders weights within layers and uses only dot-product and rounding steps to deliver the highest reported accuracy for 2-, 4-, and 8-bit medical foundation models.
A survey that reviews efficient variants of the Segment Anything Model, categorizes acceleration strategies, and provides a unified hardware evaluation on benchmarks.
MobileSAM is a 60x smaller distilled version of SAM that matches original performance and runs 5x faster than concurrent FastSAM while supporting CPU inference.
Semantic-Fast-SAM matches prior SAM-based semantic segmentation accuracy on Cityscapes and ADE20K while running about 20 times faster by combining FastSAM with SSA labeling and CLIP for open-vocabulary cases.
citing papers explorer
-
LAGO: Language-Guided Adaptive Object-Region Focus for Zero-Shot Visual-Text Alignment
LAGO achieves state-of-the-art zero-shot performance with fewer image regions by using class-agnostic object discovery followed by confidence-controlled language-guided refinement and dual-channel aggregation.