GaussLite conditions 3D Gaussian Splatting seeding density, gradient flow, and scaling on task relevance masks derived from LLM-parsed natural language and open-vocabulary detection, yielding +2.72 dB ROI PSNR gains on Replica and +2.23 dB on real hardware at fixed budget.
hub
arXiv preprint arXiv:2306.12156 (2023) 31
33 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
OpenSGA fuses vision-language, textual, and geometric features via a distance-gated attention encoder and minimum-cost-flow allocator to outperform prior methods on both frame-to-scan and subscan-to-subscan 3D scene graph alignment, backed by a new 700k-sample ScanNet-SG dataset.
LAGO achieves state-of-the-art zero-shot performance with fewer image regions by using class-agnostic object discovery followed by confidence-controlled language-guided refinement and dual-channel aggregation.
Seg2Change adapts open-vocabulary segmentation models to open-vocabulary change detection via a category-agnostic change head and new dataset CA-CDD, delivering +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND.
Boxes2Pixels distills noisy SAM pseudo-masks into a compact DINOv2-based student with auxiliary localization and one-sided self-correction, delivering +6.97 anomaly mIoU and +9.71 binary IoU gains over baselines on wind turbine data with 80% fewer parameters.
OmniOVCD uses SAM 3's decoupled outputs and an SFID strategy to achieve state-of-the-art IoU scores of 67.2, 66.5, 24.5, and 27.1 on four OVCD benchmarks, surpassing prior methods.
SegFS is a dual-path architecture that uses sparse keyframe open-vocabulary predictions to condition a fast feature-space network for efficient temporal instance segmentation in videos.
MV-GEL localizes fine-grained geometric entities on 3D meshes from natural language by ranking informative views with GELviews, applying VLM segmentation, and lifting masks via geometry-aware ray casting, reporting up to 1.7X face IoU and 4.5X edge F1 gains over baselines.
FAT decomposes structured prediction into specialist hypothesis generation and foundation-model proxy reasoning, yielding consistent gains over baselines on detection, trajectory, and segmentation tasks.
Presents MMIOC-1M benchmark with 1M+ samples across 14 super-categories and RTVPNet with domain projection, sparse sampling, and bidirectional interaction, claiming SOTA on MMIOC-1M, LVIS, and COCO.
Meridian matches metric-semantic primitives across aerial and ground views for training-free global localization in diverse natural environments, reporting 2.4 m average trajectory error over 19 km.
COTRATE is an online self-supervised framework that uses proprioceptive terrain assessment to supervise visual traversability estimation with alignment loss and diversity-aware replay for continual robot-agnostic learning.
Introduces Embodied Tool Protocol and tool externalization to improve embodied AI performance on perception and cognition tasks, with measured gains but limits on execution capabilities.
InstructSAM uses learnable queries in a VLM to condition SAM3 for single-pass multi-instance segmentation from arbitrary instructions, with a new Inst2Seg benchmark.
RepSAM applies CKA-guided rank allocation in PEFT plus multi-modal fusion to adapt SAM, reaching 97.9% of full fine-tuning mIoU with 158x fewer parameters on robotic benchmarks.
P2DNav proposes a three-part hierarchical framework (panorama-to-downview reasoning, sliding-window dialogue memory, and reflective reorientation) that reports large success-rate gains on the R2R-CE zero-shot VLN benchmark.
SparseSAM achieves 2x faster inference and 2.8x memory reduction in SAM with only 0.004 mIoU loss at 0.4 density via Stripe-Sort Attention and Residual-Consistency MLP.
StateScribe uses a dual-layer memory architecture for episodic scenes and object-centric changes to deliver live and historical descriptions, achieving 83.1% F1 accuracy across revisits in evaluations and user studies with BLV participants.
GRAIL autonomously grounds relational concepts in NeSy-RL by using LLM weak supervision followed by interaction-based refinement, matching or exceeding manually defined concepts on Atari games.
H-SPAM produces accurate, regular, and perfectly nested hierarchical superpixels that outperform prior hierarchical methods and match recent non-hierarchical state-of-the-art.
A deformable soft conical hand is modeled in physics simulation and its scooping trajectories are optimized via evolutionary search, enabling effective contact-rich granular tasks validated in both simulation and physical robot experiments.
AIM-CoT enhances interleaved multimodal chain-of-thought reasoning by adding context-enhanced attention generation, active visual probing via information foraging, and dynamic attention-shift triggering.
Terra produces a lightweight task-agnostic metric-semantic 3D scene graph for outdoor environments using terrain-aware place nodes and hierarchically organized regions.
CucumberVision compares five 3D length methods on 48 RGB-D captures of seven cucumbers and shows a novel medial-axis cubic spline with trapezoidal integration achieves the lowest 4.13% MAPE, outperforming baselines at corrected significance.
citing papers explorer
-
A Real-time Scale-robust Network for Glottis Segmentation in Nasal Transnasal Intubation
A scale-robust lightweight CNN for glottis segmentation achieves 92.9% mDice at over 170 FPS with a 19 MB model size on three datasets.