{"total":20,"items":[{"citing_arxiv_id":"2605.19634","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation","primary_cat":"cs.CV","submitted_at":"2026-05-19T10:18:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"P2DNav proposes a three-part hierarchical framework (panorama-to-downview reasoning, sliding-window dialogue memory, and reflective reorientation) that reports large success-rate gains on the R2R-CE zero-shot VLN benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18013","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TinySAM 2: Extreme Memory Compression for Efficient Track Anything Model","primary_cat":"cs.CV","submitted_at":"2026-05-18T08:05:59+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TinySAM 2 reaches 90% of SAM 2.1 performance on DAVIS and SA-V using 7% of the memory tokens and 3% of the training data via frame selection, spatial average pooling, temporal similarity-based token pruning, and a RepViT image encoder.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17633","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SparseSAM: Structured Sparsification of Activations in Segment Anything Models","primary_cat":"cs.CV","submitted_at":"2026-05-17T19:54:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SparseSAM achieves 2x faster inference and 2.8x memory reduction in SAM with only 0.004 mIoU loss at 0.4 density via Stripe-Sort Attention and Residual-Consistency MLP.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10484","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OpenSGA: Efficient 3D Scene Graph Alignment in the Open World","primary_cat":"cs.CV","submitted_at":"2026-05-11T12:44:18+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OpenSGA fuses vision-language, textual, and geometric features via a distance-gated attention encoder and minimum-cost-flow allocator to outperform prior methods on both frame-to-scan and subscan-to-subscan 3D scene graph alignment, backed by a new 700k-sample ScanNet-SG dataset.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"To train the global scene embedding in 4.2, we rely on a triplet margin contrastive loss with negative hard-mining. Given a processed batch, we select the positive pair and the hardest negative scan, the one that is closest in embedding space toc a CLS. To train our samples, we use the following loss function dap =∥c a CLS −c p CLS∥2 , d an =∥c a CLS −c n CLS∥2 (25) Ltriplet = max (dap −d an +m,0)(26) wherec a CLS is the global embedding associated to the graph from the image,c p CLS is the embedding of the associated scan graph andc n CLS is the embedding of the mined graph. 5 Scene Graph Building and Dataset Construction This section first introduces our pipeline used to construct 3D scene graphs with the object features described in Section 3."},{"citing_arxiv_id":"2605.03669","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FUS3DMaps: Scalable and Accurate Open-Vocabulary Semantic Mapping by 3D Fusion of Voxel- and Instance-Level Layers","primary_cat":"cs.RO","submitted_at":"2026-05-05T12:08:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FUS3DMaps fuses voxel- and instance-level open-vocabulary layers inside a shared 3D voxel map to improve both layers and enable scalable accurate semantic mapping.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08156","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LAGO: Language-Guided Adaptive Object-Region Focus for Zero-Shot Visual-Text Alignment","primary_cat":"cs.CV","submitted_at":"2026-05-04T09:07:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LAGO achieves state-of-the-art zero-shot performance with fewer image regions by using class-agnostic object discovery followed by confidence-controlled language-guided refinement and dual-channel aggregation.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"introduced carefully to avoid theprediction loop. This motivates our confidence-aware two-stage design, which delays semantic conditioning until a stable visual initialization has been established. 3 Method As shown in Figure 2, the pipeline consists of four stages. In thePreprocessingstage (§3.1), LAGO extracts full-image features, generates object proposals with FastSAM [ 40], and encodes LLM-generated class descriptions as text features. TheVisual-Only Diverse Searchstage (§3.2) 3 implements class-agnostic object-centric initialization, constructing a compact and diverse set of proposal-centered visual regions. TheEnsemble Prediction & Confidence EstimationandAdaptive Text-guided Refinementblocks implement confidence-aware two-stage region discovery (§3."},{"citing_arxiv_id":"2604.27383","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Real-time Scale-robust Network for Glottis Segmentation in Nasal Transnasal Intubation","primary_cat":"eess.IV","submitted_at":"2026-04-30T03:51:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A scale-robust lightweight CNN for glottis segmentation achieves 92.9% mDice at over 170 FPS with a 19 MB model size on three datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23749","ref_index":71,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"StateScribe: Towards Accessible Change Awareness Across Real-World Revisits","primary_cat":"cs.HC","submitted_at":"2026-04-26T14:57:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"StateScribe uses a dual-layer memory architecture for episodic scenes and object-centric changes to deliver live and historical descriptions, achieving 83.1% F1 accuracy across revisits in evaluations and user studies with BLV participants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20169","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Semantic-Fast-SAM: Efficient Semantic Segmenter","primary_cat":"cs.CV","submitted_at":"2026-04-22T04:18:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Semantic-Fast-SAM matches prior SAM-based semantic segmentation accuracy on Cityscapes and ADE20K while running about 20 times faster by combining FastSAM with SSA labeling and CLIP for open-vocabulary cases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16871","ref_index":66,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GRAIL: Autonomous Concept Grounding for Neuro-Symbolic Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-04-18T06:41:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GRAIL autonomously grounds relational concepts in NeSy-RL by using LLM weak supervision followed by interaction-based refinement, matching or exceeding manually defined concepts on Atari games.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11231","ref_index":84,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection","primary_cat":"cs.CV","submitted_at":"2026-04-13T09:35:14+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Seg2Change adapts open-vocabulary segmentation models to open-vocabulary change detection via a category-agnostic change head and new dataset CA-CDD, delivering +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11218","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"H-SPAM: Hierarchical Superpixel Anything Model","primary_cat":"cs.CV","submitted_at":"2026-04-13T09:19:41+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"H-SPAM produces accurate, regular, and perfectly nested hierarchical superpixels that outperform prior hierarchical methods and match recent non-hierarchical state-of-the-art.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11162","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Boxes2Pixels: Learning Defect Segmentation from Noisy SAM Masks","primary_cat":"cs.CV","submitted_at":"2026-04-13T08:25:45+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Boxes2Pixels distills noisy SAM pseudo-masks into a compact DINOv2-based student with auxiliary localization and one-sided self-correction, delivering +6.97 anomaly mIoU and +9.71 binary IoU gains over baselines on wind turbine data with 80% fewer parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07674","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Weight Group-wise Post-Training Quantization for Medical Foundation Model","primary_cat":"cs.CV","submitted_at":"2026-04-09T00:34:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Permutation-COMQ is a new post-training quantization algorithm that reorders weights within layers and uses only dot-product and rounding steps to deliver the highest reported accuracy for 2-, 4-, and 8-bit medical foundation models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05531","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Simulation-Driven Evolutionary Motion Parameterization for Contact-Rich Granular Scooping with a Soft Conical Robotic Hand","primary_cat":"cs.RO","submitted_at":"2026-04-07T07:30:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A deformable soft conical hand is modeled in physics simulation and its scooping trajectories are optimized via evolutionary search, enabling effective contact-rich granular tasks validated in both simulation and physical robot experiments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.13895","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OmniOVCD: Streamlining Open-Vocabulary Change Detection with SAM 3","primary_cat":"cs.CV","submitted_at":"2026-01-20T12:25:41+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OmniOVCD uses SAM 3's decoupled outputs and an SFID strategy to achieve state-of-the-art IoU scores of 67.2, 66.5, 24.5, and 27.1 on four OVCD benchmarks, surpassing prior methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.25699","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AIM-CoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning","primary_cat":"cs.CV","submitted_at":"2025-09-30T02:57:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AIM-CoT enhances interleaved multimodal chain-of-thought reasoning by adding context-enhanced attention generation, active visual probing via information foraging, and dynamic attention-shift triggering.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.19579","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Terra: Hierarchical Terrain-Aware 3D Scene Graph for Task-Agnostic Outdoor Mapping","primary_cat":"cs.RO","submitted_at":"2025-09-23T21:05:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Terra produces a lightweight task-agnostic metric-semantic 3D scene graph for outdoor environments using terrain-aware place nodes and hierarchically organized regions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.04960","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On Efficient Variants of Segment Anything Model: A Survey","primary_cat":"cs.CV","submitted_at":"2024-10-07T11:59:54+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"fine structures, leading to imprecise boundaries; and 2) it is not real-time and remains resource- intensive, particularly when using heavy image encoders like ViT-H. To address these issues, some works aim to improve mask quality by utiliz- ing high-resolution images [104, 105], integrating images and prompts [106], or introducing opti- mized prompts [107, 108], while others [41, 43, 46, 5 109] focus on creating more efficient architectures to reduce SAM's time and resource consumption. Some recent works, such as [110], have also started to find a better balance between model's accu- racy and efficiency. Previous surveys [20, 31, 85] have explored recent advances in enhancing SAM for higher-quality results. In this survey, we focus"},{"citing_arxiv_id":"2306.14289","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Faster Segment Anything: Towards Lightweight SAM for Mobile Applications","primary_cat":"cs.CV","submitted_at":"2023-06-25T16:37:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MobileSAM is a 60x smaller distilled version of SAM that matches original performance and runs 5x faster than concurrent FastSAM while supporting CPU inference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}