super hub Mixed citations

Segment Anything

Alexander Kirillov, Chloe Rolland, Eric Mintun, Hanzi Mao, Laura Gustafson, Nikhila Ravi · 2023 · cs.CV · arXiv 2304.02643

Mixed citation behavior. Most common role is background (54%).

123 Pith papers citing it

Background 54% of classified citations

open full Pith review browse 123 citing papers more from Alexander Kirillov arXiv PDF

abstract

We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive -- often competitive with or even superior to prior fully supervised results. We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at https://segment-anything.com to foster research into foundation models for computer vision.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 12 method 8 other 3 dataset 1

citation-polarity summary

background 13 use method 8 unclear 3

claims ledger

abstract We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive -- often competitive with or even superior to prior fully supervised results. We are releas

authors

Alexander Kirillov Chloe Rolland Eric Mintun Hanzi Mao Laura Gustafson Nikhila Ravi

co-cited works

representative citing papers

SelvaBox: A high-resolution dataset for tropical tree crown detection

cs.CV · 2025-06-30 · accept · novelty 8.0

SelvaBox is the largest open high-resolution dataset for tropical tree crown detection, with benchmarks showing that higher resolution improves accuracy and models trained on it generalize competitively to other unseen tropical datasets.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Vision Harnessing Agent for Open Ad-hoc Segmentation

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.

Functionalization via Structure Completion and Motion Rectification

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.

MedCore: Boundary-Preserving Medical Core Pruning for MedSAM

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

MedCore achieves 60% parameter and 58.4% FLOP reduction on MedSAM with Dice 0.9549 and preserved boundary metrics via dual-intervention pruning and a new boundary leverage principle.

Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

Qwen3-VL-Seg decodes MLLM bounding boxes into pixel-level referring segmentation via a lightweight box-guided mask decoder, new SA1B-ORS training data, and ORS-Bench evaluation, showing strong open-world performance.

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

cs.RO · 2026-05-07 · unverdicted · novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures

cs.CV · 2026-05-05 · unverdicted · novelty 7.0

HeadsUp maps multi-view captures to UV-parameterized 3D Gaussians on a template via an encoder-decoder, achieving state-of-the-art quality and generalization after training on more than 10,000 subjects.

Does it Really Count? Assessing Semantic Grounding in Text-Guided Class-Agnostic Counting

cs.CV · 2026-05-04 · unverdicted · novelty 7.0 · 2 refs

Text-guided class-agnostic counting models exhibit significant weaknesses in grounding textual prompts to visual objects, as demonstrated by new negative-label and distractor tests on a multi-category dataset.

IMPACT-CYCLE: A Contract-Based Multi-Agent System for Claim-Level Supervisory Correction of Long-Video Semantic Memory

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

A contract-based multi-agent system maintains a claim-level semantic memory for long videos, enabling targeted corrections that raise VQA accuracy from 0.71 to 0.79 and cut human arbitration cost by 4.8x on VidOR.

A 3D SAM-Based Progressive Prompting Framework for Multi-Task Segmentation of Radiotherapy-induced Normal Tissue Injuries in Limited-Data Settings

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

A progressive prompting framework on 3D SAM with text, dose-box, and click prompts plus small-target loss achieves reliable multi-task segmentation of osteoradionecrosis, cerebral edema, and cerebral radiation necrosis on a new limited-data dataset and outperforms prior methods.

Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection

cs.CV · 2026-04-13 · conditional · novelty 7.0

Seg2Change adapts open-vocabulary segmentation models to open-vocabulary change detection via a category-agnostic change head and new dataset CA-CDD, delivering +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND.

Off-the-shelf Vision Models Benefit Image Manipulation Localization

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

ReVi adapter enables off-the-shelf vision models to localize image manipulations by separating and enhancing manipulation cues from semantic features without full model retraining.

Training a Student Expert via Semi-Supervised Foundation Model Distillation

cs.CV · 2026-04-04 · conditional · novelty 7.0

A semi-supervised framework distills vision foundation models into compact instance segmentation experts that outperform their teachers by up to 11.9 AP on Cityscapes and 8.6 AP on ADE20K while being 11 times smaller.

TSegAgent: Zero-Shot Tooth Segmentation via Geometry-Aware Vision-Language Agents

cs.CV · 2026-03-20 · unverdicted · novelty 7.0

TSegAgent achieves accurate zero-shot tooth segmentation on 3D dental scans via geometry-aware vision-language reasoning without task-specific training.

OPTED: Open Preprocessed Trachoma Eye Dataset Using Zero-Shot SAM 3 Segmentation

cs.CV · 2026-03-06 · accept · novelty 7.0

OPTED is a publicly released preprocessed trachoma eye image dataset generated via zero-shot SAM 3 segmentation of the tarsal conjunctiva with an optimal text prompt and quality filtering.

PhysMem: Scaling Test-Time Memory for Embodied Physical Reasoning

cs.RO · 2026-02-23 · unverdicted · novelty 7.0

PhysMem enables VLM-based robot planners to learn and verify physical properties through test-time interaction and hypothesis testing, raising success on a brick insertion task from 23% to 76%.

Descriptor: Parasitoid Wasps and Associated Hymenoptera Dataset (DAPWH)

cs.CV · 2026-02-20 · unverdicted · novelty 7.0

Releases the DAPWH dataset of 3556 wasp images including 1739 COCO-annotated examples to enable AI models for identifying Ichneumonoidea and associated families.

SAM 2++: Tracking Anything at Any Granularity

cs.CV · 2025-10-21 · conditional · novelty 7.0

SAM 2++ unifies video tracking across mask, box, and point granularities via task-specific prompts, a unified decoder, task-adaptive memory, and a new multi-granularity dataset, reporting state-of-the-art results.

ASTRA: Let Arbitrary Subjects Transform in Video Editing

cs.CV · 2025-10-01 · unverdicted · novelty 7.0

ASTRA is a plug-and-play training-free method for precise multi-subject video editing that uses prompt-guided multimodal alignment and prior-based mask retargeting to avoid attention dilution and boundary issues.

Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning

cs.CV · 2025-07-02 · unverdicted · novelty 7.0

Presents Reason50K dataset and ReasonBrain framework for hypothetical instruction-based image editing that requires physical, temporal, causal, and story reasoning.

Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

cs.AI · 2024-10-06 · unverdicted · novelty 7.0

PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

cs.RO · 2024-09-03 · conditional · novelty 7.0

ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms without task-specific data.

Deep Time Series Models: A Comprehensive Survey and Benchmark

cs.LG · 2024-07-18 · unverdicted · novelty 7.0

This survey and benchmark of deep time series models using the released TSLib library finds that models with specific structures perform well only on distinct analysis tasks.

citing papers explorer

Showing 50 of 123 citing papers.

SelvaBox: A high-resolution dataset for tropical tree crown detection cs.CV · 2025-06-30 · accept · none · ref 41 · internal anchor
SelvaBox is the largest open high-resolution dataset for tropical tree crown detection, with benchmarks showing that higher resolution improves accuracy and models trained on it generalize competitively to other unseen tropical datasets.
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution cs.CL · 2023-09-28 · unverdicted · none · ref 179 · internal anchor
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
Vision Harnessing Agent for Open Ad-hoc Segmentation cs.CV · 2026-05-19 · unverdicted · none · ref 2 · internal anchor
VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.
Functionalization via Structure Completion and Motion Rectification cs.CV · 2026-05-18 · unverdicted · none · ref 171 · internal anchor
Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.
MedCore: Boundary-Preserving Medical Core Pruning for MedSAM cs.CV · 2026-05-13 · unverdicted · none · ref 1 · internal anchor
MedCore achieves 60% parameter and 58.4% FLOP reduction on MedSAM with Dice 0.9549 and preserved boundary metrics via dual-intervention pruning and a new boundary leverage principle.
Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding cs.CV · 2026-05-08 · unverdicted · none · ref 10 · internal anchor
Qwen3-VL-Seg decodes MLLM bounding boxes into pixel-level referring segmentation via a lightweight box-guided mask decoder, new SA1B-ORS training data, and ORS-Bench evaluation, showing strong open-world performance.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation cs.RO · 2026-05-07 · unverdicted · none · ref 36 · internal anchor
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures cs.CV · 2026-05-05 · unverdicted · none · ref 35 · internal anchor
HeadsUp maps multi-view captures to UV-parameterized 3D Gaussians on a template via an encoder-decoder, achieving state-of-the-art quality and generalization after training on more than 10,000 subjects.
Does it Really Count? Assessing Semantic Grounding in Text-Guided Class-Agnostic Counting cs.CV · 2026-05-04 · unverdicted · none · ref 35 · 2 links · internal anchor
Text-guided class-agnostic counting models exhibit significant weaknesses in grounding textual prompts to visual objects, as demonstrated by new negative-label and distractor tests on a multi-category dataset.
IMPACT-CYCLE: A Contract-Based Multi-Agent System for Claim-Level Supervisory Correction of Long-Video Semantic Memory cs.CV · 2026-04-22 · unverdicted · none · ref 56 · internal anchor
A contract-based multi-agent system maintains a claim-level semantic memory for long videos, enabling targeted corrections that raise VQA accuracy from 0.71 to 0.79 and cut human arbitration cost by 4.8x on VidOR.
A 3D SAM-Based Progressive Prompting Framework for Multi-Task Segmentation of Radiotherapy-induced Normal Tissue Injuries in Limited-Data Settings cs.CV · 2026-04-15 · unverdicted · none · ref 11 · internal anchor
A progressive prompting framework on 3D SAM with text, dose-box, and click prompts plus small-target loss achieves reliable multi-task segmentation of osteoradionecrosis, cerebral edema, and cerebral radiation necrosis on a new limited-data dataset and outperforms prior methods.
Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection cs.CV · 2026-04-13 · conditional · none · ref 35 · internal anchor
Seg2Change adapts open-vocabulary segmentation models to open-vocabulary change detection via a category-agnostic change head and new dataset CA-CDD, delivering +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND.
Off-the-shelf Vision Models Benefit Image Manipulation Localization cs.CV · 2026-04-10 · unverdicted · none · ref 18 · internal anchor
ReVi adapter enables off-the-shelf vision models to localize image manipulations by separating and enhancing manipulation cues from semantic features without full model retraining.
Training a Student Expert via Semi-Supervised Foundation Model Distillation cs.CV · 2026-04-04 · conditional · none · ref 24 · internal anchor
A semi-supervised framework distills vision foundation models into compact instance segmentation experts that outperform their teachers by up to 11.9 AP on Cityscapes and 8.6 AP on ADE20K while being 11 times smaller.
TSegAgent: Zero-Shot Tooth Segmentation via Geometry-Aware Vision-Language Agents cs.CV · 2026-03-20 · unverdicted · none · ref 8 · internal anchor
TSegAgent achieves accurate zero-shot tooth segmentation on 3D dental scans via geometry-aware vision-language reasoning without task-specific training.
OPTED: Open Preprocessed Trachoma Eye Dataset Using Zero-Shot SAM 3 Segmentation cs.CV · 2026-03-06 · accept · none · ref 13 · internal anchor
OPTED is a publicly released preprocessed trachoma eye image dataset generated via zero-shot SAM 3 segmentation of the tarsal conjunctiva with an optimal text prompt and quality filtering.
PhysMem: Scaling Test-Time Memory for Embodied Physical Reasoning cs.RO · 2026-02-23 · unverdicted · none · ref 32 · internal anchor
PhysMem enables VLM-based robot planners to learn and verify physical properties through test-time interaction and hypothesis testing, raising success on a brick insertion task from 23% to 76%.
Descriptor: Parasitoid Wasps and Associated Hymenoptera Dataset (DAPWH) cs.CV · 2026-02-20 · unverdicted · none · ref 31 · internal anchor
Releases the DAPWH dataset of 3556 wasp images including 1739 COCO-annotated examples to enable AI models for identifying Ichneumonoidea and associated families.
SAM 2++: Tracking Anything at Any Granularity cs.CV · 2025-10-21 · conditional · none · ref 37 · internal anchor
SAM 2++ unifies video tracking across mask, box, and point granularities via task-specific prompts, a unified decoder, task-adaptive memory, and a new multi-granularity dataset, reporting state-of-the-art results.
ASTRA: Let Arbitrary Subjects Transform in Video Editing cs.CV · 2025-10-01 · unverdicted · none · ref 14 · internal anchor
ASTRA is a plug-and-play training-free method for precise multi-subject video editing that uses prompt-guided multimodal alignment and prior-based mask retargeting to avoid attention dilution and boundary issues.
Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning cs.CV · 2025-07-02 · unverdicted · none · ref 8 · internal anchor
Presents Reason50K dataset and ReasonBrain framework for hypothetical instruction-based image editing that requires physical, temporal, causal, and story reasoning.
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark cs.AI · 2024-10-06 · unverdicted · none · ref 22 · internal anchor
PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation cs.RO · 2024-09-03 · conditional · none · ref 132 · internal anchor
ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms without task-specific data.
Deep Time Series Models: A Comprehensive Survey and Benchmark cs.LG · 2024-07-18 · unverdicted · none · ref 198 · internal anchor
This survey and benchmark of deep time series models using the released TSLib library finds that models with specific structures perform well only on distinct analysis tasks.
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? cs.CV · 2024-03-21 · conditional · none · ref 29 · internal anchor
MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.
Project Aria: A New Tool for Egocentric Multi-Modal AI Research cs.HC · 2023-08-24 · accept · none · ref 11 · internal anchor
Project Aria presents a new wearable egocentric multi-modal recording device and software tools to accelerate AI research for augmented reality applications.
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension cs.CL · 2023-07-30 · unverdicted · none · ref 29 · internal anchor
SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models cs.RO · 2023-07-12 · unverdicted · none · ref 118 · internal anchor
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
Objaverse-XL: A Universe of 10M+ 3D Objects cs.CV · 2023-07-11 · accept · none · ref 31 · internal anchor
Objaverse-XL supplies over 10 million diverse 3D objects that, when used to render 100 million views, improve zero-shot novel-view synthesis in models such as Zero123.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention cs.CV · 2023-03-28 · conditional · none · ref 279 · internal anchor
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
SR-Ground: Image Quality Grounding for Super-Resolved Content cs.CV · 2026-05-20 · unverdicted · none · ref 18 · internal anchor
The paper releases SR-Ground, a crowdsourced dataset for pixel-level segmentation of six artifact types in super-resolved images, and shows its use for training grounded IQA models and artifact-reducing fine-tuning.
RT-Splatting: Joint Reflection-Transmission Modeling with Gaussian Splatting cs.CV · 2026-05-18 · unverdicted · none · ref 20 · internal anchor
RT-Splatting adds a disentangled occupancy-opacity factorization and specular-aware gradient gating to 3D Gaussian Splatting, enabling joint high-fidelity reflection and transmission in real-time novel view synthesis.
ASIP-Planner: Adaptive Planning for UAV Surface Inspection in Partially Known Indoor Environments cs.RO · 2026-05-11 · unverdicted · none · ref 26 · internal anchor
ASIP-Planner achieves near-complete surface coverage and shorter trajectories in partially known indoor environments by clustering inspection targets globally and adapting viewing angles locally to handle occlusions.
HeteroGenManip: Generalizable Manipulation For Heterogeneous Object Interactions cs.RO · 2026-05-11 · unverdicted · none · ref 24 · 2 links · internal anchor
A task-conditioned two-stage system decouples grasp localization from interaction trajectory planning using specialized foundation models to improve generalization across heterogeneous object types.
ClickSeg3D: Few-Click Interactive Segmentation via Semantic Embeddings cs.CV · 2026-05-09 · unverdicted · none · ref 25 · 2 links · internal anchor
ClickSeg3D uses a point Transformer encoder and hierarchical mask decoder with semantic embeddings to enable single-pass multi-object 3D interactive segmentation from sparse points, reporting over 20% mIoU gains versus baselines and 8-10% cross-dataset improvements with one click per instance.
Leveraging Image Generators to Address Training Data Scarcity: The Gen4Regen Dataset for Forest Regeneration Mapping cs.CV · 2026-05-07 · conditional · none · ref 39 · internal anchor
Mixing real UAV imagery with 2101 AI-generated image-mask pairs improves semantic segmentation F1 scores for fine-grained forest species by over 15 percentage points overall and up to 30 points for rare classes.
YOTOnet: Zero-Shot Cross-Domain Fault Diagnosis via Domain-Conditioned Mixture of Experts cs.LG · 2026-05-06 · unverdicted · none · ref 8 · internal anchor
YOTOnet achieves improved zero-shot cross-domain fault diagnosis on bearing datasets by combining a physics-aware invariant feature distiller with domain-conditioned sparse experts, showing performance scaling as more training domains are added.
Approaching human parity in the quality of automated organoid image segmentation cs.CV · 2026-05-04 · conditional · none · ref 22 · internal anchor
A composite SAM-based method segments organoid images with accuracy matching or approaching inter-observer variability among human annotators.
Learning Equivariant Neural-Augmented Object Dynamics From Few Interactions cs.RO · 2026-05-04 · unverdicted · none · ref 17 · internal anchor
PIEGraph augments a spring-mass particle model with an equivariant GNN and novel action representation to predict accurate object dynamics for robotic manipulation from few interactions.
GS-Playground: A High-Throughput Photorealistic Simulator for Vision-Informed Robot Learning cs.RO · 2026-04-28 · unverdicted · none · ref 26 · internal anchor
GS-Playground delivers a high-throughput photorealistic simulator for vision-informed robot learning via parallel physics integrated with batch 3D Gaussian Splatting at 10^4 FPS and an automated Real2Sim workflow for consistent environments.
DiffuSAM: Diffusion-Based Prompt-Free SAM2 for Few-Shot and Source-Free Medical Image Segmentation cs.CV · 2026-04-27 · unverdicted · none · ref 7 · internal anchor
DiffuSAM synthesizes SAM2-compatible mask embeddings via a diffusion prior conditioned on prior slices to enable accurate prompt-free medical image segmentation under SF-UDA and few-shot settings.
AgentLens: Adaptive Visual Modalities for Human-Agent Interaction in Mobile GUI Agents cs.HC · 2026-04-22 · unverdicted · none · ref 25 · internal anchor
AgentLens adaptively deploys Full UI, Partial UI, and GenUI modalities with virtual display overlays for mobile GUI agents, yielding 85.7% user preference and best-in-study usability in a 21-participant evaluation.
SpaceDex: Generalizable Dexterous Grasping in Tiered Workspaces cs.RO · 2026-04-20 · unverdicted · none · ref 4 · internal anchor
SpaceDex achieves 63% success grasping unseen objects in tiered workspaces via VLM spatial planning and arm-hand feature separation, beating a 39% tabletop baseline in 100 real trials.
Chain Of Interaction Benchmark (COIN): When Reasoning meets Embodied Interaction cs.RO · 2026-04-18 · unverdicted · none · ref 7 · internal anchor
COIN provides 50 interactive robotic tasks, a 1000-demonstration dataset collected via AR teleoperation, and metrics showing that CodeAsPolicy, VLA, and H-VLA models fail at causally-dependent interactive reasoning due to gaps between vision and execution.
One-Shot Cross-Geometry Skill Transfer through Part Decomposition cs.RO · 2026-04-16 · unverdicted · none · ref 8 · internal anchor
Part decomposition with generative shape models allows one-shot robot skill transfer across unfamiliar object geometries in simulation and real settings.
From Boundaries to Semantics: Prompt-Guided Multi-Task Learning for Petrographic Thin-section Segmentation cs.CV · 2026-04-16 · unverdicted · none · ref 6 · internal anchor
Petro-SAM adapts SAM via a Merge Block for polarized views plus multi-scale fusion and color-entropy priors to jointly achieve grain-edge and lithology segmentation in petrographic images.
Granularity-Aware Transfer for Tree Instance Segmentation in Synthetic and Real Forests cs.CV · 2026-04-15 · unverdicted · none · ref 9 · internal anchor
Granularity-aware distillation improves tree instance segmentation accuracy on real forest images by merging logits and unifying masks from fine-grained synthetic teachers despite coarse real labels.
GTPBD-MM: A Global Terraced Parcel and Boundary Dataset with Multi-Modality cs.CV · 2026-04-14 · unverdicted · none · ref 17 · internal anchor
GTPBD-MM is the first multimodal benchmark for global terraced parcel extraction, integrating image, text, and DEM data with experiments showing that textual and terrain cues improve delineation accuracy over image-only approaches.
Self-supervised Pretraining of Cell Segmentation Models cs.CV · 2026-04-12 · unverdicted · none · ref 13 · internal anchor
DINOCell achieves a SEG score of 0.784 on LIVECell by self-supervised domain adaptation of DINOv2, improving 10.42% over SAM-based models and showing strong zero-shot transfer.
Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents cs.CV · 2026-04-10 · unverdicted · none · ref 19 · internal anchor
Closed-loop VLM agents using multi-view reasoning, object-centered visualization, and single-axis rotation prediction achieve superior text-guided 6D pose rearrangement for target objects in scenes.

Segment Anything

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer