{"work":{"id":"2bbf46ca-720a-45a1-8e9c-10c33fbeada0","openalex_id":null,"doi":null,"arxiv_id":"2304.02643","raw_key":null,"title":"Segment Anything","authors":null,"authors_text":"Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson","year":2023,"venue":"cs.CV","abstract":"We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive -- often competitive with or even superior to prior fully supervised results. We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at https://segment-anything.com to foster research into foundation models for computer vision.","external_url":"https://arxiv.org/abs/2304.02643","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-24T09:04:15.017608+00:00","pith_arxiv_id":"2304.02643","created_at":"2026-05-08T18:54:01.867645+00:00","updated_at":"2026-05-24T09:04:15.017608+00:00","title_quality_ok":false,"display_title":"Segment Anything","render_title":"Segment Anything"},"hub":{"state":{"work_id":"2bbf46ca-720a-45a1-8e9c-10c33fbeada0","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":123,"external_cited_by_count":null,"distinct_field_count":10,"first_pith_cited_at":"2023-03-28T17:59:12+00:00","last_pith_cited_at":"2026-05-20T14:33:13+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-25T03:45:21.924667+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":12},{"context_role":"method","n":8},{"context_role":"other","n":3},{"context_role":"dataset","n":1}],"polarity_counts":[{"context_polarity":"background","n":13},{"context_polarity":"use_method","n":8},{"context_polarity":"unclear","n":3}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Segment Anything","claims":[{"claim_text":"We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive -- often competitive with or even superior to prior fully supervised results. We are releas","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"Figure 1: Grounded SAM can simultaneously detect and segment corresponding regions within images based on arbitrary text inputs provided by users. And it can seamlessly integrate with other Open-World models to accomplish more intricate visual tasks Abstract We introduce Grounded SAM , which uses Grounding DINO [38] as an open-set object detector to combine with the segment anything model (SAM) [ 25]. This integration enables the detection and segmentation of any regions based on arbitrary text ","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"For each LMP, we include 5-20 example queries and corresponding responses as part of the prompt. An example can be found in Fig. 2 (simplified for clarity). Full prompts are in Appendix. VLMs and Perception. Given an object/part query from LLMs, we first invoke open-vocab detector OWL-ViT [15] to obtain a bounding box, then feed it into Segment Anything [118] to obtain a mask, and finally track the mask using video tracker XMEM [119]. The tracked mask is used with RGB-D observation to reconstruc","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Scene generation methods target full rooms or environments rather than isolated objects. [61] introduces layout-driven factorization to separate structural layout from scene appear- ance in generation. Ctrl-Room[65] adds explicit layout constraints for text-to-3D room mesh 7 generation, while ControlRoom3D[215] uses semantic proxy rooms to provide controllable room synthesis with semantic structure. [116] focuses on indoor scene synthesis using recon- structed RGB-D cues, and iControl3D[126] pro","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Figure 3: Experiment tasks and visualization of optimization results. Seven tasks are designed to validate different aspects of our system, including in-the-wild specification with commonsense knowledge, multi-stage tasks with spatio-temporal dependencies, bimanual coordination with geometric awareness, and reactiveness when collaborating with humans and under disturbances. (SAM) [132]. For each mask j, we cluster the masked features Finterp[mj] using k-means with k = 5 with a cosine-similarity ","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"ities, typically including instruction following [5], [6], In- Context Learning (ICL) [7], and Chain of Thought (CoT) [8]. Although LLMs have demonstrated surprising zero/few- shot reasoning performance on most Natural Language Processing (NLP) tasks, they are inherently \"blind\" to vision since they can only understand discrete text. Concurrently, Large Vision Models (LVMs) can see clearly [9], [10], [11], [12], but commonly lag in reasoning. In light of this complementarity, LLM and LVM run tow","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"label-dependent and semantic-label-free. Semantic-label-dependent frameworks:These meth- ods require accurate, usually human-annotated semantic la- bel as well as explicit 3D supervision such as camera poses and depth maps as the supervision signals to train the frame- work. GARField [11] learns a scale-conditioned 3D affin- ity field by lifting multi-view SAM [12] masks via con- trastive learning. SAGA [2] and Gaussian Grouping [44] extend this framework to Gaussian primitives, where each 3D Ga","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Segment Anything because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (10 contexts).","role_counts":[{"n":10,"context_role":"background"},{"n":8,"context_role":"method"},{"n":2,"context_role":"other"},{"n":1,"context_role":"dataset"}]},"error":null,"updated_at":"2026-05-20T20:02:18.715501+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"93a549ae-6113-4764-bd45-bc06e1f1b5dc","orcid":null,"display_name":"Alexander Kirillov"},{"id":"6288e03d-c28c-46ef-8300-bb22e5620929","orcid":null,"display_name":"Eric Mintun"},{"id":"1f88c20e-7ef8-4409-974f-ec1302ffc976","orcid":null,"display_name":"Nikhila Ravi"},{"id":"5771153d-99f7-49ad-99e7-cf8de413e9b1","orcid":null,"display_name":"Hanzi Mao"},{"id":"f7e5318b-b8a5-47aa-bdf8-2dfa1b1e0b41","orcid":null,"display_name":"Chloe Rolland"},{"id":"53e6dcef-d171-4472-9144-9892d29449bb","orcid":null,"display_name":"Laura Gustafson"}]},"error":null,"updated_at":"2026-05-20T20:02:19.583644+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T13:00:43.900160+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"SAM 2: Segment Anything in Images and Videos","work_id":"acc13f66-d814-44f9-9688-375688bf2d4a","shared_citers":14},{"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","shared_citers":8},{"title":"Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection","work_id":"3757dc8f-79d5-4beb-a03b-eb4c9a33427d","shared_citers":7},{"title":"SAM 3: Segment Anything with Concepts","work_id":"4a72a006-2592-4554-aad0-a9c41a9f952d","shared_citers":6},{"title":"U-Net: Convolutional Networks for Biomedical Image Segmentation","work_id":"5c6b13d6-e704-4bf4-9df7-3a3a4d3b6950","shared_citers":5},{"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","work_id":"f790abdc-a796-482f-a40d-f8ee035ecfc2","shared_citers":4},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":4},{"title":"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware","work_id":"6fe159e0-fa73-481a-88d4-4719c15140be","shared_citers":4},{"title":"RT-1: Robotics Transformer for Real-World Control at Scale","work_id":"e11bda85-8531-46bc-a07f-d0ade3643ab1","shared_citers":4},{"title":"Visual Instruction Tuning","work_id":"68be622d-a6dc-4a13-82de-e3054a3dc509","shared_citers":4},{"title":"Emerging Properties in Self-Supervised Vision Transformers","work_id":"6b124bd1-c9f1-4251-96c1-2683f7f17a64","shared_citers":3},{"title":"Fast segment anything","work_id":"feed3d9f-cc9f-42db-90e6-e9ff051cef57","shared_citers":3},{"title":"Langsplat: 3d language gaussian splatting","work_id":"0c7a20d6-4a48-41eb-912b-9002a07212d9","shared_citers":3},{"title":"Learning Transferable Visual Models From Natural Language Supervision","work_id":"6de86bb5-27bd-4d5c-8b89-967ebfc52659","shared_citers":3},{"title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models","work_id":"a7e3a737-e007-42bc-be89-c4d34c5ee071","shared_citers":3},{"title":"OpenVLA: An Open-Source Vision-Language-Action Model","work_id":"3e7e65c5-5aed-4fe9-8414-2092bcb31cc7","shared_citers":3},{"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","shared_citers":3},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":3},{"title":"RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control","work_id":"ff438a8a-8003-4fae-9131-acd418b3597b","shared_citers":3},{"title":"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models","work_id":"63d03f4d-15f4-4583-8286-913c19f02294","shared_citers":2},{"title":"Conceptfusion: Open-set multimodal 3d mapping","work_id":"9190d1f8-03e9-478a-80a1-8d74bcec4ce9","shared_citers":2},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":2},{"title":"Depth pro: Sharp monocular metric depth in less than a second","work_id":"0b67883b-1901-45f1-9d58-1ef7a928df23","shared_citers":2},{"title":"Diffusion Policy: Visuomotor Policy Learning via Action Diffusion","work_id":"2dce18e6-f07a-4f57-8a81-e71c3e6a293c","shared_citers":2}],"time_series":[{"n":3,"year":2023},{"n":3,"year":2024},{"n":1,"year":2025},{"n":44,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T13:10:51.090119+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T13:00:51.380710+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Segment Anything","claims":[{"claim_text":"We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive -- often competitive with or even superior to prior fully supervised results. We are releas","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"Figure 1: Grounded SAM can simultaneously detect and segment corresponding regions within images based on arbitrary text inputs provided by users. And it can seamlessly integrate with other Open-World models to accomplish more intricate visual tasks Abstract We introduce Grounded SAM , which uses Grounding DINO [38] as an open-set object detector to combine with the segment anything model (SAM) [ 25]. This integration enables the detection and segmentation of any regions based on arbitrary text ","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"For each LMP, we include 5-20 example queries and corresponding responses as part of the prompt. An example can be found in Fig. 2 (simplified for clarity). Full prompts are in Appendix. VLMs and Perception. Given an object/part query from LLMs, we first invoke open-vocab detector OWL-ViT [15] to obtain a bounding box, then feed it into Segment Anything [118] to obtain a mask, and finally track the mask using video tracker XMEM [119]. The tracked mask is used with RGB-D observation to reconstruc","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Scene generation methods target full rooms or environments rather than isolated objects. [61] introduces layout-driven factorization to separate structural layout from scene appear- ance in generation. Ctrl-Room[65] adds explicit layout constraints for text-to-3D room mesh 7 generation, while ControlRoom3D[215] uses semantic proxy rooms to provide controllable room synthesis with semantic structure. [116] focuses on indoor scene synthesis using recon- structed RGB-D cues, and iControl3D[126] pro","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Figure 3: Experiment tasks and visualization of optimization results. Seven tasks are designed to validate different aspects of our system, including in-the-wild specification with commonsense knowledge, multi-stage tasks with spatio-temporal dependencies, bimanual coordination with geometric awareness, and reactiveness when collaborating with humans and under disturbances. (SAM) [132]. For each mask j, we cluster the masked features Finterp[mj] using k-means with k = 5 with a cosine-similarity ","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"ities, typically including instruction following [5], [6], In- Context Learning (ICL) [7], and Chain of Thought (CoT) [8]. Although LLMs have demonstrated surprising zero/few- shot reasoning performance on most Natural Language Processing (NLP) tasks, they are inherently \"blind\" to vision since they can only understand discrete text. Concurrently, Large Vision Models (LVMs) can see clearly [9], [10], [11], [12], but commonly lag in reasoning. In light of this complementarity, LLM and LVM run tow","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"label-dependent and semantic-label-free. Semantic-label-dependent frameworks:These meth- ods require accurate, usually human-annotated semantic la- bel as well as explicit 3D supervision such as camera poses and depth maps as the supervision signals to train the frame- work. GARField [11] learns a scale-conditioned 3D affin- ity field by lifting multi-view SAM [12] masks via con- trastive learning. SAGA [2] and Gaussian Grouping [44] extend this framework to Gaussian primitives, where each 3D Ga","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Segment Anything because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (10 contexts).","role_counts":[{"n":10,"context_role":"background"},{"n":8,"context_role":"method"},{"n":2,"context_role":"other"},{"n":1,"context_role":"dataset"}]},"error":null,"updated_at":"2026-05-20T20:02:18.719085+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Segment Anything","claims":[{"claim_text":"We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive -- often competitive with or even superior to prior fully supervised results. We are releas","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Segment Anything because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T13:10:54.291874+00:00"}},"summary":{"title":"Segment Anything","claims":[{"claim_text":"We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive -- often competitive with or even superior to prior fully supervised results. We are releas","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Segment Anything because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"SAM 2: Segment Anything in Images and Videos","work_id":"acc13f66-d814-44f9-9688-375688bf2d4a","shared_citers":14},{"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","shared_citers":8},{"title":"Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection","work_id":"3757dc8f-79d5-4beb-a03b-eb4c9a33427d","shared_citers":7},{"title":"SAM 3: Segment Anything with Concepts","work_id":"4a72a006-2592-4554-aad0-a9c41a9f952d","shared_citers":6},{"title":"U-Net: Convolutional Networks for Biomedical Image Segmentation","work_id":"5c6b13d6-e704-4bf4-9df7-3a3a4d3b6950","shared_citers":5},{"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","work_id":"f790abdc-a796-482f-a40d-f8ee035ecfc2","shared_citers":4},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":4},{"title":"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware","work_id":"6fe159e0-fa73-481a-88d4-4719c15140be","shared_citers":4},{"title":"RT-1: Robotics Transformer for Real-World Control at Scale","work_id":"e11bda85-8531-46bc-a07f-d0ade3643ab1","shared_citers":4},{"title":"Visual Instruction Tuning","work_id":"68be622d-a6dc-4a13-82de-e3054a3dc509","shared_citers":4},{"title":"Emerging Properties in Self-Supervised Vision Transformers","work_id":"6b124bd1-c9f1-4251-96c1-2683f7f17a64","shared_citers":3},{"title":"Fast segment anything","work_id":"feed3d9f-cc9f-42db-90e6-e9ff051cef57","shared_citers":3},{"title":"Langsplat: 3d language gaussian splatting","work_id":"0c7a20d6-4a48-41eb-912b-9002a07212d9","shared_citers":3},{"title":"Learning Transferable Visual Models From Natural Language Supervision","work_id":"6de86bb5-27bd-4d5c-8b89-967ebfc52659","shared_citers":3},{"title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models","work_id":"a7e3a737-e007-42bc-be89-c4d34c5ee071","shared_citers":3},{"title":"OpenVLA: An Open-Source Vision-Language-Action Model","work_id":"3e7e65c5-5aed-4fe9-8414-2092bcb31cc7","shared_citers":3},{"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","shared_citers":3},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":3},{"title":"RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control","work_id":"ff438a8a-8003-4fae-9131-acd418b3597b","shared_citers":3},{"title":"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models","work_id":"63d03f4d-15f4-4583-8286-913c19f02294","shared_citers":2},{"title":"Conceptfusion: Open-set multimodal 3d mapping","work_id":"9190d1f8-03e9-478a-80a1-8d74bcec4ce9","shared_citers":2},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":2},{"title":"Depth pro: Sharp monocular metric depth in less than a second","work_id":"0b67883b-1901-45f1-9d58-1ef7a928df23","shared_citers":2},{"title":"Diffusion Policy: Visuomotor Policy Learning via Action Diffusion","work_id":"2dce18e6-f07a-4f57-8a81-e71c3e6a293c","shared_citers":2}],"time_series":[{"n":3,"year":2023},{"n":3,"year":2024},{"n":1,"year":2025},{"n":44,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"93a549ae-6113-4764-bd45-bc06e1f1b5dc","orcid":null,"display_name":"Alexander Kirillov","source":"manual","import_confidence":0.72},{"id":"f7e5318b-b8a5-47aa-bdf8-2dfa1b1e0b41","orcid":null,"display_name":"Chloe Rolland","source":"manual","import_confidence":0.72},{"id":"6288e03d-c28c-46ef-8300-bb22e5620929","orcid":null,"display_name":"Eric Mintun","source":"manual","import_confidence":0.72},{"id":"5771153d-99f7-49ad-99e7-cf8de413e9b1","orcid":null,"display_name":"Hanzi Mao","source":"manual","import_confidence":0.72},{"id":"53e6dcef-d171-4472-9144-9892d29449bb","orcid":null,"display_name":"Laura Gustafson","source":"manual","import_confidence":0.72},{"id":"1f88c20e-7ef8-4409-974f-ec1302ffc976","orcid":null,"display_name":"Nikhila Ravi","source":"manual","import_confidence":0.72}]}}