Recognition: 3 theorem links
Segment Anything
Pith reviewed 2026-05-11 06:08 UTC · model grok-4.3
The pith
A promptable model trained on a billion masks enables zero-shot segmentation that often matches supervised results.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date, with over 1 billion masks on 11M images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive -- often competitive with or even superior to prior fully supervised results.
What carries the argument
The promptable Segment Anything Model (SAM) that takes user-provided prompts such as points, boxes, or coarse masks and outputs object segmentation masks.
If this is right
- The model can be applied directly to new image types and tasks without collecting new labeled data or retraining.
- Prompt-based interaction becomes a practical way to guide segmentation on arbitrary images.
- The released dataset supports training or fine-tuning of additional vision models at large scale.
- Releasing both the model and data lowers the barrier for researchers to experiment with promptable segmentation.
Where Pith is reading between the lines
- The self-supervised data collection loop may offer a template for building large datasets in other vision domains where annotation is expensive.
- Promptable architectures could extend beyond segmentation to tasks like detection or editing with similar zero-shot benefits.
- Interactive tools built on this model might reduce the need for per-task model training in applied settings such as medical imaging or content creation.
- Performance on video sequences or 3D data would test whether the promptable property holds across temporal and spatial dimensions.
Load-bearing premise
That the zero-shot results reflect genuine generalization to new distributions rather than overfitting to the self-collected data or the evaluation tasks.
What would settle it
A controlled test on a fresh image domain and segmentation task where the model's zero-shot accuracy falls clearly below a model trained with full supervision on that same task.
read the original abstract
We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive -- often competitive with or even superior to prior fully supervised results. We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at https://segment-anything.com to foster research into foundation models for computer vision.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Segment Anything (SA) project, comprising a new promptable segmentation task, the Segment Anything Model (SAM), and the SA-1B dataset of over 1 billion masks on 11 million images. The dataset is constructed via a multi-stage data engine that uses an efficient version of the model itself to propose, refine, and automatically generate masks. The central claim is that the resulting promptable model transfers zero-shot to new image distributions and tasks, with performance that is often competitive with or superior to prior fully supervised methods; the model and dataset are released publicly.
Significance. If the zero-shot generalization results are robust, this work would mark a notable advance toward foundation models in computer vision by demonstrating a single promptable model that can handle diverse segmentation tasks without task-specific training or fine-tuning. The unprecedented scale of SA-1B and the open release of both model and data constitute clear strengths that could enable substantial follow-on research.
major comments (2)
- [Data Engine section] Data Engine section: The staged collection process (particularly stages 2 and 3) invokes the model to generate and refine masks, so the training distribution is shaped by the model's own inductive biases. This creates a circularity risk that could undermine the zero-shot claim; the manuscript must supply a concrete analysis (e.g., ablation on held-out image sources or comparison of mask statistics before/after the automatic stages) showing that reported gains on external benchmarks are not artifacts of distribution alignment.
- [Experiments section] Experiments section: The claim that zero-shot performance is 'often competitive with or even superior to prior fully supervised results' is load-bearing, yet the manuscript provides limited detail on exact metrics, chosen baselines, error bars, and statistical tests across the evaluated tasks. Full per-task tables with variance estimates and explicit comparison protocols are required to substantiate the central performance assertion.
minor comments (2)
- [Abstract] Abstract: The phrase 'numerous tasks' is vague; a short parenthetical list of the primary evaluation benchmarks would improve immediate clarity.
- [Model section] Notation: The distinction between the 'efficient' model used in the data engine and the final SAM should be introduced with explicit symbols or subsection headings to avoid reader confusion in later sections.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the data engine and experimental reporting. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Data Engine section] Data Engine section: The staged collection process (particularly stages 2 and 3) invokes the model to generate and refine masks, so the training distribution is shaped by the model's own inductive biases. This creates a circularity risk that could undermine the zero-shot claim; the manuscript must supply a concrete analysis (e.g., ablation on held-out image sources or comparison of mask statistics before/after the automatic stages) showing that reported gains on external benchmarks are not artifacts of distribution alignment.
Authors: We acknowledge the potential circularity concern arising from the model's role in stages 2 and 3 of the data engine. However, the zero-shot evaluations use entirely external benchmarks and image distributions that were never seen during data collection or training. To directly address this, we will add a new analysis subsection in the revised manuscript that includes: (1) mask statistic comparisons (e.g., size, complexity, and diversity metrics) before and after the automatic stages, and (2) performance ablations on held-out image sources excluded from the data engine. These additions will demonstrate that the reported zero-shot gains on external tasks are not artifacts of distribution alignment. revision: yes
-
Referee: [Experiments section] Experiments section: The claim that zero-shot performance is 'often competitive with or even superior to prior fully supervised results' is load-bearing, yet the manuscript provides limited detail on exact metrics, chosen baselines, error bars, and statistical tests across the evaluated tasks. Full per-task tables with variance estimates and explicit comparison protocols are required to substantiate the central performance assertion.
Authors: We agree that the central performance claim requires more granular reporting for full substantiation. In the revised manuscript, we will expand the Experiments section with complete per-task tables that include exact metrics for all evaluated tasks, the specific baselines used, error bars or variance estimates (from multiple seeds or cross-validation where feasible), and results of statistical significance tests. We will also add explicit descriptions of the comparison protocols, including how prompts were generated and how zero-shot transfer was measured against fully supervised methods. revision: yes
Circularity Check
No circularity: empirical zero-shot evaluations remain independent of data engine
full rationale
The paper's core claim is empirical: a promptable model trained on SA-1B achieves competitive zero-shot results on external tasks and image distributions. The data engine (staged collection using an efficient model variant) is a practical annotation procedure whose outputs are then used for supervised training; the reported evaluations use separate benchmarks whose ground truth and image sources are not generated by the same loop. No equation, uniqueness theorem, or prediction reduces by construction to a fitted parameter or self-generated input. Self-citations are absent from the load-bearing steps, and the architecture's promptability is justified by design and training rather than by renaming prior results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard deep learning assumptions on generalization from large-scale training data
Forward citations
Cited by 58 Pith papers
-
Does it Really Count? Assessing Semantic Grounding in Text-Guided Class-Agnostic Counting
Current CAC models often count the wrong objects because they misalign text prompts with visual content, as demonstrated by new negative-label and distractor tests on the MUCCA dataset.
-
MedCore: Boundary-Preserving Medical Core Pruning for MedSAM
MedCore achieves 60% parameter and 58.4% FLOP reduction on MedSAM with Dice 0.9549 and preserved boundary metrics via dual-intervention pruning and a new boundary leverage principle.
-
Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding
Qwen3-VL-Seg decodes MLLM bounding boxes into pixel-level referring segmentation via a lightweight box-guided mask decoder, new SA1B-ORS training data, and ORS-Bench evaluation, showing strong open-world performance.
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures
HeadsUp maps multi-view captures to UV-parameterized 3D Gaussians on a template via an encoder-decoder, achieving state-of-the-art quality and generalization after training on more than 10,000 subjects.
-
Does it Really Count? Assessing Semantic Grounding in Text-Guided Class-Agnostic Counting
Text-guided class-agnostic counting models exhibit significant weaknesses in grounding textual prompts to visual objects, as demonstrated by new negative-label and distractor tests on a multi-category dataset.
-
IMPACT-CYCLE: A Contract-Based Multi-Agent System for Claim-Level Supervisory Correction of Long-Video Semantic Memory
A contract-based multi-agent system maintains a claim-level semantic memory for long videos, enabling targeted corrections that raise VQA accuracy from 0.71 to 0.79 and cut human arbitration cost by 4.8x on VidOR.
-
A 3D SAM-Based Progressive Prompting Framework for Multi-Task Segmentation of Radiotherapy-induced Normal Tissue Injuries in Limited-Data Settings
A progressive prompting framework on 3D SAM with text, dose-box, and click prompts plus small-target loss achieves reliable multi-task segmentation of osteoradionecrosis, cerebral edema, and cerebral radiation necrosi...
-
Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection
Seg2Change adapts open-vocabulary segmentation models to open-vocabulary change detection via a category-agnostic change head and new dataset CA-CDD, delivering +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND.
-
Off-the-shelf Vision Models Benefit Image Manipulation Localization
ReVi adapter enables off-the-shelf vision models to localize image manipulations by separating and enhancing manipulation cues from semantic features without full model retraining.
-
Training a Student Expert via Semi-Supervised Foundation Model Distillation
A semi-supervised framework distills vision foundation models into compact instance segmentation experts that outperform their teachers by up to 11.9 AP on Cityscapes and 8.6 AP on ADE20K while being 11 times smaller.
-
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
-
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
-
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
-
ASIP-Planner: Adaptive Planning for UAV Surface Inspection in Partially Known Indoor Environments
ASIP-Planner achieves near-complete surface coverage and shorter trajectories in partially known indoor environments by clustering inspection targets globally and adapting viewing angles locally to handle occlusions.
-
HeteroGenManip: Generalizable Manipulation For Heterogeneous Object Interactions
HeteroGenManip decouples grasp localization from interaction planning using task-conditioned foundation models and multi-model diffusion policies, delivering 31% average gains in broad simulation tasks and 36.7% in fo...
-
HeteroGenManip: Generalizable Manipulation For Heterogeneous Object Interactions
A task-conditioned two-stage system decouples grasp localization from interaction trajectory planning using specialized foundation models to improve generalization across heterogeneous object types.
-
Few-Click-Driven Interactive 3D Segmentation with Semantic Embedding
A point-Transformer interactive 3D instance segmentation model handles multiple clicks jointly in one pass and reports over 20% mIoU gains versus baselines plus 8-10% cross-dataset improvement for one-click-per-instan...
-
Leveraging Image Generators to Address Training Data Scarcity: The Gen4Regen Dataset for Forest Regeneration Mapping
Mixing real UAV imagery with 2101 AI-generated image-mask pairs improves semantic segmentation F1 scores for fine-grained forest species by over 15 percentage points overall and up to 30 points for rare classes.
-
YOTOnet: Zero-Shot Cross-Domain Fault Diagnosis via Domain-Conditioned Mixture of Experts
YOTOnet achieves improved zero-shot cross-domain fault diagnosis on bearing datasets by combining a physics-aware invariant feature distiller with domain-conditioned sparse experts, showing performance scaling as more...
-
Approaching human parity in the quality of automated organoid image segmentation
A composite SAM-based method segments organoid images with accuracy matching or approaching inter-observer variability among human annotators.
-
Learning Equivariant Neural-Augmented Object Dynamics From Few Interactions
PIEGraph augments a spring-mass particle model with an equivariant GNN and novel action representation to predict accurate object dynamics for robotic manipulation from few interactions.
-
GS-Playground: A High-Throughput Photorealistic Simulator for Vision-Informed Robot Learning
GS-Playground delivers a high-throughput photorealistic simulator for vision-informed robot learning via parallel physics integrated with batch 3D Gaussian Splatting at 10^4 FPS and an automated Real2Sim workflow for ...
-
DiffuSAM: Diffusion-Based Prompt-Free SAM2 for Few-Shot and Source-Free Medical Image Segmentation
DiffuSAM synthesizes SAM2-compatible mask embeddings via a diffusion prior conditioned on prior slices to enable accurate prompt-free medical image segmentation under SF-UDA and few-shot settings.
-
AgentLens: Adaptive Visual Modalities for Human-Agent Interaction in Mobile GUI Agents
AgentLens adaptively deploys Full UI, Partial UI, and GenUI modalities with virtual display overlays for mobile GUI agents, yielding 85.7% user preference and best-in-study usability in a 21-participant evaluation.
-
SpaceDex: Generalizable Dexterous Grasping in Tiered Workspaces
SpaceDex achieves 63% success grasping unseen objects in tiered workspaces via VLM spatial planning and arm-hand feature separation, beating a 39% tabletop baseline in 100 real trials.
-
Chain Of Interaction Benchmark (COIN): When Reasoning meets Embodied Interaction
COIN provides 50 interactive robotic tasks, a 1000-demonstration dataset collected via AR teleoperation, and metrics showing that CodeAsPolicy, VLA, and H-VLA models fail at causally-dependent interactive reasoning du...
-
One-Shot Cross-Geometry Skill Transfer through Part Decomposition
Part decomposition with generative shape models allows one-shot robot skill transfer across unfamiliar object geometries in simulation and real settings.
-
From Boundaries to Semantics: Prompt-Guided Multi-Task Learning for Petrographic Thin-section Segmentation
Petro-SAM adapts SAM via a Merge Block for polarized views plus multi-scale fusion and color-entropy priors to jointly achieve grain-edge and lithology segmentation in petrographic images.
-
Granularity-Aware Transfer for Tree Instance Segmentation in Synthetic and Real Forests
Granularity-aware distillation improves tree instance segmentation accuracy on real forest images by merging logits and unifying masks from fine-grained synthetic teachers despite coarse real labels.
-
GTPBD-MM: A Global Terraced Parcel and Boundary Dataset with Multi-Modality
GTPBD-MM is the first multimodal benchmark for global terraced parcel extraction, integrating image, text, and DEM data with experiments showing that textual and terrain cues improve delineation accuracy over image-on...
-
Self-supervised Pretraining of Cell Segmentation Models
DINOCell achieves a SEG score of 0.784 on LIVECell by self-supervised domain adaptation of DINOv2, improving 10.42% over SAM-based models and showing strong zero-shot transfer.
-
Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents
Closed-loop VLM agents using multi-view reasoning, object-centered visualization, and single-axis rotation prediction achieve superior text-guided 6D pose rearrangement for target objects in scenes.
-
GESS: Multi-cue Guided Local Feature Learning via Geometric and Semantic Synergy
GESS introduces joint semantic-normal and depth stability prediction heads, the SDAK keypoint mechanism, and the UTCF descriptor fusion module to leverage multi-cue synergy for improved robustness and discriminability.
-
Moondream Segmentation: From Words to Masks
Moondream Segmentation achieves 80.2% cIoU on RefCOCO by autoregressively decoding paths from referring expressions and using RL to refine masks, plus releases a cleaned RefCOCO-M dataset.
-
DeepSeek-OCR: Contexts Optical Compression
DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.
-
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
-
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.
-
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Grounded SAM integrates Grounding DINO and SAM to support text-prompted open-world detection and segmentation, achieving 48.7 mean AP on SegInW zero-shot with the base detector and huge segmenter.
-
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
-
TD-MPC2: Scalable, Robust World Models for Continuous Control
TD-MPC2 scales an implicit world-model RL method to a 317M-parameter agent that masters 80 tasks across four domains with a single hyperparameter configuration.
-
CrackMorph-XAI-Net: A Topology-Preserving and Explainable Framework for Automated Crack Morphology
CrackMorph-XAI-Net extracts crack skeletons with Dice 0.991 and topology preservation in 98.5% of cases, detects junctions with F1 0.887, and computes morphology descriptors with correlations above 0.95 on an extended...
-
Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures
Pith review generated a malformed one-line summary.
-
FUS3DMaps: Scalable and Accurate Open-Vocabulary Semantic Mapping by 3D Fusion of Voxel- and Instance-Level Layers
FUS3DMaps fuses voxel- and instance-level open-vocabulary layers inside a shared 3D voxel map to improve both layers and enable scalable accurate semantic mapping.
-
CoAX: Cognitive-Oriented Attribution eXplanation User Model of Human Understanding of AI Explanations
Cognitive models of user reasoning strategies with XAI methods on tabular data fit human forward-simulation decisions better than ML baselines and support hypothesis testing without new user studies.
-
Semantic Foam: Unifying Spatial and Semantic Scene Decomposition
Semantic Foam unifies spatial Voronoi decomposition with cell-level semantic features to achieve superior object segmentation by enabling direct spatial regularization that avoids occlusion and view-inconsistency artifacts.
-
DiffuSAM: Diffusion Guided Zero-Shot Object Grounding for Remote Sensing Imagery
DiffuSAM fuses diffusion-based localization cues with SAM models to deliver over 14% higher Acc@0.5 in zero-shot object grounding for remote sensing imagery compared to prior methods.
-
SGP-SAM: Self-Gated Prompting for Transferring 3D Segment Anything Models to Lesion Segmentation
SGP-SAM transfers 3D SAM to lesion segmentation using a self-gated module for conditional multi-scale enhancement and a Zoom Loss, achieving 7.3% mDice gain over fine-tuning on MSD Liver Tumor data.
-
STEP-Parts: Geometric Partitioning of Boundary Representations for Large-Scale CAD Processing
STEP-Parts produces tessellation-robust geometric part labels from STEP B-Reps by deterministic merging of same-primitive faces, enabling consistent supervision on 180k+ models.
-
MyoVision: A Mobile Research Tool and NEATBoost-Attention Ensemble Framework for Real Time Chicken Breast Myopathy Detection
Smartphone transillumination imaging paired with a neuroevolution-tuned ensemble model classifies chicken breast myopathies at 82.4% accuracy on 336 fillets, matching costly hyperspectral systems.
-
Robotic Nanoparticle Synthesis via Solution-based Processes
Screw-based motion planning extracted from single demonstrations enables robots to autonomously execute long-horizon nanoparticle synthesis protocols.
-
FF3R: Feedforward Feature 3D Reconstruction from Unconstrained views
FF3R unifies geometric and semantic 3D reconstruction in a single annotation-free feed-forward network trained solely via RGB and feature rendering supervision.
-
F3G-Avatar : Face Focused Full-body Gaussian Avatar
F3G-Avatar improves full-body Gaussian avatars by adding a dedicated face-focused deformation branch to better preserve facial geometry and expressions from multi-view RGB video.
-
Edge Deep Learning in Computer Vision and Medical Diagnostics: A Comprehensive Survey
A comprehensive survey of edge deep learning in computer vision and medical diagnostics that presents a novel categorization of hardware platforms by performance and usage scenarios.
-
Label-Efficient School Detection from Aerial Imagery via Weakly Supervised Pretraining and Fine-Tuning
A two-stage weakly supervised pipeline pretrains on auto-generated school labels from sparse points and fine-tunes on only 50 manual examples to achieve strong detection performance in aerial imagery.
-
AMO-ENE: Attention-based Multi-Omics Fusion Model for Outcome Prediction in Extra Nodal Extension and HPV-associated Oropharyngeal Cancer
An attention-based fusion model combining semi-supervised CT segmentation, radiomics, and clinical features predicts metastatic recurrence, overall survival, and disease-free survival in HPV+ oropharyngeal cancer with...
-
DeepSeek-VL: Towards Real-World Vision-Language Understanding
DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder,...
-
Semantic-Fast-SAM: Efficient Semantic Segmenter
Semantic-Fast-SAM matches prior SAM-based semantic segmentation accuracy on Cityscapes and ADE20K while running about 20 times faster by combining FastSAM with SSA labeling and CLIP for open-vocabulary cases.
Reference graph
Works this paper leans on
-
[1]
On seeing stuff: the perception of materials by humans and machines
Edward H Adelson. On seeing stuff: the perception of materials by humans and machines. Human vision and electronic imaging VI ,
-
[2]
Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. What is an object? CVPR, 2010. 4, 10
work page 2010
-
[3]
Contour detection and hierarchical image segmentation
Pablo Arbel ´aez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. TPAMI, 2010. 4, 10, 21, 28
work page 2010
-
[4]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv:1607.06450, 2016. 16
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[5]
BEiT: BERT Pre-Training of Image Transformers
Hangbo Bao, Li Dong, and Furu Wei. BEiT: BERT pre-training of image transformers. arXiv:2106.08254, 2021. 17
work page internal anchor Pith review arXiv 2021
-
[6]
ZeroWaste dataset: Towards deformable object segmentation in cluttered scenes
Dina Bashkirova, Mohamed Abdelfattah, Ziliang Zhu, James Akl, Fadi Alladkani, Ping Hu, Vitaly Ablavsky, Berk Calli, Sarah Adel Bargal, and Kate Saenko. ZeroWaste dataset: Towards deformable object segmentation in cluttered scenes. CVPR, 2022. 9, 20
work page 2022
-
[7]
Stuart Berg, Dominik Kutra, Thorben Kroeger, Christoph N. Straehle, Bernhard X. Kausler, Carsten Haubold, Martin Schiegg, Janez Ales, Thorsten Beier, Markus Rudy, Kemal Eren, Jaime I. Cervantes, Buote Xu, Fynn Beuttenmueller, Adrian Wolny, Chong Zhang, Ullrich Koethe, Fred A. Hamprecht, and Anna Kreshuk. ilastik: interactive machine learning for (bio)imag...
work page 2019
-
[8]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportu- nities and risks of foundation models. arXiv:2108.07258, 2021. 1, 12
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
Iterative interaction training for segmentation editing networks
Gustav Bredell, Christine Tanner, and Ender Konukoglu. Iterative interaction training for segmentation editing networks. MICCAI,
-
[10]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...
work page 2020
-
[11]
Cascade R-CNN: Delving into high quality object detection
Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delving into high quality object detection. CVPR, 2018. 10
work page 2018
-
[12]
Caicedo, Allen Goodman, Kyle W
Juan C. Caicedo, Allen Goodman, Kyle W. Karhohs, Beth A. Ci- mini, Jeanelle Ackerman, Marzieh Haghighi, CherKeng Heng, Tim Becker, Minh Doan, Claire McQuin, Mohammad Rohban, Shan- tanu Singh, and Anne E. Carpenter. Nucleus segmentation across imaging experiments: the 2018 data science bowl. Nature Methods,
work page 2018
-
[13]
A computational approach to edge detection
John Canny. A computational approach to edge detection. TPAMI,
-
[14]
End-to-end object detection with Transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with Transformers. ECCV, 2020. 5, 16, 17
work page 2020
-
[15]
Automatic image colorization via multimodal predictions
Guillaume Charpiat, Matthias Hofmann, and Bernhard Sch ¨olkopf. Automatic image colorization via multimodal predictions. ECCV,
-
[16]
Object-proposal evaluation protocol is’ gameable’
Neelima Chavali, Harsh Agrawal, Aroma Mahendru, and Dhruv Batra. Object-proposal evaluation protocol is’ gameable’. CVPR,
-
[17]
3D instance segmentation of MVS buildings
Jiazhou Chen, Yanghui Xu, Shufang Lu, Ronghua Liang, and Lian- gliang Nan. 3D instance segmentation of MVS buildings. IEEE Transactions on Geoscience and Remote Sensing, 2022. 9, 19, 20, 23, 24
work page 2022
-
[18]
FocalClick: towards practical interactive image segmentation
Xi Chen, Zhiyan Zhao, Yilei Zhang, Manni Duan, Donglian Qi, and Hengshuang Zhao. FocalClick: towards practical interactive image segmentation. CVPR, 2022. 8, 9, 12, 19
work page 2022
-
[19]
Masked-attention mask transformer for universal image segmentation
Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kir- illov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. CVPR, 2022. 4
work page 2022
-
[20]
Per- pixel classification is not all you need for semantic segmentation
Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per- pixel classification is not all you need for semantic segmentation. NeurIPS, 2021. 5, 16, 17
work page 2021
-
[21]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv:2204.02311, 2022. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
Domain adaptation for traffic density estimation
Luca Ciampi, Carlos Santiago, Joao Costeira, Claudio Gennaro, and Giuseppe Amato. Domain adaptation for traffic density estimation. International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, 2021. 9, 20
work page 2021
-
[23]
Luca Ciampi, Carlos Santiago, Joao Costeira, Claudio Gennaro, and Giuseppe Amato. Night and day instance segmented park (NDIS- Park) dataset: a collection of images taken by day and by night for vehicle detection, segmentation and counting in parking areas.Zen- odo, 2022. 9, 20
work page 2022
-
[24]
Semantic segmen- tation in art paintings
Nadav Cohen, Yael Newman, and Ariel Shamir. Semantic segmen- tation in art paintings. Computer Graphics Forum, 2022. 9, 19, 20, 23, 24
work page 2022
-
[25]
The Cityscapes dataset for semantic urban scene understanding
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes dataset for semantic urban scene understanding. CVPR, 2016. 9, 19, 20
work page 2016
-
[26]
Bruno da Silva, George Konidaris, and Andrew Barto. Learning parameterized skills. ICML, 2012. 4
work page 2012
-
[27]
Rescaling egocentric vision: Collection, pipeline and challenges for EPIC- KITCHENS-100
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC- KITCHENS-100. IJCV, 2022. 9, 20, 23, 24
work page 2022
-
[28]
EPIC-KITCHENS VISOR benchmark: Video segmentations and object relations
Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, Sanja Fidler, David Fouhey, and Dima Damen. EPIC-KITCHENS VISOR benchmark: Video segmentations and object relations. NeurIPS, 2022. 9, 19, 20, 23, 24
work page 2022
-
[29]
Does object recognition work for everyone?CVPR workshops, 2019
Terrance De Vries, Ishan Misra, Changhan Wang, and Laurens Van der Maaten. Does object recognition work for everyone?CVPR workshops, 2019. 18
work page 2019
-
[30]
Mark D ´ıaz, Ian Kivlichan, Rachel Rosen, Dylan Baker, Razvan Amironesei, Vinodkumar Prabhakaran, and Emily Denton. Crowd- WorkSheets: Accounting for individual and collective identities un- derlying crowdsourced dataset annotation. ACM Conference on Fairness, Accountability, and Transparency, 2022. 25
work page 2022
-
[31]
PhraseClick: toward achieving flexible interactive segmentation by phrase and click
Henghui Ding, Scott Cohen, Brian Price, and Xudong Jiang. PhraseClick: toward achieving flexible interactive segmentation by phrase and click. ECCV, 2020. 11
work page 2020
-
[32]
Fast edge detection using structured forests
Piotr Doll ´ar and C Lawrence Zitnick. Fast edge detection using structured forests. TPAMI, 2014. 21
work page 2014
-
[33]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa De- hghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021. 5, 8, 16
work page 2021
-
[34]
Alireza Fathi, Xiaofeng Ren, and James M. Rehg. Learning to rec- ognize objects in egocentric activities. CVPR, 2011. 9, 19, 20
work page 2011
-
[35]
Efficient graph- based image segmentation
Pedro F Felzenszwalb and Daniel P Huttenlocher. Efficient graph- based image segmentation. IJCV, 2004. 10
work page 2004
-
[36]
Thomas B. Fitzpatrick. The validity and practicality of sun-reactive skin types i through vi. Archives of Dermatology, 1988. 8
work page 1988
-
[37]
Getting to 99% accuracy in interactive segmentation
Marco Forte, Brian Price, Scott Cohen, Ning Xu, and Franc ¸ois Piti´e. Getting to 99% accuracy in interactive segmentation. arXiv:2003.07932, 2020. 5, 17
-
[38]
Instance segmentation for au- tonomous log grasping in forestry operations
Jean-Michel Fortin, Olivier Gamache, Vincent Grondin, Franc ¸ois Pomerleau, and Philippe Gigu `ere. Instance segmentation for au- tonomous log grasping in forestry operations. IROS, 2022. 9, 20 13
work page 2022
-
[39]
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jen- nifer Wortman Vaughan, Hanna Wallach, Hal Daum´e Iii, and Kate Crawford. Datasheets for datasets. Communications of the ACM ,
-
[40]
Simple copy-paste is a strong data augmentation method for instance segmentation.CVPR,
Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-Yi Lin, Ekin D Cubuk, Quoc V Le, and Barret Zoph. Simple copy-paste is a strong data augmentation method for instance segmentation.CVPR,
-
[41]
Rich feature hierarchies for accurate object detection and semantic segmentation
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR, 2014. 10
work page 2014
-
[42]
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Priya Goyal, Piotr Doll ´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv:1706.02677, 2017. 17
work page internal anchor Pith review arXiv 2017
-
[43]
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Na- garajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhong- cong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent C...
work page 2022
-
[44]
LVIS: A dataset for large vocabulary instance segmentation
Agrim Gupta, Piotr Dollar, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. CVPR, 2019. 2, 6, 7, 9, 10, 11, 19, 20, 21, 24
work page 2019
-
[45]
Multiple choice learning: Learning to produce multiple structured outputs
Abner Guzman-Rivera, Dhruv Batra, and Pushmeet Kohli. Multiple choice learning: Learning to produce multiple structured outputs. NeurIPS, 2012. 5, 17
work page 2012
-
[46]
Timm Haucke, Hjalmar S. K ¨uhl, and V olker Steinhage. SOCRATES: Introducing depth in visual wildlife monitoring using stereo vision. Sensors, 2022. 9, 20
work page 2022
-
[47]
Masked autoencoders are scalable vision learn- ers
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll ´ar, and Ross Girshick. Masked autoencoders are scalable vision learn- ers. CVPR, 2022. 5, 8, 12, 16, 17
work page 2022
-
[48]
Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Girshick. Mask R-CNN. ICCV, 2017. 10
work page 2017
-
[49]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CVPR, 2016. 16
work page 2016
-
[50]
Gaussian Error Linear Units (GELUs)
Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv:1606.08415, 2016. 16
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[51]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv:2203.15556, 2022. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[52]
TrashCan: A semantically-segmented dataset towards visual detection of marine debris
Jungseok Hong, Michael Fulton, and Junaed Sattar. TrashCan: A semantically-segmented dataset towards visual detection of marine debris. arXiv:2007.08097, 2020. 9, 19, 20
-
[53]
Deep networks with stochastic depth
Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Wein- berger. Deep networks with stochastic depth. ECCV, 2016. 17
work page 2016
-
[54]
Oneformer: One transformer to rule universal image segmentation
Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, and Humphrey Shi. Oneformer: One transformer to rule universal image segmentation. arXiv:2211.06220, 2022. 4
-
[55]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. ICML, 2021. 1
work page 2021
-
[56]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv:2001.08361, 2020. 1
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[57]
Michael Kass, Andrew Witkin, and Demetri Terzopoulos. Snakes: Active contour models. IJCV, 1988. 4
work page 1988
-
[58]
Learning open-world object proposals without learning to classify
Dahun Kim, Tsung-Yi Lin, Anelia Angelova, In So Kweon, and Weicheng Kuo. Learning open-world object proposals without learning to classify. IEEE Robotics and Automation Letters, 2022. 21
work page 2022
-
[59]
Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Doll´ar. Panoptic segmentation. CVPR, 2019. 4
work page 2019
-
[60]
Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020. 2, 6, 7, 18, 19
work page 2020
-
[61]
Quantifying the carbon emissions of machine learning
Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon emissions of machine learning. arXiv:1910.09700, 2019. 28
-
[62]
Explor- ing plain vision transformer backbones for object detection
Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Explor- ing plain vision transformer backbones for object detection. ECCV,
-
[63]
5, 10, 11, 16, 21, 23, 24
-
[64]
Yin Li, Zhefan Ye, and James M. Rehg. Delving into egocentric actions. CVPR, 2015. 9, 20
work page 2015
-
[65]
Interactive image segmentation with latent diversity
Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Interactive image segmentation with latent diversity. CVPR, 2018. 5, 17, 19
work page 2018
-
[66]
Focal loss for dense object detection
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. ICCV, 2017. 5, 17
work page 2017
-
[67]
Mi- crosoft COCO: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Mi- crosoft COCO: Common objects in context. ECCV, 2014. 2, 4, 6, 7, 11, 18, 19, 20
work page 2014
-
[68]
Sim- pleClick: Interactive image segmentation with simple vision trans- formers
Qin Liu, Zhenlin Xu, Gedas Bertasius, and Marc Niethammer. Sim- pleClick: Interactive image segmentation with simple vision trans- formers. arXiv:2210.11006, 2022. 8, 9, 12, 19
-
[69]
Decoupled weight decay regu- larization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regu- larization. ICLR, 2019. 17
work page 2019
-
[70]
Cathy H Lucas, Daniel OB Jones, Catherine J Hollyhead, Robert H Condon, Carlos M Duarte, William M Graham, Kelly L Robinson, Kylie A Pitt, Mark Schildhauer, and Jim Regetz. Gelatinous zoo- plankton biomass in the global oceans: geographic variation and environmental drivers. Global Ecology and Biogeography , 2014. 20
work page 2014
-
[71]
Iter- atively trained interactive segmentation
Sabarinath Mahadevan, Paul V oigtlaender, and Bastian Leibe. Iter- atively trained interactive segmentation. BMVC, 2018. 4, 17
work page 2018
-
[72]
Deep extreme cut: From extreme points to object seg- mentation
Kevis-Kokitsi Maninis, Sergi Caelles, Jordi Pont-Tuset, and Luc Van Gool. Deep extreme cut: From extreme points to object seg- mentation. CVPR, 2018. 6
work page 2018
-
[73]
David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its applica- tion to evaluating segmentation algorithms and measuring ecologi- cal statistics. ICCV, 2001. 10, 21, 28
work page 2001
-
[74]
V-Net: Fully convolutional neural networks for volumetric medical image segmentation
Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. 3DV, 2016. 5, 17
work page 2016
- [75]
-
[76]
Model cards for model reporting
Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Debo- rah Raji, and Timnit Gebru. Model cards for model reporting. Pro- ceedings of the conference on fairness, accountability, and trans- parency, 2019. 25, 28 14
work page 2019
-
[77]
Extreme clicking for efficient object annotation
Dim P Papadopoulos, Jasper RR Uijlings, Frank Keller, and Vittorio Ferrari. Extreme clicking for efficient object annotation. ICCV,
-
[78]
Carbon Emissions and Large Neural Network Training
David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis- Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training. arXiv:2104.10350, 2021. 28
work page internal anchor Pith review arXiv 2021
-
[79]
Semi-supervised sequence tagging with bidirectional language models
Matthew E Peters, Waleed Ammar, Chandra Bhagavatula, and Rus- sell Power. Semi-supervised sequence tagging with bidirectional language models. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017. 18
work page 2017
-
[80]
EDTER: Edge detection with transformer
Mengyang Pu, Yaping Huang, Yuming Liu, Qingji Guan, and Haibin Ling. EDTER: Edge detection with transformer. CVPR,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.