pith. sign in

arxiv: 2304.02643 · v1 · submitted 2023-04-05 · 💻 cs.CV · cs.AI· cs.LG

Segment Anything

Pith reviewed 2026-05-11 06:08 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords image segmentationzero-shot transferpromptable modelslarge-scale datasetfoundation modelscomputer visionSAMSA-1B
0
0 comments X

The pith

A promptable model trained on a billion masks enables zero-shot segmentation that often matches supervised results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a new task, model, and dataset for image segmentation. Using an efficient version of the model to collect data in a loop, the authors assembled the largest segmentation dataset to date, containing over one billion masks across eleven million images. The resulting model is built to accept prompts such as points or boxes, allowing it to generalize zero-shot to new image distributions and tasks without retraining. Evaluations across many tasks show that this zero-shot performance is often competitive with or better than earlier models trained with full supervision for each specific task. The work releases both the model and dataset to support further research on foundation models for vision.

Core claim

We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date, with over 1 billion masks on 11M images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive -- often competitive with or even superior to prior fully supervised results.

What carries the argument

The promptable Segment Anything Model (SAM) that takes user-provided prompts such as points, boxes, or coarse masks and outputs object segmentation masks.

If this is right

  • The model can be applied directly to new image types and tasks without collecting new labeled data or retraining.
  • Prompt-based interaction becomes a practical way to guide segmentation on arbitrary images.
  • The released dataset supports training or fine-tuning of additional vision models at large scale.
  • Releasing both the model and data lowers the barrier for researchers to experiment with promptable segmentation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The self-supervised data collection loop may offer a template for building large datasets in other vision domains where annotation is expensive.
  • Promptable architectures could extend beyond segmentation to tasks like detection or editing with similar zero-shot benefits.
  • Interactive tools built on this model might reduce the need for per-task model training in applied settings such as medical imaging or content creation.
  • Performance on video sequences or 3D data would test whether the promptable property holds across temporal and spatial dimensions.

Load-bearing premise

That the zero-shot results reflect genuine generalization to new distributions rather than overfitting to the self-collected data or the evaluation tasks.

What would settle it

A controlled test on a fresh image domain and segmentation task where the model's zero-shot accuracy falls clearly below a model trained with full supervision on that same task.

read the original abstract

We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive -- often competitive with or even superior to prior fully supervised results. We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at https://segment-anything.com to foster research into foundation models for computer vision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Segment Anything (SA) project, comprising a new promptable segmentation task, the Segment Anything Model (SAM), and the SA-1B dataset of over 1 billion masks on 11 million images. The dataset is constructed via a multi-stage data engine that uses an efficient version of the model itself to propose, refine, and automatically generate masks. The central claim is that the resulting promptable model transfers zero-shot to new image distributions and tasks, with performance that is often competitive with or superior to prior fully supervised methods; the model and dataset are released publicly.

Significance. If the zero-shot generalization results are robust, this work would mark a notable advance toward foundation models in computer vision by demonstrating a single promptable model that can handle diverse segmentation tasks without task-specific training or fine-tuning. The unprecedented scale of SA-1B and the open release of both model and data constitute clear strengths that could enable substantial follow-on research.

major comments (2)
  1. [Data Engine section] Data Engine section: The staged collection process (particularly stages 2 and 3) invokes the model to generate and refine masks, so the training distribution is shaped by the model's own inductive biases. This creates a circularity risk that could undermine the zero-shot claim; the manuscript must supply a concrete analysis (e.g., ablation on held-out image sources or comparison of mask statistics before/after the automatic stages) showing that reported gains on external benchmarks are not artifacts of distribution alignment.
  2. [Experiments section] Experiments section: The claim that zero-shot performance is 'often competitive with or even superior to prior fully supervised results' is load-bearing, yet the manuscript provides limited detail on exact metrics, chosen baselines, error bars, and statistical tests across the evaluated tasks. Full per-task tables with variance estimates and explicit comparison protocols are required to substantiate the central performance assertion.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'numerous tasks' is vague; a short parenthetical list of the primary evaluation benchmarks would improve immediate clarity.
  2. [Model section] Notation: The distinction between the 'efficient' model used in the data engine and the final SAM should be introduced with explicit symbols or subsection headings to avoid reader confusion in later sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the data engine and experimental reporting. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Data Engine section] Data Engine section: The staged collection process (particularly stages 2 and 3) invokes the model to generate and refine masks, so the training distribution is shaped by the model's own inductive biases. This creates a circularity risk that could undermine the zero-shot claim; the manuscript must supply a concrete analysis (e.g., ablation on held-out image sources or comparison of mask statistics before/after the automatic stages) showing that reported gains on external benchmarks are not artifacts of distribution alignment.

    Authors: We acknowledge the potential circularity concern arising from the model's role in stages 2 and 3 of the data engine. However, the zero-shot evaluations use entirely external benchmarks and image distributions that were never seen during data collection or training. To directly address this, we will add a new analysis subsection in the revised manuscript that includes: (1) mask statistic comparisons (e.g., size, complexity, and diversity metrics) before and after the automatic stages, and (2) performance ablations on held-out image sources excluded from the data engine. These additions will demonstrate that the reported zero-shot gains on external tasks are not artifacts of distribution alignment. revision: yes

  2. Referee: [Experiments section] Experiments section: The claim that zero-shot performance is 'often competitive with or even superior to prior fully supervised results' is load-bearing, yet the manuscript provides limited detail on exact metrics, chosen baselines, error bars, and statistical tests across the evaluated tasks. Full per-task tables with variance estimates and explicit comparison protocols are required to substantiate the central performance assertion.

    Authors: We agree that the central performance claim requires more granular reporting for full substantiation. In the revised manuscript, we will expand the Experiments section with complete per-task tables that include exact metrics for all evaluated tasks, the specific baselines used, error bars or variance estimates (from multiple seeds or cross-validation where feasible), and results of statistical significance tests. We will also add explicit descriptions of the comparison protocols, including how prompts were generated and how zero-shot transfer was measured against fully supervised methods. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical zero-shot evaluations remain independent of data engine

full rationale

The paper's core claim is empirical: a promptable model trained on SA-1B achieves competitive zero-shot results on external tasks and image distributions. The data engine (staged collection using an efficient model variant) is a practical annotation procedure whose outputs are then used for supervised training; the reported evaluations use separate benchmarks whose ground truth and image sources are not generated by the same loop. No equation, uniqueness theorem, or prediction reduces by construction to a fitted parameter or self-generated input. Self-citations are absent from the load-bearing steps, and the architecture's promptability is justified by design and training rather than by renaming prior results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities beyond standard deep-learning training assumptions and the effectiveness of the promptable design.

axioms (1)
  • domain assumption Standard deep learning assumptions on generalization from large-scale training data
    The zero-shot transfer claim rests on the model learning general segmentation capabilities from the collected data.

pith-pipeline@v0.9.0 · 5466 in / 1071 out tokens · 36264 ms · 2026-05-11T06:08:52.324361+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Does it Really Count? Assessing Semantic Grounding in Text-Guided Class-Agnostic Counting

    cs.CV 2026-05 conditional novelty 8.0

    Current CAC models often count the wrong objects because they misalign text prompts with visual content, as demonstrated by new negative-label and distractor tests on the MUCCA dataset.

  2. SelvaBox: A high-resolution dataset for tropical tree crown detection

    cs.CV 2025-06 accept novelty 8.0

    SelvaBox is the largest open high-resolution dataset for tropical tree crown detection, with benchmarks showing that higher resolution improves accuracy and models trained on it generalize competitively to other unsee...

  3. Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

    cs.CL 2023-09 unverdicted novelty 8.0

    Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

  4. Vision Harnessing Agent for Open Ad-hoc Segmentation

    cs.CV 2026-05 unverdicted novelty 7.0

    VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.

  5. Functionalization via Structure Completion and Motion Rectification

    cs.CV 2026-05 unverdicted novelty 7.0

    Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture wi...

  6. MedCore: Boundary-Preserving Medical Core Pruning for MedSAM

    cs.CV 2026-05 unverdicted novelty 7.0

    MedCore achieves 60% parameter and 58.4% FLOP reduction on MedSAM with Dice 0.9549 and preserved boundary metrics via dual-intervention pruning and a new boundary leverage principle.

  7. Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    Qwen3-VL-Seg decodes MLLM bounding boxes into pixel-level referring segmentation via a lightweight box-guided mask decoder, new SA1B-ORS training data, and ORS-Bench evaluation, showing strong open-world performance.

  8. OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

  9. Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures

    cs.CV 2026-05 unverdicted novelty 7.0

    HeadsUp maps multi-view captures to UV-parameterized 3D Gaussians on a template via an encoder-decoder, achieving state-of-the-art quality and generalization after training on more than 10,000 subjects.

  10. Does it Really Count? Assessing Semantic Grounding in Text-Guided Class-Agnostic Counting

    cs.CV 2026-05 unverdicted novelty 7.0

    Text-guided class-agnostic counting models exhibit significant weaknesses in grounding textual prompts to visual objects, as demonstrated by new negative-label and distractor tests on a multi-category dataset.

  11. IMPACT-CYCLE: A Contract-Based Multi-Agent System for Claim-Level Supervisory Correction of Long-Video Semantic Memory

    cs.CV 2026-04 unverdicted novelty 7.0

    A contract-based multi-agent system maintains a claim-level semantic memory for long videos, enabling targeted corrections that raise VQA accuracy from 0.71 to 0.79 and cut human arbitration cost by 4.8x on VidOR.

  12. A 3D SAM-Based Progressive Prompting Framework for Multi-Task Segmentation of Radiotherapy-induced Normal Tissue Injuries in Limited-Data Settings

    cs.CV 2026-04 unverdicted novelty 7.0

    A progressive prompting framework on 3D SAM with text, dose-box, and click prompts plus small-target loss achieves reliable multi-task segmentation of osteoradionecrosis, cerebral edema, and cerebral radiation necrosi...

  13. Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection

    cs.CV 2026-04 conditional novelty 7.0

    Seg2Change adapts open-vocabulary segmentation models to open-vocabulary change detection via a category-agnostic change head and new dataset CA-CDD, delivering +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND.

  14. Off-the-shelf Vision Models Benefit Image Manipulation Localization

    cs.CV 2026-04 unverdicted novelty 7.0

    ReVi adapter enables off-the-shelf vision models to localize image manipulations by separating and enhancing manipulation cues from semantic features without full model retraining.

  15. Training a Student Expert via Semi-Supervised Foundation Model Distillation

    cs.CV 2026-04 conditional novelty 7.0

    A semi-supervised framework distills vision foundation models into compact instance segmentation experts that outperform their teachers by up to 11.9 AP on Cityscapes and 8.6 AP on ADE20K while being 11 times smaller.

  16. TSegAgent: Zero-Shot Tooth Segmentation via Geometry-Aware Vision-Language Agents

    cs.CV 2026-03 unverdicted novelty 7.0

    TSegAgent achieves accurate zero-shot tooth segmentation on 3D dental scans via geometry-aware vision-language reasoning without task-specific training.

  17. OPTED: Open Preprocessed Trachoma Eye Dataset Using Zero-Shot SAM 3 Segmentation

    cs.CV 2026-03 accept novelty 7.0

    OPTED is a publicly released preprocessed trachoma eye image dataset generated via zero-shot SAM 3 segmentation of the tarsal conjunctiva with an optimal text prompt and quality filtering.

  18. PhysMem: Scaling Test-Time Memory for Embodied Physical Reasoning

    cs.RO 2026-02 unverdicted novelty 7.0

    PhysMem enables VLM-based robot planners to learn and verify physical properties through test-time interaction and hypothesis testing, raising success on a brick insertion task from 23% to 76%.

  19. Descriptor: Parasitoid Wasps and Associated Hymenoptera Dataset (DAPWH)

    cs.CV 2026-02 unverdicted novelty 7.0

    Releases the DAPWH dataset of 3556 wasp images including 1739 COCO-annotated examples to enable AI models for identifying Ichneumonoidea and associated families.

  20. SAM 2++: Tracking Anything at Any Granularity

    cs.CV 2025-10 conditional novelty 7.0

    SAM 2++ unifies video tracking across mask, box, and point granularities via task-specific prompts, a unified decoder, task-adaptive memory, and a new multi-granularity dataset, reporting state-of-the-art results.

  21. ASTRA: Let Arbitrary Subjects Transform in Video Editing

    cs.CV 2025-10 unverdicted novelty 7.0

    ASTRA is a plug-and-play training-free method for precise multi-subject video editing that uses prompt-guided multimodal alignment and prior-based mask retargeting to avoid attention dilution and boundary issues.

  22. Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning

    cs.CV 2025-07 unverdicted novelty 7.0

    Presents Reason50K dataset and ReasonBrain framework for hypothetical instruction-based image editing that requires physical, temporal, causal, and story reasoning.

  23. Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

    cs.AI 2024-10 unverdicted novelty 7.0

    PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.

  24. ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

    cs.RO 2024-09 conditional novelty 7.0

    ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms ...

  25. Deep Time Series Models: A Comprehensive Survey and Benchmark

    cs.LG 2024-07 unverdicted novelty 7.0

    This survey and benchmark of deep time series models using the released TSLib library finds that models with specific structures perform well only on distinct analysis tasks.

  26. MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

    cs.CV 2024-03 conditional novelty 7.0

    MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.

  27. Project Aria: A New Tool for Egocentric Multi-Modal AI Research

    cs.HC 2023-08 accept novelty 7.0

    Project Aria presents a new wearable egocentric multi-modal recording device and software tools to accelerate AI research for augmented reality applications.

  28. SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    cs.CL 2023-07 unverdicted novelty 7.0

    SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.

  29. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    cs.RO 2023-07 unverdicted novelty 7.0

    VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.

  30. Objaverse-XL: A Universe of 10M+ 3D Objects

    cs.CV 2023-07 accept novelty 7.0

    Objaverse-XL supplies over 10 million diverse 3D objects that, when used to render 100 million views, improve zero-shot novel-view synthesis in models such as Zero123.

  31. LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    cs.CV 2023-03 conditional novelty 7.0

    LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.

  32. RiGS: Rigid-aware 4D Gaussian Splatting from a Single Monocular Video

    cs.CV 2026-05 unverdicted novelty 6.0

    RiGS decomposes scenes into static, rigid, and transient 4D Gaussians with an object-wise dynamic mask and scene flow guidance to model multi-scale motions and achieve SOTA novel view synthesis.

  33. B-GRTO: Bootstrapped Group Relative Tool Optimization for Referring Segmentation

    cs.CV 2026-05 unverdicted novelty 6.0

    B-GRTO extends GRPO by reusing rollouts to optimize auxiliary segmentation decoder objectives, yielding substantial gains over plain GRPO on referring segmentation tasks.

  34. SR-Ground: Image Quality Grounding for Super-Resolved Content

    cs.CV 2026-05 unverdicted novelty 6.0

    The paper releases SR-Ground, a crowdsourced dataset for pixel-level segmentation of six artifact types in super-resolved images, and shows its use for training grounded IQA models and artifact-reducing fine-tuning.

  35. RT-Splatting: Joint Reflection-Transmission Modeling with Gaussian Splatting

    cs.CV 2026-05 unverdicted novelty 6.0

    RT-Splatting adds a disentangled occupancy-opacity factorization and specular-aware gradient gating to 3D Gaussian Splatting, enabling joint high-fidelity reflection and transmission in real-time novel view synthesis.

  36. ASIP-Planner: Adaptive Planning for UAV Surface Inspection in Partially Known Indoor Environments

    cs.RO 2026-05 unverdicted novelty 6.0

    ASIP-Planner achieves near-complete surface coverage and shorter trajectories in partially known indoor environments by clustering inspection targets globally and adapting viewing angles locally to handle occlusions.

  37. HeteroGenManip: Generalizable Manipulation For Heterogeneous Object Interactions

    cs.RO 2026-05 unverdicted novelty 6.0

    HeteroGenManip decouples grasp localization from interaction planning using task-conditioned foundation models and multi-model diffusion policies, delivering 31% average gains in broad simulation tasks and 36.7% in fo...

  38. HeteroGenManip: Generalizable Manipulation For Heterogeneous Object Interactions

    cs.RO 2026-05 unverdicted novelty 6.0

    A task-conditioned two-stage system decouples grasp localization from interaction trajectory planning using specialized foundation models to improve generalization across heterogeneous object types.

  39. ClickSeg3D: Few-Click Interactive Segmentation via Semantic Embeddings

    cs.CV 2026-05 unverdicted novelty 6.0

    A point-Transformer interactive 3D instance segmentation model handles multiple clicks jointly in one pass and reports over 20% mIoU gains versus baselines plus 8-10% cross-dataset improvement for one-click-per-instan...

  40. ClickSeg3D: Few-Click Interactive Segmentation via Semantic Embeddings

    cs.CV 2026-05 unverdicted novelty 6.0

    ClickSeg3D uses a point Transformer encoder and hierarchical mask decoder with semantic embeddings to enable single-pass multi-object 3D interactive segmentation from sparse points, reporting over 20% mIoU gains versu...

  41. Leveraging Image Generators to Address Training Data Scarcity: The Gen4Regen Dataset for Forest Regeneration Mapping

    cs.CV 2026-05 conditional novelty 6.0

    Mixing real UAV imagery with 2101 AI-generated image-mask pairs improves semantic segmentation F1 scores for fine-grained forest species by over 15 percentage points overall and up to 30 points for rare classes.

  42. YOTOnet: Zero-Shot Cross-Domain Fault Diagnosis via Domain-Conditioned Mixture of Experts

    cs.LG 2026-05 unverdicted novelty 6.0

    YOTOnet achieves improved zero-shot cross-domain fault diagnosis on bearing datasets by combining a physics-aware invariant feature distiller with domain-conditioned sparse experts, showing performance scaling as more...

  43. Approaching human parity in the quality of automated organoid image segmentation

    cs.CV 2026-05 conditional novelty 6.0

    A composite SAM-based method segments organoid images with accuracy matching or approaching inter-observer variability among human annotators.

  44. Learning Equivariant Neural-Augmented Object Dynamics From Few Interactions

    cs.RO 2026-05 unverdicted novelty 6.0

    PIEGraph augments a spring-mass particle model with an equivariant GNN and novel action representation to predict accurate object dynamics for robotic manipulation from few interactions.

  45. GS-Playground: A High-Throughput Photorealistic Simulator for Vision-Informed Robot Learning

    cs.RO 2026-04 unverdicted novelty 6.0

    GS-Playground delivers a high-throughput photorealistic simulator for vision-informed robot learning via parallel physics integrated with batch 3D Gaussian Splatting at 10^4 FPS and an automated Real2Sim workflow for ...

  46. DiffuSAM: Diffusion-Based Prompt-Free SAM2 for Few-Shot and Source-Free Medical Image Segmentation

    cs.CV 2026-04 unverdicted novelty 6.0

    DiffuSAM synthesizes SAM2-compatible mask embeddings via a diffusion prior conditioned on prior slices to enable accurate prompt-free medical image segmentation under SF-UDA and few-shot settings.

  47. AgentLens: Adaptive Visual Modalities for Human-Agent Interaction in Mobile GUI Agents

    cs.HC 2026-04 unverdicted novelty 6.0

    AgentLens adaptively deploys Full UI, Partial UI, and GenUI modalities with virtual display overlays for mobile GUI agents, yielding 85.7% user preference and best-in-study usability in a 21-participant evaluation.

  48. SpaceDex: Generalizable Dexterous Grasping in Tiered Workspaces

    cs.RO 2026-04 unverdicted novelty 6.0

    SpaceDex achieves 63% success grasping unseen objects in tiered workspaces via VLM spatial planning and arm-hand feature separation, beating a 39% tabletop baseline in 100 real trials.

  49. Chain Of Interaction Benchmark (COIN): When Reasoning meets Embodied Interaction

    cs.RO 2026-04 unverdicted novelty 6.0

    COIN provides 50 interactive robotic tasks, a 1000-demonstration dataset collected via AR teleoperation, and metrics showing that CodeAsPolicy, VLA, and H-VLA models fail at causally-dependent interactive reasoning du...

  50. One-Shot Cross-Geometry Skill Transfer through Part Decomposition

    cs.RO 2026-04 unverdicted novelty 6.0

    Part decomposition with generative shape models allows one-shot robot skill transfer across unfamiliar object geometries in simulation and real settings.

  51. From Boundaries to Semantics: Prompt-Guided Multi-Task Learning for Petrographic Thin-section Segmentation

    cs.CV 2026-04 unverdicted novelty 6.0

    Petro-SAM adapts SAM via a Merge Block for polarized views plus multi-scale fusion and color-entropy priors to jointly achieve grain-edge and lithology segmentation in petrographic images.

  52. Granularity-Aware Transfer for Tree Instance Segmentation in Synthetic and Real Forests

    cs.CV 2026-04 unverdicted novelty 6.0

    Granularity-aware distillation improves tree instance segmentation accuracy on real forest images by merging logits and unifying masks from fine-grained synthetic teachers despite coarse real labels.

  53. GTPBD-MM: A Global Terraced Parcel and Boundary Dataset with Multi-Modality

    cs.CV 2026-04 unverdicted novelty 6.0

    GTPBD-MM is the first multimodal benchmark for global terraced parcel extraction, integrating image, text, and DEM data with experiments showing that textual and terrain cues improve delineation accuracy over image-on...

  54. Self-supervised Pretraining of Cell Segmentation Models

    cs.CV 2026-04 unverdicted novelty 6.0

    DINOCell achieves a SEG score of 0.784 on LIVECell by self-supervised domain adaptation of DINOv2, improving 10.42% over SAM-based models and showing strong zero-shot transfer.

  55. Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents

    cs.CV 2026-04 unverdicted novelty 6.0

    Closed-loop VLM agents using multi-view reasoning, object-centered visualization, and single-axis rotation prediction achieve superior text-guided 6D pose rearrangement for target objects in scenes.

  56. GESS: Multi-cue Guided Local Feature Learning via Geometric and Semantic Synergy

    cs.CV 2026-04 unverdicted novelty 6.0

    GESS introduces joint semantic-normal and depth stability prediction heads, the SDAK keypoint mechanism, and the UTCF descriptor fusion module to leverage multi-cue synergy for improved robustness and discriminability.

  57. Moondream Segmentation: From Words to Masks

    cs.CV 2026-04 unverdicted novelty 6.0

    Moondream Segmentation achieves 80.2% cIoU on RefCOCO by autoregressively decoding paths from referring expressions and using RL to refine masks, plus releases a cleaned RefCOCO-M dataset.

  58. TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization

    cs.CV 2026-03 unverdicted novelty 6.0

    TrianguLang achieves state-of-the-art feed-forward text-guided 3D localization and segmentation by using predicted geometry to gate cross-view semantic correspondences without ground-truth poses.

  59. FG-TreeSeg: Flow-Guided Tree Crown Segmentation without Instance Annotations

    cs.CV 2026-01 unverdicted novelty 6.0

    FG-TreeSeg applies Cellpose-SAM flow fields to model tree crowns as star-convex objects and separate overlapping instances without training or instance annotations.

  60. Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models

    cs.CV 2025-12 unverdicted novelty 6.0

    Visual Funnel resolves contextual blindness in MLLMs by constructing an entropy-scaled portfolio of hierarchically structured image crops that preserves both local detail and global context.

Reference graph

Works this paper leans on

196 extracted references · 196 canonical work pages · cited by 128 Pith papers · 10 internal anchors

  1. [1]

    On seeing stuff: the perception of materials by humans and machines

    Edward H Adelson. On seeing stuff: the perception of materials by humans and machines. Human vision and electronic imaging VI ,

  2. [2]

    What is an object? CVPR, 2010

    Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. What is an object? CVPR, 2010. 4, 10

  3. [3]

    Contour detection and hierarchical image segmentation

    Pablo Arbel ´aez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. TPAMI, 2010. 4, 10, 21, 28

  4. [4]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv:1607.06450, 2016. 16

  5. [5]

    BEiT: BERT Pre-Training of Image Transformers

    Hangbo Bao, Li Dong, and Furu Wei. BEiT: BERT pre-training of image transformers. arXiv:2106.08254, 2021. 17

  6. [6]

    ZeroWaste dataset: Towards deformable object segmentation in cluttered scenes

    Dina Bashkirova, Mohamed Abdelfattah, Ziliang Zhu, James Akl, Fadi Alladkani, Ping Hu, Vitaly Ablavsky, Berk Calli, Sarah Adel Bargal, and Kate Saenko. ZeroWaste dataset: Towards deformable object segmentation in cluttered scenes. CVPR, 2022. 9, 20

  7. [7]

    Straehle, Bernhard X

    Stuart Berg, Dominik Kutra, Thorben Kroeger, Christoph N. Straehle, Bernhard X. Kausler, Carsten Haubold, Martin Schiegg, Janez Ales, Thorsten Beier, Markus Rudy, Kemal Eren, Jaime I. Cervantes, Buote Xu, Fynn Beuttenmueller, Adrian Wolny, Chong Zhang, Ullrich Koethe, Fred A. Hamprecht, and Anna Kreshuk. ilastik: interactive machine learning for (bio)imag...

  8. [8]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportu- nities and risks of foundation models. arXiv:2108.07258, 2021. 1, 12

  9. [9]

    Iterative interaction training for segmentation editing networks

    Gustav Bredell, Christine Tanner, and Ender Konukoglu. Iterative interaction training for segmentation editing networks. MICCAI,

  10. [10]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...

  11. [11]

    Cascade R-CNN: Delving into high quality object detection

    Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delving into high quality object detection. CVPR, 2018. 10

  12. [12]

    Caicedo, Allen Goodman, Kyle W

    Juan C. Caicedo, Allen Goodman, Kyle W. Karhohs, Beth A. Ci- mini, Jeanelle Ackerman, Marzieh Haghighi, CherKeng Heng, Tim Becker, Minh Doan, Claire McQuin, Mohammad Rohban, Shan- tanu Singh, and Anne E. Carpenter. Nucleus segmentation across imaging experiments: the 2018 data science bowl. Nature Methods,

  13. [13]

    A computational approach to edge detection

    John Canny. A computational approach to edge detection. TPAMI,

  14. [14]

    End-to-end object detection with Transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with Transformers. ECCV, 2020. 5, 16, 17

  15. [15]

    Automatic image colorization via multimodal predictions

    Guillaume Charpiat, Matthias Hofmann, and Bernhard Sch ¨olkopf. Automatic image colorization via multimodal predictions. ECCV,

  16. [16]

    Object-proposal evaluation protocol is’ gameable’

    Neelima Chavali, Harsh Agrawal, Aroma Mahendru, and Dhruv Batra. Object-proposal evaluation protocol is’ gameable’. CVPR,

  17. [17]

    3D instance segmentation of MVS buildings

    Jiazhou Chen, Yanghui Xu, Shufang Lu, Ronghua Liang, and Lian- gliang Nan. 3D instance segmentation of MVS buildings. IEEE Transactions on Geoscience and Remote Sensing, 2022. 9, 19, 20, 23, 24

  18. [18]

    FocalClick: towards practical interactive image segmentation

    Xi Chen, Zhiyan Zhao, Yilei Zhang, Manni Duan, Donglian Qi, and Hengshuang Zhao. FocalClick: towards practical interactive image segmentation. CVPR, 2022. 8, 9, 12, 19

  19. [19]

    Masked-attention mask transformer for universal image segmentation

    Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kir- illov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. CVPR, 2022. 4

  20. [20]

    Per- pixel classification is not all you need for semantic segmentation

    Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per- pixel classification is not all you need for semantic segmentation. NeurIPS, 2021. 5, 16, 17

  21. [21]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv:2204.02311, 2022. 1

  22. [22]

    Domain adaptation for traffic density estimation

    Luca Ciampi, Carlos Santiago, Joao Costeira, Claudio Gennaro, and Giuseppe Amato. Domain adaptation for traffic density estimation. International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, 2021. 9, 20

  23. [23]

    Luca Ciampi, Carlos Santiago, Joao Costeira, Claudio Gennaro, and Giuseppe Amato. Night and day instance segmented park (NDIS- Park) dataset: a collection of images taken by day and by night for vehicle detection, segmentation and counting in parking areas.Zen- odo, 2022. 9, 20

  24. [24]

    Semantic segmen- tation in art paintings

    Nadav Cohen, Yael Newman, and Ariel Shamir. Semantic segmen- tation in art paintings. Computer Graphics Forum, 2022. 9, 19, 20, 23, 24

  25. [25]

    The Cityscapes dataset for semantic urban scene understanding

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes dataset for semantic urban scene understanding. CVPR, 2016. 9, 19, 20

  26. [26]

    Learning parameterized skills

    Bruno da Silva, George Konidaris, and Andrew Barto. Learning parameterized skills. ICML, 2012. 4

  27. [27]

    Rescaling egocentric vision: Collection, pipeline and challenges for EPIC- KITCHENS-100

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC- KITCHENS-100. IJCV, 2022. 9, 20, 23, 24

  28. [28]

    EPIC-KITCHENS VISOR benchmark: Video segmentations and object relations

    Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, Sanja Fidler, David Fouhey, and Dima Damen. EPIC-KITCHENS VISOR benchmark: Video segmentations and object relations. NeurIPS, 2022. 9, 19, 20, 23, 24

  29. [29]

    Does object recognition work for everyone?CVPR workshops, 2019

    Terrance De Vries, Ishan Misra, Changhan Wang, and Laurens Van der Maaten. Does object recognition work for everyone?CVPR workshops, 2019. 18

  30. [30]

    Crowd- WorkSheets: Accounting for individual and collective identities un- derlying crowdsourced dataset annotation

    Mark D ´ıaz, Ian Kivlichan, Rachel Rosen, Dylan Baker, Razvan Amironesei, Vinodkumar Prabhakaran, and Emily Denton. Crowd- WorkSheets: Accounting for individual and collective identities un- derlying crowdsourced dataset annotation. ACM Conference on Fairness, Accountability, and Transparency, 2022. 25

  31. [31]

    PhraseClick: toward achieving flexible interactive segmentation by phrase and click

    Henghui Ding, Scott Cohen, Brian Price, and Xudong Jiang. PhraseClick: toward achieving flexible interactive segmentation by phrase and click. ECCV, 2020. 11

  32. [32]

    Fast edge detection using structured forests

    Piotr Doll ´ar and C Lawrence Zitnick. Fast edge detection using structured forests. TPAMI, 2014. 21

  33. [33]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa De- hghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021. 5, 8, 16

  34. [34]

    Alireza Fathi, Xiaofeng Ren, and James M. Rehg. Learning to rec- ognize objects in egocentric activities. CVPR, 2011. 9, 19, 20

  35. [35]

    Efficient graph- based image segmentation

    Pedro F Felzenszwalb and Daniel P Huttenlocher. Efficient graph- based image segmentation. IJCV, 2004. 10

  36. [36]

    Fitzpatrick

    Thomas B. Fitzpatrick. The validity and practicality of sun-reactive skin types i through vi. Archives of Dermatology, 1988. 8

  37. [37]

    Getting to 99% accuracy in interactive seg- mentation

    Marco Forte, Brian Price, Scott Cohen, Ning Xu, and Franc ¸ois Piti´e. Getting to 99% accuracy in interactive segmentation. arXiv:2003.07932, 2020. 5, 17

  38. [38]

    Instance segmentation for au- tonomous log grasping in forestry operations

    Jean-Michel Fortin, Olivier Gamache, Vincent Grondin, Franc ¸ois Pomerleau, and Philippe Gigu `ere. Instance segmentation for au- tonomous log grasping in forestry operations. IROS, 2022. 9, 20 13

  39. [39]

    Datasheets for datasets

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jen- nifer Wortman Vaughan, Hanna Wallach, Hal Daum´e Iii, and Kate Crawford. Datasheets for datasets. Communications of the ACM ,

  40. [40]

    Simple copy-paste is a strong data augmentation method for instance segmentation.CVPR,

    Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-Yi Lin, Ekin D Cubuk, Quoc V Le, and Barret Zoph. Simple copy-paste is a strong data augmentation method for instance segmentation.CVPR,

  41. [41]

    Rich feature hierarchies for accurate object detection and semantic segmentation

    Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR, 2014. 10

  42. [42]

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

    Priya Goyal, Piotr Doll ´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv:1706.02677, 2017. 17

  43. [43]

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Na- garajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhong- cong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent C...

  44. [44]

    LVIS: A dataset for large vocabulary instance segmentation

    Agrim Gupta, Piotr Dollar, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. CVPR, 2019. 2, 6, 7, 9, 10, 11, 19, 20, 21, 24

  45. [45]

    Multiple choice learning: Learning to produce multiple structured outputs

    Abner Guzman-Rivera, Dhruv Batra, and Pushmeet Kohli. Multiple choice learning: Learning to produce multiple structured outputs. NeurIPS, 2012. 5, 17

  46. [46]

    K ¨uhl, and V olker Steinhage

    Timm Haucke, Hjalmar S. K ¨uhl, and V olker Steinhage. SOCRATES: Introducing depth in visual wildlife monitoring using stereo vision. Sensors, 2022. 9, 20

  47. [47]

    Masked autoencoders are scalable vision learn- ers

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll ´ar, and Ross Girshick. Masked autoencoders are scalable vision learn- ers. CVPR, 2022. 5, 8, 12, 16, 17

  48. [48]

    Mask R-CNN

    Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Girshick. Mask R-CNN. ICCV, 2017. 10

  49. [49]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CVPR, 2016. 16

  50. [50]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv:1606.08415, 2016. 16

  51. [51]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv:2203.15556, 2022. 1

  52. [52]

    Trash- can: A semantically-segmented dataset towards visual de- tection of marine debris

    Jungseok Hong, Michael Fulton, and Junaed Sattar. TrashCan: A semantically-segmented dataset towards visual detection of marine debris. arXiv:2007.08097, 2020. 9, 19, 20

  53. [53]

    Deep networks with stochastic depth

    Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Wein- berger. Deep networks with stochastic depth. ECCV, 2016. 17

  54. [54]

    Oneformer: One transformer to rule universal image segmentation

    Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, and Humphrey Shi. Oneformer: One transformer to rule universal image segmentation. arXiv:2211.06220, 2022. 4

  55. [55]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. ICML, 2021. 1

  56. [56]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv:2001.08361, 2020. 1

  57. [57]

    Snakes: Active contour models

    Michael Kass, Andrew Witkin, and Demetri Terzopoulos. Snakes: Active contour models. IJCV, 1988. 4

  58. [58]

    Learning open-world object proposals without learning to classify

    Dahun Kim, Tsung-Yi Lin, Anelia Angelova, In So Kweon, and Weicheng Kuo. Learning open-world object proposals without learning to classify. IEEE Robotics and Automation Letters, 2022. 21

  59. [59]

    Panoptic segmentation

    Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Doll´ar. Panoptic segmentation. CVPR, 2019. 4

  60. [60]

    The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale

    Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020. 2, 6, 7, 18, 19

  61. [61]

    Quantifying the Carbon Emissions of Machine Learning

    Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon emissions of machine learning. arXiv:1910.09700, 2019. 28

  62. [62]

    Explor- ing plain vision transformer backbones for object detection

    Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Explor- ing plain vision transformer backbones for object detection. ECCV,

  63. [63]

    5, 10, 11, 16, 21, 23, 24

  64. [64]

    Yin Li, Zhefan Ye, and James M. Rehg. Delving into egocentric actions. CVPR, 2015. 9, 20

  65. [65]

    Interactive image segmentation with latent diversity

    Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Interactive image segmentation with latent diversity. CVPR, 2018. 5, 17, 19

  66. [66]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. ICCV, 2017. 5, 17

  67. [67]

    Mi- crosoft COCO: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Mi- crosoft COCO: Common objects in context. ECCV, 2014. 2, 4, 6, 7, 11, 18, 19, 20

  68. [68]

    Sim- pleClick: Interactive image segmentation with simple vision trans- formers

    Qin Liu, Zhenlin Xu, Gedas Bertasius, and Marc Niethammer. Sim- pleClick: Interactive image segmentation with simple vision trans- formers. arXiv:2210.11006, 2022. 8, 9, 12, 19

  69. [69]

    Decoupled weight decay regu- larization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regu- larization. ICLR, 2019. 17

  70. [70]

    Gelatinous zoo- plankton biomass in the global oceans: geographic variation and environmental drivers

    Cathy H Lucas, Daniel OB Jones, Catherine J Hollyhead, Robert H Condon, Carlos M Duarte, William M Graham, Kelly L Robinson, Kylie A Pitt, Mark Schildhauer, and Jim Regetz. Gelatinous zoo- plankton biomass in the global oceans: geographic variation and environmental drivers. Global Ecology and Biogeography , 2014. 20

  71. [71]

    Iter- atively trained interactive segmentation

    Sabarinath Mahadevan, Paul V oigtlaender, and Bastian Leibe. Iter- atively trained interactive segmentation. BMVC, 2018. 4, 17

  72. [72]

    Deep extreme cut: From extreme points to object seg- mentation

    Kevis-Kokitsi Maninis, Sergi Caelles, Jordi Pont-Tuset, and Luc Van Gool. Deep extreme cut: From extreme points to object seg- mentation. CVPR, 2018. 6

  73. [73]

    A database of human segmented natural images and its applica- tion to evaluating segmentation algorithms and measuring ecologi- cal statistics

    David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its applica- tion to evaluating segmentation algorithms and measuring ecologi- cal statistics. ICCV, 2001. 10, 21, 28

  74. [74]

    V-Net: Fully convolutional neural networks for volumetric medical image segmentation

    Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. 3DV, 2016. 5, 17

  75. [75]

    Tsaftaris

    Massimo Minervini, Andreas Fischbach, Hanno Scharr, and Sotirios A. Tsaftaris. Finely-grained annotated datasets for image- based plant phenotyping. Pattern Recognition Letters, 2016. 9, 20

  76. [76]

    Model cards for model reporting

    Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Debo- rah Raji, and Timnit Gebru. Model cards for model reporting. Pro- ceedings of the conference on fairness, accountability, and trans- parency, 2019. 25, 28 14

  77. [77]

    Extreme clicking for efficient object annotation

    Dim P Papadopoulos, Jasper RR Uijlings, Frank Keller, and Vittorio Ferrari. Extreme clicking for efficient object annotation. ICCV,

  78. [78]

    Carbon Emissions and Large Neural Network Training

    David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis- Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training. arXiv:2104.10350, 2021. 28

  79. [79]

    Semi-supervised sequence tagging with bidirectional language models

    Matthew E Peters, Waleed Ammar, Chandra Bhagavatula, and Rus- sell Power. Semi-supervised sequence tagging with bidirectional language models. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017. 18

  80. [80]

    EDTER: Edge detection with transformer

    Mengyang Pu, Yaping Huang, Yuming Liu, Qingji Guan, and Haibin Ling. EDTER: Edge detection with transformer. CVPR,

Showing first 80 references.