SegRAG: Training-Free Retrieval-Augmented Semantic Segmentation

Abderrahmene Boudiaf; Irfan Hussain; Sajid Javed

arxiv: 2605.17630 · v2 · pith:JGV5TOTTnew · submitted 2026-05-17 · 💻 cs.CV

SegRAG: Training-Free Retrieval-Augmented Semantic Segmentation

Abderrahmene Boudiaf , Irfan Hussain , Sajid Javed This is my paper

Pith reviewed 2026-05-21 07:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords semantic segmentationopen-vocabulary segmentationretrieval-augmented generationpoint promptszero-shot domain transferSAMDINO featuresagricultural imaging

0 comments

The pith

SegRAG derives class-specific point prompts from a distilled DINOv3 feature bank to ground SAM3 segmentation without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SegRAG as a training-free way to improve open-vocabulary segmentation models such as SAM3 when target classes are rare or visually atypical. It works by first building a reference feature bank from annotated images and then using Intra-Class Cohesion Distillation to keep only the most reliable within-class prototypes. At inference, Topographic Similarity Grounding computes similarity maps against those prototypes, locates coherent high-confidence regions, and extracts peak points that are fed to SAM3 together with ordinary text prompts. Large gains appear on standard benchmarks and especially under zero-shot domain transfer to agricultural images where text-only performance collapses.

Core claim

SegRAG extracts dense patch-level DINOv3 descriptors from annotated reference images, retains reliable prototypes through Intra-Class Cohesion Distillation, and at test time applies Topographic Similarity Grounding to produce a cosine-similarity landscape; connected-component analysis and non-maximum suppression then yield point prompts that are supplied jointly with class-name text to SAM3 in one forward pass, consistently raising mIoU over the text-only baseline and producing especially large lifts on AgML data under domain shift.

What carries the argument

Intra-Class Cohesion Distillation (ICCD) to filter prototypes and Topographic Similarity Grounding (TSG) to extract coherent point prompts from a DINOv3 feature bank.

If this is right

On LVIS the method yields gains of up to +3.92 mIoU over text-only prompting.
On AgML agricultural benchmarks under zero-shot domain transfer, mean IoU rises from 25.27 to 59.24.
Individual classes that score zero under text prompting recover to over 95 mIoU.
Ablations show that ICCD, TSG, and joint text-plus-point prompting each add independent value and combine constructively.
The framework operates entirely without fine-tuning or additional model parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prototype-selection and similarity-grounding steps could be attached to other open-vocabulary segmenters that accept point or box prompts.
Small curated reference sets may compensate for distributional gaps that remain after large-scale pretraining of vision encoders.
Performance is likely to scale with the visual diversity of the reference collection rather than its sheer size.
The approach invites direct tests on interactive or streaming segmentation scenarios where a few labeled exemplars become available on the fly.

Load-bearing premise

Prototypes kept by Intra-Class Cohesion Distillation will still locate coherent high-confidence regions via Topographic Similarity Grounding when the target images come from substantially different visual domains.

What would settle it

Run SegRAG on a new domain where the retained prototypes produce similarity maps whose connected components fall below the coherence threshold used in the paper, then measure whether mIoU remains equal to or below the text-only baseline.

read the original abstract

Open-vocabulary segmentation models such as SAM3 perform well across broad categories via text prompting, yet degrade when target classes are visually underrepresented in pretraining or depart from canonical depictions-limitations text prompts cannot resolve spatially. We present SegRAG, a training-free retrieval-augmented segmentation framework that grounds SAM3 with class-specific point prompts derived from a curated DINOv3 feature bank. Offline, dense patch-level descriptors are extracted from annotated references and filtered by Intra-Class Cohesion Distillation (ICCD), retaining only prototypes that reliably retrieve within-class foreground. At inference, Topographic Similarity Grounding (TSG) computes a cosine-similarity landscape against retrieved prototypes, identifies coherent high-confidence regions via connected-component analysis, and extracts peak locations through non-maximum suppression. The resulting point prompts are delivered jointly with class-name text in a single SAM3 forward pass. On four standard benchmarks, SegRAG consistently outperforms the text-only baseline, gaining up to +3.92 mIoU on LVIS. On AgML agricultural benchmarks under zero-shot domain transfer, it raises mean IoU from 25.27 to 59.24 (+33.97) and recovers individual classes from zero to over 95 mIoU. Ablations confirm that ICCD, TSG, and joint prompting each contribute independently and compound when combined. Code is available at (https://github.com/boudiafA/SegRAG).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SegRAG adds ICCD prototype filtering and TSG point extraction to ground SAM3 with joint text-plus-point prompts, delivering clear benchmark gains but leaving the large AgML domain-transfer results open to questions about reference similarity.

read the letter

The main thing to know is that this paper gives a training-free way to improve SAM3 segmentation by building a DINOv3 feature bank from reference images, filtering it with Intra-Class Cohesion Distillation, and then using Topographic Similarity Grounding at test time to pull out point prompts that get fed to SAM3 together with the class text. The result is better performance without any retraining or fine-tuning steps. On the positive side, the method is simple to describe and the ablations show that the filtering step, the topographic grounding, and the joint prompting each add something on their own. The reported lifts on standard sets like LVIS reach about 4 mIoU points over text-only prompting, and the code is released, which helps anyone who wants to test it. The bigger reported jump on the AgML agricultural benchmarks, from 25 to 59 mIoU with some classes moving from zero to over 95, is the part that would interest people working on narrow visual domains. That kind of gain without training is worth looking at if the numbers hold up. The soft spot is exactly the one the stress-test note flags. The large AgML gains rest on the assumption that the prototypes kept by ICCD from the reference set will still produce coherent high-confidence regions on target images from a visibly different agricultural distribution. If the references share too much visual statistics with the targets, or if ICCD retains features that are not fully class-invariant, then the connected-component analysis and NMS points could be picking up overlap rather than true transfer. The abstract claims the components contribute independently, but without variance numbers, exact reference selection details, or more controls on domain shift, it is hard to judge how much of the per-class recovery is robust. This paper is for researchers who work on open-vocabulary segmentation or quick adaptation to specialized tasks such as crop monitoring. A reader who wants concrete retrieval ideas that avoid retraining will find the specific ICCD-plus-TSG combination useful to try. It has enough new empirical work and a reproducible setup to deserve a serious referee, even though the domain-shift analysis would probably need strengthening in revision. I would send it for peer review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SegRAG, a training-free retrieval-augmented semantic segmentation framework. It extracts dense DINOv3 patch descriptors from annotated reference images, filters them via Intra-Class Cohesion Distillation (ICCD) to retain class-coherent prototypes, and at inference applies Topographic Similarity Grounding (TSG) to compute cosine-similarity maps, extract high-confidence regions via connected components, and derive NMS point prompts. These prompts are combined with text prompts in a single SAM3 pass. The paper reports consistent gains over text-only baselines on four standard benchmarks (up to +3.92 mIoU on LVIS) and large improvements on AgML agricultural benchmarks under zero-shot domain transfer (mean IoU from 25.27 to 59.24, with per-class recoveries from 0 to >95 mIoU). Ablations indicate independent contributions from ICCD, TSG, and joint prompting.

Significance. If the AgML domain-transfer results prove robust under controls for reference selection, the work would be significant for training-free open-vocabulary segmentation in visually shifted domains such as agriculture. The method leverages existing pretrained models (SAM3, DINOv3) without fine-tuning and provides code, supporting reproducibility. The approach addresses limitations of pure text prompting by adding spatially grounded point prompts derived from feature retrieval.

major comments (2)

[AgML experiments] AgML zero-shot domain transfer experiments: The +33.97 mIoU gain and per-class recoveries (0 to >95 mIoU) rest on the assumption that ICCD-filtered DINOv3 prototypes remain aligned with target-domain foreground under TSG. The manuscript must specify reference-image selection criteria and include controls (e.g., deliberately mismatched agricultural references) to demonstrate that gains are not artifacts of shared visual statistics between references and AgML targets.
[Results] Quantitative results and ablations: Reported mIoU values lack variance, standard deviations, or multiple-run statistics. Baseline implementation details and dataset statistics are also absent. These omissions prevent assessment of whether ablations truly isolate independent contributions from ICCD, TSG, and joint prompting.

minor comments (1)

[Abstract] The abstract refers to gains on 'four standard benchmarks' without naming them; this should be stated explicitly for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and will revise the manuscript to incorporate the requested clarifications and additional controls.

read point-by-point responses

Referee: AgML zero-shot domain transfer experiments: The +33.97 mIoU gain and per-class recoveries (0 to >95 mIoU) rest on the assumption that ICCD-filtered DINOv3 prototypes remain aligned with target-domain foreground under TSG. The manuscript must specify reference-image selection criteria and include controls (e.g., deliberately mismatched agricultural references) to demonstrate that gains are not artifacts of shared visual statistics between references and AgML targets.

Authors: We agree that explicit reference selection criteria and control experiments are necessary to substantiate the domain-transfer claims. In the revised manuscript we will add a dedicated subsection describing the reference-image selection process for AgML (including source datasets, annotation quality filters, and visual diversity criteria). We will also report new control experiments that deliberately use mismatched agricultural and non-agricultural references to quantify the contribution of visual alignment versus the SegRAG retrieval mechanism. revision: yes
Referee: Quantitative results and ablations: Reported mIoU values lack variance, standard deviations, or multiple-run statistics. Baseline implementation details and dataset statistics are also absent. These omissions prevent assessment of whether ablations truly isolate independent contributions from ICCD, TSG, and joint prompting.

Authors: We acknowledge the value of statistical reporting and implementation transparency. The revised version will include standard deviations obtained from multiple reference-set samplings, expanded baseline implementation details (hyperparameters, prompt templates, and preprocessing), and dataset statistics (class frequencies, image counts). We will also clarify the ablation design to better isolate the independent effects of ICCD, TSG, and joint prompting. revision: yes

Circularity Check

0 steps flagged

No circularity: SegRAG is an empirical training-free method grounded in external pretrained models

full rationale

The paper describes a retrieval-augmented segmentation pipeline that extracts DINOv3 patch descriptors from annotated references, applies Intra-Class Cohesion Distillation (ICCD) offline to retain prototypes, and uses Topographic Similarity Grounding (TSG) at inference to produce point prompts for SAM3. All performance numbers (e.g., +33.97 mIoU on AgML zero-shot transfer) are obtained by direct evaluation on held-out benchmarks. No equations, fitted parameters, or self-referential definitions appear in the derivation; the method is explicitly training-free and relies on independent pretrained backbones. Ablations are reported as empirical measurements of component contributions rather than algebraic identities. The central claims therefore remain externally falsifiable and do not reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper introduces no new free parameters or invented physical entities. It relies on standard assumptions about feature similarity and connected-component behavior that are common in the computer-vision literature.

axioms (2)

domain assumption DINOv3 patch-level descriptors capture class-discriminative information that transfers across images of the same class
Invoked when building the offline feature bank and when computing cosine-similarity landscapes in TSG.
domain assumption Connected-component analysis followed by non-maximum suppression reliably isolates peak locations from topographic similarity maps
Used inside the TSG module to convert similarity landscapes into point prompts.

pith-pipeline@v0.9.0 · 5794 in / 1511 out tokens · 70269 ms · 2026-05-21T07:38:56.030590+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Intra-Class Cohesion Distillation (ICCD) ... coherence score ρ(v) ... adaptive per-class threshold κc ... Topographic Similarity Grounding (TSG) computes a cosine-similarity landscape ... connected-component analysis ... non-maximum suppression
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

On AgML agricultural benchmarks under zero-shot domain transfer, SegRAG raises mean IoU from 25.27 to 59.24

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · 21 internal anchors

[1]

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., Doll´ ar, P., Girshick, R.: Segment any- thing. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3992–4003. IEEE, ??? (2023). https://doi.org/10.1109/iccv51070.2023.00371 .http://dx.doi.org/10.1109/iccv51070.2023.00371

work page doi:10.1109/iccv51070.2023.00371 2023
[2]

Ravi, N., Gabeur, V., Hu, Y.-T., Hu, R., Ryali, C., Ma, T., Khedr, H., R¨ adle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.-Y., Girshick, R., Doll´ ar, P., Feichtenhofer, C.: Sam 2: Segment anything in images and videos (2024) arXiv:2408.00714 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Carion, N., Gustafson, L., Hu, Y.-T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., R¨ adle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.-H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H....

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K¨ uttler, H., Lewis, M., Yih, W.-t., Rockt¨ aschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive nlp tasks (2020) arXiv:2005.11401 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2020
[5]

Sim´ eoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., J´ egou, H., Labatut, P., Bojanowski, P.: Dinov3 (2025) arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Fully Convolutional Networks for Semantic Segmentation

Long, J., Shelhamer, E., Darrell, T.: Fully Convolutional Networks for Semantic Segmentation (2015). https://arxiv.org/abs/1411.4038

work page internal anchor Pith review Pith/arXiv arXiv 2015
[7]

U-Net: Convolutional Networks for Biomedical Image Segmentation

Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional Networks for Biomedical Image Segmentation (2015). https://arxiv.org/abs/1505.04597

work page internal anchor Pith review Pith/arXiv arXiv 2015
[8]

Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs

Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs (2015). https://arxiv.org/abs/1412.7062

work page internal anchor Pith review Pith/arXiv arXiv 2015
[9]

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolu- tion, and Fully Connected CRFs (2018). https://arxiv.org/abs/1606.00915

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

https://arxiv.org/abs/1706

Chen, L.-C., Papandreou, G., Schroff, F., Adam, H.: Rethinking Atrous Con- volution for Semantic Image Segmentation (2017). https://arxiv.org/abs/1706. 05587

work page 2017
[11]

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation (2018). https://arxiv.org/abs/1802.02611

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

Mask R-CNN

He, K., Gkioxari, G., Doll´ ar, P., Girshick, R.: Mask R-CNN (2017). https://arxiv. org/abs/1703.06870

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

Panoptic Segmentation

Kirillov, A., He, K., Girshick, R., Rother, C., Doll´ ar, P.: Panoptic Segmentation (2019). https://arxiv.org/abs/1801.00868

work page internal anchor Pith review Pith/arXiv arXiv 2019
[14]

In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Kirillov, A., Girshick, R., He, K., Doll´ ar, P.: Panoptic feature pyramid networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6392–6401. IEEE, ??? (2019). https://doi.org/10.1109/cvpr.2019. 00656 .http://dx.doi.org/10.1109/cvpr.2019.00656

work page doi:10.1109/cvpr.2019 2019
[15]

https://arxiv.org/abs/1911.10194 37

Cheng, B., Collins, M.D., Zhu, Y., Liu, T., Huang, T.S., Adam, H., Chen, L.-C.: Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation (2020). https://arxiv.org/abs/1911.10194 37

work page arXiv 2020
[16]

https://arxiv.org/abs/2012

Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking Semantic Segmentation from a Sequence- to-Sequence Perspective with Transformers (2021). https://arxiv.org/abs/2012. 15840

work page 2021
[17]

https://arxiv.org/abs/2105.05633

Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: Transformer for Semantic Segmentation (2021). https://arxiv.org/abs/2105.05633

work page arXiv 2021
[18]

https://arxiv.org/abs/2105.15203

Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers (2021). https://arxiv.org/abs/2105.15203

work page arXiv 2021
[19]

In: Advances in Neural Information Processing Systems, vol

Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: Advances in Neural Information Processing Systems, vol. 34 (2021).https://arxiv.org/abs/2107.06278

work page arXiv 2021
[20]

https://arxiv.org/ abs/2112.01527

Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention Mask Transformer for Universal Image Segmentation (2022). https://arxiv.org/ abs/2112.01527

work page arXiv 2022
[21]

Emerging Properties in Self-Supervised Vision Transformers

Caron, M., Touvron, H., Misra, I., J´ egou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging Properties in Self-Supervised Vision Transformers (2021). https: //arxiv.org/abs/2104.14294

work page internal anchor Pith review Pith/arXiv arXiv 2021
[22]

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual feat...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Liu, Y., Zhu, M., Li, H., Chen, H., Wang, X., Shen, C.: Matcher: Segment anything with one shot using all-purpose feature matching (2023) arXiv:2305.13310 [cs.CV]

work page arXiv 2023
[24]

Zakir, H.M., Ho, E.T.W.: Revealing the semantic selection gap in dinov3 through training-free few-shot segmentation (2026) arXiv:2602.07550 [cs.CV]

work page arXiv 2026
[25]

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision (2021) arXiv:2103.00020 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2021
[26]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11941–11952. IEEE, ??? (2023). https://doi.org/10.1109/iccv51070. 2023.01100 .http://dx.doi.org/10.1109/iccv51070.2023.01100

work page doi:10.1109/iccv51070 2023
[27]

Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., 38 Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., H´ enaff, O., Harm- sen, J., Steiner, A., Zhai, X.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features (2025) arXiv:2502.14786 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Bolya, D., Huang, P.-Y., Sun, P., Cho, J.H., Madotto, A., Wei, C., Ma, T., Zhi, J., Rajasegaran, J., Rasheed, H., Wang, J., Monteiro, M., Xu, H., Dong, S., Ravi, N., Li, D., Doll´ ar, P., Feichtenhofer, C.: Perception encoder: The best visual embeddings are not at the output of the network (2025) arXiv:2504.13181 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End- to-End Object Detection with Transformers, pp. 213–229. Springer, ??? (2020). https://doi.org/10.1007/978-3-030-58452-8 13 .http://dx.doi.org/10.1007/978-3- 030-58452-8 13

work page doi:10.1007/978-3-030-58452-8 2020
[30]

In: 2023 IEEE/CVF International Conference on Computer Vision Work- shops (ICCVW), pp

Chen, T., Zhu, L., Ding, C., Cao, R., Wang, Y., Zhang, S., Li, Z., Sun, L., Zang, Y., Mao, P.: Sam-adapter: Adapting segment anything in underperformed scenes. In: 2023 IEEE/CVF International Conference on Computer Vision Work- shops (ICCVW), pp. 3359–3367. IEEE, ??? (2023). https://doi.org/10.1109/ iccvw60793.2023.00361 .http://dx.doi.org/10.1109/iccvw60...

work page doi:10.1109/iccvw60793.2023.00361 2023
[31]

Nature Communications15(1) (2024) https://doi.org/10.1038/ s41467-024-44824-z

Ma, J., He, Y., Li, F., Han, L., You, C., Wang, B.: Segment anything in medical images. Nature Communications15(1) (2024) https://doi.org/10.1038/ s41467-024-44824-z

work page 2024
[32]

IEEE Transactions on Geoscience and Remote Sensing62, 1–17 (2024) https://doi.org/10.1109/tgrs.2024.3356074

Chen, K., Liu, C., Chen, H., Zhang, H., Li, W., Zou, Z., Shi, Z.: Rsprompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model. IEEE Transactions on Geoscience and Remote Sensing62, 1–17 (2024) https://doi.org/10.1109/tgrs.2024.3356074

work page doi:10.1109/tgrs.2024.3356074 2024
[33]

Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation (2022) arXiv:2201.03546 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

Ghiasi, G., Gu, X., Cui, Y., Lin, T.-Y.: Scaling Open-Vocabulary Image Segmen- tation with Image-Level Labels, pp. 540–557. Springer, ??? (2022). https://doi. org/10.1007/978-3-031-20059-5 31 .http://dx.doi.org/10.1007/978-3-031-20059- 5 31

work page doi:10.1007/978-3-031-20059-5 2022
[35]

In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Liang, F., Wu, B., Dai, X., Li, K., Zhao, Y., Zhang, H., Zhang, P., Vajda, P., Marculescu, D.: Open-vocabulary semantic segmentation with mask-adapted clip. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7061–7070. IEEE, ??? (2023). https://doi.org/10.1109/cvpr52729. 2023.00682 .http://dx.doi.org/10.1109/cvpr52729.2023.00682

work page doi:10.1109/cvpr52729 2023
[36]

Yu, Q., He, J., Deng, X., Shen, X., Chen, L.-C.: Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip (2023) arXiv:2308.02487 [cs.CV] 39

work page arXiv 2023
[37]

Motiondiffuser: Controllable multi-agent motion prediction using diffusion

Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., De Mello, S.: Open- vocabulary panoptic segmentation with text-to-image diffusion models. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2955–2966. IEEE, ??? (2023). https://doi.org/10.1109/cvpr52729.2023.00289 .http://dx.doi.org/10.1109/cvpr52729.2023.00289

work page doi:10.1109/cvpr52729.2023.00289 2023
[38]

Wang, F., Mei, J., Yuille, A.: SCLIP: Rethinking Self-Attention for Dense Vision- Language Inference, pp. 315–332. Springer, ??? (2024). https://doi.org/10.1007/ 978-3-031-72664-4 18 .http://dx.doi.org/10.1007/978-3-031-72664-4 18

work page doi:10.1007/978-3-031-72664-4 2024
[39]

Hajimiri, S., Ayed, I.B., Dolz, J.: Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation (2024) arXiv:2404.08181 [cs.CV]

work page arXiv 2024
[41]

Shao, T., Tian, Z., Zhao, H., Su, J.: Explore the potential of clip for training-free open vocabulary semantic segmentation (2024) arXiv:2407.08268 [cs.CV]

work page arXiv 2024
[42]

https://arxiv.org/abs/2411.12044

Aydın, M.A., C ¸ ırpar, E.M., Abdinli, E., Unal, G., Sahin, Y.H.: ITACLIP: Boost- ing Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements (2024). https://arxiv.org/abs/2411.12044

work page arXiv 2024
[43]

Lan, M., Chen, C., Ke, Y., Wang, X., Feng, L., Zhang, W.: Proxy- CLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation, pp. 70–88. Springer, ??? (2024). https://doi.org/10.1007/978-3-031-73113-6 5 . http://dx.doi.org/10.1007/978-3-031-73113-6 5

work page doi:10.1007/978-3-031-73113-6 2024
[44]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

Kim, C., Ju, D., Han, W., Yang, M.-H., Hwang, S.J.: Distilling spectral graph for object-context aware open-vocabulary semantic segmentation. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

work page 2025
[45]

Rocket-1: Mastering open-world interaction with visual-temporal context prompting

Stojni´ c, V., Kalantidis, Y., Matas, J., Tolias, G.: Lposs: Label propagation over patches and pixels for open-vocabulary semantic segmentation. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9794–9803. IEEE, ??? (2025). https://doi.org/10.1109/cvpr52734.2025.00915 .http://dx.doi.org/10.1109/cvpr52734.2025.00915

work page doi:10.1109/cvpr52734.2025.00915 2025
[46]

In: Advances in Neural Information Processing Systems (2025)

Wang, X., Si, C., Yang, X., Zhao, Y., Wang, W., Yang, X., Shen, W.: Opmap- per: Enhancing open-vocabulary semantic segmentation with multi-guidance information. In: Advances in Neural Information Processing Systems (2025)

work page 2025
[47]

In: Advances in Neu- ral Information Processing Systems 36

Zou, X., Yang, J., Zhang, H., Li, F., Li, L., Wang, J., Wang, L., Gao, J., 40 Lee, Y.J.: Segment everything everywhere all at once. In: Advances in Neu- ral Information Processing Systems 36. NeurIPS 2023, pp. 19769–19782. Neural Information Processing Systems Foundation, Inc. (NeurIPS), ??? (2023). https: //doi.org/10.52202/075280-0868 .http://dx.doi.org...

work page doi:10.52202/075280-0868 2023
[48]

Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., Zeng, Z., Zhang, H., Li, F., Yang, J., Li, H., Jiang, Q., Zhang, L.: Grounded sam: Assembling open-world models for diverse visual tasks (2024) arXiv:2401.14159 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., Zhu, J., Zhang, L.: Grounding DINO: Marrying DINO with Grounded Pre-training for Open-Set Object Detection, pp. 38–55. Springer, ??? (2024). https://doi.org/10.1007/978-3-031-72970-6 3 .http://dx.doi.org/10.1007/978-3- 031-72970-6 3

work page doi:10.1007/978-3-031-72970-6 2024
[50]

Mark Weber, Jun Xie, Maxwell D

Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning segmentation via large language model. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9579–9589. IEEE, ??? (2024). https://doi.org/10.1109/cvpr52733.2024.00915 . http://dx.doi.org/10.1109/cvpr52733.2024.00915

work page doi:10.1109/cvpr52733.2024.00915 2024
[51]

Mark Weber, Jun Xie, Maxwell D

Sun, Y., Chen, J., Zhang, S., Zhang, X., Chen, Q., Zhang, G., Ding, E., Wang, J., Li, Z.: Vrp-sam: Sam with visual reference prompt. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 23565–23574. IEEE, ??? (2024). https://doi.org/10.1109/cvpr52733.2024. 02224 .http://dx.doi.org/10.1109/cvpr52733.2024.02224

work page doi:10.1109/cvpr52733.2024 2024
[52]

Zhang, R., Jiang, Z., Guo, Z., Yan, S., Pan, J., Ma, X., Dong, H., Gao, P., Li, H.: Personalize segment anything model with one shot (2023) arXiv:2305.03048 [cs.CV]

work page arXiv 2023
[53]

Tang, L., Jiang, P.-T., Xiao, H.-K., Li, B.: Towards training-free open-world segmentation via image prompt foundation models (2023) arXiv:2310.10912 [cs.CV]

work page arXiv 2023
[54]

Zhang, A., Gao, G., Jiao, J., Liu, C.H., Wei, Y.: Bridge the points: Graph-based few-shot segment anything semantically (2024) arXiv:2410.06964 [cs.CV]

work page arXiv 2024
[55]

In: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp

Barsellotti, L., Amoroso, R., Baraldi, L., Cucchiara, R.: Fossil: Free open- vocabulary semantic segmentation through synthetic references retrieval. In: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 1453–1462. IEEE, ??? (2024). https://doi.org/10.1109/wacv57701.2024.00149 .http://dx.doi.org/10.1109/wacv57701.2024.00149

work page doi:10.1109/wacv57701.2024.00149 2024
[56]

Mark Weber, Jun Xie, Maxwell D

Barsellotti, L., Amoroso, R., Cornia, M., Baraldi, L., Cucchiara, R.: Training- free open-vocabulary segmentation with offline diffusion-augmented prototype 41 generation. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3689–3698. IEEE, ??? (2024). https://doi.org/10.1109/ cvpr52733.2024.00354 .http://dx.doi.org/10.1109...

work page doi:10.1109/cvpr52733.2024.00354 2024
[57]

Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., Lewis, M.: Gen- eralization through memorization: Nearest neighbor language models (2020) arXiv:1911.00172 [cs.CL]

work page arXiv 2020
[58]

Sheynin, S., Ashual, O., Polyak, A., Singer, U., Gafni, O., Nachmani, E., Taigman, Y.: Knn-diffusion: Image generation via large-scale retrieval (2022) arXiv:2204.02849 [cs.CV]

work page arXiv 2022
[59]

Chen, W., Hu, H., Saharia, C., Cohen, W.W.: Re-imagen: Retrieval-augmented text-to-image generator (2022) arXiv:2209.14491 [cs.CV]

work page arXiv 2022
[60]

In: Advances in Neural Information Processing Sys- tems 35

Blattmann, A., M¨ uller, J., Oktay, K., Ommer, B., Rombach, R.: Retrieval- augmented diffusion models. In: Advances in Neural Information Processing Sys- tems 35. NeurIPS 2022, pp. 15309–15324. Neural Information Processing Systems Foundation, Inc. (NeurIPS), ??? (2022). https://doi.org/10.52202/068431-1114 . http://dx.doi.org/10.52202/068431-1114

work page doi:10.52202/068431-1114 2022
[61]

Snell, J., Swersky, K., Zemel, R.S.: Prototypical networks for few-shot learning (2017) arXiv:1703.05175 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2017
[62]

In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H.S., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1199–1208. IEEE, ??? (2018). https://doi.org/10.1109/cvpr.2018.00131 . http://dx.doi.org/10.1109/cvpr.2018.00131

work page doi:10.1109/cvpr.2018.00131 2018
[63]

Lee, J., Sung, M., Kang, J., Chen, D.: Learning dense representations of phrases at scale (2021) arXiv:2012.12624 [cs.CL]

work page arXiv 2021
[64]

In: Advances in Neural Information Processing Systems 35

Albanie, S., Shin, G., Xie, W.: Reco: Retrieve and co-segment for zero- shot transfer. In: Advances in Neural Information Processing Systems 35. NeurIPS 2022, pp. 33754–33767. Neural Information Processing Systems Foun- dation, Inc. (NeurIPS), ??? (2022). https://doi.org/10.52202/068431-2446 . http://dx.doi.org/10.52202/068431-2446

work page doi:10.52202/068431-2446 2022
[65]

Gui, Z., Sun, S., Li, R., Yuan, J., An, Z., Roth, K., Prabhu, A., Torr, P.: knn- clip: Retrieval enables training-free segmentation on continually expanding large vocabularies (2024) arXiv:2404.09447 [cs.CV]

work page arXiv 2024
[66]

Zhao, L., Chen, X., Chen, E.Z., Liu, Y., Chen, T., Sun, S.: Retrieval- augmented few-shot medical image segmentation with foundation models (2024) arXiv:2408.08813 [cs.CV] 42

work page arXiv 2024
[67]

Espinosa, M., Yang, C., Ericsson, L., McDonagh, S., Crowley, E.J.: No time to train! training-free reference-based instance segmentation (2025) arXiv:2507.02798 [cs.CV]

work page arXiv 2025
[68]

Artificial Intelligence Review57(6) (2024) https://doi.org/10.1007/ s10462-024-10775-6

Lei, L., Yang, Q., Yang, L., Shen, T., Wang, R., Fu, C.: Deep learning implementation of image segmentation in agricultural applications: a comprehen- sive review. Artificial Intelligence Review57(6) (2024) https://doi.org/10.1007/ s10462-024-10775-6

work page 2024
[69]

IEEE Robotics and Automation Letters3(1), 588–595 (2018) https://doi.org/10.1109/lra.2017.2774979

Sa, I., Chen, Z., Popovic, M., Khanna, R., Liebisch, F., Nieto, J., Siegwart, R.: weednet: Dense semantic weed classification using multispectral images and mav for smart farming. IEEE Robotics and Automation Letters3(1), 588–595 (2018) https://doi.org/10.1109/lra.2017.2774979

work page doi:10.1109/lra.2017.2774979 2018
[70]

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation (2017). https://arxiv.org/ abs/1511.00561

work page internal anchor Pith review Pith/arXiv arXiv 2017
[71]

Plant Phenomics5, 0084 (2023) https://doi.org/10.34133/plantphenomics.0084

Joshi, A., Guevara, D., Earles, M.: Standardizing and centralizing datasets for efficient training of agricultural deep learning models. Plant Phenomics5, 0084 (2023) https://doi.org/10.34133/plantphenomics.0084

work page doi:10.34133/plantphenomics.0084 2023
[72]

Sensors23(18), 7884 (2023) https://doi.org/10.3390/s23187884

Li, Y., Wang, D., Yuan, C., Li, H., Hu, J.: Enhancing agricultural image seg- mentation with an agricultural segment anything model adapter. Sensors23(18), 7884 (2023) https://doi.org/10.3390/s23187884

work page doi:10.3390/s23187884 2023
[73]

Picon, A., Eguskiza, I., Mugica, D., Romero, J., Jimenez, C.J., White, E., Do- Lago-Junqueira, G., Klukas, C., Navarra-Mestre, R.: Mitigating domain drift in multi species segmentation with dinov2: A cross-domain evaluation in herbicide research trials (2025) arXiv:2508.07514 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

Daquan Zhou, Kai Wang, Jianyang Gu, Xiangyu Peng, Dongze Lian, Yifan Zhang, Yang You, and Jiashi Feng

Gupta, A., Dollar, P., Girshick, R.: Lvis: A dataset for large vocabulary instance segmentation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5351–5359. IEEE, ??? (2019). https://doi.org/10.1109/ cvpr.2019.00550 .http://dx.doi.org/10.1109/cvpr.2019.00550

work page doi:10.1109/cvpr.2019.00550 2019
[75]

In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp

Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5122–5130. IEEE, ??? (2017). https://doi.org/ 10.1109/cvpr.2017.544 .http://dx.doi.org/10.1109/cvpr.2017.544

work page doi:10.1109/cvpr.2017.544 2017
[76]

In: 2016 IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), pp

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: 2016 IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), pp. 3213–3223. IEEE, ??? (2016). https://doi.org/10. 1109/cvpr.2016.350 .http://dx.doi.org/10.1109/cvpr.2...

work page doi:10.1109/cvpr.2016.350 2016
[77]

In: 2014 IEEE Conference on Computer Vision and Pattern Recog- nition, pp

Mottaghi, R., Chen, X., Liu, X., Cho, N.-G., Lee, S.-W., Fidler, S., Urtasun, R., Yuille, A.: The role of context for object detection and semantic segmentation in the wild. In: 2014 IEEE Conference on Computer Vision and Pattern Recog- nition, pp. 891–898. IEEE, ??? (2014). https://doi.org/10.1109/cvpr.2014.119 . http://dx.doi.org/10.1109/cvpr.2014.119

work page doi:10.1109/cvpr.2014.119 2014
[78]

International Journal of Computer Vision133, 1–15 (2025) https://doi.org/10.1007/s11263-024-02185-6

Tang, L., Jiang, P.-T., Xiao, H., Li, B.: Towards training-free open-world segmen- tation via image prompt foundation models. International Journal of Computer Vision133, 1–15 (2025) https://doi.org/10.1007/s11263-024-02185-6

work page doi:10.1007/s11263-024-02185-6 2025
[79]

IEEE Transactions on Image Processing34, 8271–8284 (2025) https://doi.org/10.1109/TIP.2025.3639996

Bai, S., Liu, Y., Han, Y., Zhang, H., Tang, Y., Zhou, J., Lu, J.: Self-calibrated clip for training-free open-vocabulary segmentation. IEEE Transactions on Image Processing34, 8271–8284 (2025) https://doi.org/10.1109/TIP.2025.3639996

work page doi:10.1109/tip.2025.3639996 2025
[80]

Har- nessing vision foundation models for high-performance, training-free open vocabulary segmentation.arXiv preprint arXiv:2411.09219, 2024

Shi, Y., Dong, M., Xu, C.: Harnessing vision foundation models for high- performance, training-free open vocabulary segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025). https://arxiv.org/abs/2411.09219

work page arXiv 2025
[81]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025)

Zhang, D., Liu, F., Tang, Q.: Corrclip: Reconstructing patch correla- tions in clip for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025). https://arxiv.org/abs/2411.10086 44 Fig. 2Effect of ICCD filtering on two example classes (top: small clustered flowers. bottom: rice leaves). Blu...

work page arXiv 2025

[1] [1]

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., Doll´ ar, P., Girshick, R.: Segment any- thing. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3992–4003. IEEE, ??? (2023). https://doi.org/10.1109/iccv51070.2023.00371 .http://dx.doi.org/10.1109/iccv51070.2023.00371

work page doi:10.1109/iccv51070.2023.00371 2023

[2] [2]

Ravi, N., Gabeur, V., Hu, Y.-T., Hu, R., Ryali, C., Ma, T., Khedr, H., R¨ adle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.-Y., Girshick, R., Doll´ ar, P., Feichtenhofer, C.: Sam 2: Segment anything in images and videos (2024) arXiv:2408.00714 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Carion, N., Gustafson, L., Hu, Y.-T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., R¨ adle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.-H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H....

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K¨ uttler, H., Lewis, M., Yih, W.-t., Rockt¨ aschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive nlp tasks (2020) arXiv:2005.11401 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2020

[5] [5]

Sim´ eoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., J´ egou, H., Labatut, P., Bojanowski, P.: Dinov3 (2025) arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Fully Convolutional Networks for Semantic Segmentation

Long, J., Shelhamer, E., Darrell, T.: Fully Convolutional Networks for Semantic Segmentation (2015). https://arxiv.org/abs/1411.4038

work page internal anchor Pith review Pith/arXiv arXiv 2015

[7] [7]

U-Net: Convolutional Networks for Biomedical Image Segmentation

Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional Networks for Biomedical Image Segmentation (2015). https://arxiv.org/abs/1505.04597

work page internal anchor Pith review Pith/arXiv arXiv 2015

[8] [8]

Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs

Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs (2015). https://arxiv.org/abs/1412.7062

work page internal anchor Pith review Pith/arXiv arXiv 2015

[9] [9]

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolu- tion, and Fully Connected CRFs (2018). https://arxiv.org/abs/1606.00915

work page internal anchor Pith review Pith/arXiv arXiv 2018

[10] [10]

https://arxiv.org/abs/1706

Chen, L.-C., Papandreou, G., Schroff, F., Adam, H.: Rethinking Atrous Con- volution for Semantic Image Segmentation (2017). https://arxiv.org/abs/1706. 05587

work page 2017

[11] [11]

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation (2018). https://arxiv.org/abs/1802.02611

work page internal anchor Pith review Pith/arXiv arXiv 2018

[12] [12]

Mask R-CNN

He, K., Gkioxari, G., Doll´ ar, P., Girshick, R.: Mask R-CNN (2017). https://arxiv. org/abs/1703.06870

work page internal anchor Pith review Pith/arXiv arXiv 2017

[13] [13]

Panoptic Segmentation

Kirillov, A., He, K., Girshick, R., Rother, C., Doll´ ar, P.: Panoptic Segmentation (2019). https://arxiv.org/abs/1801.00868

work page internal anchor Pith review Pith/arXiv arXiv 2019

[14] [14]

In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Kirillov, A., Girshick, R., He, K., Doll´ ar, P.: Panoptic feature pyramid networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6392–6401. IEEE, ??? (2019). https://doi.org/10.1109/cvpr.2019. 00656 .http://dx.doi.org/10.1109/cvpr.2019.00656

work page doi:10.1109/cvpr.2019 2019

[15] [15]

https://arxiv.org/abs/1911.10194 37

Cheng, B., Collins, M.D., Zhu, Y., Liu, T., Huang, T.S., Adam, H., Chen, L.-C.: Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation (2020). https://arxiv.org/abs/1911.10194 37

work page arXiv 2020

[16] [16]

https://arxiv.org/abs/2012

Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking Semantic Segmentation from a Sequence- to-Sequence Perspective with Transformers (2021). https://arxiv.org/abs/2012. 15840

work page 2021

[17] [17]

https://arxiv.org/abs/2105.05633

Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: Transformer for Semantic Segmentation (2021). https://arxiv.org/abs/2105.05633

work page arXiv 2021

[18] [18]

https://arxiv.org/abs/2105.15203

Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers (2021). https://arxiv.org/abs/2105.15203

work page arXiv 2021

[19] [19]

In: Advances in Neural Information Processing Systems, vol

Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: Advances in Neural Information Processing Systems, vol. 34 (2021).https://arxiv.org/abs/2107.06278

work page arXiv 2021

[20] [20]

https://arxiv.org/ abs/2112.01527

Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention Mask Transformer for Universal Image Segmentation (2022). https://arxiv.org/ abs/2112.01527

work page arXiv 2022

[21] [21]

Emerging Properties in Self-Supervised Vision Transformers

Caron, M., Touvron, H., Misra, I., J´ egou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging Properties in Self-Supervised Vision Transformers (2021). https: //arxiv.org/abs/2104.14294

work page internal anchor Pith review Pith/arXiv arXiv 2021

[22] [22]

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual feat...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Liu, Y., Zhu, M., Li, H., Chen, H., Wang, X., Shen, C.: Matcher: Segment anything with one shot using all-purpose feature matching (2023) arXiv:2305.13310 [cs.CV]

work page arXiv 2023

[24] [24]

Zakir, H.M., Ho, E.T.W.: Revealing the semantic selection gap in dinov3 through training-free few-shot segmentation (2026) arXiv:2602.07550 [cs.CV]

work page arXiv 2026

[25] [25]

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision (2021) arXiv:2103.00020 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2021

[26] [26]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11941–11952. IEEE, ??? (2023). https://doi.org/10.1109/iccv51070. 2023.01100 .http://dx.doi.org/10.1109/iccv51070.2023.01100

work page doi:10.1109/iccv51070 2023

[27] [27]

Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., 38 Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., H´ enaff, O., Harm- sen, J., Steiner, A., Zhai, X.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features (2025) arXiv:2502.14786 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Bolya, D., Huang, P.-Y., Sun, P., Cho, J.H., Madotto, A., Wei, C., Ma, T., Zhi, J., Rajasegaran, J., Rasheed, H., Wang, J., Monteiro, M., Xu, H., Dong, S., Ravi, N., Li, D., Doll´ ar, P., Feichtenhofer, C.: Perception encoder: The best visual embeddings are not at the output of the network (2025) arXiv:2504.13181 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End- to-End Object Detection with Transformers, pp. 213–229. Springer, ??? (2020). https://doi.org/10.1007/978-3-030-58452-8 13 .http://dx.doi.org/10.1007/978-3- 030-58452-8 13

work page doi:10.1007/978-3-030-58452-8 2020

[30] [30]

In: 2023 IEEE/CVF International Conference on Computer Vision Work- shops (ICCVW), pp

Chen, T., Zhu, L., Ding, C., Cao, R., Wang, Y., Zhang, S., Li, Z., Sun, L., Zang, Y., Mao, P.: Sam-adapter: Adapting segment anything in underperformed scenes. In: 2023 IEEE/CVF International Conference on Computer Vision Work- shops (ICCVW), pp. 3359–3367. IEEE, ??? (2023). https://doi.org/10.1109/ iccvw60793.2023.00361 .http://dx.doi.org/10.1109/iccvw60...

work page doi:10.1109/iccvw60793.2023.00361 2023

[31] [31]

Nature Communications15(1) (2024) https://doi.org/10.1038/ s41467-024-44824-z

Ma, J., He, Y., Li, F., Han, L., You, C., Wang, B.: Segment anything in medical images. Nature Communications15(1) (2024) https://doi.org/10.1038/ s41467-024-44824-z

work page 2024

[32] [32]

IEEE Transactions on Geoscience and Remote Sensing62, 1–17 (2024) https://doi.org/10.1109/tgrs.2024.3356074

Chen, K., Liu, C., Chen, H., Zhang, H., Li, W., Zou, Z., Shi, Z.: Rsprompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model. IEEE Transactions on Geoscience and Remote Sensing62, 1–17 (2024) https://doi.org/10.1109/tgrs.2024.3356074

work page doi:10.1109/tgrs.2024.3356074 2024

[33] [33]

Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation (2022) arXiv:2201.03546 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2022

[34] [34]

Ghiasi, G., Gu, X., Cui, Y., Lin, T.-Y.: Scaling Open-Vocabulary Image Segmen- tation with Image-Level Labels, pp. 540–557. Springer, ??? (2022). https://doi. org/10.1007/978-3-031-20059-5 31 .http://dx.doi.org/10.1007/978-3-031-20059- 5 31

work page doi:10.1007/978-3-031-20059-5 2022

[35] [35]

In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Liang, F., Wu, B., Dai, X., Li, K., Zhao, Y., Zhang, H., Zhang, P., Vajda, P., Marculescu, D.: Open-vocabulary semantic segmentation with mask-adapted clip. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7061–7070. IEEE, ??? (2023). https://doi.org/10.1109/cvpr52729. 2023.00682 .http://dx.doi.org/10.1109/cvpr52729.2023.00682

work page doi:10.1109/cvpr52729 2023

[36] [36]

Yu, Q., He, J., Deng, X., Shen, X., Chen, L.-C.: Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip (2023) arXiv:2308.02487 [cs.CV] 39

work page arXiv 2023

[37] [37]

Motiondiffuser: Controllable multi-agent motion prediction using diffusion

Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., De Mello, S.: Open- vocabulary panoptic segmentation with text-to-image diffusion models. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2955–2966. IEEE, ??? (2023). https://doi.org/10.1109/cvpr52729.2023.00289 .http://dx.doi.org/10.1109/cvpr52729.2023.00289

work page doi:10.1109/cvpr52729.2023.00289 2023

[38] [38]

Wang, F., Mei, J., Yuille, A.: SCLIP: Rethinking Self-Attention for Dense Vision- Language Inference, pp. 315–332. Springer, ??? (2024). https://doi.org/10.1007/ 978-3-031-72664-4 18 .http://dx.doi.org/10.1007/978-3-031-72664-4 18

work page doi:10.1007/978-3-031-72664-4 2024

[39] [39]

Hajimiri, S., Ayed, I.B., Dolz, J.: Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation (2024) arXiv:2404.08181 [cs.CV]

work page arXiv 2024

[40] [41]

Shao, T., Tian, Z., Zhao, H., Su, J.: Explore the potential of clip for training-free open vocabulary semantic segmentation (2024) arXiv:2407.08268 [cs.CV]

work page arXiv 2024

[41] [42]

https://arxiv.org/abs/2411.12044

Aydın, M.A., C ¸ ırpar, E.M., Abdinli, E., Unal, G., Sahin, Y.H.: ITACLIP: Boost- ing Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements (2024). https://arxiv.org/abs/2411.12044

work page arXiv 2024

[42] [43]

Lan, M., Chen, C., Ke, Y., Wang, X., Feng, L., Zhang, W.: Proxy- CLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation, pp. 70–88. Springer, ??? (2024). https://doi.org/10.1007/978-3-031-73113-6 5 . http://dx.doi.org/10.1007/978-3-031-73113-6 5

work page doi:10.1007/978-3-031-73113-6 2024

[43] [44]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

Kim, C., Ju, D., Han, W., Yang, M.-H., Hwang, S.J.: Distilling spectral graph for object-context aware open-vocabulary semantic segmentation. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

work page 2025

[44] [45]

Rocket-1: Mastering open-world interaction with visual-temporal context prompting

Stojni´ c, V., Kalantidis, Y., Matas, J., Tolias, G.: Lposs: Label propagation over patches and pixels for open-vocabulary semantic segmentation. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9794–9803. IEEE, ??? (2025). https://doi.org/10.1109/cvpr52734.2025.00915 .http://dx.doi.org/10.1109/cvpr52734.2025.00915

work page doi:10.1109/cvpr52734.2025.00915 2025

[45] [46]

In: Advances in Neural Information Processing Systems (2025)

Wang, X., Si, C., Yang, X., Zhao, Y., Wang, W., Yang, X., Shen, W.: Opmap- per: Enhancing open-vocabulary semantic segmentation with multi-guidance information. In: Advances in Neural Information Processing Systems (2025)

work page 2025

[46] [47]

In: Advances in Neu- ral Information Processing Systems 36

Zou, X., Yang, J., Zhang, H., Li, F., Li, L., Wang, J., Wang, L., Gao, J., 40 Lee, Y.J.: Segment everything everywhere all at once. In: Advances in Neu- ral Information Processing Systems 36. NeurIPS 2023, pp. 19769–19782. Neural Information Processing Systems Foundation, Inc. (NeurIPS), ??? (2023). https: //doi.org/10.52202/075280-0868 .http://dx.doi.org...

work page doi:10.52202/075280-0868 2023

[47] [48]

Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., Zeng, Z., Zhang, H., Li, F., Yang, J., Li, H., Jiang, Q., Zhang, L.: Grounded sam: Assembling open-world models for diverse visual tasks (2024) arXiv:2401.14159 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [49]

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., Zhu, J., Zhang, L.: Grounding DINO: Marrying DINO with Grounded Pre-training for Open-Set Object Detection, pp. 38–55. Springer, ??? (2024). https://doi.org/10.1007/978-3-031-72970-6 3 .http://dx.doi.org/10.1007/978-3- 031-72970-6 3

work page doi:10.1007/978-3-031-72970-6 2024

[49] [50]

Mark Weber, Jun Xie, Maxwell D

Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning segmentation via large language model. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9579–9589. IEEE, ??? (2024). https://doi.org/10.1109/cvpr52733.2024.00915 . http://dx.doi.org/10.1109/cvpr52733.2024.00915

work page doi:10.1109/cvpr52733.2024.00915 2024

[50] [51]

Mark Weber, Jun Xie, Maxwell D

Sun, Y., Chen, J., Zhang, S., Zhang, X., Chen, Q., Zhang, G., Ding, E., Wang, J., Li, Z.: Vrp-sam: Sam with visual reference prompt. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 23565–23574. IEEE, ??? (2024). https://doi.org/10.1109/cvpr52733.2024. 02224 .http://dx.doi.org/10.1109/cvpr52733.2024.02224

work page doi:10.1109/cvpr52733.2024 2024

[51] [52]

Zhang, R., Jiang, Z., Guo, Z., Yan, S., Pan, J., Ma, X., Dong, H., Gao, P., Li, H.: Personalize segment anything model with one shot (2023) arXiv:2305.03048 [cs.CV]

work page arXiv 2023

[52] [53]

Tang, L., Jiang, P.-T., Xiao, H.-K., Li, B.: Towards training-free open-world segmentation via image prompt foundation models (2023) arXiv:2310.10912 [cs.CV]

work page arXiv 2023

[53] [54]

Zhang, A., Gao, G., Jiao, J., Liu, C.H., Wei, Y.: Bridge the points: Graph-based few-shot segment anything semantically (2024) arXiv:2410.06964 [cs.CV]

work page arXiv 2024

[54] [55]

In: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp

Barsellotti, L., Amoroso, R., Baraldi, L., Cucchiara, R.: Fossil: Free open- vocabulary semantic segmentation through synthetic references retrieval. In: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 1453–1462. IEEE, ??? (2024). https://doi.org/10.1109/wacv57701.2024.00149 .http://dx.doi.org/10.1109/wacv57701.2024.00149

work page doi:10.1109/wacv57701.2024.00149 2024

[55] [56]

Mark Weber, Jun Xie, Maxwell D

Barsellotti, L., Amoroso, R., Cornia, M., Baraldi, L., Cucchiara, R.: Training- free open-vocabulary segmentation with offline diffusion-augmented prototype 41 generation. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3689–3698. IEEE, ??? (2024). https://doi.org/10.1109/ cvpr52733.2024.00354 .http://dx.doi.org/10.1109...

work page doi:10.1109/cvpr52733.2024.00354 2024

[56] [57]

Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., Lewis, M.: Gen- eralization through memorization: Nearest neighbor language models (2020) arXiv:1911.00172 [cs.CL]

work page arXiv 2020

[57] [58]

Sheynin, S., Ashual, O., Polyak, A., Singer, U., Gafni, O., Nachmani, E., Taigman, Y.: Knn-diffusion: Image generation via large-scale retrieval (2022) arXiv:2204.02849 [cs.CV]

work page arXiv 2022

[58] [59]

Chen, W., Hu, H., Saharia, C., Cohen, W.W.: Re-imagen: Retrieval-augmented text-to-image generator (2022) arXiv:2209.14491 [cs.CV]

work page arXiv 2022

[59] [60]

In: Advances in Neural Information Processing Sys- tems 35

Blattmann, A., M¨ uller, J., Oktay, K., Ommer, B., Rombach, R.: Retrieval- augmented diffusion models. In: Advances in Neural Information Processing Sys- tems 35. NeurIPS 2022, pp. 15309–15324. Neural Information Processing Systems Foundation, Inc. (NeurIPS), ??? (2022). https://doi.org/10.52202/068431-1114 . http://dx.doi.org/10.52202/068431-1114

work page doi:10.52202/068431-1114 2022

[60] [61]

Snell, J., Swersky, K., Zemel, R.S.: Prototypical networks for few-shot learning (2017) arXiv:1703.05175 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2017

[61] [62]

In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H.S., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1199–1208. IEEE, ??? (2018). https://doi.org/10.1109/cvpr.2018.00131 . http://dx.doi.org/10.1109/cvpr.2018.00131

work page doi:10.1109/cvpr.2018.00131 2018

[62] [63]

Lee, J., Sung, M., Kang, J., Chen, D.: Learning dense representations of phrases at scale (2021) arXiv:2012.12624 [cs.CL]

work page arXiv 2021

[63] [64]

In: Advances in Neural Information Processing Systems 35

Albanie, S., Shin, G., Xie, W.: Reco: Retrieve and co-segment for zero- shot transfer. In: Advances in Neural Information Processing Systems 35. NeurIPS 2022, pp. 33754–33767. Neural Information Processing Systems Foun- dation, Inc. (NeurIPS), ??? (2022). https://doi.org/10.52202/068431-2446 . http://dx.doi.org/10.52202/068431-2446

work page doi:10.52202/068431-2446 2022

[64] [65]

Gui, Z., Sun, S., Li, R., Yuan, J., An, Z., Roth, K., Prabhu, A., Torr, P.: knn- clip: Retrieval enables training-free segmentation on continually expanding large vocabularies (2024) arXiv:2404.09447 [cs.CV]

work page arXiv 2024

[65] [66]

Zhao, L., Chen, X., Chen, E.Z., Liu, Y., Chen, T., Sun, S.: Retrieval- augmented few-shot medical image segmentation with foundation models (2024) arXiv:2408.08813 [cs.CV] 42

work page arXiv 2024

[66] [67]

Espinosa, M., Yang, C., Ericsson, L., McDonagh, S., Crowley, E.J.: No time to train! training-free reference-based instance segmentation (2025) arXiv:2507.02798 [cs.CV]

work page arXiv 2025

[67] [68]

Artificial Intelligence Review57(6) (2024) https://doi.org/10.1007/ s10462-024-10775-6

Lei, L., Yang, Q., Yang, L., Shen, T., Wang, R., Fu, C.: Deep learning implementation of image segmentation in agricultural applications: a comprehen- sive review. Artificial Intelligence Review57(6) (2024) https://doi.org/10.1007/ s10462-024-10775-6

work page 2024

[68] [69]

IEEE Robotics and Automation Letters3(1), 588–595 (2018) https://doi.org/10.1109/lra.2017.2774979

Sa, I., Chen, Z., Popovic, M., Khanna, R., Liebisch, F., Nieto, J., Siegwart, R.: weednet: Dense semantic weed classification using multispectral images and mav for smart farming. IEEE Robotics and Automation Letters3(1), 588–595 (2018) https://doi.org/10.1109/lra.2017.2774979

work page doi:10.1109/lra.2017.2774979 2018

[69] [70]

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation (2017). https://arxiv.org/ abs/1511.00561

work page internal anchor Pith review Pith/arXiv arXiv 2017

[70] [71]

Plant Phenomics5, 0084 (2023) https://doi.org/10.34133/plantphenomics.0084

Joshi, A., Guevara, D., Earles, M.: Standardizing and centralizing datasets for efficient training of agricultural deep learning models. Plant Phenomics5, 0084 (2023) https://doi.org/10.34133/plantphenomics.0084

work page doi:10.34133/plantphenomics.0084 2023

[71] [72]

Sensors23(18), 7884 (2023) https://doi.org/10.3390/s23187884

Li, Y., Wang, D., Yuan, C., Li, H., Hu, J.: Enhancing agricultural image seg- mentation with an agricultural segment anything model adapter. Sensors23(18), 7884 (2023) https://doi.org/10.3390/s23187884

work page doi:10.3390/s23187884 2023

[72] [73]

Picon, A., Eguskiza, I., Mugica, D., Romero, J., Jimenez, C.J., White, E., Do- Lago-Junqueira, G., Klukas, C., Navarra-Mestre, R.: Mitigating domain drift in multi species segmentation with dinov2: A cross-domain evaluation in herbicide research trials (2025) arXiv:2508.07514 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2025

[73] [74]

Daquan Zhou, Kai Wang, Jianyang Gu, Xiangyu Peng, Dongze Lian, Yifan Zhang, Yang You, and Jiashi Feng

Gupta, A., Dollar, P., Girshick, R.: Lvis: A dataset for large vocabulary instance segmentation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5351–5359. IEEE, ??? (2019). https://doi.org/10.1109/ cvpr.2019.00550 .http://dx.doi.org/10.1109/cvpr.2019.00550

work page doi:10.1109/cvpr.2019.00550 2019

[74] [75]

In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp

Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5122–5130. IEEE, ??? (2017). https://doi.org/ 10.1109/cvpr.2017.544 .http://dx.doi.org/10.1109/cvpr.2017.544

work page doi:10.1109/cvpr.2017.544 2017

[75] [76]

In: 2016 IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), pp

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: 2016 IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), pp. 3213–3223. IEEE, ??? (2016). https://doi.org/10. 1109/cvpr.2016.350 .http://dx.doi.org/10.1109/cvpr.2...

work page doi:10.1109/cvpr.2016.350 2016

[76] [77]

In: 2014 IEEE Conference on Computer Vision and Pattern Recog- nition, pp

Mottaghi, R., Chen, X., Liu, X., Cho, N.-G., Lee, S.-W., Fidler, S., Urtasun, R., Yuille, A.: The role of context for object detection and semantic segmentation in the wild. In: 2014 IEEE Conference on Computer Vision and Pattern Recog- nition, pp. 891–898. IEEE, ??? (2014). https://doi.org/10.1109/cvpr.2014.119 . http://dx.doi.org/10.1109/cvpr.2014.119

work page doi:10.1109/cvpr.2014.119 2014

[77] [78]

International Journal of Computer Vision133, 1–15 (2025) https://doi.org/10.1007/s11263-024-02185-6

Tang, L., Jiang, P.-T., Xiao, H., Li, B.: Towards training-free open-world segmen- tation via image prompt foundation models. International Journal of Computer Vision133, 1–15 (2025) https://doi.org/10.1007/s11263-024-02185-6

work page doi:10.1007/s11263-024-02185-6 2025

[78] [79]

IEEE Transactions on Image Processing34, 8271–8284 (2025) https://doi.org/10.1109/TIP.2025.3639996

Bai, S., Liu, Y., Han, Y., Zhang, H., Tang, Y., Zhou, J., Lu, J.: Self-calibrated clip for training-free open-vocabulary segmentation. IEEE Transactions on Image Processing34, 8271–8284 (2025) https://doi.org/10.1109/TIP.2025.3639996

work page doi:10.1109/tip.2025.3639996 2025

[79] [80]

Har- nessing vision foundation models for high-performance, training-free open vocabulary segmentation.arXiv preprint arXiv:2411.09219, 2024

Shi, Y., Dong, M., Xu, C.: Harnessing vision foundation models for high- performance, training-free open vocabulary segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025). https://arxiv.org/abs/2411.09219

work page arXiv 2025

[80] [81]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025)

Zhang, D., Liu, F., Tang, Q.: Corrclip: Reconstructing patch correla- tions in clip for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025). https://arxiv.org/abs/2411.10086 44 Fig. 2Effect of ICCD filtering on two example classes (top: small clustered flowers. bottom: rice leaves). Blu...

work page arXiv 2025