SegRAG: Training-Free Retrieval-Augmented Semantic Segmentation

Abderrahmene Boudiaf; Irfan Hussain; Sajid Javed

arxiv: 2605.17630 · v1 · pith:JGV5TOTTnew · submitted 2026-05-17 · 💻 cs.CV

SegRAG: Training-Free Retrieval-Augmented Semantic Segmentation

Abderrahmene Boudiaf , Irfan Hussain , Sajid Javed This is my paper

Pith reviewed 2026-05-20 13:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords open-vocabulary semantic segmentationretrieval-augmented segmentationpoint promptingSAM3DINO featureszero-shot domain transferagricultural image segmentationtraining-free method

0 comments

The pith

SegRAG supplies SAM3 with point prompts retrieved from a distilled DINOv3 feature bank to resolve classes that text prompts alone miss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that open-vocabulary segmentation models lose accuracy when target classes appear rarely or look different from pretraining examples, because text prompts supply no spatial information to disambiguate. SegRAG addresses this by building an offline bank of patch features from a small set of annotated reference images using a frozen DINOv3 backbone, then applying Intra-Class Cohesion Distillation to keep only the prototypes that reliably match within-class foreground. At inference the method computes a similarity landscape on the query image, extracts spatially coherent peak locations, and passes those points together with the class name to SAM3 in one joint grounding step. This produces large accuracy lifts on standard open-vocabulary benchmarks and especially on agricultural images drawn from entirely new domains.

Core claim

By retaining only intra-class cohesive prototypes from DINOv3 features of reference images and locating their topographic matches on query images via connected-component analysis and non-maximum suppression, SegRAG generates class-specific point prompts that SAM3 can use alongside text to produce accurate masks without any task-specific training or synthetic data.

What carries the argument

Topographic Similarity Grounding (TSG) that turns cosine-similarity maps between query patches and retained prototypes into spatially coherent point prompts for the SAM3 mask decoder.

If this is right

On LVIS the method improves mean IoU by up to 3.92 points over the SAM3 text-only baseline.
On AgML zero-shot domain-transfer benchmarks mean IoU rises from 25.27 to 59.24, with some classes recovering from 0 to above 95 IoU.
Ablation results show that Intra-Class Cohesion Distillation, Topographic Similarity Grounding, and joint text-plus-point prompting each contribute measurable gains that add when combined.
The approach works across four open-vocabulary benchmarks while requiring no additional training or data synthesis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same retrieval-plus-point-prompt pattern could be applied to other promptable segmentation or detection models to improve their handling of rare visual concepts.
A small, carefully filtered reference set appears sufficient to bridge large domain gaps, suggesting that retrieval augmentation may reduce reliance on large-scale fine-tuning for adaptation.
Extending the feature bank with new reference images at test time could support continual adaptation without retraining the underlying vision backbone.

Load-bearing premise

Prototypes kept after Intra-Class Cohesion Distillation on reference images will still produce high-confidence, spatially coherent matches when the query images come from a visually distant new domain.

What would settle it

Run SegRAG on a held-out agricultural test set whose reference images are drawn from a different crop type and lighting condition than the query set, and observe that mean IoU remains at the text-only SAM3 baseline level.

read the original abstract

Here's a trimmed version under 1920 characters: Open-vocabulary segmentation models such as SAM3 achieve strong performance through concept-level text prompting, yet degrade when the target class is visually underrepresented in pretraining data or when its appearance departs from canonical depictions. Text prompts provide no spatial signal to resolve such ambiguity. We present SegRAG, a training-free retrieval-augmented segmentation framework that grounds SAM3 with spatially precise, class-specific point prompts derived from a curated DINOv3 feature bank. During an offline stage, patch-level descriptors are extracted from annotated reference images using a frozen DINOv3 ViT-L/16 backbone and filtered by Intra-Class Cohesion Distillation (ICCD), retaining only prototypes that reliably retrieve within-class foreground. At inference, Topographic Similarity Grounding (TSG) computes a cosine-similarity landscape between the query image and retrieved prototypes, identifies spatially coherent high-confidence regions via connected-component analysis, and extracts peak locations through non-maximum suppression. These point prompts are delivered to SAM3 alongside the class-name text in a single joint grounding pass, enabling the mask decoder to resolve semantic intent and spatial evidence together. SegRAG requires no task-specific training and no synthetic data. On four open-vocabulary benchmarks it achieves consistent gains over the SAM3 text-only baseline, with improvements of up to +3.92 mIoU on LVIS. On AgML agricultural benchmarks representing a zero-shot domain transfer setting, it raises mean IoU from 25.27 to 59.24 (+33.97) and recovers individual classes from zero to over 95 mIoU. Ablation studies confirm that ICCD, TSG, and joint prompting each contribute independently and compound when combined. Code is available at https://github.com/boudiafA/SegRAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SegRAG adds a retrieval step that pulls DINOv3 prototypes from references, filters them, and feeds point prompts to SAM3, with the biggest reported lift on agricultural zero-shot transfer.

read the letter

SegRAG's core move is to build an offline bank of DINOv3 patch features from annotated references, keep only the ones that stick together inside each class, then at test time match a query image against that bank to locate coherent regions and turn the peaks into point prompts for SAM3. The joint text-plus-points pass is what lets the mask decoder resolve both semantics and location at once. That pipeline is the actual new piece; prior work has used DINO features or SAM prompting separately, but not this exact offline filtering plus topographic grounding combination. The paper shows small steady gains on LVIS and similar sets, plus the much larger jump on AgML where mean IoU moves from 25 to 59 and some classes go from zero to over 95. Ablations indicate each component adds something when used alone and more when stacked. Code is released, which helps anyone who wants to test the claims directly. The main soft spot is the domain-shift assumption. The method needs the cosine-similarity maps between query patches and the filtered source prototypes to stay peaked and connected after the appearance change. Without reported numbers on how much similarity drops, or examples of diffuse maps that still produce usable points, it is hard to know how far the approach travels beyond the tested cases. If the full paper has those diagnostics or failure-case breakdowns, the central claim strengthens; if not, the AgML numbers remain the part that most needs independent checking. This is for vision researchers or practitioners who need open-vocabulary masks in narrow domains without fine-tuning or new labeled data. Someone working on robotics or environmental monitoring would get immediate value from trying the pipeline. The work shows clear thinking about how to combine existing frozen models and deserves a serious referee to verify the numbers and the robustness checks.

Referee Report

2 major / 3 minor

Summary. The paper presents SegRAG, a training-free retrieval-augmented framework for open-vocabulary semantic segmentation. It extracts patch-level descriptors from annotated reference images using a frozen DINOv3 ViT-L/16 backbone, filters them via Intra-Class Cohesion Distillation (ICCD) to retain only intra-class cohesive prototypes, and at inference applies Topographic Similarity Grounding (TSG) to compute cosine-similarity landscapes on query images, identify coherent high-confidence regions via connected-component analysis, and extract point prompts via non-maximum suppression. These prompts are supplied jointly with class-name text to SAM3. The manuscript reports consistent gains over the SAM3 text-only baseline on four open-vocabulary benchmarks (up to +3.92 mIoU on LVIS) and large improvements on AgML agricultural benchmarks under zero-shot domain transfer (+33.97 mIoU from 25.27 to 59.24, with individual classes recovering from 0 to >95 mIoU). Ablations indicate that ICCD, TSG, and joint prompting each contribute independently.

Significance. If the reported gains are reproducible, the work would offer a practical advance for handling visual domain shifts in open-vocabulary segmentation without task-specific training or synthetic data. The explicit component ablations and public code release strengthen the contribution. The large AgML improvements, if mechanistically verified, would highlight the value of curated reference prototypes for agricultural applications.

major comments (2)

[§4] §4 (AgML experiments): The central claim of a +33.97 mIoU gain (25.27 → 59.24) and per-class recovery to >95 mIoU under zero-shot domain transfer rests on the assumption that TSG yields spatially coherent high-confidence matches after domain shift. The manuscript provides no quantitative measure of prototype-query cosine-similarity drop, no failure-case analysis, and no ablation isolating the domain gap (e.g., same-domain versus shifted-domain references). This is load-bearing for the zero-shot transfer narrative.
[§3.2] §3.2 (ICCD description): The filtering criterion for retaining prototypes after Intra-Class Cohesion Distillation is stated at a high level but lacks the precise similarity threshold, number of retained prototypes per class, or validation metric used on the reference set. Without these details the reproducibility of the prototype bank—and therefore the downstream TSG performance—cannot be assessed.

minor comments (3)

[Abstract] Abstract: The four open-vocabulary benchmarks are not named; explicitly listing them (e.g., LVIS, ADE20K, etc.) would improve immediate clarity.
[§4] §4 and tables: No error bars, standard deviations, or number of runs are reported for the mIoU figures. Adding these would allow readers to gauge result stability.
[§3.3] §3.3 (TSG): The connected-component analysis and NMS parameters (e.g., area threshold, suppression radius) are not specified numerically, which hinders exact replication of the point-prompt extraction step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve reproducibility and strengthen the supporting evidence for our claims.

read point-by-point responses

Referee: [§4] §4 (AgML experiments): The central claim of a +33.97 mIoU gain (25.27 → 59.24) and per-class recovery to >95 mIoU under zero-shot domain transfer rests on the assumption that TSG yields spatially coherent high-confidence matches after domain shift. The manuscript provides no quantitative measure of prototype-query cosine-similarity drop, no failure-case analysis, and no ablation isolating the domain gap (e.g., same-domain versus shifted-domain references). This is load-bearing for the zero-shot transfer narrative.

Authors: We agree that direct evidence of TSG behavior under domain shift would better support the zero-shot transfer narrative. In the revised manuscript we have added: (i) a quantitative comparison of mean prototype-query cosine similarity on same-domain versus shifted-domain references, (ii) selected failure-case visualizations showing when connected-component analysis yields fragmented regions, and (iii) an ablation that substitutes same-domain reference prototypes for the shifted-domain ones used in the main AgML experiments. These additions confirm a measurable similarity drop yet show that TSG still recovers sufficiently coherent regions to produce the reported gains. revision: yes
Referee: [§3.2] §3.2 (ICCD description): The filtering criterion for retaining prototypes after Intra-Class Cohesion Distillation is stated at a high level but lacks the precise similarity threshold, number of retained prototypes per class, or validation metric used on the reference set. Without these details the reproducibility of the prototype bank—and therefore the downstream TSG performance—cannot be assessed.

Authors: We acknowledge that the original description was insufficiently precise. Section 3.2 has been expanded to state that prototypes are retained when their intra-class cohesion score exceeds a cosine-similarity threshold of 0.75, that we keep the top 100 prototypes per class, and that the selection is validated by measuring average intra-class retrieval precision on a 10 % held-out subset of the reference images. These concrete parameters and the validation procedure are now reported explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity: method is self-contained via external frozen models and reference curation

full rationale

The SegRAG framework extracts patch descriptors from annotated references using a frozen external DINOv3 backbone, applies heuristic ICCD filtering to retain intra-class prototypes, then at inference computes cosine-similarity landscapes with TSG and feeds NMS points to SAM3. No equations, fitted parameters, or predictions within the paper reduce to quantities defined by the method itself; all core components are independent of the target query domain and rely on pre-existing models plus curated references. Ablations and benchmark gains are empirical observations, not tautological derivations. This matches the default expectation of a non-circular empirical method paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that DINOv3 patch descriptors after ICCD filtering generalize across domains; no free parameters, new entities, or additional axioms are stated in the abstract.

axioms (1)

domain assumption DINOv3 ViT-L/16 patch descriptors, after Intra-Class Cohesion Distillation filtering, yield prototypes that reliably retrieve within-class foreground on unseen query images.
Invoked in the offline stage to retain prototypes and in the inference stage to compute cosine-similarity landscapes.

pith-pipeline@v0.9.0 · 5872 in / 1250 out tokens · 42541 ms · 2026-05-20T13:46:35.913868+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Topographic Similarity Grounding (TSG) computes a dense cosine-similarity landscape ... identifies spatially coherent high-confidence regions via connected-component analysis, and extracts representative peak locations through non-maximum suppression.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Intra-Class Cohesion Distillation (ICCD) ... retaining only prototypes that reliably retrieve within-class foreground across held-out reference images.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · 21 internal anchors

[1]

2023 , url =

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, 36 T., Whitehead, S., Berg, A.C., Lo, W.-Y., Doll´ ar, P., Girshick, R.: Segment any- thing. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3992–4003. IEEE, ??? (2023). https://doi.org/10.1109/iccv51070.2023.00371 .http://dx.doi.org/10.1109/iccv51070....

work page doi:10.1109/iccv51070.2023.00371 2023
[2]

Ravi, N., Gabeur, V., Hu, Y.-T., Hu, R., Ryali, C., Ma, T., Khedr, H., R¨ adle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.-Y., Girshick, R., Doll´ ar, P., Feichtenhofer, C.: Sam 2: Segment anything in images and videos (2024) arXiv:2408.00714 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Carion, N., Gustafson, L., Hu, Y.-T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., R¨ adle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.-H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H....

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K¨ uttler, H., Lewis, M., Yih, W.-t., Rockt¨ aschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive nlp tasks (2020) arXiv:2005.11401 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2020
[5]

Sim´ eoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., J´ egou, H., Labatut, P., Bojanowski, P.: Dinov3 (2025) arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Zakir, H.M., Ho, E.T.W.: Revealing the semantic selection gap in dinov3 through training-free few-shot segmentation (2026) arXiv:2602.07550 [cs.CV]

work page arXiv 2026
[7]

Fully Convolutional Networks for Semantic Segmentation

Long, J., Shelhamer, E., Darrell, T.: Fully Convolutional Networks for Semantic Segmentation (2015). https://arxiv.org/abs/1411.4038

work page internal anchor Pith review Pith/arXiv arXiv 2015
[8]

U-Net: Convolutional Networks for Biomedical Image Segmentation

Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional Networks for Biomedical Image Segmentation (2015). https://arxiv.org/abs/1505.04597

work page internal anchor Pith review Pith/arXiv arXiv 2015
[9]

Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs

Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs (2015). https://arxiv.org/abs/1412.7062

work page internal anchor Pith review Pith/arXiv arXiv 2015
[10]

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolu- tion, and Fully Connected CRFs (2018). https://arxiv.org/abs/1606.00915

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

https://arxiv.org/abs/1706

Chen, L.-C., Papandreou, G., Schroff, F., Adam, H.: Rethinking Atrous Con- volution for Semantic Image Segmentation (2017). https://arxiv.org/abs/1706. 37 05587

work page 2017
[12]

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation (2018). https://arxiv.org/abs/1802.02611

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

Mask R-CNN

He, K., Gkioxari, G., Doll´ ar, P., Girshick, R.: Mask R-CNN (2017). https://arxiv. org/abs/1703.06870

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

Panoptic Segmentation

Kirillov, A., He, K., Girshick, R., Rother, C., Doll´ ar, P.: Panoptic Segmentation (2019). https://arxiv.org/abs/1801.00868

work page internal anchor Pith review Pith/arXiv arXiv 2019
[15]

In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Kirillov, A., Girshick, R., He, K., Doll´ ar, P.: Panoptic feature pyramid networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6392–6401. IEEE, ??? (2019). https://doi.org/10.1109/cvpr.2019. 00656 .http://dx.doi.org/10.1109/cvpr.2019.00656

work page doi:10.1109/cvpr.2019 2019
[16]

https://arxiv.org/abs/1911.10194

Cheng, B., Collins, M.D., Zhu, Y., Liu, T., Huang, T.S., Adam, H., Chen, L.-C.: Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation (2020). https://arxiv.org/abs/1911.10194

work page arXiv 2020
[17]

https://arxiv.org/abs/2012

Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking Semantic Segmentation from a Sequence- to-Sequence Perspective with Transformers (2021). https://arxiv.org/abs/2012. 15840

work page 2021
[18]

https://arxiv.org/abs/2105.05633

Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: Transformer for Semantic Segmentation (2021). https://arxiv.org/abs/2105.05633

work page arXiv 2021
[19]

https://arxiv.org/abs/2105.15203

Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers (2021). https://arxiv.org/abs/2105.15203

work page arXiv 2021
[20]

In: Advances in Neural Information Processing Systems, vol

Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: Advances in Neural Information Processing Systems, vol. 34 (2021).https://arxiv.org/abs/2107.06278

work page arXiv 2021
[21]

https://arxiv.org/ abs/2112.01527

Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention Mask Transformer for Universal Image Segmentation (2022). https://arxiv.org/ abs/2112.01527

work page arXiv 2022
[22]

Emerging Properties in Self-Supervised Vision Transformers

Caron, M., Touvron, H., Misra, I., J´ egou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging Properties in Self-Supervised Vision Transformers (2021). https: //arxiv.org/abs/2104.14294

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, 38 N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual f...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Liu, Y., Zhu, M., Li, H., Chen, H., Wang, X., Shen, C.: Matcher: Segment anything with one shot using all-purpose feature matching (2023) arXiv:2305.13310 [cs.CV]

work page arXiv 2023
[25]

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision (2021) arXiv:2103.00020 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2021
[26]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11941–11952. IEEE, ??? (2023). https://doi.org/10.1109/iccv51070. 2023.01100 .http://dx.doi.org/10.1109/iccv51070.2023.01100

work page doi:10.1109/iccv51070 2023
[27]

Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., H´ enaff, O., Harm- sen, J., Steiner, A., Zhai, X.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features (2025) arXiv:2502.14786 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Bolya, D., Huang, P.-Y., Sun, P., Cho, J.H., Madotto, A., Wei, C., Ma, T., Zhi, J., Rajasegaran, J., Rasheed, H., Wang, J., Monteiro, M., Xu, H., Dong, S., Ravi, N., Li, D., Doll´ ar, P., Feichtenhofer, C.: Perception encoder: The best visual embeddings are not at the output of the network (2025) arXiv:2504.13181 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End- to-End Object Detection with Transformers, pp. 213–229. Springer, ??? (2020). https://doi.org/10.1007/978-3-030-58452-8 13 .http://dx.doi.org/10.1007/978-3- 030-58452-8 13

work page doi:10.1007/978-3-030-58452-8 2020
[30]

and Keuper, Janis , month = oct, year =

Chen, T., Zhu, L., Ding, C., Cao, R., Wang, Y., Zhang, S., Li, Z., Sun, L., Zang, Y., Mao, P.: Sam-adapter: Adapting segment anything in underperformed scenes. In: 2023 IEEE/CVF International Conference on Computer Vision Work- shops (ICCVW), pp. 3359–3367. IEEE, ??? (2023). https://doi.org/10.1109/ iccvw60793.2023.00361 .http://dx.doi.org/10.1109/iccvw60...

work page doi:10.1109/iccvw60793.2023.00361 2023
[31]

Nature Communications15(1) (2024) https://doi.org/10.1038/ s41467-024-44824-z

Ma, J., He, Y., Li, F., Han, L., You, C., Wang, B.: Segment anything in medical images. Nature Communications15(1) (2024) https://doi.org/10.1038/ s41467-024-44824-z

work page 2024
[32]

IEEE Transactions on Geoscience and Remote Sensing62, 1–17 (2024) https://doi.org/10.1109/tgrs.2024.3356074 39

Chen, K., Liu, C., Chen, H., Zhang, H., Li, W., Zou, Z., Shi, Z.: Rsprompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model. IEEE Transactions on Geoscience and Remote Sensing62, 1–17 (2024) https://doi.org/10.1109/tgrs.2024.3356074 39

work page doi:10.1109/tgrs.2024.3356074 2024
[33]

Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation (2022) arXiv:2201.03546 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

Ghiasi, G., Gu, X., Cui, Y., Lin, T.-Y.: Scaling Open-Vocabulary Image Segmen- tation with Image-Level Labels, pp. 540–557. Springer, ??? (2022). https://doi. org/10.1007/978-3-031-20059-5 31 .http://dx.doi.org/10.1007/978-3-031-20059- 5 31

work page doi:10.1007/978-3-031-20059-5 2022
[35]

In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Liang, F., Wu, B., Dai, X., Li, K., Zhao, Y., Zhang, H., Zhang, P., Vajda, P., Marculescu, D.: Open-vocabulary semantic segmentation with mask-adapted clip. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7061–7070. IEEE, ??? (2023). https://doi.org/10.1109/cvpr52729. 2023.00682 .http://dx.doi.org/10.1109/cvpr52729.2023.00682

work page doi:10.1109/cvpr52729 2023
[36]

Yu, Q., He, J., Deng, X., Shen, X., Chen, L.-C.: Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip (2023) arXiv:2308.02487 [cs.CV]

work page arXiv 2023
[37]

& Chen, C

Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., De Mello, S.: Open- vocabulary panoptic segmentation with text-to-image diffusion models. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2955–2966. IEEE, ??? (2023). https://doi.org/10.1109/cvpr52729.2023.00289 .http://dx.doi.org/10.1109/cvpr52729.2023.00289

work page doi:10.1109/cvpr52729.2023.00289 2023
[38]

Wang, F., Mei, J., Yuille, A.: SCLIP: Rethinking Self-Attention for Dense Vision- Language Inference, pp. 315–332. Springer, ??? (2024). https://doi.org/10.1007/ 978-3-031-72664-4 18 .http://dx.doi.org/10.1007/978-3-031-72664-4 18

work page doi:10.1007/978-3-031-72664-4 2024
[39]

Hajimiri, S., Ayed, I.B., Dolz, J.: Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation (2024) arXiv:2404.08181 [cs.CV]

work page arXiv 2024
[41]

Shao, T., Tian, Z., Zhao, H., Su, J.: Explore the potential of clip for training-free open vocabulary semantic segmentation (2024) arXiv:2407.08268 [cs.CV]

work page arXiv 2024
[42]

https://arxiv.org/abs/2411.12044

Aydın, M.A., C ¸ ırpar, E.M., Abdinli, E., Unal, G., Sahin, Y.H.: ITACLIP: Boost- ing Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements (2024). https://arxiv.org/abs/2411.12044

work page arXiv 2024
[43]

Lan, M., Chen, C., Ke, Y., Wang, X., Feng, L., Zhang, W.: Proxy- CLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation, pp. 70–88. Springer, ??? (2024). https://doi.org/10.1007/978-3-031-73113-6 5 . http://dx.doi.org/10.1007/978-3-031-73113-6 5 40

work page doi:10.1007/978-3-031-73113-6 2024
[44]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

Kim, C., Ju, D., Han, W., Yang, M.-H., Hwang, S.J.: Distilling spectral graph for object-context aware open-vocabulary semantic segmentation. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

work page 2025
[45]

Enhanced Contrastive Learning with Multi-view Longitudinal Data for Chest X-ray Report Generation , url=

Stojni´ c, V., Kalantidis, Y., Matas, J., Tolias, G.: Lposs: Label propagation over patches and pixels for open-vocabulary semantic segmentation. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9794–9803. IEEE, ??? (2025). https://doi.org/10.1109/cvpr52734.2025.00915 .http://dx.doi.org/10.1109/cvpr52734.2025.00915

work page doi:10.1109/cvpr52734.2025.00915 2025
[46]

In: Advances in Neural Information Processing Systems (2025)

Wang, X., Si, C., Yang, X., Zhao, Y., Wang, W., Yang, X., Shen, W.: Opmap- per: Enhancing open-vocabulary semantic segmentation with multi-guidance information. In: Advances in Neural Information Processing Systems (2025)

work page 2025
[47]

In: Advances in Neu- ral Information Processing Systems 36

Zou, X., Yang, J., Zhang, H., Li, F., Li, L., Wang, J., Wang, L., Gao, J., Lee, Y.J.: Segment everything everywhere all at once. In: Advances in Neu- ral Information Processing Systems 36. NeurIPS 2023, pp. 19769–19782. Neural Information Processing Systems Foundation, Inc. (NeurIPS), ??? (2023). https: //doi.org/10.52202/075280-0868 .http://dx.doi.org/10...

work page doi:10.52202/075280-0868 2023
[48]

Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., Zeng, Z., Zhang, H., Li, F., Yang, J., Li, H., Jiang, Q., Zhang, L.: Grounded sam: Assembling open-world models for diverse visual tasks (2024) arXiv:2401.14159 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., Zhu, J., Zhang, L.: Grounding DINO: Marrying DINO with Grounded Pre-training for Open-Set Object Detection, pp. 38–55. Springer, ??? (2024). https://doi.org/10.1007/978-3-031-72970-6 3 .http://dx.doi.org/10.1007/978-3- 031-72970-6 3

work page doi:10.1007/978-3-031-72970-6 2024
[50]

2024 , url =

Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning segmentation via large language model. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9579–9589. IEEE, ??? (2024). https://doi.org/10.1109/cvpr52733.2024.00915 . http://dx.doi.org/10.1109/cvpr52733.2024.00915

work page doi:10.1109/cvpr52733.2024.00915 2024
[51]

2024 , url =

Sun, Y., Chen, J., Zhang, S., Zhang, X., Chen, Q., Zhang, G., Ding, E., Wang, J., Li, Z.: Vrp-sam: Sam with visual reference prompt. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 23565–23574. IEEE, ??? (2024). https://doi.org/10.1109/cvpr52733.2024. 02224 .http://dx.doi.org/10.1109/cvpr52733.2024.02224

work page doi:10.1109/cvpr52733.2024 2024
[52]

Zhang, R., Jiang, Z., Guo, Z., Yan, S., Pan, J., Ma, X., Dong, H., Gao, P., Li, H.: Personalize segment anything model with one shot (2023) arXiv:2305.03048 [cs.CV] 41

work page arXiv 2023
[53]

Tang, L., Jiang, P.-T., Xiao, H.-K., Li, B.: Towards training-free open-world segmentation via image prompt foundation models (2023) arXiv:2310.10912 [cs.CV]

work page arXiv 2023
[54]

Zhang, A., Gao, G., Jiao, J., Liu, C.H., Wei, Y.: Bridge the points: Graph-based few-shot segment anything semantically (2024) arXiv:2410.06964 [cs.CV]

work page arXiv 2024
[55]

In: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2024)

Barsellotti, L., Amoroso, R., Baraldi, L., Cucchiara, R.: Fossil: Free open- vocabulary semantic segmentation through synthetic references retrieval. In: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 1453–1462. IEEE, ??? (2024). https://doi.org/10.1109/wacv57701.2024.00149 .http://dx.doi.org/10.1109/wacv57701.2024.00149

work page doi:10.1109/wacv57701.2024.00149 2024
[56]

2024 , url =

Barsellotti, L., Amoroso, R., Cornia, M., Baraldi, L., Cucchiara, R.: Training- free open-vocabulary segmentation with offline diffusion-augmented prototype generation. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3689–3698. IEEE, ??? (2024). https://doi.org/10.1109/ cvpr52733.2024.00354 .http://dx.doi.org/10.1109/cv...

work page doi:10.1109/cvpr52733.2024.00354 2024
[57]

Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., Lewis, M.: Gen- eralization through memorization: Nearest neighbor language models (2020) arXiv:1911.00172 [cs.CL]

work page arXiv 2020
[58]

Sheynin, S., Ashual, O., Polyak, A., Singer, U., Gafni, O., Nachmani, E., Taigman, Y.: Knn-diffusion: Image generation via large-scale retrieval (2022) arXiv:2204.02849 [cs.CV]

work page arXiv 2022
[59]

Chen, W., Hu, H., Saharia, C., Cohen, W.W.: Re-imagen: Retrieval-augmented text-to-image generator (2022) arXiv:2209.14491 [cs.CV]

work page arXiv 2022
[60]

In: Advances in Neural Information Processing Sys- tems 35

Blattmann, A., M¨ uller, J., Oktay, K., Ommer, B., Rombach, R.: Retrieval- augmented diffusion models. In: Advances in Neural Information Processing Sys- tems 35. NeurIPS 2022, pp. 15309–15324. Neural Information Processing Systems Foundation, Inc. (NeurIPS), ??? (2022). https://doi.org/10.52202/068431-1114 . http://dx.doi.org/10.52202/068431-1114

work page doi:10.52202/068431-1114 2022
[61]

Snell, J., Swersky, K., Zemel, R.S.: Prototypical networks for few-shot learning (2017) arXiv:1703.05175 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2017
[62]

In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H.S., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1199–1208. IEEE, ??? (2018). https://doi.org/10.1109/cvpr.2018.00131 . http://dx.doi.org/10.1109/cvpr.2018.00131

work page doi:10.1109/cvpr.2018.00131 2018
[63]

Lee, J., Sung, M., Kang, J., Chen, D.: Learning dense representations of phrases at scale (2021) arXiv:2012.12624 [cs.CL] 42

work page arXiv 2021
[64]

In: Advances in Neural Information Processing Systems 35

Albanie, S., Shin, G., Xie, W.: Reco: Retrieve and co-segment for zero- shot transfer. In: Advances in Neural Information Processing Systems 35. NeurIPS 2022, pp. 33754–33767. Neural Information Processing Systems Foun- dation, Inc. (NeurIPS), ??? (2022). https://doi.org/10.52202/068431-2446 . http://dx.doi.org/10.52202/068431-2446

work page doi:10.52202/068431-2446 2022
[65]

Gui, Z., Sun, S., Li, R., Yuan, J., An, Z., Roth, K., Prabhu, A., Torr, P.: knn- clip: Retrieval enables training-free segmentation on continually expanding large vocabularies (2024) arXiv:2404.09447 [cs.CV]

work page arXiv 2024
[66]

Zhao, L., Chen, X., Chen, E.Z., Liu, Y., Chen, T., Sun, S.: Retrieval- augmented few-shot medical image segmentation with foundation models (2024) arXiv:2408.08813 [cs.CV]

work page arXiv 2024
[67]

Espinosa, M., Yang, C., Ericsson, L., McDonagh, S., Crowley, E.J.: No time to train! training-free reference-based instance segmentation (2025) arXiv:2507.02798 [cs.CV]

work page arXiv 2025
[68]

Artificial Intelligence Review57(6) (2024) https://doi.org/10.1007/ s10462-024-10775-6

Lei, L., Yang, Q., Yang, L., Shen, T., Wang, R., Fu, C.: Deep learning implementation of image segmentation in agricultural applications: a comprehen- sive review. Artificial Intelligence Review57(6) (2024) https://doi.org/10.1007/ s10462-024-10775-6

work page 2024
[69]

IEEE Robotics and Automation Letters3(1), 588–595 (2018) https://doi.org/10.1109/lra.2017.2774979

Sa, I., Chen, Z., Popovic, M., Khanna, R., Liebisch, F., Nieto, J., Siegwart, R.: weednet: Dense semantic weed classification using multispectral images and mav for smart farming. IEEE Robotics and Automation Letters3(1), 588–595 (2018) https://doi.org/10.1109/lra.2017.2774979

work page doi:10.1109/lra.2017.2774979 2018
[70]

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation (2017). https://arxiv.org/ abs/1511.00561

work page internal anchor Pith review Pith/arXiv arXiv 2017
[71]

Plant Phenomics5, 0084 (2023) https://doi.org/10.34133/plantphenomics.0084

Joshi, A., Guevara, D., Earles, M.: Standardizing and centralizing datasets for efficient training of agricultural deep learning models. Plant Phenomics5, 0084 (2023) https://doi.org/10.34133/plantphenomics.0084

work page doi:10.34133/plantphenomics.0084 2023
[72]

Sensors23(18), 7884 (2023) https://doi.org/10.3390/s23187884

Li, Y., Wang, D., Yuan, C., Li, H., Hu, J.: Enhancing agricultural image seg- mentation with an agricultural segment anything model adapter. Sensors23(18), 7884 (2023) https://doi.org/10.3390/s23187884

work page doi:10.3390/s23187884 2023
[73]

Picon, A., Eguskiza, I., Mugica, D., Romero, J., Jimenez, C.J., White, E., Do- Lago-Junqueira, G., Klukas, C., Navarra-Mestre, R.: Mitigating domain drift in multi species segmentation with dinov2: A cross-domain evaluation in herbicide research trials (2025) arXiv:2508.07514 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

In: 2019 IEEE/CVF Conference on Computer Vision and Pattern 43 Recognition (CVPR), pp

Gupta, A., Dollar, P., Girshick, R.: Lvis: A dataset for large vocabulary instance segmentation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern 43 Recognition (CVPR), pp. 5351–5359. IEEE, ??? (2019). https://doi.org/10.1109/ cvpr.2019.00550 .http://dx.doi.org/10.1109/cvpr.2019.00550

work page doi:10.1109/cvpr.2019.00550 2019
[75]

In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp

Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5122–5130. IEEE, ??? (2017). https://doi.org/ 10.1109/cvpr.2017.544 .http://dx.doi.org/10.1109/cvpr.2017.544

work page doi:10.1109/cvpr.2017.544 2017
[76]

In: 2016 IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), pp

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: 2016 IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), pp. 3213–3223. IEEE, ??? (2016). https://doi.org/10. 1109/cvpr.2016.350 .http://dx.doi.org/10.1109/cvpr.2016.350

work page doi:10.1109/cvpr.2016.350 2016
[77]

In: 2014 IEEE Conference on Computer Vision and Pattern Recog- nition, pp

Mottaghi, R., Chen, X., Liu, X., Cho, N.-G., Lee, S.-W., Fidler, S., Urtasun, R., Yuille, A.: The role of context for object detection and semantic segmentation in the wild. In: 2014 IEEE Conference on Computer Vision and Pattern Recog- nition, pp. 891–898. IEEE, ??? (2014). https://doi.org/10.1109/cvpr.2014.119 . http://dx.doi.org/10.1109/cvpr.2014.119

work page doi:10.1109/cvpr.2014.119 2014
[78]

International Journal of Computer Vision133, 1–15 (2025) https://doi.org/10.1007/s11263-024-02185-6

Tang, L., Jiang, P.-T., Xiao, H., Li, B.: Towards training-free open-world segmen- tation via image prompt foundation models. International Journal of Computer Vision133, 1–15 (2025) https://doi.org/10.1007/s11263-024-02185-6

work page doi:10.1007/s11263-024-02185-6 2025
[79]

IEEE Transactions on Image Processing34, 8271–8284 (2025) arXiv:2411.15869 [cs.CV]

Bai, S., Liu, Y., Han, Y., Zhang, H., Tang, Y., Zhou, J., Lu, J.: Self-calibrated clip for training-free open-vocabulary segmentation. IEEE Transactions on Image Processing34, 8271–8284 (2025) arXiv:2411.15869 [cs.CV]

work page arXiv 2025
[80]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025)

Shi, Y., Dong, M., Xu, C.: Harnessing vision foundation models for high- performance, training-free open vocabulary segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025). https://arxiv.org/abs/2411.09219

work page arXiv 2025
[81]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025)

Zhang, D., Liu, F., Tang, Q.: Corrclip: Reconstructing patch correla- tions in clip for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025). https://arxiv.org/abs/2411.10086 44 Fig. 2Effect of ICCD filtering on two example classes (top: small clustered flowers. bottom: rice leaves). Blu...

work page arXiv 2025

[1] [1]

2023 , url =

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, 36 T., Whitehead, S., Berg, A.C., Lo, W.-Y., Doll´ ar, P., Girshick, R.: Segment any- thing. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3992–4003. IEEE, ??? (2023). https://doi.org/10.1109/iccv51070.2023.00371 .http://dx.doi.org/10.1109/iccv51070....

work page doi:10.1109/iccv51070.2023.00371 2023

[2] [2]

Ravi, N., Gabeur, V., Hu, Y.-T., Hu, R., Ryali, C., Ma, T., Khedr, H., R¨ adle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.-Y., Girshick, R., Doll´ ar, P., Feichtenhofer, C.: Sam 2: Segment anything in images and videos (2024) arXiv:2408.00714 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Carion, N., Gustafson, L., Hu, Y.-T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., R¨ adle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.-H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H....

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K¨ uttler, H., Lewis, M., Yih, W.-t., Rockt¨ aschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive nlp tasks (2020) arXiv:2005.11401 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2020

[5] [5]

Sim´ eoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., J´ egou, H., Labatut, P., Bojanowski, P.: Dinov3 (2025) arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Zakir, H.M., Ho, E.T.W.: Revealing the semantic selection gap in dinov3 through training-free few-shot segmentation (2026) arXiv:2602.07550 [cs.CV]

work page arXiv 2026

[7] [7]

Fully Convolutional Networks for Semantic Segmentation

Long, J., Shelhamer, E., Darrell, T.: Fully Convolutional Networks for Semantic Segmentation (2015). https://arxiv.org/abs/1411.4038

work page internal anchor Pith review Pith/arXiv arXiv 2015

[8] [8]

U-Net: Convolutional Networks for Biomedical Image Segmentation

Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional Networks for Biomedical Image Segmentation (2015). https://arxiv.org/abs/1505.04597

work page internal anchor Pith review Pith/arXiv arXiv 2015

[9] [9]

Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs

Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs (2015). https://arxiv.org/abs/1412.7062

work page internal anchor Pith review Pith/arXiv arXiv 2015

[10] [10]

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolu- tion, and Fully Connected CRFs (2018). https://arxiv.org/abs/1606.00915

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

https://arxiv.org/abs/1706

Chen, L.-C., Papandreou, G., Schroff, F., Adam, H.: Rethinking Atrous Con- volution for Semantic Image Segmentation (2017). https://arxiv.org/abs/1706. 37 05587

work page 2017

[12] [12]

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation (2018). https://arxiv.org/abs/1802.02611

work page internal anchor Pith review Pith/arXiv arXiv 2018

[13] [13]

Mask R-CNN

He, K., Gkioxari, G., Doll´ ar, P., Girshick, R.: Mask R-CNN (2017). https://arxiv. org/abs/1703.06870

work page internal anchor Pith review Pith/arXiv arXiv 2017

[14] [14]

Panoptic Segmentation

Kirillov, A., He, K., Girshick, R., Rother, C., Doll´ ar, P.: Panoptic Segmentation (2019). https://arxiv.org/abs/1801.00868

work page internal anchor Pith review Pith/arXiv arXiv 2019

[15] [15]

In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Kirillov, A., Girshick, R., He, K., Doll´ ar, P.: Panoptic feature pyramid networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6392–6401. IEEE, ??? (2019). https://doi.org/10.1109/cvpr.2019. 00656 .http://dx.doi.org/10.1109/cvpr.2019.00656

work page doi:10.1109/cvpr.2019 2019

[16] [16]

https://arxiv.org/abs/1911.10194

Cheng, B., Collins, M.D., Zhu, Y., Liu, T., Huang, T.S., Adam, H., Chen, L.-C.: Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation (2020). https://arxiv.org/abs/1911.10194

work page arXiv 2020

[17] [17]

https://arxiv.org/abs/2012

Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., Zhang, L.: Rethinking Semantic Segmentation from a Sequence- to-Sequence Perspective with Transformers (2021). https://arxiv.org/abs/2012. 15840

work page 2021

[18] [18]

https://arxiv.org/abs/2105.05633

Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: Transformer for Semantic Segmentation (2021). https://arxiv.org/abs/2105.05633

work page arXiv 2021

[19] [19]

https://arxiv.org/abs/2105.15203

Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers (2021). https://arxiv.org/abs/2105.15203

work page arXiv 2021

[20] [20]

In: Advances in Neural Information Processing Systems, vol

Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: Advances in Neural Information Processing Systems, vol. 34 (2021).https://arxiv.org/abs/2107.06278

work page arXiv 2021

[21] [21]

https://arxiv.org/ abs/2112.01527

Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention Mask Transformer for Universal Image Segmentation (2022). https://arxiv.org/ abs/2112.01527

work page arXiv 2022

[22] [22]

Emerging Properties in Self-Supervised Vision Transformers

Caron, M., Touvron, H., Misra, I., J´ egou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging Properties in Self-Supervised Vision Transformers (2021). https: //arxiv.org/abs/2104.14294

work page internal anchor Pith review Pith/arXiv arXiv 2021

[23] [23]

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, 38 N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual f...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Liu, Y., Zhu, M., Li, H., Chen, H., Wang, X., Shen, C.: Matcher: Segment anything with one shot using all-purpose feature matching (2023) arXiv:2305.13310 [cs.CV]

work page arXiv 2023

[25] [25]

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision (2021) arXiv:2103.00020 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2021

[26] [26]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11941–11952. IEEE, ??? (2023). https://doi.org/10.1109/iccv51070. 2023.01100 .http://dx.doi.org/10.1109/iccv51070.2023.01100

work page doi:10.1109/iccv51070 2023

[27] [27]

Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., H´ enaff, O., Harm- sen, J., Steiner, A., Zhai, X.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features (2025) arXiv:2502.14786 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Bolya, D., Huang, P.-Y., Sun, P., Cho, J.H., Madotto, A., Wei, C., Ma, T., Zhi, J., Rajasegaran, J., Rasheed, H., Wang, J., Monteiro, M., Xu, H., Dong, S., Ravi, N., Li, D., Doll´ ar, P., Feichtenhofer, C.: Perception encoder: The best visual embeddings are not at the output of the network (2025) arXiv:2504.13181 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End- to-End Object Detection with Transformers, pp. 213–229. Springer, ??? (2020). https://doi.org/10.1007/978-3-030-58452-8 13 .http://dx.doi.org/10.1007/978-3- 030-58452-8 13

work page doi:10.1007/978-3-030-58452-8 2020

[30] [30]

and Keuper, Janis , month = oct, year =

Chen, T., Zhu, L., Ding, C., Cao, R., Wang, Y., Zhang, S., Li, Z., Sun, L., Zang, Y., Mao, P.: Sam-adapter: Adapting segment anything in underperformed scenes. In: 2023 IEEE/CVF International Conference on Computer Vision Work- shops (ICCVW), pp. 3359–3367. IEEE, ??? (2023). https://doi.org/10.1109/ iccvw60793.2023.00361 .http://dx.doi.org/10.1109/iccvw60...

work page doi:10.1109/iccvw60793.2023.00361 2023

[31] [31]

Nature Communications15(1) (2024) https://doi.org/10.1038/ s41467-024-44824-z

Ma, J., He, Y., Li, F., Han, L., You, C., Wang, B.: Segment anything in medical images. Nature Communications15(1) (2024) https://doi.org/10.1038/ s41467-024-44824-z

work page 2024

[32] [32]

IEEE Transactions on Geoscience and Remote Sensing62, 1–17 (2024) https://doi.org/10.1109/tgrs.2024.3356074 39

Chen, K., Liu, C., Chen, H., Zhang, H., Li, W., Zou, Z., Shi, Z.: Rsprompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model. IEEE Transactions on Geoscience and Remote Sensing62, 1–17 (2024) https://doi.org/10.1109/tgrs.2024.3356074 39

work page doi:10.1109/tgrs.2024.3356074 2024

[33] [33]

Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation (2022) arXiv:2201.03546 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2022

[34] [34]

Ghiasi, G., Gu, X., Cui, Y., Lin, T.-Y.: Scaling Open-Vocabulary Image Segmen- tation with Image-Level Labels, pp. 540–557. Springer, ??? (2022). https://doi. org/10.1007/978-3-031-20059-5 31 .http://dx.doi.org/10.1007/978-3-031-20059- 5 31

work page doi:10.1007/978-3-031-20059-5 2022

[35] [35]

In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Liang, F., Wu, B., Dai, X., Li, K., Zhao, Y., Zhang, H., Zhang, P., Vajda, P., Marculescu, D.: Open-vocabulary semantic segmentation with mask-adapted clip. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7061–7070. IEEE, ??? (2023). https://doi.org/10.1109/cvpr52729. 2023.00682 .http://dx.doi.org/10.1109/cvpr52729.2023.00682

work page doi:10.1109/cvpr52729 2023

[36] [36]

Yu, Q., He, J., Deng, X., Shen, X., Chen, L.-C.: Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip (2023) arXiv:2308.02487 [cs.CV]

work page arXiv 2023

[37] [37]

& Chen, C

Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., De Mello, S.: Open- vocabulary panoptic segmentation with text-to-image diffusion models. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2955–2966. IEEE, ??? (2023). https://doi.org/10.1109/cvpr52729.2023.00289 .http://dx.doi.org/10.1109/cvpr52729.2023.00289

work page doi:10.1109/cvpr52729.2023.00289 2023

[38] [38]

Wang, F., Mei, J., Yuille, A.: SCLIP: Rethinking Self-Attention for Dense Vision- Language Inference, pp. 315–332. Springer, ??? (2024). https://doi.org/10.1007/ 978-3-031-72664-4 18 .http://dx.doi.org/10.1007/978-3-031-72664-4 18

work page doi:10.1007/978-3-031-72664-4 2024

[39] [39]

Hajimiri, S., Ayed, I.B., Dolz, J.: Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation (2024) arXiv:2404.08181 [cs.CV]

work page arXiv 2024

[40] [41]

Shao, T., Tian, Z., Zhao, H., Su, J.: Explore the potential of clip for training-free open vocabulary semantic segmentation (2024) arXiv:2407.08268 [cs.CV]

work page arXiv 2024

[41] [42]

https://arxiv.org/abs/2411.12044

Aydın, M.A., C ¸ ırpar, E.M., Abdinli, E., Unal, G., Sahin, Y.H.: ITACLIP: Boost- ing Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements (2024). https://arxiv.org/abs/2411.12044

work page arXiv 2024

[42] [43]

Lan, M., Chen, C., Ke, Y., Wang, X., Feng, L., Zhang, W.: Proxy- CLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation, pp. 70–88. Springer, ??? (2024). https://doi.org/10.1007/978-3-031-73113-6 5 . http://dx.doi.org/10.1007/978-3-031-73113-6 5 40

work page doi:10.1007/978-3-031-73113-6 2024

[43] [44]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

Kim, C., Ju, D., Han, W., Yang, M.-H., Hwang, S.J.: Distilling spectral graph for object-context aware open-vocabulary semantic segmentation. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

work page 2025

[44] [45]

Enhanced Contrastive Learning with Multi-view Longitudinal Data for Chest X-ray Report Generation , url=

Stojni´ c, V., Kalantidis, Y., Matas, J., Tolias, G.: Lposs: Label propagation over patches and pixels for open-vocabulary semantic segmentation. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9794–9803. IEEE, ??? (2025). https://doi.org/10.1109/cvpr52734.2025.00915 .http://dx.doi.org/10.1109/cvpr52734.2025.00915

work page doi:10.1109/cvpr52734.2025.00915 2025

[45] [46]

In: Advances in Neural Information Processing Systems (2025)

Wang, X., Si, C., Yang, X., Zhao, Y., Wang, W., Yang, X., Shen, W.: Opmap- per: Enhancing open-vocabulary semantic segmentation with multi-guidance information. In: Advances in Neural Information Processing Systems (2025)

work page 2025

[46] [47]

In: Advances in Neu- ral Information Processing Systems 36

Zou, X., Yang, J., Zhang, H., Li, F., Li, L., Wang, J., Wang, L., Gao, J., Lee, Y.J.: Segment everything everywhere all at once. In: Advances in Neu- ral Information Processing Systems 36. NeurIPS 2023, pp. 19769–19782. Neural Information Processing Systems Foundation, Inc. (NeurIPS), ??? (2023). https: //doi.org/10.52202/075280-0868 .http://dx.doi.org/10...

work page doi:10.52202/075280-0868 2023

[47] [48]

Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., Zeng, Z., Zhang, H., Li, F., Yang, J., Li, H., Jiang, Q., Zhang, L.: Grounded sam: Assembling open-world models for diverse visual tasks (2024) arXiv:2401.14159 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [49]

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., Zhu, J., Zhang, L.: Grounding DINO: Marrying DINO with Grounded Pre-training for Open-Set Object Detection, pp. 38–55. Springer, ??? (2024). https://doi.org/10.1007/978-3-031-72970-6 3 .http://dx.doi.org/10.1007/978-3- 031-72970-6 3

work page doi:10.1007/978-3-031-72970-6 2024

[49] [50]

2024 , url =

Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning segmentation via large language model. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9579–9589. IEEE, ??? (2024). https://doi.org/10.1109/cvpr52733.2024.00915 . http://dx.doi.org/10.1109/cvpr52733.2024.00915

work page doi:10.1109/cvpr52733.2024.00915 2024

[50] [51]

2024 , url =

Sun, Y., Chen, J., Zhang, S., Zhang, X., Chen, Q., Zhang, G., Ding, E., Wang, J., Li, Z.: Vrp-sam: Sam with visual reference prompt. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 23565–23574. IEEE, ??? (2024). https://doi.org/10.1109/cvpr52733.2024. 02224 .http://dx.doi.org/10.1109/cvpr52733.2024.02224

work page doi:10.1109/cvpr52733.2024 2024

[51] [52]

Zhang, R., Jiang, Z., Guo, Z., Yan, S., Pan, J., Ma, X., Dong, H., Gao, P., Li, H.: Personalize segment anything model with one shot (2023) arXiv:2305.03048 [cs.CV] 41

work page arXiv 2023

[52] [53]

Tang, L., Jiang, P.-T., Xiao, H.-K., Li, B.: Towards training-free open-world segmentation via image prompt foundation models (2023) arXiv:2310.10912 [cs.CV]

work page arXiv 2023

[53] [54]

Zhang, A., Gao, G., Jiao, J., Liu, C.H., Wei, Y.: Bridge the points: Graph-based few-shot segment anything semantically (2024) arXiv:2410.06964 [cs.CV]

work page arXiv 2024

[54] [55]

In: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2024)

Barsellotti, L., Amoroso, R., Baraldi, L., Cucchiara, R.: Fossil: Free open- vocabulary semantic segmentation through synthetic references retrieval. In: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 1453–1462. IEEE, ??? (2024). https://doi.org/10.1109/wacv57701.2024.00149 .http://dx.doi.org/10.1109/wacv57701.2024.00149

work page doi:10.1109/wacv57701.2024.00149 2024

[55] [56]

2024 , url =

Barsellotti, L., Amoroso, R., Cornia, M., Baraldi, L., Cucchiara, R.: Training- free open-vocabulary segmentation with offline diffusion-augmented prototype generation. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3689–3698. IEEE, ??? (2024). https://doi.org/10.1109/ cvpr52733.2024.00354 .http://dx.doi.org/10.1109/cv...

work page doi:10.1109/cvpr52733.2024.00354 2024

[56] [57]

Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., Lewis, M.: Gen- eralization through memorization: Nearest neighbor language models (2020) arXiv:1911.00172 [cs.CL]

work page arXiv 2020

[57] [58]

Sheynin, S., Ashual, O., Polyak, A., Singer, U., Gafni, O., Nachmani, E., Taigman, Y.: Knn-diffusion: Image generation via large-scale retrieval (2022) arXiv:2204.02849 [cs.CV]

work page arXiv 2022

[58] [59]

Chen, W., Hu, H., Saharia, C., Cohen, W.W.: Re-imagen: Retrieval-augmented text-to-image generator (2022) arXiv:2209.14491 [cs.CV]

work page arXiv 2022

[59] [60]

In: Advances in Neural Information Processing Sys- tems 35

Blattmann, A., M¨ uller, J., Oktay, K., Ommer, B., Rombach, R.: Retrieval- augmented diffusion models. In: Advances in Neural Information Processing Sys- tems 35. NeurIPS 2022, pp. 15309–15324. Neural Information Processing Systems Foundation, Inc. (NeurIPS), ??? (2022). https://doi.org/10.52202/068431-1114 . http://dx.doi.org/10.52202/068431-1114

work page doi:10.52202/068431-1114 2022

[60] [61]

Snell, J., Swersky, K., Zemel, R.S.: Prototypical networks for few-shot learning (2017) arXiv:1703.05175 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2017

[61] [62]

In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H.S., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1199–1208. IEEE, ??? (2018). https://doi.org/10.1109/cvpr.2018.00131 . http://dx.doi.org/10.1109/cvpr.2018.00131

work page doi:10.1109/cvpr.2018.00131 2018

[62] [63]

Lee, J., Sung, M., Kang, J., Chen, D.: Learning dense representations of phrases at scale (2021) arXiv:2012.12624 [cs.CL] 42

work page arXiv 2021

[63] [64]

In: Advances in Neural Information Processing Systems 35

Albanie, S., Shin, G., Xie, W.: Reco: Retrieve and co-segment for zero- shot transfer. In: Advances in Neural Information Processing Systems 35. NeurIPS 2022, pp. 33754–33767. Neural Information Processing Systems Foun- dation, Inc. (NeurIPS), ??? (2022). https://doi.org/10.52202/068431-2446 . http://dx.doi.org/10.52202/068431-2446

work page doi:10.52202/068431-2446 2022

[64] [65]

Gui, Z., Sun, S., Li, R., Yuan, J., An, Z., Roth, K., Prabhu, A., Torr, P.: knn- clip: Retrieval enables training-free segmentation on continually expanding large vocabularies (2024) arXiv:2404.09447 [cs.CV]

work page arXiv 2024

[65] [66]

Zhao, L., Chen, X., Chen, E.Z., Liu, Y., Chen, T., Sun, S.: Retrieval- augmented few-shot medical image segmentation with foundation models (2024) arXiv:2408.08813 [cs.CV]

work page arXiv 2024

[66] [67]

Espinosa, M., Yang, C., Ericsson, L., McDonagh, S., Crowley, E.J.: No time to train! training-free reference-based instance segmentation (2025) arXiv:2507.02798 [cs.CV]

work page arXiv 2025

[67] [68]

Artificial Intelligence Review57(6) (2024) https://doi.org/10.1007/ s10462-024-10775-6

Lei, L., Yang, Q., Yang, L., Shen, T., Wang, R., Fu, C.: Deep learning implementation of image segmentation in agricultural applications: a comprehen- sive review. Artificial Intelligence Review57(6) (2024) https://doi.org/10.1007/ s10462-024-10775-6

work page 2024

[68] [69]

IEEE Robotics and Automation Letters3(1), 588–595 (2018) https://doi.org/10.1109/lra.2017.2774979

Sa, I., Chen, Z., Popovic, M., Khanna, R., Liebisch, F., Nieto, J., Siegwart, R.: weednet: Dense semantic weed classification using multispectral images and mav for smart farming. IEEE Robotics and Automation Letters3(1), 588–595 (2018) https://doi.org/10.1109/lra.2017.2774979

work page doi:10.1109/lra.2017.2774979 2018

[69] [70]

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation (2017). https://arxiv.org/ abs/1511.00561

work page internal anchor Pith review Pith/arXiv arXiv 2017

[70] [71]

Plant Phenomics5, 0084 (2023) https://doi.org/10.34133/plantphenomics.0084

Joshi, A., Guevara, D., Earles, M.: Standardizing and centralizing datasets for efficient training of agricultural deep learning models. Plant Phenomics5, 0084 (2023) https://doi.org/10.34133/plantphenomics.0084

work page doi:10.34133/plantphenomics.0084 2023

[71] [72]

Sensors23(18), 7884 (2023) https://doi.org/10.3390/s23187884

Li, Y., Wang, D., Yuan, C., Li, H., Hu, J.: Enhancing agricultural image seg- mentation with an agricultural segment anything model adapter. Sensors23(18), 7884 (2023) https://doi.org/10.3390/s23187884

work page doi:10.3390/s23187884 2023

[72] [73]

Picon, A., Eguskiza, I., Mugica, D., Romero, J., Jimenez, C.J., White, E., Do- Lago-Junqueira, G., Klukas, C., Navarra-Mestre, R.: Mitigating domain drift in multi species segmentation with dinov2: A cross-domain evaluation in herbicide research trials (2025) arXiv:2508.07514 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2025

[73] [74]

In: 2019 IEEE/CVF Conference on Computer Vision and Pattern 43 Recognition (CVPR), pp

Gupta, A., Dollar, P., Girshick, R.: Lvis: A dataset for large vocabulary instance segmentation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern 43 Recognition (CVPR), pp. 5351–5359. IEEE, ??? (2019). https://doi.org/10.1109/ cvpr.2019.00550 .http://dx.doi.org/10.1109/cvpr.2019.00550

work page doi:10.1109/cvpr.2019.00550 2019

[74] [75]

In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp

Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5122–5130. IEEE, ??? (2017). https://doi.org/ 10.1109/cvpr.2017.544 .http://dx.doi.org/10.1109/cvpr.2017.544

work page doi:10.1109/cvpr.2017.544 2017

[75] [76]

In: 2016 IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), pp

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: 2016 IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), pp. 3213–3223. IEEE, ??? (2016). https://doi.org/10. 1109/cvpr.2016.350 .http://dx.doi.org/10.1109/cvpr.2016.350

work page doi:10.1109/cvpr.2016.350 2016

[76] [77]

In: 2014 IEEE Conference on Computer Vision and Pattern Recog- nition, pp

Mottaghi, R., Chen, X., Liu, X., Cho, N.-G., Lee, S.-W., Fidler, S., Urtasun, R., Yuille, A.: The role of context for object detection and semantic segmentation in the wild. In: 2014 IEEE Conference on Computer Vision and Pattern Recog- nition, pp. 891–898. IEEE, ??? (2014). https://doi.org/10.1109/cvpr.2014.119 . http://dx.doi.org/10.1109/cvpr.2014.119

work page doi:10.1109/cvpr.2014.119 2014

[77] [78]

International Journal of Computer Vision133, 1–15 (2025) https://doi.org/10.1007/s11263-024-02185-6

Tang, L., Jiang, P.-T., Xiao, H., Li, B.: Towards training-free open-world segmen- tation via image prompt foundation models. International Journal of Computer Vision133, 1–15 (2025) https://doi.org/10.1007/s11263-024-02185-6

work page doi:10.1007/s11263-024-02185-6 2025

[78] [79]

IEEE Transactions on Image Processing34, 8271–8284 (2025) arXiv:2411.15869 [cs.CV]

Bai, S., Liu, Y., Han, Y., Zhang, H., Tang, Y., Zhou, J., Lu, J.: Self-calibrated clip for training-free open-vocabulary segmentation. IEEE Transactions on Image Processing34, 8271–8284 (2025) arXiv:2411.15869 [cs.CV]

work page arXiv 2025

[79] [80]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025)

Shi, Y., Dong, M., Xu, C.: Harnessing vision foundation models for high- performance, training-free open vocabulary segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025). https://arxiv.org/abs/2411.09219

work page arXiv 2025

[80] [81]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025)

Zhang, D., Liu, F., Tang, Q.: Corrclip: Reconstructing patch correla- tions in clip for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025). https://arxiv.org/abs/2411.10086 44 Fig. 2Effect of ICCD filtering on two example classes (top: small clustered flowers. bottom: rice leaves). Blu...

work page arXiv 2025