Memory-SAM: Human-Prompt-Free Tongue Segmentation via Retrieval-to-Prompt

Dongmei Yu; Joongwon Chae; Lian Zhang; Lihui Luo; Peiwu Qin; Xi Yuan; Zhenglin Chen

arxiv: 2510.15849 · v2 · pith:TLW74ZPNnew · submitted 2025-10-17 · 💻 cs.CV

Memory-SAM: Human-Prompt-Free Tongue Segmentation via Retrieval-to-Prompt

Joongwon Chae , Lihui Luo , Xi Yuan , Dongmei Yu , Zhenglin Chen , Lian Zhang , Peiwu Qin This is my paper

Pith reviewed 2026-05-18 06:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords tongue segmentationSAM2retrieval-based promptingDINO featuresmedical image analysisprompt-free segmentationTCM imaging

0 comments

The pith

Memory-SAM retrieves similar past tongue images to generate point prompts that guide SAM2 without any human clicks or fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Memory-SAM as a training-free method for segmenting tongue images used in traditional Chinese medicine analysis. Instead of requiring large labeled datasets or manual prompts, the approach stores a small set of prior annotated cases and retrieves the most similar one for each new query image. Dense DINOv3 features and FAISS indexing locate the match, after which the exemplar's mask supplies foreground and background point prompts that direct SAM2. On 600 expert-annotated images mixing controlled and in-the-wild conditions, the method reaches 0.9863 mIoU and shows clear gains over both fully supervised models and detector-driven baselines, especially on real-world irregular boundaries. A sympathetic reader would care because the pipeline makes reliable segmentation feasible with minimal data and zero intervention at test time.

Core claim

Memory-SAM is a training-free, human-prompt-free pipeline that automatically generates effective prompts from a small memory of prior cases via dense DINOv3 features and FAISS retrieval. Given a query image, mask-constrained correspondences to the retrieved exemplar are distilled into foreground/background point prompts that guide SAM2 without manual clicks or model fine-tuning. On the mixed test split of 600 expert-annotated images, Memory-SAM achieves mIoU 0.9863, surpassing FCN at 0.8188 and a detector-to-box SAM baseline at 0.1839, with particular advantages under real-world conditions where boundaries are irregular.

What carries the argument

The retrieval-to-prompt mechanism that locates an exemplar via dense DINOv3 features and FAISS, then converts mask-constrained feature correspondences into foreground and background point prompts for SAM2.

If this is right

Segmentation of irregular tongue boundaries becomes feasible with only a small collection of prior annotated cases rather than thousands of new labels.
The same retrieval-to-prompt process works on both controlled lab images and in-the-wild captures without any model retraining.
Point prompts derived from correspondences allow SAM2 to focus on foreground and background regions automatically.
The overall pipeline stays fully automatic at inference time, removing the need for human clicks or updates to the underlying model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If retrieval occasionally returns a poor match for rare tongue presentations, accuracy would drop, implying that expanding the memory bank with more diverse cases could further stabilize results.
The same retrieval-plus-prompt strategy might transfer to other medical imaging domains where SAM-family models are already deployed but manual prompting remains a bottleneck.
Allowing the memory bank to grow incrementally with newly segmented cases could create a self-improving system without requiring periodic full retraining.
In clinical settings with limited annotation staff, this approach lowers the barrier to deploying reliable tongue analysis tools.

Load-bearing premise

That dense DINOv3 features will consistently retrieve an exemplar whose existing mask yields accurate foreground and background point prompts for any new tongue image.

What would settle it

A controlled test on query images whose closest memory matches have tongue shapes or lighting that differ markedly from the query, followed by measurement of whether the distilled point prompts still produce high-accuracy SAM2 segmentations.

Figures

Figures reproduced from arXiv: 2510.15849 by Dongmei Yu, Joongwon Chae, Lian Zhang, Lihui Luo, Peiwu Qin, Xi Yuan, Zhenglin Chen.

**Figure 1.** Figure 1: Conceptual comparison of different tongue segmentation paradigms. (a) Traditional supervised methods [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The overall framework of Memory-SAM. Given a query image, we retrieve the most similar case from a [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison in challenging in-the-wild scenarios. SAM-based results (Tongue-SAM, Memory [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on the simpler HIT-Tongue-Image dataset. Color coding is consistent with Fig. 3. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Accurate tongue segmentation is crucial for reliable TCM analysis. Supervised models require large annotated datasets, while SAM-family models remain prompt-driven. We present Memory-SAM, a training-free, human-prompt-free pipeline that automatically generates effective prompts from a small memory of prior cases via dense DINOv3 features and FAISS retrieval. Given a query image, mask-constrained correspondences to the retrieved exemplar are distilled into foreground/background point prompts that guide SAM2 without manual clicks or model fine-tuning. We evaluate on 600 expert-annotated images (300 controlled, 300 in-the-wild). On the mixed test split, Memory-SAM achieves mIoU 0.9863, surpassing FCN (0.8188) and a detector-to-box SAM baseline (0.1839). On controlled data, ceiling effects above 0.98 make small differences less meaningful given annotation variability, while our method shows clear gains under real-world conditions. Results indicate that retrieval-to-prompt enables data-efficient, robust segmentation of irregular boundaries in tongue imaging. The code is publicly available at https://github.com/jw-chae/memory-sam.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Memory-SAM gets strong mIoU on tongue images by turning DINOv3 retrieval into SAM2 point prompts without training or clicks, but the in-the-wild reliability rests on unexamined retrieval quality.

read the letter

The main thing here is that Memory-SAM pulls a similar past case from a small memory bank using dense DINOv3 features and FAISS, then converts the retrieved mask into foreground and background point prompts for SAM2. This produces 0.9863 mIoU on a 600-image mixed test set that includes 300 in-the-wild shots, beating the FCN baseline at 0.8188 and a detector-to-box SAM baseline at 0.1839. The gains appear larger on the variable real-world split, which fits the clinical goal of reducing manual work in TCM tongue imaging. The code release is helpful for anyone who wants to inspect the prompt generation step directly. The core novelty is the specific mask-constrained correspondence step that turns retrieval into usable SAM2 prompts for this narrow task, rather than a general new architecture. The evaluation setup with separate controlled and in-the-wild splits is straightforward and shows where the method adds value. The numbers are reported clearly enough to see the claimed improvement under domain shift. The soft spot is the missing diagnostics on the retrieval step itself. If the top match from the memory bank is not close in pose or lighting, the transferred points could be noisy, and SAM2 would fall back toward weaker performance. The paper does not appear to include similarity score histograms, retrieval rank statistics, or failure examples, so it is difficult to judge how often this happens on the in-the-wild set. Error bars and ablation tables on the prompt distillation choices are also absent from the summary, though they may exist in the full text. This is a narrow but practical application paper. It is useful for readers working on data-efficient medical segmentation or on ways to drive SAM-family models without fine-tuning or human prompts. The central claim is internally consistent with the reported metrics and the motivation is clear. I would send it to peer review so referees can ask for retrieval robustness checks and any additional ablations that are already in the manuscript.

Referee Report

1 major / 2 minor

Summary. The manuscript presents Memory-SAM, a training-free and human-prompt-free pipeline for tongue segmentation. Given a query image, it retrieves a similar exemplar from a small memory bank using dense DINOv3 features and FAISS, then distills mask-constrained correspondences into foreground/background point prompts that guide SAM2. Evaluated on 600 expert-annotated images (300 controlled, 300 in-the-wild), Memory-SAM reports mIoU 0.9863 on the mixed test split, outperforming FCN (0.8188) and a detector-to-box SAM baseline (0.1839), with noted gains under real-world conditions. The code is publicly released.

Significance. If the results hold, the work demonstrates a data-efficient way to adapt SAM2 for medical segmentation tasks with irregular boundaries and limited annotations, such as TCM tongue imaging. By avoiding fine-tuning and manual prompts, it addresses practical barriers in deploying foundation models. Public code availability is a clear strength that supports reproducibility and extension to other domains.

major comments (1)

[Evaluation] The central performance claim (mIoU 0.9863 on the mixed split, especially the 300 in-the-wild images) rests on the retrieval step consistently surfacing suitable exemplars whose masks yield reliable fg/bg point prompts after domain shift. However, the manuscript provides no per-image retrieval diagnostics, similarity-score histograms, retrieval success rates, or failure-case analysis to verify this assumption.

minor comments (2)

[Abstract] The abstract notes ceiling effects above 0.98 on controlled data due to annotation variability but does not report the per-split mIoU values, standard deviations, or inter-annotator agreement to contextualize the results.
[Evaluation] No error bars or multiple-run statistics accompany the reported mIoU figures, which would strengthen the quantitative comparison to the FCN and detector-to-box baselines.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address the major comment on evaluation below and will strengthen the manuscript accordingly.

read point-by-point responses

Referee: [Evaluation] The central performance claim (mIoU 0.9863 on the mixed split, especially the 300 in-the-wild images) rests on the retrieval step consistently surfacing suitable exemplars whose masks yield reliable fg/bg point prompts after domain shift. However, the manuscript provides no per-image retrieval diagnostics, similarity-score histograms, retrieval success rates, or failure-case analysis to verify this assumption.

Authors: We agree that additional diagnostics on the retrieval step would improve transparency and allow readers to better assess robustness under domain shift. While the reported mIoU of 0.9863 on the mixed split (with clear gains on the 300 in-the-wild images) provides indirect evidence that retrieved exemplars yield effective prompts, we acknowledge the value of direct verification. In the revised manuscript we will add a new subsection containing: (i) a histogram of FAISS similarity scores across the test set, (ii) retrieval success rates (defined as the fraction of queries where the retrieved exemplar produces point prompts leading to mIoU > 0.90), and (iii) a qualitative failure-case analysis with example images where retrieval is suboptimal. These elements will be computed from the existing public code and data. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical, training-free pipeline that retrieves exemplars from a small memory bank using dense DINOv3 features and FAISS, then distills mask-constrained correspondences into point prompts for the external SAM2 model. The reported mIoU of 0.9863 is obtained via direct evaluation on a held-out test split of 600 expert-annotated images (300 controlled, 300 in-the-wild), rather than any closed-form derivation or fitted parameter that reduces to the method's own inputs by construction. No equations, self-definitional loops, or load-bearing self-citations appear in the provided description; the central performance claim rests on the independent behavior of pre-trained external models and retrieval evaluated against ground-truth annotations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that feature similarity implies transferable mask information for prompt generation. No free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Dense DINOv3 features enable reliable mask-constrained correspondences between query and retrieved tongue images.
Invoked to distill foreground/background point prompts from retrieved exemplars.

pith-pipeline@v0.9.0 · 5749 in / 1347 out tokens · 34542 ms · 2026-05-18T06:00:24.821775+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Given a query image, mask-constrained correspondences to the retrieved exemplar are distilled into foreground/background point prompts that guide SAM2 without manual clicks or model fine-tuning.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 5 internal anchors

[1]

Boundary guidance hierarchical network for real-time tongue segmenta- tion.arXiv preprint arXiv:2003.06529, 2020

Xinyi Zeng, Qian Zhang, Jia Chen, et al. Boundary guidance hierarchical network for real-time tongue segmenta- tion.arXiv preprint arXiv:2003.06529, 2020

work page arXiv 2003
[2]

Tonguenet: A precise and fast tongue segmentation system using u-net with a morphological processing layer.Applied Sciences, 9(15):3128, 2019

Jianhang Zhou, Qi Zhang, Bob Zhang, and Xiaojiao Chen. Tonguenet: A precise and fast tongue segmentation system using u-net with a morphological processing layer.Applied Sciences, 9(15):3128, 2019

work page 2019
[3]

Eastland Press, 1995

Giovanni Maciocia.Tongue Diagnosis in Chinese Medicine. Eastland Press, 1995

work page 1995
[4]

Zhang, R

H. Zhang, R. Jiang, Y . Wang, et al. Study on tcm tongue image segmentation model based on convolutional neural network fused with superpixel.Computational and Mathematical Methods in Medicine, 2022:8416486, 2022

work page 2022
[5]

C. Lin, S. Yang, and J. Lee. Tongue image segmentation and constitution identification with deep learning. Electronics, 14(4):733, 2023

work page 2023
[6]

Ronneberger, P

O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015: 18th International Conference, Munich, Germany, pages 234–241. Springer, 2015

work page 2015
[7]

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, pages 12077–12090, 2021

work page 2021
[8]

Segment Anything

A. Kirillov, E. Mintun, N. Ravi, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

N. Ravi, V . Gabeur, Y . T. Hu, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024. 7 Running Title for Header

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Segment anything in medical images.Nature Communications, 15(1):654, 2024

Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. Segment anything in medical images.Nature Communications, 15(1):654, 2024

work page 2024
[11]

arXiv preprint arXiv:2408.00874 (2024)

Jiayuan Zhu, Abdullah Hamdi, Yunli Qi, Yueming Jin, and Junde Wu. Medical sam 2: Segment medical images as video via segment anything model 2.arXiv preprint arXiv:2408.00874, 2024

work page arXiv 2024
[12]

A review of yolo algorithm developments.Procedia computer science, 199:1066–1073, 2022

Peiyuan Jiang, Daji Ergu, Fangyao Liu, Ying Cai, and Bo Ma. A review of yolo algorithm developments.Procedia computer science, 199:1066–1073, 2022

work page 2022
[13]

Tonguesam: An universal tongue segmentation model based on sam with zero-shot

Shan Cao, Qingfeng Wu, and Linjian Ma. Tonguesam: An universal tongue segmentation model based on sam with zero-shot. In2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 4520–4526. IEEE, 2023

work page 2023
[14]

X. Liu, Y . Li, J. Li, et al. Learning customized visual models with retrieval-augmented knowledge. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18041–18051, 2023

work page 2023
[15]

S. Liu, Z. Zeng, T. Ren, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

DINOv3

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

The Faiss library

M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P. E. Mazaré, et al. The faiss library. arXiv preprint arXiv:2401.08281, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015

work page 2015
[19]

U2-net: Going deeper with nested u-structure for salient object detection.Pattern recognition, 106:107404, 2020

Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood Dehghan, Osmar R Zaiane, and Martin Jagersand. U2-net: Going deeper with nested u-structure for salient object detection.Pattern recognition, 106:107404, 2020. 8

work page 2020

[1] [1]

Boundary guidance hierarchical network for real-time tongue segmenta- tion.arXiv preprint arXiv:2003.06529, 2020

Xinyi Zeng, Qian Zhang, Jia Chen, et al. Boundary guidance hierarchical network for real-time tongue segmenta- tion.arXiv preprint arXiv:2003.06529, 2020

work page arXiv 2003

[2] [2]

Tonguenet: A precise and fast tongue segmentation system using u-net with a morphological processing layer.Applied Sciences, 9(15):3128, 2019

Jianhang Zhou, Qi Zhang, Bob Zhang, and Xiaojiao Chen. Tonguenet: A precise and fast tongue segmentation system using u-net with a morphological processing layer.Applied Sciences, 9(15):3128, 2019

work page 2019

[3] [3]

Eastland Press, 1995

Giovanni Maciocia.Tongue Diagnosis in Chinese Medicine. Eastland Press, 1995

work page 1995

[4] [4]

Zhang, R

H. Zhang, R. Jiang, Y . Wang, et al. Study on tcm tongue image segmentation model based on convolutional neural network fused with superpixel.Computational and Mathematical Methods in Medicine, 2022:8416486, 2022

work page 2022

[5] [5]

C. Lin, S. Yang, and J. Lee. Tongue image segmentation and constitution identification with deep learning. Electronics, 14(4):733, 2023

work page 2023

[6] [6]

Ronneberger, P

O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015: 18th International Conference, Munich, Germany, pages 234–241. Springer, 2015

work page 2015

[7] [7]

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, pages 12077–12090, 2021

work page 2021

[8] [8]

Segment Anything

A. Kirillov, E. Mintun, N. Ravi, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

N. Ravi, V . Gabeur, Y . T. Hu, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024. 7 Running Title for Header

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Segment anything in medical images.Nature Communications, 15(1):654, 2024

Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. Segment anything in medical images.Nature Communications, 15(1):654, 2024

work page 2024

[11] [11]

arXiv preprint arXiv:2408.00874 (2024)

Jiayuan Zhu, Abdullah Hamdi, Yunli Qi, Yueming Jin, and Junde Wu. Medical sam 2: Segment medical images as video via segment anything model 2.arXiv preprint arXiv:2408.00874, 2024

work page arXiv 2024

[12] [12]

A review of yolo algorithm developments.Procedia computer science, 199:1066–1073, 2022

Peiyuan Jiang, Daji Ergu, Fangyao Liu, Ying Cai, and Bo Ma. A review of yolo algorithm developments.Procedia computer science, 199:1066–1073, 2022

work page 2022

[13] [13]

Tonguesam: An universal tongue segmentation model based on sam with zero-shot

Shan Cao, Qingfeng Wu, and Linjian Ma. Tonguesam: An universal tongue segmentation model based on sam with zero-shot. In2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 4520–4526. IEEE, 2023

work page 2023

[14] [14]

X. Liu, Y . Li, J. Li, et al. Learning customized visual models with retrieval-augmented knowledge. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18041–18051, 2023

work page 2023

[15] [15]

S. Liu, Z. Zeng, T. Ren, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

DINOv3

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

The Faiss library

M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P. E. Mazaré, et al. The faiss library. arXiv preprint arXiv:2401.08281, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015

work page 2015

[19] [19]

U2-net: Going deeper with nested u-structure for salient object detection.Pattern recognition, 106:107404, 2020

Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood Dehghan, Osmar R Zaiane, and Martin Jagersand. U2-net: Going deeper with nested u-structure for salient object detection.Pattern recognition, 106:107404, 2020. 8

work page 2020