Memory-SAM: Human-Prompt-Free Tongue Segmentation via Retrieval-to-Prompt
Pith reviewed 2026-05-18 06:00 UTC · model grok-4.3
The pith
Memory-SAM retrieves similar past tongue images to generate point prompts that guide SAM2 without any human clicks or fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Memory-SAM is a training-free, human-prompt-free pipeline that automatically generates effective prompts from a small memory of prior cases via dense DINOv3 features and FAISS retrieval. Given a query image, mask-constrained correspondences to the retrieved exemplar are distilled into foreground/background point prompts that guide SAM2 without manual clicks or model fine-tuning. On the mixed test split of 600 expert-annotated images, Memory-SAM achieves mIoU 0.9863, surpassing FCN at 0.8188 and a detector-to-box SAM baseline at 0.1839, with particular advantages under real-world conditions where boundaries are irregular.
What carries the argument
The retrieval-to-prompt mechanism that locates an exemplar via dense DINOv3 features and FAISS, then converts mask-constrained feature correspondences into foreground and background point prompts for SAM2.
If this is right
- Segmentation of irregular tongue boundaries becomes feasible with only a small collection of prior annotated cases rather than thousands of new labels.
- The same retrieval-to-prompt process works on both controlled lab images and in-the-wild captures without any model retraining.
- Point prompts derived from correspondences allow SAM2 to focus on foreground and background regions automatically.
- The overall pipeline stays fully automatic at inference time, removing the need for human clicks or updates to the underlying model.
Where Pith is reading between the lines
- If retrieval occasionally returns a poor match for rare tongue presentations, accuracy would drop, implying that expanding the memory bank with more diverse cases could further stabilize results.
- The same retrieval-plus-prompt strategy might transfer to other medical imaging domains where SAM-family models are already deployed but manual prompting remains a bottleneck.
- Allowing the memory bank to grow incrementally with newly segmented cases could create a self-improving system without requiring periodic full retraining.
- In clinical settings with limited annotation staff, this approach lowers the barrier to deploying reliable tongue analysis tools.
Load-bearing premise
That dense DINOv3 features will consistently retrieve an exemplar whose existing mask yields accurate foreground and background point prompts for any new tongue image.
What would settle it
A controlled test on query images whose closest memory matches have tongue shapes or lighting that differ markedly from the query, followed by measurement of whether the distilled point prompts still produce high-accuracy SAM2 segmentations.
Figures
read the original abstract
Accurate tongue segmentation is crucial for reliable TCM analysis. Supervised models require large annotated datasets, while SAM-family models remain prompt-driven. We present Memory-SAM, a training-free, human-prompt-free pipeline that automatically generates effective prompts from a small memory of prior cases via dense DINOv3 features and FAISS retrieval. Given a query image, mask-constrained correspondences to the retrieved exemplar are distilled into foreground/background point prompts that guide SAM2 without manual clicks or model fine-tuning. We evaluate on 600 expert-annotated images (300 controlled, 300 in-the-wild). On the mixed test split, Memory-SAM achieves mIoU 0.9863, surpassing FCN (0.8188) and a detector-to-box SAM baseline (0.1839). On controlled data, ceiling effects above 0.98 make small differences less meaningful given annotation variability, while our method shows clear gains under real-world conditions. Results indicate that retrieval-to-prompt enables data-efficient, robust segmentation of irregular boundaries in tongue imaging. The code is publicly available at https://github.com/jw-chae/memory-sam.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Memory-SAM, a training-free and human-prompt-free pipeline for tongue segmentation. Given a query image, it retrieves a similar exemplar from a small memory bank using dense DINOv3 features and FAISS, then distills mask-constrained correspondences into foreground/background point prompts that guide SAM2. Evaluated on 600 expert-annotated images (300 controlled, 300 in-the-wild), Memory-SAM reports mIoU 0.9863 on the mixed test split, outperforming FCN (0.8188) and a detector-to-box SAM baseline (0.1839), with noted gains under real-world conditions. The code is publicly released.
Significance. If the results hold, the work demonstrates a data-efficient way to adapt SAM2 for medical segmentation tasks with irregular boundaries and limited annotations, such as TCM tongue imaging. By avoiding fine-tuning and manual prompts, it addresses practical barriers in deploying foundation models. Public code availability is a clear strength that supports reproducibility and extension to other domains.
major comments (1)
- [Evaluation] The central performance claim (mIoU 0.9863 on the mixed split, especially the 300 in-the-wild images) rests on the retrieval step consistently surfacing suitable exemplars whose masks yield reliable fg/bg point prompts after domain shift. However, the manuscript provides no per-image retrieval diagnostics, similarity-score histograms, retrieval success rates, or failure-case analysis to verify this assumption.
minor comments (2)
- [Abstract] The abstract notes ceiling effects above 0.98 on controlled data due to annotation variability but does not report the per-split mIoU values, standard deviations, or inter-annotator agreement to contextualize the results.
- [Evaluation] No error bars or multiple-run statistics accompany the reported mIoU figures, which would strengthen the quantitative comparison to the FCN and detector-to-box baselines.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for minor revision. We address the major comment on evaluation below and will strengthen the manuscript accordingly.
read point-by-point responses
-
Referee: [Evaluation] The central performance claim (mIoU 0.9863 on the mixed split, especially the 300 in-the-wild images) rests on the retrieval step consistently surfacing suitable exemplars whose masks yield reliable fg/bg point prompts after domain shift. However, the manuscript provides no per-image retrieval diagnostics, similarity-score histograms, retrieval success rates, or failure-case analysis to verify this assumption.
Authors: We agree that additional diagnostics on the retrieval step would improve transparency and allow readers to better assess robustness under domain shift. While the reported mIoU of 0.9863 on the mixed split (with clear gains on the 300 in-the-wild images) provides indirect evidence that retrieved exemplars yield effective prompts, we acknowledge the value of direct verification. In the revised manuscript we will add a new subsection containing: (i) a histogram of FAISS similarity scores across the test set, (ii) retrieval success rates (defined as the fraction of queries where the retrieved exemplar produces point prompts leading to mIoU > 0.90), and (iii) a qualitative failure-case analysis with example images where retrieval is suboptimal. These elements will be computed from the existing public code and data. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper describes an empirical, training-free pipeline that retrieves exemplars from a small memory bank using dense DINOv3 features and FAISS, then distills mask-constrained correspondences into point prompts for the external SAM2 model. The reported mIoU of 0.9863 is obtained via direct evaluation on a held-out test split of 600 expert-annotated images (300 controlled, 300 in-the-wild), rather than any closed-form derivation or fitted parameter that reduces to the method's own inputs by construction. No equations, self-definitional loops, or load-bearing self-citations appear in the provided description; the central performance claim rests on the independent behavior of pre-trained external models and retrieval evaluated against ground-truth annotations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Dense DINOv3 features enable reliable mask-constrained correspondences between query and retrieved tongue images.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Given a query image, mask-constrained correspondences to the retrieved exemplar are distilled into foreground/background point prompts that guide SAM2 without manual clicks or model fine-tuning.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Xinyi Zeng, Qian Zhang, Jia Chen, et al. Boundary guidance hierarchical network for real-time tongue segmenta- tion.arXiv preprint arXiv:2003.06529, 2020
-
[2]
Jianhang Zhou, Qi Zhang, Bob Zhang, and Xiaojiao Chen. Tonguenet: A precise and fast tongue segmentation system using u-net with a morphological processing layer.Applied Sciences, 9(15):3128, 2019
work page 2019
-
[3]
Giovanni Maciocia.Tongue Diagnosis in Chinese Medicine. Eastland Press, 1995
work page 1995
- [4]
-
[5]
C. Lin, S. Yang, and J. Lee. Tongue image segmentation and constitution identification with deep learning. Electronics, 14(4):733, 2023
work page 2023
-
[6]
O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015: 18th International Conference, Munich, Germany, pages 234–241. Springer, 2015
work page 2015
-
[7]
E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, pages 12077–12090, 2021
work page 2021
-
[8]
A. Kirillov, E. Mintun, N. Ravi, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
N. Ravi, V . Gabeur, Y . T. Hu, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024. 7 Running Title for Header
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Segment anything in medical images.Nature Communications, 15(1):654, 2024
Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. Segment anything in medical images.Nature Communications, 15(1):654, 2024
work page 2024
-
[11]
arXiv preprint arXiv:2408.00874 (2024)
Jiayuan Zhu, Abdullah Hamdi, Yunli Qi, Yueming Jin, and Junde Wu. Medical sam 2: Segment medical images as video via segment anything model 2.arXiv preprint arXiv:2408.00874, 2024
-
[12]
A review of yolo algorithm developments.Procedia computer science, 199:1066–1073, 2022
Peiyuan Jiang, Daji Ergu, Fangyao Liu, Ying Cai, and Bo Ma. A review of yolo algorithm developments.Procedia computer science, 199:1066–1073, 2022
work page 2022
-
[13]
Tonguesam: An universal tongue segmentation model based on sam with zero-shot
Shan Cao, Qingfeng Wu, and Linjian Ma. Tonguesam: An universal tongue segmentation model based on sam with zero-shot. In2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 4520–4526. IEEE, 2023
work page 2023
-
[14]
X. Liu, Y . Li, J. Li, et al. Learning customized visual models with retrieval-augmented knowledge. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18041–18051, 2023
work page 2023
-
[15]
S. Liu, Z. Zeng, T. Ren, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P. E. Mazaré, et al. The faiss library. arXiv preprint arXiv:2401.08281, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015
work page 2015
-
[19]
Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood Dehghan, Osmar R Zaiane, and Martin Jagersand. U2-net: Going deeper with nested u-structure for salient object detection.Pattern recognition, 106:107404, 2020. 8
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.