pith. sign in

arxiv: 2510.15849 · v2 · pith:TLW74ZPNnew · submitted 2025-10-17 · 💻 cs.CV

Memory-SAM: Human-Prompt-Free Tongue Segmentation via Retrieval-to-Prompt

Pith reviewed 2026-05-18 06:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords tongue segmentationSAM2retrieval-based promptingDINO featuresmedical image analysisprompt-free segmentationTCM imaging
0
0 comments X

The pith

Memory-SAM retrieves similar past tongue images to generate point prompts that guide SAM2 without any human clicks or fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Memory-SAM as a training-free method for segmenting tongue images used in traditional Chinese medicine analysis. Instead of requiring large labeled datasets or manual prompts, the approach stores a small set of prior annotated cases and retrieves the most similar one for each new query image. Dense DINOv3 features and FAISS indexing locate the match, after which the exemplar's mask supplies foreground and background point prompts that direct SAM2. On 600 expert-annotated images mixing controlled and in-the-wild conditions, the method reaches 0.9863 mIoU and shows clear gains over both fully supervised models and detector-driven baselines, especially on real-world irregular boundaries. A sympathetic reader would care because the pipeline makes reliable segmentation feasible with minimal data and zero intervention at test time.

Core claim

Memory-SAM is a training-free, human-prompt-free pipeline that automatically generates effective prompts from a small memory of prior cases via dense DINOv3 features and FAISS retrieval. Given a query image, mask-constrained correspondences to the retrieved exemplar are distilled into foreground/background point prompts that guide SAM2 without manual clicks or model fine-tuning. On the mixed test split of 600 expert-annotated images, Memory-SAM achieves mIoU 0.9863, surpassing FCN at 0.8188 and a detector-to-box SAM baseline at 0.1839, with particular advantages under real-world conditions where boundaries are irregular.

What carries the argument

The retrieval-to-prompt mechanism that locates an exemplar via dense DINOv3 features and FAISS, then converts mask-constrained feature correspondences into foreground and background point prompts for SAM2.

If this is right

  • Segmentation of irregular tongue boundaries becomes feasible with only a small collection of prior annotated cases rather than thousands of new labels.
  • The same retrieval-to-prompt process works on both controlled lab images and in-the-wild captures without any model retraining.
  • Point prompts derived from correspondences allow SAM2 to focus on foreground and background regions automatically.
  • The overall pipeline stays fully automatic at inference time, removing the need for human clicks or updates to the underlying model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If retrieval occasionally returns a poor match for rare tongue presentations, accuracy would drop, implying that expanding the memory bank with more diverse cases could further stabilize results.
  • The same retrieval-plus-prompt strategy might transfer to other medical imaging domains where SAM-family models are already deployed but manual prompting remains a bottleneck.
  • Allowing the memory bank to grow incrementally with newly segmented cases could create a self-improving system without requiring periodic full retraining.
  • In clinical settings with limited annotation staff, this approach lowers the barrier to deploying reliable tongue analysis tools.

Load-bearing premise

That dense DINOv3 features will consistently retrieve an exemplar whose existing mask yields accurate foreground and background point prompts for any new tongue image.

What would settle it

A controlled test on query images whose closest memory matches have tongue shapes or lighting that differ markedly from the query, followed by measurement of whether the distilled point prompts still produce high-accuracy SAM2 segmentations.

Figures

Figures reproduced from arXiv: 2510.15849 by Dongmei Yu, Joongwon Chae, Lian Zhang, Lihui Luo, Peiwu Qin, Xi Yuan, Zhenglin Chen.

Figure 1
Figure 1. Figure 1: Conceptual comparison of different tongue segmentation paradigms. (a) Traditional supervised methods [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall framework of Memory-SAM. Given a query image, we retrieve the most similar case from a [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison in challenging in-the-wild scenarios. SAM-based results (Tongue-SAM, Memory [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on the simpler HIT-Tongue-Image dataset. Color coding is consistent with Fig. 3. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Accurate tongue segmentation is crucial for reliable TCM analysis. Supervised models require large annotated datasets, while SAM-family models remain prompt-driven. We present Memory-SAM, a training-free, human-prompt-free pipeline that automatically generates effective prompts from a small memory of prior cases via dense DINOv3 features and FAISS retrieval. Given a query image, mask-constrained correspondences to the retrieved exemplar are distilled into foreground/background point prompts that guide SAM2 without manual clicks or model fine-tuning. We evaluate on 600 expert-annotated images (300 controlled, 300 in-the-wild). On the mixed test split, Memory-SAM achieves mIoU 0.9863, surpassing FCN (0.8188) and a detector-to-box SAM baseline (0.1839). On controlled data, ceiling effects above 0.98 make small differences less meaningful given annotation variability, while our method shows clear gains under real-world conditions. Results indicate that retrieval-to-prompt enables data-efficient, robust segmentation of irregular boundaries in tongue imaging. The code is publicly available at https://github.com/jw-chae/memory-sam.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents Memory-SAM, a training-free and human-prompt-free pipeline for tongue segmentation. Given a query image, it retrieves a similar exemplar from a small memory bank using dense DINOv3 features and FAISS, then distills mask-constrained correspondences into foreground/background point prompts that guide SAM2. Evaluated on 600 expert-annotated images (300 controlled, 300 in-the-wild), Memory-SAM reports mIoU 0.9863 on the mixed test split, outperforming FCN (0.8188) and a detector-to-box SAM baseline (0.1839), with noted gains under real-world conditions. The code is publicly released.

Significance. If the results hold, the work demonstrates a data-efficient way to adapt SAM2 for medical segmentation tasks with irregular boundaries and limited annotations, such as TCM tongue imaging. By avoiding fine-tuning and manual prompts, it addresses practical barriers in deploying foundation models. Public code availability is a clear strength that supports reproducibility and extension to other domains.

major comments (1)
  1. [Evaluation] The central performance claim (mIoU 0.9863 on the mixed split, especially the 300 in-the-wild images) rests on the retrieval step consistently surfacing suitable exemplars whose masks yield reliable fg/bg point prompts after domain shift. However, the manuscript provides no per-image retrieval diagnostics, similarity-score histograms, retrieval success rates, or failure-case analysis to verify this assumption.
minor comments (2)
  1. [Abstract] The abstract notes ceiling effects above 0.98 on controlled data due to annotation variability but does not report the per-split mIoU values, standard deviations, or inter-annotator agreement to contextualize the results.
  2. [Evaluation] No error bars or multiple-run statistics accompany the reported mIoU figures, which would strengthen the quantitative comparison to the FCN and detector-to-box baselines.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address the major comment on evaluation below and will strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: [Evaluation] The central performance claim (mIoU 0.9863 on the mixed split, especially the 300 in-the-wild images) rests on the retrieval step consistently surfacing suitable exemplars whose masks yield reliable fg/bg point prompts after domain shift. However, the manuscript provides no per-image retrieval diagnostics, similarity-score histograms, retrieval success rates, or failure-case analysis to verify this assumption.

    Authors: We agree that additional diagnostics on the retrieval step would improve transparency and allow readers to better assess robustness under domain shift. While the reported mIoU of 0.9863 on the mixed split (with clear gains on the 300 in-the-wild images) provides indirect evidence that retrieved exemplars yield effective prompts, we acknowledge the value of direct verification. In the revised manuscript we will add a new subsection containing: (i) a histogram of FAISS similarity scores across the test set, (ii) retrieval success rates (defined as the fraction of queries where the retrieved exemplar produces point prompts leading to mIoU > 0.90), and (iii) a qualitative failure-case analysis with example images where retrieval is suboptimal. These elements will be computed from the existing public code and data. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical, training-free pipeline that retrieves exemplars from a small memory bank using dense DINOv3 features and FAISS, then distills mask-constrained correspondences into point prompts for the external SAM2 model. The reported mIoU of 0.9863 is obtained via direct evaluation on a held-out test split of 600 expert-annotated images (300 controlled, 300 in-the-wild), rather than any closed-form derivation or fitted parameter that reduces to the method's own inputs by construction. No equations, self-definitional loops, or load-bearing self-citations appear in the provided description; the central performance claim rests on the independent behavior of pre-trained external models and retrieval evaluated against ground-truth annotations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that feature similarity implies transferable mask information for prompt generation. No free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption Dense DINOv3 features enable reliable mask-constrained correspondences between query and retrieved tongue images.
    Invoked to distill foreground/background point prompts from retrieved exemplars.

pith-pipeline@v0.9.0 · 5749 in / 1347 out tokens · 34542 ms · 2026-05-18T06:00:24.821775+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 5 internal anchors

  1. [1]

    Boundary guidance hierarchical network for real-time tongue segmenta- tion.arXiv preprint arXiv:2003.06529, 2020

    Xinyi Zeng, Qian Zhang, Jia Chen, et al. Boundary guidance hierarchical network for real-time tongue segmenta- tion.arXiv preprint arXiv:2003.06529, 2020

  2. [2]

    Tonguenet: A precise and fast tongue segmentation system using u-net with a morphological processing layer.Applied Sciences, 9(15):3128, 2019

    Jianhang Zhou, Qi Zhang, Bob Zhang, and Xiaojiao Chen. Tonguenet: A precise and fast tongue segmentation system using u-net with a morphological processing layer.Applied Sciences, 9(15):3128, 2019

  3. [3]

    Eastland Press, 1995

    Giovanni Maciocia.Tongue Diagnosis in Chinese Medicine. Eastland Press, 1995

  4. [4]

    Zhang, R

    H. Zhang, R. Jiang, Y . Wang, et al. Study on tcm tongue image segmentation model based on convolutional neural network fused with superpixel.Computational and Mathematical Methods in Medicine, 2022:8416486, 2022

  5. [5]

    C. Lin, S. Yang, and J. Lee. Tongue image segmentation and constitution identification with deep learning. Electronics, 14(4):733, 2023

  6. [6]

    Ronneberger, P

    O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015: 18th International Conference, Munich, Germany, pages 234–241. Springer, 2015

  7. [7]

    E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, pages 12077–12090, 2021

  8. [8]

    Segment Anything

    A. Kirillov, E. Mintun, N. Ravi, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023

  9. [9]

    N. Ravi, V . Gabeur, Y . T. Hu, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024. 7 Running Title for Header

  10. [10]

    Segment anything in medical images.Nature Communications, 15(1):654, 2024

    Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. Segment anything in medical images.Nature Communications, 15(1):654, 2024

  11. [11]

    arXiv preprint arXiv:2408.00874 (2024)

    Jiayuan Zhu, Abdullah Hamdi, Yunli Qi, Yueming Jin, and Junde Wu. Medical sam 2: Segment medical images as video via segment anything model 2.arXiv preprint arXiv:2408.00874, 2024

  12. [12]

    A review of yolo algorithm developments.Procedia computer science, 199:1066–1073, 2022

    Peiyuan Jiang, Daji Ergu, Fangyao Liu, Ying Cai, and Bo Ma. A review of yolo algorithm developments.Procedia computer science, 199:1066–1073, 2022

  13. [13]

    Tonguesam: An universal tongue segmentation model based on sam with zero-shot

    Shan Cao, Qingfeng Wu, and Linjian Ma. Tonguesam: An universal tongue segmentation model based on sam with zero-shot. In2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 4520–4526. IEEE, 2023

  14. [14]

    X. Liu, Y . Li, J. Li, et al. Learning customized visual models with retrieval-augmented knowledge. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18041–18051, 2023

  15. [15]

    S. Liu, Z. Zeng, T. Ren, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2024

  16. [16]

    DINOv3

    Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

  17. [17]

    The Faiss library

    M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P. E. Mazaré, et al. The faiss library. arXiv preprint arXiv:2401.08281, 2024

  18. [18]

    J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015

  19. [19]

    U2-net: Going deeper with nested u-structure for salient object detection.Pattern recognition, 106:107404, 2020

    Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood Dehghan, Osmar R Zaiane, and Martin Jagersand. U2-net: Going deeper with nested u-structure for salient object detection.Pattern recognition, 106:107404, 2020. 8