Flemme: A Flexible and Modular Learning Platform for Medical Images
Pith reviewed 2026-05-23 21:50 UTC · model grok-4.3
The pith
Separating encoders from architectures in a modular platform yields average gains of 5.6% Dice and 5.57% PSNR on medical image tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Flemme separates encoders from model architectures so that different models are built via combinations of supported encoders and derived encoder-decoder styles. Encoders process 2D and 3D patches using convolution, transformer, and SSM blocks. A general hierarchical architecture adds a pyramid loss to optimize and fuse vertical features. This construction produces average improvements of 5.60% Dice and 7.81% mIoU for segmentation models together with 5.57% PSNR and 8.22% SSIM for reconstruction models.
What carries the argument
Modular separation of encoders (built from convolution, transformer, and SSM blocks) from base encoder-decoder architectures, plus a pyramid loss for hierarchical vertical feature fusion.
If this is right
- Segmentation models reach higher Dice and mIoU scores across datasets.
- Reconstruction models achieve higher PSNR and SSIM values.
- Different encoder families can be swapped and ranked for effectiveness and efficiency on the same task.
- Practitioners avoid repeated manual construction of combined backbones and heads.
Where Pith is reading between the lines
- The separation principle could shorten the time needed to prototype new medical imaging pipelines.
- New encoder designs could be inserted and benchmarked without rewriting the downstream task heads.
- The same modular pattern might apply to classification or detection tasks if the encoder blocks remain compatible.
Load-bearing premise
The measured performance gains arise mainly from the encoder-architecture separation and pyramid loss rather than from dataset tuning or baseline selection.
What would settle it
Re-running the reported experiments with a single integrated model that uses the identical best encoder-architecture pair but removes the explicit modular interface would erase the reported metric improvements.
Figures
read the original abstract
As the rapid development of computer vision and the emergence of powerful network backbones and architectures, the application of deep learning in medical imaging has become increasingly significant. Unlike natural images, medical images lack huge volumes of data but feature more modalities, making it difficult to train a general model that has satisfactory performance across various datasets. In practice, practitioners often suffer from manually creating and testing models combining independent backbones and architectures, which is a laborious and time-consuming process. We propose Flemme, a FLExible and Modular learning platform for MEdical images. Our platform separates encoders from the model architectures so that different models can be constructed via various combinations of supported encoders and architectures. We construct encoders using building blocks based on convolution, transformer, and state-space model (SSM) to process both 2D and 3D image patches. A base architecture is implemented following an encoder-decoder style, with several derived architectures for image segmentation, reconstruction, and generation tasks. In addition, we propose a general hierarchical architecture incorporating a pyramid loss to optimize and fuse vertical features. Experiments demonstrate that this simple design leads to an average improvement of 5.60% in Dice score and 7.81% in mean interaction of units (mIoU) for segmentation models, as well as an enhancement of 5.57% in peak signal-to-noise ratio (PSNR) and 8.22% in structural similarity (SSIM) for reconstruction models. We further utilize Flemme as an analytical tool to assess the effectiveness and efficiency of various encoders across different tasks. Code is available at https://github.com/wlsdzyzl/flemme.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Flemme, a flexible modular platform for medical images that decouples encoders (constructed from convolution, transformer, and state-space model blocks supporting 2D/3D patches) from task-specific architectures (base encoder-decoder with derived variants for segmentation, reconstruction, and generation). It adds a hierarchical pyramid loss for vertical feature optimization and reports average empirical gains of 5.60% Dice / 7.81% mIoU on segmentation tasks and 5.57% PSNR / 8.22% SSIM on reconstruction tasks, while positioning the platform as an analytical tool for encoder evaluation. Code is released at a GitHub repository.
Significance. If the reported gains can be attributed to the modular encoder-architecture separation and pyramid loss under controlled conditions, the platform would address a genuine practical bottleneck in medical imaging by reducing manual model assembly effort and enabling systematic encoder comparisons across modalities. The open-source release strengthens reproducibility. However, the current evidence does not yet establish this attribution, limiting the immediate impact.
major comments (3)
- [Abstract and Experiments] Abstract and Experiments section: The headline improvements (5.60% Dice, 7.81% mIoU, 5.57% PSNR, 8.22% SSIM) are stated as resulting from the modular design, yet no information is given on whether baseline models received identical optimizer schedules, augmentation pipelines, early-stopping criteria, or encoder pre-training. Without these controls the attribution to encoder-architecture decoupling plus pyramid loss cannot be verified.
- [Experiments] Experiments section: No ablation studies are described that isolate the contribution of the encoder-architecture separation from the pyramid loss, or from possible differences in hyperparameter search effort between the proposed models and the baselines.
- [Experiments] Experiments section: The manuscript provides no details on the number of datasets or models averaged to obtain the reported percentages, nor any statistical significance tests, standard deviations across runs, or exclusion criteria for the quantitative results.
minor comments (2)
- [Method] The description of the pyramid loss could be clarified with a diagram or explicit equation showing how vertical features are fused across scales.
- [Platform description] A summary table listing all supported encoders and derived architectures would improve readability and help readers quickly assess the platform's scope.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for clearer experimental controls and statistical reporting. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and Experiments] Abstract and Experiments section: The headline improvements (5.60% Dice, 7.81% mIoU, 5.57% PSNR, 8.22% SSIM) are stated as resulting from the modular design, yet no information is given on whether baseline models received identical optimizer schedules, augmentation pipelines, early-stopping criteria, or encoder pre-training. Without these controls the attribution to encoder-architecture decoupling plus pyramid loss cannot be verified.
Authors: We agree that the current manuscript does not explicitly document the training configurations applied to the baseline models. In the revised version, we will add a dedicated experimental setup subsection (and associated table) that details identical optimizer schedules, augmentation pipelines, early-stopping criteria, and any encoder pre-training used across all compared models. This will enable direct verification that performance differences are attributable to the modular encoder-architecture separation and pyramid loss. revision: yes
-
Referee: [Experiments] Experiments section: No ablation studies are described that isolate the contribution of the encoder-architecture separation from the pyramid loss, or from possible differences in hyperparameter search effort between the proposed models and the baselines.
Authors: We acknowledge the lack of targeted ablations. The original experiments were designed to demonstrate the platform's overall utility rather than isolate individual components. In the revision we will include new ablation studies that (i) compare the full platform against versions without the pyramid loss and (ii) report results under matched hyperparameter search budgets to separate the effects of the modular design from tuning effort. revision: yes
-
Referee: [Experiments] Experiments section: The manuscript provides no details on the number of datasets or models averaged to obtain the reported percentages, nor any statistical significance tests, standard deviations across runs, or exclusion criteria for the quantitative results.
Authors: The manuscript indeed omits these quantitative details. We will revise the Experiments section to state the exact number of datasets and models included in the averages, report standard deviations across multiple random seeds, include statistical significance tests (e.g., paired t-tests), and explicitly list any exclusion criteria applied to the reported results. revision: yes
Circularity Check
No circularity; claims rest on direct empirical measurements, not derivations that reduce to inputs
full rationale
The paper introduces a modular platform separating encoders from architectures and reports measured performance gains (Dice, mIoU, PSNR, SSIM) from experiments on segmentation and reconstruction tasks. No equations, fitted parameters, or predictions are presented that reduce the reported improvements to quantities defined by the platform itself or by self-citations. The central results are empirical comparisons, not self-definitional or fitted-input predictions. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in the provided text. This is a standard non-circular empirical engineering paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard deep-learning optimization, loss functions, and encoder-decoder structures transfer to medical image tasks.
Reference graph
Works this paper leans on
-
[1]
Imagenet classification with deep convolutional neural networks,
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural informa- tion processing systems , vol. 25, 2012
work page 2012
-
[2]
Imagenet: A large-scale hierarchical image database,
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition . Ieee, 2009, pp. 248–255
work page 2009
-
[3]
U-net: Convolutional networks for biomedical image segmentation,
O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international con- ference, Munich, Germany, October 5-9, 2015, proceedings, part III 18 . Springer, 2015, pp. 234–241
work page 2015
-
[4]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems , vol. 30, 2017
work page 2017
-
[5]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[6]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al. , “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929 , 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[7]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2016, pp. 770–778
work page 2016
-
[8]
Swin transformer: Hierarchical vision transformer using shifted windows,
Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision , 2021, pp. 10 012–10 022
work page 2021
-
[9]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, “Vision mamba: Efficient visual representation learning with bidirectional state space model,” arXiv preprint arXiv:2401.09417 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Vmamba: Visual state space model,
Y . Liu, Y . Tian, Y . Zhao, H. Yu, L. Xie, Y . Wang, Q. Ye, and Y . Liu, “Vmamba: Visual state space model,” 2024
work page 2024
-
[12]
Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2022, pp. 11 976–11 986
work page 2022
-
[13]
Auto-Encoding Variational Bayes
D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[14]
Denoising diffusion probabilistic models,
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems , vol. 33, pp. 6840– 6851, 2020
work page 2020
-
[15]
Denoising Diffusion Implicit Models
J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502 , 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[16]
Y . Wu and K. He, “Group normalization,” in Proceedings of the European conference on computer vision (ECCV) , 2018, pp. 3–19
work page 2018
-
[17]
FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models
W. Grathwohl, R. T. Chen, J. Bettencourt, I. Sutskever, and D. Duve- naud, “Ffjord: Free-form continuous dynamics for scalable reversible generative models,” arXiv preprint arXiv:1810.01367 , 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[18]
J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[19]
FractalNet: Ultra-Deep Neural Networks without Residuals
G. Larsson, M. Maire, and G. Shakhnarovich, “Fractalnet: Ultra-deep neural networks without residuals,” arXiv preprint arXiv:1605.07648 , 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[20]
T. Dao and A. Gu, “Transformers are ssms: Generalized models and ef- ficient algorithms through structured state space duality,” arXiv preprint arXiv:2405.21060, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Swin-unet: Unet-like pure transformer for medical image segmenta- tion,
H. Cao, Y . Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang, “Swin-unet: Unet-like pure transformer for medical image segmenta- tion,” in European conference on computer vision . Springer, 2022, pp. 205–218
work page 2022
-
[22]
TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation
J. Chen, Y . Lu, Q. Yu, X. Luo, E. Adeli, Y . Wang, L. Lu, A. L. Yuille, and Y . Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” arXiv preprint arXiv:2102.04306 , 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[23]
Mamba-unet: Unet- like pure visual mamba for medical image segmentation,
Z. Wang, J.-Q. Zheng, Y . Zhang, G. Cui, and L. Li, “Mamba-unet: Unet- like pure visual mamba for medical image segmentation,” arXiv preprint arXiv:2402.05079, 2024
-
[24]
Conditional image generation with pixelcnn decoders,
A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves et al., “Conditional image generation with pixelcnn decoders,” Advances in neural information processing systems , vol. 29, 2016
work page 2016
-
[25]
Classifier-Free Diffusion Guidance
J. Ho and T. Salimans, “Classifier-free diffusion guidance,” arXiv preprint arXiv:2207.12598, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[26]
J. Bernal, F. J. S ´anchez, G. Fern ´andez-Esparrach, D. Gil, C. Rodr ´ıguez, and F. Vilari ˜no, “Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,” Comput- erized medical imaging and graphics , vol. 43, pp. 99–111, 2015
work page 2015
-
[27]
Echonet-dynamic: a large new cardiac motion video data resource for medical machine learning,
D. Ouyang, B. He, A. Ghorbani, M. P. Lungren, E. A. Ashley, D. H. Liang, and J. Y . Zou, “Echonet-dynamic: a large new cardiac motion video data resource for medical machine learning,” in NeurIPS ML4H Workshop, 2019, pp. 1–11
work page 2019
-
[28]
D. Gutman, N. C. Codella, E. Celebi, B. Helba, M. Marchetti, N. Mishra, and A. Halpern, “Skin lesion analysis toward melanoma detection: A challenge at the international symposium on biomedical imaging (isbi) 2016, hosted by the international skin imaging collaboration (isic),” arXiv preprint arXiv:1605.01397 , 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[29]
Multi-task learning for thyroid nodule segmentation with thyroid region prior,
H. Gong, G. Chen, R. Wang, X. Xie, M. Mao, Y . Yu, F. Chen, and G. Li, “Multi-task learning for thyroid nodule segmentation with thyroid region prior,” in2021 IEEE 18th international symposium on biomedical imaging (ISBI) . IEEE, 2021, pp. 257–261
work page 2021
-
[30]
U. Baid, S. Ghodasara, S. Mohan, M. Bilello, E. Calabrese, E. Colak, K. Farahani, J. Kalpathy-Cramer, F. C. Kitamura, S. Pati et al. , “The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification,” arXiv preprint arXiv:2107.02314 , 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[31]
A. Zeng, C. Wu, G. Lin, W. Xie, J. Hong, M. Huang, J. Zhuang, S. Bi, D. Pan, N. Ullah et al. , “Imagecas: A large-scale dataset and benchmark for coronary artery segmentation based on computed tomography angiography images,” Computerized Medical Imaging and Graphics, vol. 109, p. 102287, 2023
work page 2023
-
[32]
fastMRI: An Open Dataset and Benchmarks for Accelerated MRI
J. Zbontar, F. Knoll, A. Sriram, T. Murrell, Z. Huang, M. J. Muck- ley, A. Defazio, R. Stern, P. Johnson, M. Bruno et al. , “fastmri: An open dataset and benchmarks for accelerated mri,” arXiv preprint arXiv:1811.08839, 2018
work page internal anchor Pith review arXiv 2018
-
[33]
Image quality metrics: Psnr vs. ssim,
A. Hore and D. Ziou, “Image quality metrics: Psnr vs. ssim,” in 2010 20th international conference on pattern recognition . IEEE, 2010, pp. 2366–2369
work page 2010
-
[34]
Image quality assessment: from error visibility to structural similarity,
Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing , vol. 13, no. 4, pp. 600–612, 2004
work page 2004
-
[35]
Adam: A Method for Stochastic Optimization
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.