pith. sign in

arxiv: 2408.09369 · v3 · submitted 2024-08-18 · 📡 eess.IV · cs.CV

Flemme: A Flexible and Modular Learning Platform for Medical Images

Pith reviewed 2026-05-23 21:50 UTC · model grok-4.3

classification 📡 eess.IV cs.CV
keywords medical imagingimage segmentationimage reconstructionmodular architectureencoder-decoderpyramid lossdeep learning platform
0
0 comments X

The pith

Separating encoders from architectures in a modular platform yields average gains of 5.6% Dice and 5.57% PSNR on medical image tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Flemme as a platform that decouples encoders from task architectures to let practitioners combine different building blocks without rebuilding models from scratch. Encoders are assembled from convolution, transformer, and state-space model blocks that handle both 2D and 3D patches. A base encoder-decoder style is extended to segmentation, reconstruction, and generation, with an added hierarchical variant that applies pyramid loss across scales. Experiments report consistent metric lifts when models are assembled this way. The same platform is used to compare encoder types for speed and accuracy on the same tasks.

Core claim

Flemme separates encoders from model architectures so that different models are built via combinations of supported encoders and derived encoder-decoder styles. Encoders process 2D and 3D patches using convolution, transformer, and SSM blocks. A general hierarchical architecture adds a pyramid loss to optimize and fuse vertical features. This construction produces average improvements of 5.60% Dice and 7.81% mIoU for segmentation models together with 5.57% PSNR and 8.22% SSIM for reconstruction models.

What carries the argument

Modular separation of encoders (built from convolution, transformer, and SSM blocks) from base encoder-decoder architectures, plus a pyramid loss for hierarchical vertical feature fusion.

If this is right

  • Segmentation models reach higher Dice and mIoU scores across datasets.
  • Reconstruction models achieve higher PSNR and SSIM values.
  • Different encoder families can be swapped and ranked for effectiveness and efficiency on the same task.
  • Practitioners avoid repeated manual construction of combined backbones and heads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation principle could shorten the time needed to prototype new medical imaging pipelines.
  • New encoder designs could be inserted and benchmarked without rewriting the downstream task heads.
  • The same modular pattern might apply to classification or detection tasks if the encoder blocks remain compatible.

Load-bearing premise

The measured performance gains arise mainly from the encoder-architecture separation and pyramid loss rather than from dataset tuning or baseline selection.

What would settle it

Re-running the reported experiments with a single integrated model that uses the identical best encoder-architecture pair but removes the explicit modular interface would erase the reported metric improvements.

Figures

Figures reproduced from arXiv: 2408.09369 by Guoqing Zhang, Jingyun Yang, Yang Li.

Figure 1
Figure 1. Figure 1: A semantic overview of Flemme. The left box gives 3 examples of building blocks based on convolution, transformer, and SSM. Encoders and [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pipelines of encoder and decoder. Components enclosed in dotted [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of supported architectures: (a) SeM, (b) AE, (d) DDPM. The dashed lines indicate optional paths. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A segmentation model constructed with Hierarchical SeM (H-SeM) [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Quantitative results of segmentation models. The top four rows show segmentation results for 2D image datasets: CVC-ClinicDB, Echonet, ISIC, and [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Quantitative results of reconstruction and generation models for FastMRI dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

As the rapid development of computer vision and the emergence of powerful network backbones and architectures, the application of deep learning in medical imaging has become increasingly significant. Unlike natural images, medical images lack huge volumes of data but feature more modalities, making it difficult to train a general model that has satisfactory performance across various datasets. In practice, practitioners often suffer from manually creating and testing models combining independent backbones and architectures, which is a laborious and time-consuming process. We propose Flemme, a FLExible and Modular learning platform for MEdical images. Our platform separates encoders from the model architectures so that different models can be constructed via various combinations of supported encoders and architectures. We construct encoders using building blocks based on convolution, transformer, and state-space model (SSM) to process both 2D and 3D image patches. A base architecture is implemented following an encoder-decoder style, with several derived architectures for image segmentation, reconstruction, and generation tasks. In addition, we propose a general hierarchical architecture incorporating a pyramid loss to optimize and fuse vertical features. Experiments demonstrate that this simple design leads to an average improvement of 5.60% in Dice score and 7.81% in mean interaction of units (mIoU) for segmentation models, as well as an enhancement of 5.57% in peak signal-to-noise ratio (PSNR) and 8.22% in structural similarity (SSIM) for reconstruction models. We further utilize Flemme as an analytical tool to assess the effectiveness and efficiency of various encoders across different tasks. Code is available at https://github.com/wlsdzyzl/flemme.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Flemme, a flexible modular platform for medical images that decouples encoders (constructed from convolution, transformer, and state-space model blocks supporting 2D/3D patches) from task-specific architectures (base encoder-decoder with derived variants for segmentation, reconstruction, and generation). It adds a hierarchical pyramid loss for vertical feature optimization and reports average empirical gains of 5.60% Dice / 7.81% mIoU on segmentation tasks and 5.57% PSNR / 8.22% SSIM on reconstruction tasks, while positioning the platform as an analytical tool for encoder evaluation. Code is released at a GitHub repository.

Significance. If the reported gains can be attributed to the modular encoder-architecture separation and pyramid loss under controlled conditions, the platform would address a genuine practical bottleneck in medical imaging by reducing manual model assembly effort and enabling systematic encoder comparisons across modalities. The open-source release strengthens reproducibility. However, the current evidence does not yet establish this attribution, limiting the immediate impact.

major comments (3)
  1. [Abstract and Experiments] Abstract and Experiments section: The headline improvements (5.60% Dice, 7.81% mIoU, 5.57% PSNR, 8.22% SSIM) are stated as resulting from the modular design, yet no information is given on whether baseline models received identical optimizer schedules, augmentation pipelines, early-stopping criteria, or encoder pre-training. Without these controls the attribution to encoder-architecture decoupling plus pyramid loss cannot be verified.
  2. [Experiments] Experiments section: No ablation studies are described that isolate the contribution of the encoder-architecture separation from the pyramid loss, or from possible differences in hyperparameter search effort between the proposed models and the baselines.
  3. [Experiments] Experiments section: The manuscript provides no details on the number of datasets or models averaged to obtain the reported percentages, nor any statistical significance tests, standard deviations across runs, or exclusion criteria for the quantitative results.
minor comments (2)
  1. [Method] The description of the pyramid loss could be clarified with a diagram or explicit equation showing how vertical features are fused across scales.
  2. [Platform description] A summary table listing all supported encoders and derived architectures would improve readability and help readers quickly assess the platform's scope.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for clearer experimental controls and statistical reporting. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: The headline improvements (5.60% Dice, 7.81% mIoU, 5.57% PSNR, 8.22% SSIM) are stated as resulting from the modular design, yet no information is given on whether baseline models received identical optimizer schedules, augmentation pipelines, early-stopping criteria, or encoder pre-training. Without these controls the attribution to encoder-architecture decoupling plus pyramid loss cannot be verified.

    Authors: We agree that the current manuscript does not explicitly document the training configurations applied to the baseline models. In the revised version, we will add a dedicated experimental setup subsection (and associated table) that details identical optimizer schedules, augmentation pipelines, early-stopping criteria, and any encoder pre-training used across all compared models. This will enable direct verification that performance differences are attributable to the modular encoder-architecture separation and pyramid loss. revision: yes

  2. Referee: [Experiments] Experiments section: No ablation studies are described that isolate the contribution of the encoder-architecture separation from the pyramid loss, or from possible differences in hyperparameter search effort between the proposed models and the baselines.

    Authors: We acknowledge the lack of targeted ablations. The original experiments were designed to demonstrate the platform's overall utility rather than isolate individual components. In the revision we will include new ablation studies that (i) compare the full platform against versions without the pyramid loss and (ii) report results under matched hyperparameter search budgets to separate the effects of the modular design from tuning effort. revision: yes

  3. Referee: [Experiments] Experiments section: The manuscript provides no details on the number of datasets or models averaged to obtain the reported percentages, nor any statistical significance tests, standard deviations across runs, or exclusion criteria for the quantitative results.

    Authors: The manuscript indeed omits these quantitative details. We will revise the Experiments section to state the exact number of datasets and models included in the averages, report standard deviations across multiple random seeds, include statistical significance tests (e.g., paired t-tests), and explicitly list any exclusion criteria applied to the reported results. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on direct empirical measurements, not derivations that reduce to inputs

full rationale

The paper introduces a modular platform separating encoders from architectures and reports measured performance gains (Dice, mIoU, PSNR, SSIM) from experiments on segmentation and reconstruction tasks. No equations, fitted parameters, or predictions are presented that reduce the reported improvements to quantities defined by the platform itself or by self-citations. The central results are empirical comparisons, not self-definitional or fitted-input predictions. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in the provided text. This is a standard non-circular empirical engineering paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an engineering platform paper whose contribution is software modularity and empirical benchmarking rather than new theoretical constructs; it relies on standard deep-learning assumptions without introducing free parameters, axioms, or invented entities beyond those already common in the field.

axioms (1)
  • domain assumption Standard deep-learning optimization, loss functions, and encoder-decoder structures transfer to medical image tasks.
    The platform is built directly on existing DL practices without stating or proving new background results.

pith-pipeline@v0.9.0 · 5830 in / 1213 out tokens · 43943 ms · 2026-05-23T21:50:55.601657+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 16 internal anchors

  1. [1]

    Imagenet classification with deep convolutional neural networks,

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural informa- tion processing systems , vol. 25, 2012

  2. [2]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition . Ieee, 2009, pp. 248–255

  3. [3]

    U-net: Convolutional networks for biomedical image segmentation,

    O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international con- ference, Munich, Germany, October 5-9, 2015, proceedings, part III 18 . Springer, 2015, pp. 234–241

  4. [4]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems , vol. 30, 2017

  5. [5]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018

  6. [6]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al. , “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929 , 2020

  7. [7]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2016, pp. 770–778

  8. [8]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision , 2021, pp. 10 012–10 022

  9. [9]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752 , 2023

  10. [10]

    Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

    L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, “Vision mamba: Efficient visual representation learning with bidirectional state space model,” arXiv preprint arXiv:2401.09417 , 2024

  11. [11]

    Vmamba: Visual state space model,

    Y . Liu, Y . Tian, Y . Zhao, H. Yu, L. Xie, Y . Wang, Q. Ye, and Y . Liu, “Vmamba: Visual state space model,” 2024

  12. [12]

    A convnet for the 2020s,

    Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2022, pp. 11 976–11 986

  13. [13]

    Auto-Encoding Variational Bayes

    D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013

  14. [14]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems , vol. 33, pp. 6840– 6851, 2020

  15. [15]

    Denoising Diffusion Implicit Models

    J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502 , 2020

  16. [16]

    Group normalization,

    Y . Wu and K. He, “Group normalization,” in Proceedings of the European conference on computer vision (ECCV) , 2018, pp. 3–19

  17. [17]

    FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models

    W. Grathwohl, R. T. Chen, J. Bettencourt, I. Sutskever, and D. Duve- naud, “Ffjord: Free-form continuous dynamics for scalable reversible generative models,” arXiv preprint arXiv:1810.01367 , 2018

  18. [18]

    Layer Normalization

    J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016

  19. [19]

    FractalNet: Ultra-Deep Neural Networks without Residuals

    G. Larsson, M. Maire, and G. Shakhnarovich, “Fractalnet: Ultra-deep neural networks without residuals,” arXiv preprint arXiv:1605.07648 , 2016

  20. [20]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    T. Dao and A. Gu, “Transformers are ssms: Generalized models and ef- ficient algorithms through structured state space duality,” arXiv preprint arXiv:2405.21060, 2024

  21. [21]

    Swin-unet: Unet-like pure transformer for medical image segmenta- tion,

    H. Cao, Y . Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang, “Swin-unet: Unet-like pure transformer for medical image segmenta- tion,” in European conference on computer vision . Springer, 2022, pp. 205–218

  22. [22]

    TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

    J. Chen, Y . Lu, Q. Yu, X. Luo, E. Adeli, Y . Wang, L. Lu, A. L. Yuille, and Y . Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” arXiv preprint arXiv:2102.04306 , 2021

  23. [23]

    Mamba-unet: Unet- like pure visual mamba for medical image segmentation,

    Z. Wang, J.-Q. Zheng, Y . Zhang, G. Cui, and L. Li, “Mamba-unet: Unet- like pure visual mamba for medical image segmentation,” arXiv preprint arXiv:2402.05079, 2024

  24. [24]

    Conditional image generation with pixelcnn decoders,

    A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves et al., “Conditional image generation with pixelcnn decoders,” Advances in neural information processing systems , vol. 29, 2016

  25. [25]

    Classifier-Free Diffusion Guidance

    J. Ho and T. Salimans, “Classifier-free diffusion guidance,” arXiv preprint arXiv:2207.12598, 2022

  26. [26]

    Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,

    J. Bernal, F. J. S ´anchez, G. Fern ´andez-Esparrach, D. Gil, C. Rodr ´ıguez, and F. Vilari ˜no, “Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,” Comput- erized medical imaging and graphics , vol. 43, pp. 99–111, 2015

  27. [27]

    Echonet-dynamic: a large new cardiac motion video data resource for medical machine learning,

    D. Ouyang, B. He, A. Ghorbani, M. P. Lungren, E. A. Ashley, D. H. Liang, and J. Y . Zou, “Echonet-dynamic: a large new cardiac motion video data resource for medical machine learning,” in NeurIPS ML4H Workshop, 2019, pp. 1–11

  28. [28]

    Skin Lesion Analysis toward Melanoma Detection: A Challenge at the International Symposium on Biomedical Imaging (ISBI) 2016, hosted by the International Skin Imaging Collaboration (ISIC)

    D. Gutman, N. C. Codella, E. Celebi, B. Helba, M. Marchetti, N. Mishra, and A. Halpern, “Skin lesion analysis toward melanoma detection: A challenge at the international symposium on biomedical imaging (isbi) 2016, hosted by the international skin imaging collaboration (isic),” arXiv preprint arXiv:1605.01397 , 2016

  29. [29]

    Multi-task learning for thyroid nodule segmentation with thyroid region prior,

    H. Gong, G. Chen, R. Wang, X. Xie, M. Mao, Y . Yu, F. Chen, and G. Li, “Multi-task learning for thyroid nodule segmentation with thyroid region prior,” in2021 IEEE 18th international symposium on biomedical imaging (ISBI) . IEEE, 2021, pp. 257–261

  30. [30]

    The RSNA-ASNR-MICCAI BraTS 2021 Benchmark on Brain Tumor Segmentation and Radiogenomic Classification

    U. Baid, S. Ghodasara, S. Mohan, M. Bilello, E. Calabrese, E. Colak, K. Farahani, J. Kalpathy-Cramer, F. C. Kitamura, S. Pati et al. , “The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification,” arXiv preprint arXiv:2107.02314 , 2021

  31. [31]

    Imagecas: A large-scale dataset and benchmark for coronary artery segmentation based on computed tomography angiography images,

    A. Zeng, C. Wu, G. Lin, W. Xie, J. Hong, M. Huang, J. Zhuang, S. Bi, D. Pan, N. Ullah et al. , “Imagecas: A large-scale dataset and benchmark for coronary artery segmentation based on computed tomography angiography images,” Computerized Medical Imaging and Graphics, vol. 109, p. 102287, 2023

  32. [32]

    fastMRI: An Open Dataset and Benchmarks for Accelerated MRI

    J. Zbontar, F. Knoll, A. Sriram, T. Murrell, Z. Huang, M. J. Muck- ley, A. Defazio, R. Stern, P. Johnson, M. Bruno et al. , “fastmri: An open dataset and benchmarks for accelerated mri,” arXiv preprint arXiv:1811.08839, 2018

  33. [33]

    Image quality metrics: Psnr vs. ssim,

    A. Hore and D. Ziou, “Image quality metrics: Psnr vs. ssim,” in 2010 20th international conference on pattern recognition . IEEE, 2010, pp. 2366–2369

  34. [34]

    Image quality assessment: from error visibility to structural similarity,

    Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing , vol. 13, no. 4, pp. 600–612, 2004

  35. [35]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014