pith. sign in

arxiv: 1907.01734 · v1 · pith:U4DJISVPnew · submitted 2019-07-03 · 💻 cs.LG · stat.ML

AMI-Net+: A Novel Multi-Instance Neural Network for Medical Diagnosis from Incomplete and Imbalanced Data

Pith reviewed 2026-05-25 10:10 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords multi-instance learningmedical diagnosisimbalanced dataincomplete datafocal lossattention mechanismneural network
0
0 comments X

The pith

AMI-Net+ improves diagnosis from incomplete and imbalanced medical data by replacing cross-entropy loss with focal loss and adding self-adaptive instance-level pooling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AMI-Net+ as an extension to AMI-Net for medical diagnosis tasks where patient records are fragmentary and class distributions are extremely skewed. It retains embedding and multi-head attention to model symptom relations, swaps in focal loss to emphasize hard examples, and introduces a new self-adaptive multi-instance pooling step that produces bag-level representations directly from instance features. The authors test the resulting network on two real-world datasets drawn from separate medical domains and report that it exceeds the performance of AMI-Net and other baselines by a considerable margin.

Core claim

AMI-Net+ captures relations among symptoms and between symptoms and the target disease through embedding and attention layers; it substitutes focal loss for cross-entropy loss and replaces standard gated attention pooling with a novel self-adaptive multi-instance pooling operator that works at the instance level to form each bag representation.

What carries the argument

Self-adaptive multi-instance pooling operator that computes bag representations from instance-level features, paired with focal loss to address extreme class imbalance.

If this is right

  • The network produces more reliable diagnoses when trained on fragmentary patient records.
  • Focal loss combined with adaptive pooling mitigates the effect of extreme class imbalance in medical classification.
  • Symptom-disease relations are better modeled by the joint use of multi-head attention and gated attention pooling.
  • Performance gains hold across two distinct medical domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same loss-plus-pooling changes could be tested on non-medical multi-instance tasks that also suffer from missing features.
  • An ablation that isolates the contribution of the self-adaptive pooling step would clarify whether the gain comes mainly from the pooling or from focal loss.
  • If the method scales to larger numbers of instances per bag, it might apply to longitudinal electronic health records.

Load-bearing premise

The combination of focal loss and the new self-adaptive pooling will improve results on incomplete data without creating new overfitting or selection problems.

What would settle it

On either of the two real-world datasets, measure accuracy, F1, or AUC for AMI-Net+ and find that at least one metric is no higher than the corresponding metric for AMI-Net or the other baselines.

Figures

Figures reproduced from arXiv: 1907.01734 by Josiah Poon, Simon Poon, Zeyuan Wang.

Figure 1
Figure 1. Figure 1: The architecture of AMI-Net+ 2.3 Self-Attention Mechanism Self-attention is first proposed by Vaswani et. al [23] in the transformer architecture, to capture the correlations of words from the source and target sentences for the machine translation task. Their work demonstrates the validity of self-attention to reveal the syn￾tactic and semantic information in text. In recent years, it has been applied in … view at source ↗
Figure 2
Figure 2. Figure 2: The architecture of multi-head attention. Scaled Dot-Product Attention. It takes the query, keys with 𝑑𝑘 dimensions, and values with 𝑑𝑣 dimensions as input and compute the cosine similarities, i.e., dot products, be￾tween the given query and all keys divided by a scaling factor √𝑑𝑘. The scaling factor makes sure that the gradient in back propagation wouldn’t vanish or be extreme small. Then a softmax funct… view at source ↗
Figure 3
Figure 3. Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of different number of heads in multi-head attention [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

In medical real-world study (RWS), how to fully utilize the fragmentary and scarce information in model training to generate the solid diagnosis results is a challenging task. In this work, we introduce a novel multi-instance neural network, AMI-Net+, to train and predict from the incomplete and extremely imbalanced data. It is more effective than the state-of-art method, AMI-Net. First, we also implement embedding, multi-head attention and gated attention-based multi-instance pooling to capture the relations of symptoms themselves and with the given disease. Besides, we propose var-ious improvements to AMI-Net, that the cross-entropy loss is replaced by focal loss and we propose a novel self-adaptive multi-instance pooling method on instance-level to obtain the bag representation. We validate the performance of AMI-Net+ on two real-world datasets, from two different medical domains. Results show that our approach outperforms other base-line models by a considerable margin.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces AMI-Net+, an extension of AMI-Net for medical diagnosis from incomplete and imbalanced data. It retains embedding, multi-head attention, and gated attention pooling while adding focal loss (replacing cross-entropy) and a novel self-adaptive multi-instance pooling method at the instance level; the authors claim this yields superior performance over baselines on two real-world datasets from different medical domains.

Significance. If the performance claims are substantiated with quantitative results and the handling of missing data is explicitly described, the work could offer a practical advance for multi-instance learning on fragmentary medical records. The combination of focal loss for class imbalance and adaptive pooling is a reasonable direction, but the current lack of metrics, ablations, and missing-value mechanisms prevents assessment of whether these additions deliver the claimed gains.

major comments (3)
  1. [Abstract] Abstract: the central claim that AMI-Net+ trains effectively from incomplete data is unsupported because no mechanism (masking, imputation, missingness indicators, or bag-level handling of absent instances) is described in the architecture or training procedure. Without this, performance gains on the two datasets cannot be attributed to the proposed changes rather than unstated preprocessing.
  2. [Abstract] Abstract: the assertion that the approach 'outperforms other baseline models by a considerable margin' supplies no numerical results, error bars, statistical tests, or ablation studies, rendering the performance claim impossible to evaluate.
  3. [Abstract] Abstract: the assumption that focal loss plus self-adaptive instance-level pooling will reliably improve results on incomplete data is presented without any quantitative support or analysis of potential overfitting or selection artifacts introduced by these components.
minor comments (2)
  1. [Abstract] Abstract: 'var-ious' is a typographical error for 'various'.
  2. [Abstract] Abstract: 'base-line' should be written as the single word 'baseline'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback focused on the abstract. We will revise the abstract to improve clarity on data handling and to include quantitative support for the performance claims. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that AMI-Net+ trains effectively from incomplete data is unsupported because no mechanism (masking, imputation, missingness indicators, or bag-level handling of absent instances) is described in the architecture or training procedure. Without this, performance gains on the two datasets cannot be attributed to the proposed changes rather than unstated preprocessing.

    Authors: We agree the abstract should explicitly reference the handling of incomplete data. AMI-Net+ inherits the multi-instance bag representation from AMI-Net, in which absent instances are simply omitted from the bag; the embedding, multi-head attention, and pooling layers operate only on the observed instances without imputation or masking. We will revise the abstract to state this mechanism concisely so that the claim is supported. revision: yes

  2. Referee: [Abstract] Abstract: the assertion that the approach 'outperforms other baseline models by a considerable margin' supplies no numerical results, error bars, statistical tests, or ablation studies, rendering the performance claim impossible to evaluate.

    Authors: We accept that the abstract would be stronger with concrete numbers. The full manuscript contains tables reporting AUC, F1, and accuracy on both datasets together with comparisons to baselines. We will add the key quantitative improvements (with the reported margins) to the abstract while respecting length limits. revision: yes

  3. Referee: [Abstract] Abstract: the assumption that focal loss plus self-adaptive instance-level pooling will reliably improve results on incomplete data is presented without any quantitative support or analysis of potential overfitting or selection artifacts introduced by these components.

    Authors: The experimental section of the manuscript already demonstrates the gains from replacing cross-entropy with focal loss and from the new self-adaptive pooling on the two imbalanced medical datasets. Space constraints in the abstract prevent detailed ablation or overfitting analysis, but we will strengthen the abstract sentence to reference the supporting results and will consider adding a short discussion of these components if the revision allows. revision: partial

Circularity Check

0 steps flagged

No derivation chain; empirical architecture proposal with no self-referential reductions

full rationale

The paper describes an empirical neural network extension (embedding + multi-head attention + gated pooling, focal loss, self-adaptive instance pooling) and reports performance on two real-world datasets. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted parameters or prior self-citations. The central claim is an architectural and loss-function change whose validity is assessed externally via held-out data performance, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard neural-network assumptions plus the domain assumption that symptom bags can be treated as multi-instance examples for disease prediction; no new entities are postulated.

free parameters (1)
  • network hyperparameters and attention weights
    All neural-network weights and the self-adaptive pooling parameters are fitted to the training data; exact count and values not reported.
axioms (1)
  • domain assumption Multi-instance learning framework is appropriate for representing incomplete patient records as bags of symptoms
    Invoked when the authors frame the medical diagnosis task as a multi-instance problem.

pith-pipeline@v0.9.0 · 5695 in / 1271 out tokens · 36861 ms · 2026-05-25T10:10:41.369595+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 4 internal anchors

  1. [1]

    Real world evidence: experience and lessons from China

    Sun X, Tan J, Tang L, Guo JJ, Li X. Real world evidence: experience and lessons from China. bmj. 2018 Feb 5;360:j5262

  2. [2]

    Analysis of incomplete multivariate data

    Schafer JL. Analysis of incomplete multivariate data. Chapman and Hall/CRC; 1997 Aug 1

  3. [3]

    A Study of K -Nearest Neighbour as an Imputation Method

    Batista GE, Monard MC. A Study of K -Nearest Neighbour as an Imputation Method. HIS. 2002 Dec 30;87(251-260):48

  4. [4]

    Supervised learning from incomplete data via an EM approach

    Ghahramani Z, Jordan MI. Supervised learning from incomplete data via an EM approach. InAdvances in neural information processing systems 1994 (pp. 120-127)

  5. [5]

    Feature set e mbedding for incomplete data

    Grangier D, Melvin I. Feature set e mbedding for incomplete data. InAdvances in Neural Information Processing Systems 2010 (pp. 793-801)

  6. [6]

    A brief introduction to weakly supervised learning

    Zhou ZH. A brief introduction to weakly supervised learning. National Science Review. 2017 Aug 25;5(1):44-53

  7. [8]

    EM-DD: An improved multiple-instance learning technique

    Zhang Q, Goldman SA. EM-DD: An improved multiple-instance learning technique. In- Advances in neural information processing systems 2002 (pp. 1073-1080)

  8. [9]

    Support vector machines for multiple-instance learning

    Andrews S, Tsochantaridis I, Hofmann T. Support vector machines for multiple-instance learning. InAdvances in neural information processing systems 2003 (pp. 577-584)

  9. [10]

    Multi-instance learning by treating instances as non-iid sam- ples

    Zhou ZH, Sun YY, Li YF. Multi-instance learning by treating instances as non-iid sam- ples. InProceedings of the 26th annual international conference on machine learning 2009 Jun 14 (pp. 1249-1256). ACM

  10. [11]

    and Zhou, Z.H., 2014, December

    Wei, X.S., Wu, J. and Zhou, Z.H., 2014, December. Scalable multi-instance learning. In 2014 IEEE International Conference on Data Mining (pp. 1037-1042). IEEE

  11. [12]

    Neural networks for multi-instance learning

    Zhou ZH, Zhang ML. Neural networks for multi-instance learning. InProceedings of the International Conference on Intelligent Information Technology, Beijing, China 2002 Aug (pp. 455-459)

  12. [13]

    Attention-based Deep Multiple Instance Learning

    Ilse M, Tomczak JM, Welling M. Attention-based deep multiple instance learning. arXiv preprint arXiv:1802.04712. 2018 Feb 13

  13. [14]

    and Liu, W., 2018

    Wang, X., Yan, Y., Tang, P., Bai, X. and Liu, W., 2018. Revisiting multiple instance neu- ral networks. Pattern Recognition, 74, pp.15-24

  14. [15]

    and Huang, J., 2018, November

    Yan, Y., Wang, X., Guo, X., Fang, J., Liu, W. and Huang, J., 2018, November. Deep Multi-instance Learning with Dynamic Pooling. In Asian Conference on Machine Learn- ing (pp. 662-677)

  15. [16]

    Deep multiple instance learning for image classification and auto-annotation

    Wu J, Yu Y, Huang C, Yu K. Deep multiple instance learning for image classification and auto-annotation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2015 (pp. 3460-3469)

  16. [17]

    Ensemble multi-instance multi-label learning approach for video annotation task

    Xu XS, Xue X, Zhou ZH. Ensemble multi-instance multi-label learning approach for video annotation task. InProceedings of the 19th ACM international conference on Multimedia 2011 Nov 28 (pp. 1153-1156). ACM

  17. [18]

    Multi-instance multi-label learning for relation extraction

    Surdeanu M, Tibshirani J, Nallapati R, Manning CD. Multi-instance multi-label learning for relation extraction. InProceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning 2012 Jul 12 (pp. 455-465). Association for Computational Linguistics. 12

  18. [19]

    Residual attention network for image classification

    Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X. Residual attention network for image classification. InProceedings of the IEEE Conference on Computer Vi- sion and Pattern Recognition 2017 (pp. 3156-3164)

  19. [20]

    Hierarchical attention networks for document classification

    Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E. Hierarchical attention networks for document classification. InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2016 (pp. 1480-1489)

  20. [21]

    Deep MIML network

    Feng J, Zhou ZH. Deep MIML network. InThirty-First AAAI Conference on Artificial In- telligence 2017 Feb 13

  21. [22]

    Attention-based Multi-instance Neural Network for Medical Diagnosis from Incomplete and Low Quality Data

    Wang Z, Poon J, Sun S, Poon S. Attention-based Multi-instance Neural Network for Medi- cal Diagnosis from Incomplete and Low Quality Data. arXiv preprint arXiv:1904.04460. 2019 Apr 9

  22. [23]

    Attention is all you need

    Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. InAdvances in neural information processing systems 2017 (pp. 5998-6008)

  23. [24]

    Focal loss for dense object detection

    Lin TY, Goyal P, Girshick R, He K, Dollár P. Focal loss for dense object detection. InPro- ceedings of the IEEE international conference on computer vision 2017 (pp. 2980-2988)

  24. [25]

    Multi-instance multi-label learning

    Zhou ZH, Zhang ML, Huang SJ, Li YF. Multi-instance multi-label learning. Artificial In- telligence. 2012 Jan 1;176(1):2291-320

  25. [26]

    Solving the multiple instance problem with axis-parallel rectangles

    Dietterich TG, Lathrop RH, Lozano-Pérez T. Solving the multiple instance problem with axis-parallel rectangles. Artificial intelligence. 1997 Jan 1;89(1-2):31-71

  26. [27]

    A framework for multiple-instance learning

    Maron O, Lozano-Pérez T. A framework for multiple-instance learning. InAdvances in neural information processing systems 1998 (pp. 570-576)

  27. [28]

    Multi instance neural networks

    Ramon J, De Raedt L. Multi instance neural networks. InProceedings of the ICML-2000 workshop on attribute-value and relational learning 2000 (pp. 53-60)

  28. [29]

    Handwritten digit recognition with a back-propagation network

    LeCun Y, Boser BE, Denker JS, Henderson D, Howard RE, Hubbard WE, Jackel LD. Handwritten digit recognition with a back-propagation network. InAdvances in neural in- formation processing systems 1990 (pp. 396-404)

  29. [30]

    Multiple instance learning: A sur- vey of problem characteristics and applications

    Carbonneau MA, Cheplygina V, Granger E, Gagnon G. Multiple instance learning: A sur- vey of problem characteristics and applications. Pattern Recognition. 2018 May 1;77:329- 53

  30. [31]

    Deep semantic role labeling with self-attention

    Tan Z, Wang M, Xie J, Chen Y, Shi X. Deep semantic role labeling with self-attention. InThirty-Second AAAI Conference on Artificial Intelligence 2018 Apr 26

  31. [32]

    Simultaneously Self-Attending to All Mentions for Full-Abstract Biological Relation Extraction

    Verga P, Strubell E, McCallum A. Simultaneously self-attending to all mentions for full- abstract biological relation extraction. arXiv preprint arXiv:1802.10569. 2018 Feb 28

  32. [33]

    Layer Normalization

    Lei Ba J, Kiros JR, Hinton GE. Layer normalization. arXiv preprint arXiv:1607.06450. 2016 Jul

  33. [34]

    Deep residual learning for image recognition

    He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition 2016 (pp. 770-778)

  34. [35]

    Language modeling with gated convolutional networks

    Dauphin YN, Fan A, Auli M, Grangier D. Language modeling with gated convolutional networks. InProceedings of the 34th International Conference on Machine Learning-Vol- ume 70 2017 Aug 6 (pp. 933-941). JMLR. org

  35. [36]

    Applied logistic regression

    Hosmer Jr DW, Lemeshow S, Sturdivant RX. Applied logistic regression. John Wiley & Sons; 2013

  36. [37]

    Support vector machines: theory and applications

    Wang L, editor. Support vector machines: theory and applications. Springer Science & Business Media; 2005 Jun 21

  37. [38]

    Random decision forests

    Ho TK. Random decision forests. InProceedings of 3rd international conference on docu- ment analysis and recognition 1995 Aug 14 (Vol. 1, pp. 278-282). IEEE

  38. [39]

    Xgboost: A scalable tree boosting system

    Chen T, Guestrin C. Xgboost: A scalable tree boosting system. InProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining 2016 Aug 13 (pp. 785-794). ACM