Learning predictive models for combinations of heterogeneous proteomic data sources

Michal Valko; Milo\v{s} Hauskrecht; Richard Pelikan

arxiv: 2605.08958 · v1 · submitted 2026-05-09 · 💻 cs.LG

Learning predictive models for combinations of heterogeneous proteomic data sources

Michal Valko , Richard Pelikan , Milo\v{s} Hauskrecht This is my paper

Pith reviewed 2026-05-12 01:46 UTC · model grok-4.3

classification 💻 cs.LG

keywords heterogeneous data fusionproteomic classificationmodel combinationpancreatic cancer detectionmass spectrometryprotein arrayspredictive modelsmachine learning

0 comments

The pith

Classification models successful on single proteomic sources fail on their heterogeneous combination, but model fusion can recover the benefits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper looks at combining two types of data that measure proteins in the body for detecting pancreatic cancer. One type is whole-sample mass spectrometry profiling and the other is multiplexed protein arrays. Models that classify well using just one type of data do not work as well when both types are used together. The authors introduce fusion methods that take these differences into account to make better use of both data sources at once.

Core claim

We show that for the combination of these two (heterogeneous) datasets, classification models that work well on one of them individually fail on the combination of the two datasets. We study and propose a class of model fusion methods that acknowledge the differences and try to reap most of the benefits from their combination.

What carries the argument

A class of model fusion methods that acknowledge differences between heterogeneous proteomic data sources.

Load-bearing premise

That the poor performance of individual models on the combined data stems mainly from the heterogeneity of the sources rather than other issues like data quality or quantity.

What would settle it

Measuring the classification performance of the proposed fusion methods versus standard models on the pancreatic cancer datasets from both MS profiling and protein arrays.

Figures

Figures reproduced from arXiv: 2605.08958 by Michal Valko, Milo\v{s} Hauskrecht, Richard Pelikan.

**Figure 1.** Figure 1: ROC for linear SVM 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ROCs 1 - specificity sensitivity luminex :: RF AUC: 0.98 sd: 0.02 seldi peaks + luminex :: RF AUC: 0.88 sd: 0.06 seldi peak :: RF AUC: 0.78 sd: 0.06 [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: ROC for Random Forest [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of the ROC curves for the best [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Multiple technologies that measure expression levels of protein mixtures in the human body offer a potential for detection and understanding the disease. The recent increase of these technologies prompts researchers to evaluate the individual and combined utility of data generated by the technologies. In this work, we study two data sources to measure the expression of protein mixtures in the human body: whole-sample MS profiling and multiplexed protein arrays. We investigate the individual and combined utility of these technologies by learning and testing a variety of classification models on the data from a pancreatic cancer study. We show that for the combination of these two (heterogeneous) datasets, classification models that work well on one of them individually fail on the combination of the two datasets. We study and propose a class of model fusion methods that acknowledge the differences and try to reap most of the benefits from their combination.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows standard classifiers fail on naive concatenation of MS profiling and protein array data from one pancreatic cancer cohort and offers fusion methods that try to handle the source differences.

read the letter

The core observation is that models trained on either proteomic source alone lose performance when the two are simply joined, and the authors explore fusion strategies that keep the sources somewhat separate during learning. This is a straightforward empirical extension of existing fusion ideas to a new pair of measurement technologies in a real disease dataset. It does a decent job of highlighting a practical issue that comes up whenever people try to pool heterogeneous omics measurements for classification. The pancreatic cancer context gives it some grounding that pure synthetic experiments lack. The fusion class they study is not a brand-new algorithm, but applying it here and showing the individual-model failure is useful for people who actually work with these platforms. The main soft spot is that the abstract (and the stress-test note) leaves open whether the performance drop is truly from the measurement differences or from mundane factors like fewer overlapping samples, how missing values are handled, or feature-space inflation after alignment. If the full paper does not include controls that hold sample size and feature overlap fixed, the causal claim about heterogeneity weakens. No obvious circularity or invented entities, and the work stays within standard classification and fusion territory. This is the kind of paper that belongs in a bioinformatics or clinical ML venue rather than a methods conference. A reader building multi-platform diagnostic models would get practical value from the fusion discussion and the specific data sources. It is solid enough on its own terms to deserve a serious referee, even if the experiments need tightening on the data-joining details.

Referee Report

1 major / 1 minor

Summary. The manuscript examines the combination of two heterogeneous proteomic datasets from a pancreatic cancer study: whole-sample MS profiling and multiplexed protein arrays. It shows that classification models effective on individual datasets fail on the combined data and proposes a class of model fusion methods that account for the differences between the sources to maximize benefits from their integration.

Significance. If the fusion methods demonstrably improve performance while addressing the heterogeneity, this could advance multi-source data integration in proteomics and related biomedical applications. The work highlights a practical challenge in combining measurement technologies, but its significance hinges on rigorous controls for confounds and clear performance gains over baselines.

major comments (1)

[Abstract] Abstract: The central claim that standard classifiers fail on the combined data specifically because of heterogeneity between MS profiling and multiplexed arrays is not supported by any reported sample counts per source, overlap size, feature alignment procedure, or missing-value handling. Without these details it is impossible to rule out that degradation arises from reduced effective sample size or dimensionality inflation rather than measurement differences, undermining the motivation for the proposed fusion methods.

minor comments (1)

The abstract would be strengthened by including basic dataset statistics (number of samples and proteins per source) to allow readers to assess the scale of the combination problem.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the single major comment point by point below and have incorporated revisions to strengthen the presentation of our results.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that standard classifiers fail on the combined data specifically because of heterogeneity between MS profiling and multiplexed arrays is not supported by any reported sample counts per source, overlap size, feature alignment procedure, or missing-value handling. Without these details it is impossible to rule out that degradation arises from reduced effective sample size or dimensionality inflation rather than measurement differences, undermining the motivation for the proposed fusion methods.

Authors: We agree that the abstract, as currently written, is too concise and does not include the supporting details needed to fully substantiate the claim or to exclude alternative explanations such as reduced sample size or increased dimensionality. The manuscript body provides a description of the two data sources and the experimental protocol, but these specifics were not highlighted in the abstract. To address the concern directly, we will revise the abstract to briefly report the sample counts per source, the overlap between the datasets, the feature alignment procedure used to combine them, and the missing-value handling approach. These additions will make the motivation for the fusion methods clearer and help demonstrate that performance degradation is attributable to measurement heterogeneity rather than confounds. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML evaluation on proteomic datasets

full rationale

The paper reports an empirical study: standard classifiers are trained and tested on individual vs. combined MS-profiling and multiplexed-array data from a pancreatic cancer cohort, observed to degrade on the joint matrix, and a class of fusion methods is proposed and evaluated. No derivation chain, equations, or first-principles claims exist that could reduce to self-definition, fitted inputs renamed as predictions, or self-citation load-bearing steps. All performance statements rest on held-out experimental results rather than tautological constructions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, preventing identification of specific free parameters, axioms, or invented entities from the full manuscript.

pith-pipeline@v0.9.0 · 5438 in / 1041 out tokens · 56047 ms · 2026-05-12T01:46:50.740557+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

Proteomic ap- proaches to biomarker discovery in prostate and bladder canc ers

Adam BL, Vlahou A, Semmes OJ, Wright GL Jr. Proteomic ap- proaches to biomarker discovery in prostate and bladder canc ers. Pro- teomics . 1:1264-70, 2001

work page 2001
[2]

Wright, GW Jr, Cazares LH, Leung SM, Nasim S, Adam BL , Yip TT, Schellhammer PF, Gong L, Vlahou A. Proteinchip(R) surfa ce enhanced laser desorption/ionization (SELDI) mass spectrometry: a novel protein biochip technology for detection of prostate cancer biomarkers in complex protein mixtures. Prostate Cancer Prostatic Dis . 2(5/6):264-276, 1999

work page 1999
[3]

Use of proteomic patterns in serum to identify ovarian cancer

Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusar o VA, Stein- berg SM, Mills GB, Simone C, Fishman DA, Kohn EC, Liot ta LA. Use of proteomic patterns in serum to identify ovarian cancer. Lancet. 359:572-7, 2002

work page 2002
[4]

Serum Proteomic Patterns f or Detection of Prostate Cancer

Petricoin E, Ornstein DK. Serum Proteomic Patterns f or Detection of Prostate Cancer. Journal of the National Cancer Institute, Vol. 94, No. 20, 2002

work page 2002
[5]

Saulot, O

V. Saulot, O. Vittecoq, R. Charlionet, P. Fardelone, C . Lange, L. Marvin, N. Machour, X. Le Loet, D. Gilbert, and F. Tron. Presence of autoantibodies to the gl ycolytic enzyme alpha-enolase in sera from patients with early rheumatoi d arthritis. Arthritis Rheum , 46(5): 1196 -1201, May 2002

work page 2002
[6]

Sickmann, W

A. Sickmann, W. Dormeyer, S. Wortelkamp, D. Woitalla, W. Kuhn, and H. E. Meyer. Towards a high resolution separation of human cerebro- spinal fluid. J Chromatogr B Analyt Technol Biomed LifeSci , 771(1-2): 167- 196, May 2002

work page 2002
[7]

Hauskrecht, R

M. Hauskrecht, R. Pelikan, W.L. Bigbee, D. Malehorn, M.T. Lotze, H.J. Zeh, D.C. Whitcomb, and J. Lyons-Weiler. Feature Se lection for Classi- fication of SELDI-TOF-MS Proteomic Profiles, Applied Bioinformatics , 4:4, 2005

work page 2005
[8]

The Nature of Statistical Learning Theory

Vapnik VN.. The Nature of Statistical Learning Theory . Springer- Verlag, New York, 1995

work page 1995
[9]

Burges C. J.C. A tutorial on support vector machines f or pattern recogni- tion. Data Mining and Knowledge Discovery, 2:121-167. 1998

work page 1998
[10]

Scholkopf, B., A. Smola. 2002. Learning with Kernels. MIT Press. 2002

work page 2002
[11]

Breiman, Random forests , Machine Learning, 45(1), 5-32, (2001)

L. Breiman, Random forests , Machine Learning, 45(1), 5-32, (2001)

work page 2001
[12]

The elements of statistical learn- ing

Hastie T, Tibshirani R, Friedman J. The elements of statistical learn- ing. Springer, 2001

work page 2001
[13]

Classification and Regression Trees

Breiman L., Friedman JH., Olshen RA., and Stone CJ . Classification and Regression Trees. Belmont, CA: Wadsworth. 1984

work page 1984
[14]

Efron B, Tibshirani RJ. 1993. An introduction to the bootstrap . Chapman & Hall

work page 1993
[15]

Machine Learning 52(3): 239-281 (2003)

Claude Nadeau, Yoshua Bengio: Inference for the Generali zation Error. Machine Learning 52(3): 239-281 (2003)

work page 2003
[16]

Caruana, A

R. Caruana, A. Niculescu-Mizil, G. Crew and A. Ksi kes, Ensemble Selection from Libraries of Models, Intl. Conf. of Machine Learning, 2004

work page 2004
[17]

Effect s of SVM parameter optimization on discrimination and calibratio n for post-procedural PCI mortality

Matheny ME, Resnic FS, Arora N, Ohno-Machado L. Effect s of SVM parameter optimization on discrimination and calibratio n for post-procedural PCI mortality. J Biomed Inform. 2007 Dec;40(6):688-97 Page 5 of 5

work page 2007

[1] [1]

Proteomic ap- proaches to biomarker discovery in prostate and bladder canc ers

Adam BL, Vlahou A, Semmes OJ, Wright GL Jr. Proteomic ap- proaches to biomarker discovery in prostate and bladder canc ers. Pro- teomics . 1:1264-70, 2001

work page 2001

[2] [2]

Wright, GW Jr, Cazares LH, Leung SM, Nasim S, Adam BL , Yip TT, Schellhammer PF, Gong L, Vlahou A. Proteinchip(R) surfa ce enhanced laser desorption/ionization (SELDI) mass spectrometry: a novel protein biochip technology for detection of prostate cancer biomarkers in complex protein mixtures. Prostate Cancer Prostatic Dis . 2(5/6):264-276, 1999

work page 1999

[3] [3]

Use of proteomic patterns in serum to identify ovarian cancer

Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusar o VA, Stein- berg SM, Mills GB, Simone C, Fishman DA, Kohn EC, Liot ta LA. Use of proteomic patterns in serum to identify ovarian cancer. Lancet. 359:572-7, 2002

work page 2002

[4] [4]

Serum Proteomic Patterns f or Detection of Prostate Cancer

Petricoin E, Ornstein DK. Serum Proteomic Patterns f or Detection of Prostate Cancer. Journal of the National Cancer Institute, Vol. 94, No. 20, 2002

work page 2002

[5] [5]

Saulot, O

V. Saulot, O. Vittecoq, R. Charlionet, P. Fardelone, C . Lange, L. Marvin, N. Machour, X. Le Loet, D. Gilbert, and F. Tron. Presence of autoantibodies to the gl ycolytic enzyme alpha-enolase in sera from patients with early rheumatoi d arthritis. Arthritis Rheum , 46(5): 1196 -1201, May 2002

work page 2002

[6] [6]

Sickmann, W

A. Sickmann, W. Dormeyer, S. Wortelkamp, D. Woitalla, W. Kuhn, and H. E. Meyer. Towards a high resolution separation of human cerebro- spinal fluid. J Chromatogr B Analyt Technol Biomed LifeSci , 771(1-2): 167- 196, May 2002

work page 2002

[7] [7]

Hauskrecht, R

M. Hauskrecht, R. Pelikan, W.L. Bigbee, D. Malehorn, M.T. Lotze, H.J. Zeh, D.C. Whitcomb, and J. Lyons-Weiler. Feature Se lection for Classi- fication of SELDI-TOF-MS Proteomic Profiles, Applied Bioinformatics , 4:4, 2005

work page 2005

[8] [8]

The Nature of Statistical Learning Theory

Vapnik VN.. The Nature of Statistical Learning Theory . Springer- Verlag, New York, 1995

work page 1995

[9] [9]

Burges C. J.C. A tutorial on support vector machines f or pattern recogni- tion. Data Mining and Knowledge Discovery, 2:121-167. 1998

work page 1998

[10] [10]

Scholkopf, B., A. Smola. 2002. Learning with Kernels. MIT Press. 2002

work page 2002

[11] [11]

Breiman, Random forests , Machine Learning, 45(1), 5-32, (2001)

L. Breiman, Random forests , Machine Learning, 45(1), 5-32, (2001)

work page 2001

[12] [12]

The elements of statistical learn- ing

Hastie T, Tibshirani R, Friedman J. The elements of statistical learn- ing. Springer, 2001

work page 2001

[13] [13]

Classification and Regression Trees

Breiman L., Friedman JH., Olshen RA., and Stone CJ . Classification and Regression Trees. Belmont, CA: Wadsworth. 1984

work page 1984

[14] [14]

Efron B, Tibshirani RJ. 1993. An introduction to the bootstrap . Chapman & Hall

work page 1993

[15] [15]

Machine Learning 52(3): 239-281 (2003)

Claude Nadeau, Yoshua Bengio: Inference for the Generali zation Error. Machine Learning 52(3): 239-281 (2003)

work page 2003

[16] [16]

Caruana, A

R. Caruana, A. Niculescu-Mizil, G. Crew and A. Ksi kes, Ensemble Selection from Libraries of Models, Intl. Conf. of Machine Learning, 2004

work page 2004

[17] [17]

Effect s of SVM parameter optimization on discrimination and calibratio n for post-procedural PCI mortality

Matheny ME, Resnic FS, Arora N, Ohno-Machado L. Effect s of SVM parameter optimization on discrimination and calibratio n for post-procedural PCI mortality. J Biomed Inform. 2007 Dec;40(6):688-97 Page 5 of 5

work page 2007