pith. sign in

arxiv: 2605.15383 · v1 · pith:WDFD5QSBnew · submitted 2026-05-14 · 💻 cs.CV

MorphoHELM: A Comprehensive Benchmark for Evaluating Representations for Microscopy-Based Morphology Assays

Pith reviewed 2026-05-19 15:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords Cell Paintingmorphological profilingrepresentation learningbatch effectsmicroscopyfeature extractionbenchmarkcomputer vision
0
0 comments X p. Extension
pith:WDFD5QSB Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{WDFD5QSB}

Prints a linked pith:WDFD5QSB badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

No existing model outperforms classic computer vision analytic strategies across all settings for microscopy morphology representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MorphoHELM as a unified benchmark that standardizes evaluation of feature extraction methods on Cell Painting microscopy images used to study cell responses to perturbations. It tests methods at multiple simulated levels of batch effects to measure how well each detects biological signals as technical noise rises. Results reveal clear trade-offs, with some approaches strong on particular signals but weaker on others. Overall, classic computer vision strategies emerge as the most reliable for general use cases rather than any deep learning model dominating every condition.

Core claim

MorphoHELM consolidates tasks and metrics for Cell Painting assays, extends them for robustness, and evaluates the widest range of methods to date while quantifying performance degradation under increasing batch effects; this shows that no model outperforms classic computer vision analytic strategies across all settings, which remain the strongest general use-case representations.

What carries the argument

MorphoHELM benchmark, which applies each representation task at controlled degrees of batch effects to isolate how well methods extract biological signal amid technical noise.

If this is right

  • Researchers gain a standardized way to compare new methods against established baselines under realistic noise conditions.
  • Method selection should depend on the specific biological signal of interest because of observed performance trade-offs.
  • Classic computer vision approaches serve as the default strong choice for general morphological profiling applications.
  • Public datasets and code enable consistent future benchmarking and correction of prior fragmented evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could be applied to real multi-lab batch variation data to test whether simulated noise levels match actual experimental variability.
  • Hybrid methods that combine classic analytic features with deep learning components might close the gap in specific high-noise regimes.
  • Similar evaluation structures could help standardize representation testing in other high-content biological imaging domains.

Load-bearing premise

The chosen tasks, metrics, and simulated batch effect levels fully and without bias capture the ability of representations to detect true biological signals.

What would settle it

A new model that outperforms classic computer vision methods on every task at every batch effect level within the MorphoHELM datasets would falsify the central finding.

Figures

Figures reproduced from arXiv: 2605.15383 by Alex X. Lu, Emre Hayir, Lorin Crawford.

Figure 1
Figure 1. Figure 1: MoA enrichment. Restrictions: NR = No Restriction, NSB = Not Same Batch, NSS = Not [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pairwise Jaccard similarity between least-restrictive MoA significant-compound discovery [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Gene Pathway Enrichment. performance advantage in the Not Same Batch setting). Conversely, ResNet is the worst performing model on our Geometric Mean OR metric (except for our untrained ResNet baseline), despite being relatively strong for MoA enrichment. This indicates that different models may have trade-offs across biological tasks. Intriguingly, models may exhibit generalization among tasks that doesn’… view at source ↗
Figure 4
Figure 4. Figure 4: kNN replicate retrieval. matching. For example, the expected random Recall@1 for cpg-target2 is 0.33, and the weakest models show about 10× better performance even in the most challenging Not in Same Layout setting, contrary to MoA enrichment, suggesting for replicate retrieval consistency, all methods detect some reproducible biological signature. CellProfiler features are superior to any representation l… view at source ↗
read the original abstract

Microscopy images contain rich information about how cells respond to perturbations, making them essential to applications like drug screening. To quantify images, researchers often use representation extraction methods, and recent years have seen a proliferation of deep learning methods. While measuring the quality of these representations is essential, evaluation remains fragmented, with each proposed model evaluated on different tasks and datasets, using custom pipelines and metrics, making it difficult to fairly compare models. Here, we introduce MorphoHELM, a comprehensive open benchmark for evaluating feature extraction methods for Cell Painting, the most widely-used morphological profiling assay. MorphoHELM consolidates evaluation standards in the field, extends and corrects them to be more robust, and evaluates on the widest range of methods to date. A defining feature of the benchmark is that each task is evaluated at different degrees of batch effects (or technical noise), directly quantifying how the ability of methods to detect biological signal degrades as noise increases. Together, these properties enable MorphoHELM to detect trade-offs between methods, and we demonstrate that models that excel at certain kinds of biological signal are weaker at others. We show that no existing model outperforms classic computer vision analytic strategies across all settings, which remain the strongest general use-case representations. All datasets, code, and evaluation tools are publicly available at https://github.com/microsoft/MorphoHELM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces MorphoHELM, a comprehensive open benchmark for evaluating feature extraction methods on Cell Painting microscopy images. It consolidates and extends existing evaluation standards, tests a wide range of methods (deep learning and classic computer vision) at multiple levels of simulated batch effects/technical noise, and concludes that no existing model outperforms classic analytic strategies across all settings, positioning the latter as the strongest general-use representations for detecting biological signals.

Significance. If the central findings hold, the work is significant as a standardization effort in morphological profiling for applications like drug screening. The public release of all datasets, code, and evaluation tools is a clear strength that supports reproducibility and community use. The benchmark's design for quantifying degradation under increasing noise levels usefully reveals method trade-offs, though its impact depends on the fidelity of the noise model to real assay artifacts.

major comments (1)
  1. [Methods (batch effect simulation and task evaluation)] The batch-effect simulation procedure (described in the methods section on task evaluation under noise) is load-bearing for the headline claim that classic CV strategies remain strongest across settings. The paper does not provide direct validation (e.g., comparison of simulated vs. real multi-batch Cell Painting artifact distributions or spatial correlation statistics) that the chosen noise model (additive Gaussian, global shifts, plate effects) faithfully reproduces the relative difficulty of detecting true morphological perturbations; without this, analytic pipelines tuned to the simulation could appear artificially robust while learned representations are unfairly penalized.
minor comments (2)
  1. [Results figures] Figure captions and axis labels for the degradation curves under increasing batch effects could be expanded to explicitly state the exact noise parameters and number of replicates per level.
  2. [Evaluation protocol] The manuscript would benefit from an explicit table listing all evaluated methods, their training regimes (pretrained vs. fine-tuned), and the precise metrics (e.g., mAP, correlation) used for each task.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript introducing MorphoHELM. We address the major comment on the batch effect simulation below and outline planned revisions to strengthen the work.

read point-by-point responses
  1. Referee: [Methods (batch effect simulation and task evaluation)] The batch-effect simulation procedure (described in the methods section on task evaluation under noise) is load-bearing for the headline claim that classic CV strategies remain strongest across settings. The paper does not provide direct validation (e.g., comparison of simulated vs. real multi-batch Cell Painting artifact distributions or spatial correlation statistics) that the chosen noise model (additive Gaussian, global shifts, plate effects) faithfully reproduces the relative difficulty of detecting true morphological perturbations; without this, analytic pipelines tuned to the simulation could appear artificially robust while learned representations are unfairly penalized.

    Authors: We appreciate the referee's emphasis on the need for validation of the batch effect simulation, as it underpins our comparative analysis. The noise models employed—additive Gaussian noise, global shifts, and plate effects—were selected to emulate commonly observed technical artifacts in Cell Painting experiments, drawing from prior literature on batch correction and noise modeling in high-content screening. This controlled simulation enables us to isolate the impact of increasing technical noise on representation quality without confounding factors from real multi-batch data collection. We acknowledge that a direct empirical comparison to real artifact distributions, such as through spatial correlation statistics or distribution matching, is not included in the current manuscript. To address this, we will revise the methods and discussion sections to provide additional justification for the noise model choices with supporting references, explicitly discuss the assumptions and potential limitations of the simulation, and suggest directions for future validation using real multi-batch datasets. This will clarify the scope of our claims regarding method robustness. revision: partial

Circularity Check

0 steps flagged

No circularity detected in MorphoHELM benchmark

full rationale

The paper introduces MorphoHELM as an open benchmark consolidating evaluation standards for Cell Painting assays on public datasets. It evaluates representation methods across tasks at varying simulated batch effect levels and concludes that classic computer vision analytic strategies remain the strongest general-use representations based on comparative performance. No derivation step reduces to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations; the chain is grounded in external data, metrics, and consolidated standards, making the analysis self-contained against benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As a benchmark paper the work rests on domain assumptions about what constitutes valid evaluation for morphology assays rather than new derivations; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Standard computer vision and machine learning evaluation practices for image representations apply directly to Cell Painting microscopy data.
    The benchmark builds on existing tasks and metrics in the field without re-deriving them.

pith-pipeline@v0.9.0 · 5777 in / 1395 out tokens · 62219 ms · 2026-05-19T15:42:44.492555+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 1 internal anchor

  1. [1]

    Cell painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes.Nature Protocols, 11(9):1757–1774, August 2016

    Mark-Anthony Bray, Shantanu Singh, Han Han, Chadwick T Davis, Blake Borgeson, Cathy Hartland, Maria Kost-Alimova, Sigrun M Gustafsdottir, Christopher C Gibson, and Anne E Carpenter. Cell painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes.Nature Protocols, 11(9):1757–1774, August 2016. URL: http: //dx....

  2. [2]

    Carpenter

    Srijit Seal, Maria-Anna Trapotsi, Ola Spjuth, Shantanu Singh, Jordi Carreras-Puigvert, Nigel Greene, Andreas Bender, and Anne E. Carpenter. Cell painting: a decade of discovery and innovation in cellular imaging.Nature Methods, 22(2):254–268, December 2024. URL: http://dx.doi.org/10.1038/s41592-024-02528-8, doi:10.1038/s41592-024-02528-8

  3. [3]

    Data-analysis strategies for image-based cell profiling

    Juan C Caicedo, Sam Cooper, Florian Heigwer, Scott Warchal, Peng Qiu, Csaba Molnar, Aliaksei S Vasilevich, Joseph D Barry, Harmanjit Singh Bansal, Oren Kraus, Mathias Wawer, Lassi Paavolainen, Markus D Herrmann, Mohammad Rohban, Jane Hung, Holger Hennig, John Concannon, Ian Smith, Paul A Clemons, Shantanu Singh, Paul Rees, Peter Horvath, Roger G Linington...

  4. [4]

    Machine learning and image-based profiling in drug discovery.Current Opinion in Systems Biology, 10:43–52, 2018

    Christian Scheeder, Florian Heigwer, and Michael Boutros. Machine learning and image-based profiling in drug discovery.Current Opinion in Systems Biology, 10:43–52, 2018. URL: http://dx.doi.org/10.1016/j.coisb.2018.05.004, doi:10.1016/j.coisb.2018.05.004

  5. [5]

    Image-based profiling for drug discovery: due for a machine-learning upgrade?Nat

    Srinivas Niranj Chandrasekaran, Hugo Ceulemans, Justin D Boyd, and Anne E Carpenter. Image-based profiling for drug discovery: due for a machine-learning upgrade?Nat. Rev. Drug Discov., 20(2):145–159, February 2021

  6. [6]

    CellProfiler: image analysis software for identifying and quantifying cell phenotypes.Genome Biol., 7(10):R100, October 2006

    Anne E Carpenter, Thouis R Jones, Michael R Lamprecht, Colin Clarke, In Han Kang, Ola Friman, David A Guertin, Joo Han Chang, Robert A Lindquist, Jason Moffat, Polina Golland, and David M Sabatini. CellProfiler: image analysis software for identifying and quantifying cell phenotypes.Genome Biol., 7(10):R100, October 2006

  7. [7]

    Stirling, Madison J

    David R. Stirling, Madison J. Swain-Bowden, Alice M. Lucas, Anne E. Carpenter, Beth A. Cimini, and Allen Goodman. Cellprofiler 4: improvements in speed, utility and usability. BMC Bioinformatics, 2021. URL: http://dx.doi.org/10.1186/s12859-021-04344-9 , doi:10.1186/s12859-021-04344-9

  8. [8]

    Evaluating batch correction methods for image-based cell profiling.Nat

    John Arevalo, Ellen Su, Jessica D Ewald, Robert van Dijk, Anne E Carpenter, and Shantanu Singh. Evaluating batch correction methods for image-based cell profiling.Nat. Commun., 15(1):6516, August 2024

  9. [9]

    Morphological profiling for drug discovery in the era of deep learning.Briefings in Bioinformatics, 25(4):bbae284, 07 2024

    Qiaosi Tang, Ranjala Ratnayake, Gustavo Seabra, Zhe Jiang, Ruogu Fang, Lina Cui, Yousong Ding, Tamer Kahveci, Jiang Bian, Chenglong Li, Hendrik Luesch, and Yanjun Li. Morphological profiling for drug discovery in the era of deep learning.Briefings in Bioinformatics, 25(4):bbae284, 07 2024. URL: https://doi.org/10.1093/bib/bbae284, arXiv:https://academic.o...

  10. [10]

    Plummer, and Juan C Caicedo

    Zitong Chen, Chau Pham, Siqi Wang, Michael Doron, Nikita Moshkov, Bryan A. Plummer, and Juan C Caicedo. CHAMMI: a benchmark for channel-adaptive models in microscopy imaging. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Bench- marks Track. NeurIPS, 2023. URL:https://openreview.net/forum?id=Luc1bZLeMY

  11. [11]

    Rxrx1: a dataset for evaluating experimental batch correction methods

    Maciej Sypetkowski, Morteza Rezanejad, Saber Saberian, Oren Kraus, John Urbanik, James Taylor, Ben Mabey, Mason Victors, Jason Yosinski, Alborz Rezazadeh Sereshkeh, Imran Haque, and Berton Earnshaw. Rxrx1: a dataset for evaluating experimental batch correction methods

  12. [12]

    URL:https://arxiv.org/abs/2301.05768, arXiv:2301.05768

  13. [13]

    Learning representations for image-based profiling of perturbations.Nat

    Nikita Moshkov, Michael Bornholdt, Santiago Benoit, Matthew Smith, Claire McQuin, Allen Goodman, Rebecca A Senft, Yu Han, Mehrtash Babadi, Peter Horvath, Beth A Cimini, Anne E Carpenter, Shantanu Singh, and Juan C Caicedo. Learning representations for image-based profiling of perturbations.Nat. Commun., 15(1):1594, February 2024. 10

  14. [14]

    Lazar, Rahul Mohan, Conor Tillinghast, Tommaso Biancalani, Marta M

    Safiye Celik, Jan-Christian Hütter, Sandra Melo Carlos, Nathan H. Lazar, Rahul Mohan, Conor Tillinghast, Tommaso Biancalani, Marta M. Fay, Berton A. Earnshaw, and Imran S. Haque. Building, benchmarking, and exploring perturbative maps of transcriptional and morphological data.PLOS Computational Biology, 20(10):1–24, 10 2024. URL: https://doi.org/10. 1371/...

  15. [15]

    Caie, Rebecca E

    Peter D. Caie, Rebecca E. Walls, Alexandra Ingleston-Orme, Sandeep Daya, Tom Houslay, Rob Eagle, Mark E. Roberts, and Neil O. Carragher. High-content phenotypic profiling of drug response signatures across distinct cancer cells.Molecular Cancer Therapeutics, 9(6):1913–1926, 06 2010. URL: https://doi.org/10.1158/1535-7163.MCT-09-1148 , arXiv:https://aacrjo...

  16. [16]

    Michael Ando, Cory Y

    D. Michael Ando, Cory Y . McLean, and Marc Berndl. Improving phe- notypic measurements in high-content imaging screens.bioRxiv, 2017. URL: https://www.biorxiv.org/content/early/2017/07/10/161422, arXiv:https://www.biorxiv.org/content/early/2017/07/10/161422.full.pdf, doi:10.1101/161422

  17. [17]

    WILDS: A Benchmark of in-the-Wild Distribution Shifts

    Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton Earnshaw, Imran Haque, Sara M Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. WILDS: A ...

  18. [18]

    Zero- cell corrections in random-effects meta-analyses.Research Synthesis Methods, 11(6):913–919, 2020

    Frank Weber, Guido Knapp, Katja Ickstadt, Günther Kundt, and Änne Glass. Zero- cell corrections in random-effects meta-analyses.Research Synthesis Methods, 11(6):913–919, 2020. URL: https://onlinelibrary.wiley.com/doi/abs/10. 1002/jrsm.1460, arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/jrsm.1460, doi:https://doi.org/10.1002/jrsm.1460

  19. [19]

    On estimating the relation between blood group and disease.Ann

    B Woolf. On estimating the relation between blood group and disease.Ann. Hum. Genet., 19(4):251–253, June 1955

  20. [20]

    Integrating chemical structures as treatments improves representations of microscopy images for morphological profiling

    Yemin Yu, Neil Tenenholtz, Lester Mackey, Ying Wei, David Alvarez-Melis, Ava P. Amini, and Alex X. Lu. Causal integration of chemical structures improves representations of microscopy images for morphological profiling. 2025. URL: https://arxiv.org/abs/2504.09544, arXiv:2504.09544

  21. [21]

    Corum: the comprehensive resource of mam- malian protein complexes–2022.Nucleic Acids Research, 51(D1):D539–D545, 01 2023

    George Tsitsiridis, Ralph Steinkamp, Madalina Giurgiu, Barbara Brauner, Gisela Fobo, Goar Frishman, Corinna Montrone, and Andreas Ruepp. Corum: the comprehensive resource of mam- malian protein complexes–2022.Nucleic Acids Research, 51(D1):D539–D545, 01 2023. URL: https://doi.org/10.1093/nar/gkac1015, arXiv:https://academic.oup.com/nar/article- pdf/51/D1/...

  22. [22]

    hu.MAP 2.0: integration of over 15,000 proteomic experiments builds a global compendium of human multiprotein assemblies

    Kevin Drew, John B Wallingford, and Edward M Marcotte. hu.MAP 2.0: integration of over 15,000 proteomic experiments builds a global compendium of human multiprotein assemblies. Mol. Syst. Biol., 17(5):e10016, May 2021

  23. [23]

    The string database in 2025: protein networks with directional- ity of regulation.Nucleic Acids Research, 53(D1):D730–D737, 01 2025

    Damian Szklarczyk, Katerina Nastou, Mikaela Koutrouli, Rebecca Kirsch, Farrokh Mehryary, Radja Hachilif, Dewei Hu, Matteo E Peluso, Qingyao Huang, Tao Fang, Nadezhda T Doncheva, Sampo Pyysalo, Peer Bork, Lars J Jensen, and Christian von Mering. The string database in 2025: protein networks with directional- ity of regulation.Nucleic Acids Research, 53(D1)...

  24. [24]

    Signor 2.0, the signaling network open resource 2.0: 2019 update.Nucleic Acids Research, 48(D1):D504–D510, 01 2020

    Luana Licata, Prisca Lo Surdo, Marta Iannuccelli, Alessandro Palma, Elisa Micarelli, Livia Perfetto, Daniele Peluso, Alberto Calderone, Luisa Castagnoli, and Gianni Ce- sareni. Signor 2.0, the signaling network open resource 2.0: 2019 update.Nucleic Acids Research, 48(D1):D504–D510, 01 2020. URL: https://doi.org/10.1093/nar/ 11 gkz949, arXiv:https://acade...

  25. [25]

    Reactome: a database of reactions, pathways and biological processes.Nucleic Acids Res., 39(Database issue):D691–7, January 2011

    David Croft, Gavin O’Kelly, Guanming Wu, Robin Haw, Marc Gillespie, Lisa Matthews, Michael Caudy, Phani Garapati, Gopal Gopinath, Bijay Jassal, Steven Jupe, Irina Kalatskaya, Shahana Mahajan, Bruce May, Nelson Ndegwa, Esther Schmidt, Veronica Shamovsky, Christina Yung, Ewan Birney, Henning Hermjakob, Peter D’Eustachio, and Lincoln Stein. Reactome: a datab...

  26. [26]

    The drug repurposing hub: a next-generation drug library and information resource.Nat

    Steven M Corsello, Joshua A Bittker, Zihan Liu, Joshua Gould, Patrick McCarren, Jodi E Hirschman, Stephen E Johnston, Anita Vrcic, Bang Wong, Mariya Khan, Jacob Asiedu, Rajiv Narayan, Christopher C Mader, Aravind Subramanian, and Todd R Golub. The drug repurposing hub: a next-generation drug library and information resource.Nat. Med., 23(4):405–408, April 2017

  27. [27]

    2016, in IEEE Conf

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume, 770–778. 2016. doi:10.1109/CVPR.2016.90

  28. [28]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick L...

  29. [29]

    Masked autoencoders for microscopy are scalable learners of cellular biology

    Oren Kraus, Kian Kenyon-Dean, Saber Saberian, Maryam Fallah, Peter McLean, Jess Le- ung, Vasudev Sharma, Ayla Khan, Jia Balakrishnan, Safiye Celik, Dominique Beaini, Maciej Sypetkowski, Chi Vicky Cheng, Kristen Morse, Maureen Makes, Ben Mabey, and Berton Earnshaw. Masked autoencoders for microscopy are scalable learners of cellular biology. In Proceedings...

  30. [30]

    Hansen, Mohini K

    Ankit Gupta, Zoe Wefers, Konstantin Kahnert, Jan N. Hansen, Mohini K. Misra, Will Leineweber, Anthony Cesnik, Dan Lu, Ulrika Axelsson, Frederic Ballllosera, Russ B. Altman, Theofanis Karaletsos, and Emma Lundberg. Subcell: proteome- aware vision foundation models for microscopy capture single-cell biology.bioRxiv,

  31. [31]

    627299, arXiv:https://www.biorxiv.org/content/early/2025/10/30/2024.12.06.627299.full.pdf, doi:10.1101/2024.12.06.627299

    URL: https://www.biorxiv.org/content/early/2025/10/30/2024.12.06. 627299, arXiv:https://www.biorxiv.org/content/early/2025/10/30/2024.12.06.627299.full.pdf, doi:10.1101/2024.12.06.627299

  32. [32]

    CLOOME: contrastive learning unlocks bioimaging databases for queries with chemical struc- tures.Nat

    Ana Sanchez-Fernandez, Elisabeth Rumetshofer, Sepp Hochreiter, and Günter Klambauer. CLOOME: contrastive learning unlocks bioimaging databases for queries with chemical struc- tures.Nat. Commun., 14(1):7339, November 2023

  33. [33]

    Michael Ando, John Arevalo, Melissa Bennion, Nicolas Boisseau, Adriana Borowa, Justin D

    Srinivas Niranj Chandrasekaran, Jeanelle Ackerman, Eric Alix, D. Michael Ando, John Arevalo, Melissa Bennion, Nicolas Boisseau, Adriana Borowa, Justin D. Boyd, Laurent Brino, Patrick J. Byrne, Hugo Ceulemans, Carolyn Ch’ng, Beth A. Cimini, Djork-Arne Clevert, Nicole Deflaux, John G Doench, Thierry Dorval, Regis Doyonnas, Vincenza Dragone, Ola Engkvist, Pa...

  34. [34]

    534023, arXiv:https://www.biorxiv.org/content/early/2023/03/27/2023.03.23.534023.full.pdf, doi:10.1101/2023.03.23.534023

    URL: https://www.biorxiv.org/content/early/2023/03/27/2023.03.23. 534023, arXiv:https://www.biorxiv.org/content/early/2023/03/27/2023.03.23.534023.full.pdf, doi:10.1101/2023.03.23.534023

  35. [35]

    A dataset of images and morphological profiles of 30 000 small-molecule treatments using the cell painting assay.GigaScience, 6(12):giw014, 12 2017

    Mark-Anthony Bray, Sigrun M Gustafsdottir, Mohammad H Rohban, Shantanu Singh, Veb- jorn Ljosa, Katherine L Sokolnicki, Joshua A Bittker, Nicole E Bodycombe, Vlado Dan ˇcík, Thomas P Hasaka, Cindy S Hon, Melissa M Kemp, Kejie Li, Deepika Walpita, Mathias J Wawer, Todd R Golub, Stuart L Schreiber, Paul A Clemons, Alykhan F Shamji, and Anne E Carpenter. A da...

  36. [36]

    Not Same Batch (NSB): Candidates from the same experimental batch are excluded, testing robustness to day-to-day variation between batches

  37. [37]

    Not Same Source (NSS): Candidates from the same institution are excluded, testing robust- ness to domain shifts such as microscopy hardware differences

  38. [38]

    For the enrichment tasks (MoA and gene pathway enrichment), stringency is implemented by constructing separate consensus profiles from non-overlapping subsets of replicates

    Not Same Layout (NSL): Candidates sharing the same plate position as the query are excluded, controlling for position-induced batch effects [5]. For the enrichment tasks (MoA and gene pathway enrichment), stringency is implemented by constructing separate consensus profiles from non-overlapping subsets of replicates. At the No Restriction level, all repli...

  39. [39]

    This step both standardizes dimensionality and denoises the feature space

    Platewise Center Scaling and PCA.All features are scaled to a standard range, then reduced via Principal Component Analysis applied across the entire dataset. This step both standardizes dimensionality and denoises the feature space. We use 64 components for the PCA analysis

  40. [40]

    Robustize

    Platewise Robust Standardization.A Median Absolute Deviation (MAD) "Robustize" transformation is applied per plate to align distributions across experimental runs while remaining robust to outliers. This pipeline is applied identically to all methods evaluated in the benchmark. 20 Appendix D: Supplementary Analyses D.1 Effect of Modified Haldane-Anscombe ...