SubQuad: Near-Quadratic-Free Structure Inference with Distribution-Balanced Objectives in Adaptive Receptor framework
Pith reviewed 2026-05-15 21:09 UTC · model grok-4.3
The pith
SubQuad pairs MinHash prefiltering with fairness calibration to analyze large immune repertoires at reduced quadratic cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SubQuad is an end-to-end system that performs antigen-aware near-subquadratic retrieval, GPU-accelerated affinity kernels, learned multimodal fusion through per-pair differentiable gating, and fairness-constrained clustering, delivering measured improvements in throughput and peak memory on large repertoires while preserving or improving recall@k, cluster purity, and subgroup equity.
What carries the argument
Compact MinHash prefiltering combined with a differentiable gating module for adaptive weighting of alignment and embedding channels, plus an automated routine that enforces proportional representation of rare subgroups.
If this is right
- Vaccine target prioritization can run on larger patient cohorts without proportional increases in compute or memory.
- Biomarker discovery pipelines gain the ability to surface signals from underrepresented antigen subgroups.
- Clustering results become more stable across dataset imbalance ratios without post-hoc reweighting.
- Downstream translational tasks such as subgroup-specific response prediction become feasible on standard GPU hardware.
Where Pith is reading between the lines
- The same prefilter-plus-gating design could transfer to other large-scale sequence clustering domains where quadratic costs currently limit scale.
- If the fairness calibration proves robust, it may reduce the need for separate rebalancing stages in related single-cell or metagenomic pipelines.
- A natural next test would measure how the method behaves when the underlying embeddings are replaced by newer protein language models.
Load-bearing premise
The MinHash prefilter and gating module will not discard clinically relevant minority clonotypes and the fairness calibration will not distort the underlying biological signals.
What would settle it
A benchmark set of repertoires containing known rare clonotypes in which SubQuad reports materially lower recall for those minorities or lower equity scores than an exhaustive pairwise baseline.
Figures
read the original abstract
Comparative analysis of adaptive immune repertoires at population scale is hampered by two practical bottlenecks: the near-quadratic cost of pairwise affinity evaluations and dataset imbalances that obscure clinically important minority clonotypes. We introduce SubQuad, an end-to-end pipeline that addresses these challenges by combining antigen-aware, near-subquadratic retrieval with GPU-accelerated affinity kernels, learned multimodal fusion, and fairness-constrained clustering. The system employs compact MinHash prefiltering to sharply reduce candidate comparisons, a differentiable gating module that adaptively weights complementary alignment and embedding channels on a per-pair basis, and an automated calibration routine that enforces proportional representation of rare antigen-specific subgroups. On large viral and tumor repertoires SubQuad achieves measured gains in throughput and peak memory usage while preserving or improving recall@k, cluster purity, and subgroup equity. By co-designing indexing, similarity fusion, and equity-aware objectives, SubQuad offers a scalable, bias-aware platform for repertoire mining and downstream translational tasks such as vaccine target prioritization and biomarker discovery.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SubQuad, an end-to-end pipeline for population-scale analysis of adaptive immune repertoires. It combines compact MinHash prefiltering for near-subquadratic candidate retrieval, a differentiable gating module that adaptively fuses alignment and embedding channels on a per-pair basis, GPU-accelerated affinity kernels, and an automated fairness calibration routine that enforces proportional representation of rare antigen-specific subgroups. The central claim is that this co-design yields measured gains in throughput and peak memory usage on large viral and tumor repertoires while preserving or improving recall@k, cluster purity, and subgroup equity.
Significance. If the performance and recall guarantees are rigorously validated, SubQuad would offer a practical, bias-aware platform for repertoire mining with direct relevance to vaccine target prioritization and biomarker discovery. The explicit integration of distribution-balanced objectives with subquadratic indexing is a constructive contribution to scalable, fairness-aware methods in computational immunology.
major comments (3)
- [MinHash prefiltering] MinHash prefiltering component: no analytic recall bound (e.g., via Jaccard-to-affinity mapping) or empirical recall@k curves on minority subgroups in imbalanced repertoires are supplied. This is load-bearing for the claim that downstream fairness calibration operates on the true distribution rather than a filtered subset.
- [Evaluation and results] Evaluation section: the abstract states 'measured gains in throughput and peak memory usage' but supplies no numerical values, error bars, baseline comparisons, or ablation results. Without these data the central performance assertions cannot be verified.
- [Differentiable gating module] Differentiable gating and calibration: the description of the learned gating module and automated calibration does not specify whether parameters are fit on held-out data or the same evaluation set, raising a circularity risk for the reported recall@k and equity metrics.
minor comments (1)
- [Abstract] Abstract: inclusion of at least one concrete quantitative result (e.g., 'X-fold throughput improvement at Y% recall') would make the claims more immediately assessable.
Simulated Author's Rebuttal
We thank the referee for their thorough and constructive review of our manuscript. We address each major comment below and commit to revisions that strengthen the clarity and rigor of the SubQuad pipeline description.
read point-by-point responses
-
Referee: [MinHash prefiltering] MinHash prefiltering component: no analytic recall bound (e.g., via Jaccard-to-affinity mapping) or empirical recall@k curves on minority subgroups in imbalanced repertoires are supplied. This is load-bearing for the claim that downstream fairness calibration operates on the true distribution rather than a filtered subset.
Authors: We agree that an analytic recall bound would strengthen the theoretical claims. Deriving a tight closed-form Jaccard-to-affinity mapping under our learned multimodal fusion is non-trivial, but we will add extensive empirical recall@k curves with explicit breakdowns for minority antigen-specific subgroups across imbalanced viral and tumor repertoires. These results will be placed in a dedicated subsection of the evaluation to demonstrate that prefiltering preserves the underlying distribution for fairness calibration. revision: yes
-
Referee: [Evaluation and results] Evaluation section: the abstract states 'measured gains in throughput and peak memory usage' but supplies no numerical values, error bars, baseline comparisons, or ablation results. Without these data the central performance assertions cannot be verified.
Authors: We acknowledge that the abstract and evaluation section require explicit numerical support. We will revise the abstract to report concrete throughput and memory gains with error bars from repeated runs. The evaluation section will be expanded to include full baseline comparisons (standard MinHash, embedding-only, alignment-only, and fairness-unaware clustering) together with ablation studies on each component, reporting all metrics (recall@k, cluster purity, subgroup equity) with standard deviations. revision: yes
-
Referee: [Differentiable gating module] Differentiable gating and calibration: the description of the learned gating module and automated calibration does not specify whether parameters are fit on held-out data or the same evaluation set, raising a circularity risk for the reported recall@k and equity metrics.
Authors: We thank the referee for identifying this ambiguity. The gating module parameters and fairness calibration routine are fit exclusively on held-out validation sets; final recall@k and equity metrics are computed on completely disjoint test sets. We will add an explicit description of the train/validation/test splits and training protocol in the methods section to remove any risk of circularity. revision: yes
Circularity Check
No significant circularity; pipeline claims rest on empirical measurements rather than self-referential definitions
full rationale
The abstract and available description present SubQuad as an end-to-end pipeline combining MinHash prefiltering, a differentiable gating module, multimodal fusion, and fairness-constrained clustering. No equations, derivation steps, or self-citations are exhibited that reduce any claimed prediction or uniqueness result to a fitted parameter or prior author result by construction. The reported gains in throughput, memory, recall@k, purity, and equity are framed as measured outcomes on viral and tumor repertoires, with no indication that any core quantity is defined in terms of itself or renamed from a known result. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption MinHash prefiltering combined with learned gating preserves recall for antigen-specific minority clonotypes
- domain assumption Automated calibration can enforce proportional subgroup representation without introducing new bias
Reference graph
Works this paper leans on
-
[1]
Alexandr Andoni, Piotr Indyk, Huy L. Nguyen, and Ilya Razenshteyn. Beyond locality-sensitive hashing. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1018–1028. SIAM, 2014
work page 2014
-
[2]
Amir Abboud, Vincent Cohen-Addad, and Hussein Houdrouge. Subquadratic high-dimensional hierarchical clustering.Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[3]
Darwin: A hardware-acceleration framework for genomic sequence alignment.Biorxiv, page 092171, 2017
Yatish Turakhia, Kevin Jie Zheng, Gill Bejerano, and William J Dally. Darwin: A hardware-acceleration framework for genomic sequence alignment.Biorxiv, page 092171, 2017
work page 2017
-
[4]
Genomics-gpu: a benchmark suite for gpu-accelerated genome analysis
Zhuren Liu, Shouzhe Zhang, Justin Garrigus, and Hui Zhao. Genomics-gpu: a benchmark suite for gpu-accelerated genome analysis. In2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 178–188. IEEE, 2023
work page 2023
-
[5]
Billion-scale similarity search with gpus.IEEE Transactions on Big Data, 7(3):535–547, 2019
Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus.IEEE Transactions on Big Data, 7(3):535–547, 2019
work page 2019
-
[6]
Philip Sun, David Simcha, Dave Dopson, Ruiqi Guo, and Sanjiv Kumar. Soar: improved indexing for approximate nearest neighbor search.Advances in Neural Information Processing Systems, 36:3189–3204, 2023
work page 2023
-
[7]
Jianshu Zhao, Jean Pierre Both, Luis M Rodriguez-R, and Konstantinos T. Konstantinidis. Gsearch: ultra-fast and scalable genome search by combining k-mer hashing with hierarchical navigable small world graphs.Nucleic Acids Research, 52(16):e74, 2024. doi: 10.1093/nar/gkae609
-
[8]
PhD thesis, Johannes Gutenberg-Universität Mainz, 2023
Robin Kobus.Accelerating bioinformatics applications on CUDA-enabled multi-GPU systems. PhD thesis, Johannes Gutenberg-Universität Mainz, 2023
work page 2023
-
[9]
Youngjun Son, Chaewon Kim, and Jaejin Lee. Fed: Fast and efficient dataset deduplication framework with gpu acceleration.arXiv preprint arXiv:2501.01046, 2025
work page internal anchor Pith review arXiv 2025
-
[10]
Cs-phylo: Accelerating evolutionary distance estimation with closed syncmer-enhanced minhash
Fajun Huang, Huan Liu, Hongyu Ou, Mengyuan Wang, and Xuhui Zuo. Cs-phylo: Accelerating evolutionary distance estimation with closed syncmer-enhanced minhash. InInternational Conference on Intelligent Computing (ICIC 2025), pages 80–91. Springer, 2025
work page 2025
-
[11]
Chau Tran, Siddharth Khadkikar, and Aleksey Porollo. Survey of protein sequence embedding models.Interna- tional Journal of Molecular Sciences, 24(4):3775, 2023
work page 2023
-
[12]
Hans-Christof Gasser, Georges Bedran, Bo Ren, David Goodlett, Javier Alfaro, and Ajitha Rajan. Interpreting bert architecture predictions for peptide presentation by mhc class i proteins.arXiv preprint arXiv:2111.07137, 2021
-
[13]
Yikun Zhang, Mei Lang, Jiuhong Jiang, Zhiqiang Gao, Fan Xu, Thomas Litfin, Ke Chen, Jaswinder Singh, Xiansong Huang, Guoli Song, et al. Multiple sequence alignment-based rna language model and its application to structural inference.Nucleic Acids Research, 52(1):e3, 2024. doi: 10.1093/nar/gkad1031
-
[14]
Transfusion: Multi-modal fusion for video tag inference via translation-based knowledge embedding
Di Jin, Zhongang Qi, Yingmin Luo, and Ying Shan. Transfusion: Multi-modal fusion for video tag inference via translation-based knowledge embedding. InProceedings of the 29th ACM International Conference on Multimedia, pages 1093–1101, 2021
work page 2021
-
[15]
Multimodal fusion refiner networks.arXiv preprint arXiv:2104.03435, 2021
Sethuraman Sankaran, David Yang, and Ser-Nam Lim. Multimodal fusion refiner networks.arXiv preprint arXiv:2104.03435, 2021
-
[16]
Fei Wu, Yongheng Ma, Hao Jin, Xiao-Yuan Jing, and Guo-Ping Jiang. Mfeclip: Clip with mapping-fusion embedding for text-guided image editing.IEEE Signal Processing Letters, 31:116–120, 2023
work page 2023
-
[17]
M3l: Language-based video editing via multi-modal multi-level transformers
Tsu-Jui Fu, Xin Eric Wang, Scott T Grafton, Miguel P Eckstein, and William Yang Wang. M3l: Language-based video editing via multi-modal multi-level transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10513–10522, 2022. 11 SubQuad
work page 2022
-
[18]
Learning discrete structures for graph neural networks
Luca Franceschi, Mathias Niepert, Massimiliano Pontil, and Xiao He. Learning discrete structures for graph neural networks. InInternational conference on machine learning, pages 1972–1982. PMLR, 2019
work page 1972
-
[19]
Ichcha Manipur, Maurizio Giordano, Marina Piccirillo, Seetharaman Parashuraman, and Lucia Maddalena. Community detection in protein-protein interaction networks and applications.IEEE/ACM Transactions on Computational Biology and Bioinformatics, 20(1):217–237, 2021
work page 2021
-
[20]
Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic decision making and the cost of fairness.Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 797–806, 2017
work page 2017
-
[21]
Brian Brubach, Darshan Chakrabarti, John P Dickerson, Aravind Srinivasan, and Leonidas Tsepenekas. Fairness, semi-supervised learning, and more: A general framework for clustering with stochastic pairwise constraints. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 6822–6830, 2021
work page 2021
-
[22]
Constrained clustering: general pairwise and cardinality constraints.IEEE Access, 11:5824–5836, 2023
Adel Bibi, Ali Alqahtani, and Bernard Ghanem. Constrained clustering: general pairwise and cardinality constraints.IEEE Access, 11:5824–5836, 2023
work page 2023
-
[23]
Doubly constrained fair clustering
John Dickerson, Seyed Esmaeili, Jamie H Morgenstern, and Claire Jie Zhang. Doubly constrained fair clustering. Advances in Neural Information Processing Systems, 36:13267–13293, 2023
work page 2023
-
[24]
Fairness-aware clique-preserving spectral clustering of temporal graphs
Dongqi Fu, Dawei Zhou, Ross Maciejewski, Arie Croitoru, Marcus Boyd, and Jingrui He. Fairness-aware clique-preserving spectral clustering of temporal graphs. InProceedings of the ACM Web Conference (WWW), pages 3755–3765, 2023
work page 2023
-
[25]
Diversifying the genomic data science research community.Genome Research, 32(7):1231–1241, 2022
Rosa Alcazar, Maria Alvarez, Rachel Arnold, Mentewab Ayalew, et al. Diversifying the genomic data science research community.Genome Research, 32(7):1231–1241, 2022
work page 2022
-
[26]
Son Nguyen, Adam Wang, and Albert Montillo. Fairness-enhancing mixed effects deep learning improves fairness on in-and out-of-distribution clustered (non-iid) data.arXiv preprint arXiv:2310.03146, 2023
-
[27]
Empowering bioinformatics communities with nextflow and nf-core.Genome Biology, 26(1):228, 2025
Björn E Langer, Andreia Amaral, Marie-Odile Baudement, et al. Empowering bioinformatics communities with nextflow and nf-core.Genome Biology, 26(1):228, 2025
work page 2025
-
[28]
Adina S Wagner, Laura K Waite, Małgorzata Wierzba, Felix Hoffstaedter, et al. Fairly big: A framework for computationally reproducible processing of large-scale data.Scientific Data, 9(1):80, 2022
work page 2022
-
[29]
Chen Peng, Zinuo Huang, Xin Wei, Liuyiqi Jiang, Xiaoping Zhu, Zhen Liu, Qiong Chen, Xiaotao Shen, Peng Gao, and Chao Jiang. Metanet: a scalable and integrated tool for reproducible omics network analysis.bioRxiv, pages 2025–06, 2025
work page 2025
-
[30]
Min Zhang, Qi Cheng, Zhenyu Wei, Jiayu Xu, Shiwei Wu, Nan Xu, Chengkui Zhao, Lei Yu, and Weixing Feng. Berttcr: a bert-based deep learning framework for predicting cancer-related immune status based on t cell receptor repertoire.Briefings in Bioinformatics, 25(5):bbae420, 2024
work page 2024
-
[31]
Jared K Slone, Anja Conev, Mauricio M Rigo, Alexandre Reuben, and Lydia E Kavraki. Tcr-pmhc binding specificity prediction from structure using graph neural networks.IEEE Transactions on Computational Biology and Bioinformatics, 2025
work page 2025
-
[32]
Analyzing immunomes using sequence embedding and network analysis
Kristina Motuzenko and Ilya Makarov. Analyzing immunomes using sequence embedding and network analysis. In2023 IEEE 21st World Symposium on Applied Machine Intelligence and Informatics (SAMI), pages 000325– 000330. IEEE, 2023
work page 2023
-
[33]
Zilan Yu, Mengnan Jiang, and Xun Lan. Heterotcr: A heterogeneous graph neural network-based method for predicting peptide-tcr interaction.Communications Biology, 7(1):684, 2024
work page 2024
-
[34]
Hongyi Zhang, Xiaowei Zhan, and Bo Li. Giana allows computationally-efficient tcr clustering and multi-disease repertoire classification by isometric transformation.Nature communications, 12(1):4699, 2021
work page 2021
-
[35]
Paul Richter. Large-scale gpu-based network analysis of the human t-cell receptor repertoire.arXiv preprint arXiv:2112.06613, 2021. 12 SubQuad
-
[36]
William D Chronister, Austin Crinklaw, Swapnil Mahajan, Randi Vita, Zeynep Ko¸ salo˘glu-Yalçın, Zhen Yan, Jason A Greenbaum, Leon E Jessen, Morten Nielsen, Scott Christley, et al. Tcrmatch: predicting t-cell receptor specificity based on sequence similarity to previously characterized receptors.Frontiers in immunology, 12: 640725, 2021
work page 2021
-
[37]
Nair: network analysis of immune repertoire.Frontiers in Immunology, 14:1181825, 2023
Hai Yang, Jason Cham, Brian Patrick Neal, Zenghua Fan, Tao He, and Li Zhang. Nair: network analysis of immune repertoire.Frontiers in Immunology, 14:1181825, 2023
work page 2023
-
[38]
xtrimopglm: unified 100b-scale pre-trained transformer for deciphering the language of protein
Bo Chen, Xingyi Cheng, Pan Li, Yangli-ao Geng, Jing Gong, Shen Li, Zhilei Bei, Xu Tan, Boyan Wang, Xin Zeng, et al. xtrimopglm: unified 100b-scale pre-trained transformer for deciphering the language of protein.arXiv preprint arXiv:2401.06199, 2024
-
[39]
Mikhail Shugay, Dmitriy V Bagaev, Ivan V Zvyagin, Renske M Vroomans, Jeremy Chase Crawford, Garry Dolton, Ekaterina A Komech, Anastasiya L Sycheva, Anna E Koneva, Evgeniy S Egorov, et al. Vdjdb: a curated database of t-cell receptor sequences with known antigen specificity.Nucleic acids research, 46(D1):D419–D427, 2018
work page 2018
-
[40]
Nili Tickotsky, Tal Sagiv, Jaime Prilusky, Eric Shifrut, and Nir Friedman. Mcpas-tcr: a manually curated catalogue of pathology-associated t cell receptor sequences.Bioinformatics, 33(18):2924–2929, 2017
work page 2017
-
[41]
Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023
work page 2023
-
[42]
Protst: Multi-modality learning of protein sequences and biomedical texts
Minghao Xu, Xinyu Yuan, Santiago Miret, and Jian Tang. Protst: Multi-modality learning of protein sequences and biomedical texts. InInternational Conference on Machine Learning, pages 38749–38767. PMLR, 2023
work page 2023
-
[43]
Jiaqi Xia, Peng Bai, Weiliang Fan, Qiming Li, Yongzheng Li, Dehe Wang, Lei Yin, and Yu Zhou. Nepdb: a database of t-cell experimentally-validated neoantigens and pan-cancer predicted neoepitopes for cancer immunotherapy. Frontiers in Immunology, 12:644637, 2021. A Repertoire-Level Distance Measure To compare two immune repertoires at the library scale we ...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.