SubQuad: Near-Quadratic-Free Structure Inference with Distribution-Balanced Objectives in Adaptive Receptor framework

Jiekai Wu; Kun Liu; Rong Fu; Simon Fong; Xianda Li; Zijian Zhang

arxiv: 2602.17330 · v4 · submitted 2026-02-19 · 💻 cs.LG · cs.AI

SubQuad: Near-Quadratic-Free Structure Inference with Distribution-Balanced Objectives in Adaptive Receptor framework

Rong Fu , Zijian Zhang , Kun Liu , Jiekai Wu , Xianda Li , Simon Fong This is my paper

Pith reviewed 2026-05-15 21:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords immune repertoire analysisMinHash prefilteringdifferentiable gatingfairness-constrained clusteringsubquadratic retrievalclonotype groupingantigen-specific subgroups

0 comments

The pith

SubQuad pairs MinHash prefiltering with fairness calibration to analyze large immune repertoires at reduced quadratic cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SubQuad as a pipeline for population-scale immune repertoire comparison that tackles two bottlenecks: the near-quadratic expense of pairwise sequence affinity checks and the tendency of imbalanced datasets to bury clinically relevant minority clonotypes. It combines compact MinHash prefiltering to limit candidate pairs, a differentiable gating module that learns to weight alignment and embedding signals per pair, and an automated calibration step that enforces proportional representation of rare antigen-specific groups. A sympathetic reader would care because these steps together promise to let standard hardware handle much larger viral and tumor datasets while keeping or raising recall, cluster purity, and subgroup equity metrics.

Core claim

SubQuad is an end-to-end system that performs antigen-aware near-subquadratic retrieval, GPU-accelerated affinity kernels, learned multimodal fusion through per-pair differentiable gating, and fairness-constrained clustering, delivering measured improvements in throughput and peak memory on large repertoires while preserving or improving recall@k, cluster purity, and subgroup equity.

What carries the argument

Compact MinHash prefiltering combined with a differentiable gating module for adaptive weighting of alignment and embedding channels, plus an automated routine that enforces proportional representation of rare subgroups.

If this is right

Vaccine target prioritization can run on larger patient cohorts without proportional increases in compute or memory.
Biomarker discovery pipelines gain the ability to surface signals from underrepresented antigen subgroups.
Clustering results become more stable across dataset imbalance ratios without post-hoc reweighting.
Downstream translational tasks such as subgroup-specific response prediction become feasible on standard GPU hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prefilter-plus-gating design could transfer to other large-scale sequence clustering domains where quadratic costs currently limit scale.
If the fairness calibration proves robust, it may reduce the need for separate rebalancing stages in related single-cell or metagenomic pipelines.
A natural next test would measure how the method behaves when the underlying embeddings are replaced by newer protein language models.

Load-bearing premise

The MinHash prefilter and gating module will not discard clinically relevant minority clonotypes and the fairness calibration will not distort the underlying biological signals.

What would settle it

A benchmark set of repertoires containing known rare clonotypes in which SubQuad reports materially lower recall for those minorities or lower equity scores than an exhaustive pairwise baseline.

Figures

Figures reproduced from arXiv: 2602.17330 by Jiekai Wu, Kun Liu, Rong Fu, Simon Fong, Xianda Li, Zijian Zhang.

**Figure 1.** Figure 1: Overview of the SubQuad framework for near-quadratic-free, equity-aware repertoire inference. Scalable Preprocessing: Raw sequences S are processed via MinHash-based Indexing to generate a sparse candidate list CAN D and optimized using hardware-aware batching B. Representation Learning: A Dual-Phase Meta-Encoder utilizes ImmunoBERT-style pretraining followed by MetaNet fine-tuning. The Meta-Controller dyn… view at source ↗

**Figure 2.** Figure 2: Community structure in immune receptor networks. Vertices denote unique CDR3 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: reports empirical median and p98 latencies for a 107 -sequence index under the efConstruction=200 and M=16 configuration used in our experiments [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: UMAP projection of ImmunoBERT embeddings showing conserved antigen clusters. [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: F1 Score Heatmap for MinHash Parameter Selection [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Parameter optimization landscape for MinHash configurations. [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Performance enhancement across computational domains. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Topological community organization in immune receptor network. Node size indicates TCR frequency, [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Feature distributions of immune receptor sequences across different antigens. Each subplot compares two [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Scale sensitivity of fairness metrics. Normalized disparity ( [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Clinical decision support dashboard with human-AI collaboration. [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

read the original abstract

Comparative analysis of adaptive immune repertoires at population scale is hampered by two practical bottlenecks: the near-quadratic cost of pairwise affinity evaluations and dataset imbalances that obscure clinically important minority clonotypes. We introduce SubQuad, an end-to-end pipeline that addresses these challenges by combining antigen-aware, near-subquadratic retrieval with GPU-accelerated affinity kernels, learned multimodal fusion, and fairness-constrained clustering. The system employs compact MinHash prefiltering to sharply reduce candidate comparisons, a differentiable gating module that adaptively weights complementary alignment and embedding channels on a per-pair basis, and an automated calibration routine that enforces proportional representation of rare antigen-specific subgroups. On large viral and tumor repertoires SubQuad achieves measured gains in throughput and peak memory usage while preserving or improving recall@k, cluster purity, and subgroup equity. By co-designing indexing, similarity fusion, and equity-aware objectives, SubQuad offers a scalable, bias-aware platform for repertoire mining and downstream translational tasks such as vaccine target prioritization and biomarker discovery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SubQuad packages MinHash prefiltering, per-pair gating, and fairness calibration into one pipeline for large immune repertoires, but the prefilter's effect on rare clonotypes is not bounded or ablated.

read the letter

The paper's core move is to cut the quadratic pairwise step in repertoire comparison by running compact MinHash sketches first, then feeding survivors into a differentiable gate that blends alignment scores with embeddings, and finally applying an automated reweighting step so minority antigen groups stay visible in the clusters. That combination is presented as new for this data type, and the motivation is clear: real datasets from viral or tumor samples are both enormous and heavily skewed toward a few dominant clonotypes. If the measured throughput and memory wins hold without dropping signal, the work would be directly usable for vaccine target work or biomarker screens. The authors do cite the relevant approximate-nearest-neighbor and fairness literature, and they ship an end-to-end system rather than isolated tricks, which is a plus. The experiments are described as showing gains in recall@k, purity, and equity, which is the right set of metrics. The main weakness is exactly the one the stress-test flags. MinHash sketch size trades speed for recall, and on imbalanced repertoires the probability of missing a rare pair rises fast. The paper supplies no separate recall curves for minority subgroups, no analytic Jaccard bound mapped to the affinity threshold, and no ablation that isolates the prefilter's contribution. Without those, the fairness calibration is operating on whatever survived the first cut, so any claim that equity is preserved rests on an untested assumption. The abstract's performance numbers are also given without error bars or baseline tables in the summary view, which makes it hard to judge effect size. This is the kind of paper that belongs in a methods journal or a computational immunology venue. Readers who build pipelines for sequence data will want to see the full methods and the missing recall plots before adopting the code. It is worth sending to referees so they can check whether the empirical claims survive once the prefilter behavior is quantified.

Referee Report

3 major / 1 minor

Summary. The paper introduces SubQuad, an end-to-end pipeline for population-scale analysis of adaptive immune repertoires. It combines compact MinHash prefiltering for near-subquadratic candidate retrieval, a differentiable gating module that adaptively fuses alignment and embedding channels on a per-pair basis, GPU-accelerated affinity kernels, and an automated fairness calibration routine that enforces proportional representation of rare antigen-specific subgroups. The central claim is that this co-design yields measured gains in throughput and peak memory usage on large viral and tumor repertoires while preserving or improving recall@k, cluster purity, and subgroup equity.

Significance. If the performance and recall guarantees are rigorously validated, SubQuad would offer a practical, bias-aware platform for repertoire mining with direct relevance to vaccine target prioritization and biomarker discovery. The explicit integration of distribution-balanced objectives with subquadratic indexing is a constructive contribution to scalable, fairness-aware methods in computational immunology.

major comments (3)

[MinHash prefiltering] MinHash prefiltering component: no analytic recall bound (e.g., via Jaccard-to-affinity mapping) or empirical recall@k curves on minority subgroups in imbalanced repertoires are supplied. This is load-bearing for the claim that downstream fairness calibration operates on the true distribution rather than a filtered subset.
[Evaluation and results] Evaluation section: the abstract states 'measured gains in throughput and peak memory usage' but supplies no numerical values, error bars, baseline comparisons, or ablation results. Without these data the central performance assertions cannot be verified.
[Differentiable gating module] Differentiable gating and calibration: the description of the learned gating module and automated calibration does not specify whether parameters are fit on held-out data or the same evaluation set, raising a circularity risk for the reported recall@k and equity metrics.

minor comments (1)

[Abstract] Abstract: inclusion of at least one concrete quantitative result (e.g., 'X-fold throughput improvement at Y% recall') would make the claims more immediately assessable.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review of our manuscript. We address each major comment below and commit to revisions that strengthen the clarity and rigor of the SubQuad pipeline description.

read point-by-point responses

Referee: [MinHash prefiltering] MinHash prefiltering component: no analytic recall bound (e.g., via Jaccard-to-affinity mapping) or empirical recall@k curves on minority subgroups in imbalanced repertoires are supplied. This is load-bearing for the claim that downstream fairness calibration operates on the true distribution rather than a filtered subset.

Authors: We agree that an analytic recall bound would strengthen the theoretical claims. Deriving a tight closed-form Jaccard-to-affinity mapping under our learned multimodal fusion is non-trivial, but we will add extensive empirical recall@k curves with explicit breakdowns for minority antigen-specific subgroups across imbalanced viral and tumor repertoires. These results will be placed in a dedicated subsection of the evaluation to demonstrate that prefiltering preserves the underlying distribution for fairness calibration. revision: yes
Referee: [Evaluation and results] Evaluation section: the abstract states 'measured gains in throughput and peak memory usage' but supplies no numerical values, error bars, baseline comparisons, or ablation results. Without these data the central performance assertions cannot be verified.

Authors: We acknowledge that the abstract and evaluation section require explicit numerical support. We will revise the abstract to report concrete throughput and memory gains with error bars from repeated runs. The evaluation section will be expanded to include full baseline comparisons (standard MinHash, embedding-only, alignment-only, and fairness-unaware clustering) together with ablation studies on each component, reporting all metrics (recall@k, cluster purity, subgroup equity) with standard deviations. revision: yes
Referee: [Differentiable gating module] Differentiable gating and calibration: the description of the learned gating module and automated calibration does not specify whether parameters are fit on held-out data or the same evaluation set, raising a circularity risk for the reported recall@k and equity metrics.

Authors: We thank the referee for identifying this ambiguity. The gating module parameters and fairness calibration routine are fit exclusively on held-out validation sets; final recall@k and equity metrics are computed on completely disjoint test sets. We will add an explicit description of the train/validation/test splits and training protocol in the methods section to remove any risk of circularity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; pipeline claims rest on empirical measurements rather than self-referential definitions

full rationale

The abstract and available description present SubQuad as an end-to-end pipeline combining MinHash prefiltering, a differentiable gating module, multimodal fusion, and fairness-constrained clustering. No equations, derivation steps, or self-citations are exhibited that reduce any claimed prediction or uniqueness result to a fitted parameter or prior author result by construction. The reported gains in throughput, memory, recall@k, purity, and equity are framed as measured outcomes on viral and tumor repertoires, with no indication that any core quantity is defined in terms of itself or renamed from a known result. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claim rests on domain assumptions about approximate retrieval preserving recall and fairness objectives not distorting biology; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption MinHash prefiltering combined with learned gating preserves recall for antigen-specific minority clonotypes
Invoked to justify near-subquadratic cost without loss of clinically relevant signals.
domain assumption Automated calibration can enforce proportional subgroup representation without introducing new bias
Used to claim equity gains while preserving cluster purity.

pith-pipeline@v0.9.0 · 5487 in / 1246 out tokens · 32589 ms · 2026-05-15T21:09:01.056103+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 1 internal anchor

[1]

Nguyen, and Ilya Razenshteyn

Alexandr Andoni, Piotr Indyk, Huy L. Nguyen, and Ilya Razenshteyn. Beyond locality-sensitive hashing. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1018–1028. SIAM, 2014

work page 2014
[2]

Subquadratic high-dimensional hierarchical clustering.Advances in Neural Information Processing Systems, 32, 2019

Amir Abboud, Vincent Cohen-Addad, and Hussein Houdrouge. Subquadratic high-dimensional hierarchical clustering.Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[3]

Darwin: A hardware-acceleration framework for genomic sequence alignment.Biorxiv, page 092171, 2017

Yatish Turakhia, Kevin Jie Zheng, Gill Bejerano, and William J Dally. Darwin: A hardware-acceleration framework for genomic sequence alignment.Biorxiv, page 092171, 2017

work page 2017
[4]

Genomics-gpu: a benchmark suite for gpu-accelerated genome analysis

Zhuren Liu, Shouzhe Zhang, Justin Garrigus, and Hui Zhao. Genomics-gpu: a benchmark suite for gpu-accelerated genome analysis. In2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 178–188. IEEE, 2023

work page 2023
[5]

Billion-scale similarity search with gpus.IEEE Transactions on Big Data, 7(3):535–547, 2019

Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus.IEEE Transactions on Big Data, 7(3):535–547, 2019

work page 2019
[6]

Soar: improved indexing for approximate nearest neighbor search.Advances in Neural Information Processing Systems, 36:3189–3204, 2023

Philip Sun, David Simcha, Dave Dopson, Ruiqi Guo, and Sanjiv Kumar. Soar: improved indexing for approximate nearest neighbor search.Advances in Neural Information Processing Systems, 36:3189–3204, 2023

work page 2023
[7]

Konstantinidis

Jianshu Zhao, Jean Pierre Both, Luis M Rodriguez-R, and Konstantinos T. Konstantinidis. Gsearch: ultra-fast and scalable genome search by combining k-mer hashing with hierarchical navigable small world graphs.Nucleic Acids Research, 52(16):e74, 2024. doi: 10.1093/nar/gkae609

work page doi:10.1093/nar/gkae609 2024
[8]

PhD thesis, Johannes Gutenberg-Universität Mainz, 2023

Robin Kobus.Accelerating bioinformatics applications on CUDA-enabled multi-GPU systems. PhD thesis, Johannes Gutenberg-Universität Mainz, 2023

work page 2023
[9]

Fed: Fast and efficient dataset deduplication framework with gpu acceleration.arXiv preprint arXiv:2501.01046, 2025

Youngjun Son, Chaewon Kim, and Jaejin Lee. Fed: Fast and efficient dataset deduplication framework with gpu acceleration.arXiv preprint arXiv:2501.01046, 2025

work page internal anchor Pith review arXiv 2025
[10]

Cs-phylo: Accelerating evolutionary distance estimation with closed syncmer-enhanced minhash

Fajun Huang, Huan Liu, Hongyu Ou, Mengyuan Wang, and Xuhui Zuo. Cs-phylo: Accelerating evolutionary distance estimation with closed syncmer-enhanced minhash. InInternational Conference on Intelligent Computing (ICIC 2025), pages 80–91. Springer, 2025

work page 2025
[11]

Survey of protein sequence embedding models.Interna- tional Journal of Molecular Sciences, 24(4):3775, 2023

Chau Tran, Siddharth Khadkikar, and Aleksey Porollo. Survey of protein sequence embedding models.Interna- tional Journal of Molecular Sciences, 24(4):3775, 2023

work page 2023
[12]

Interpreting bert architecture predictions for peptide presentation by mhc class i proteins.arXiv preprint arXiv:2111.07137, 2021

Hans-Christof Gasser, Georges Bedran, Bo Ren, David Goodlett, Javier Alfaro, and Ajitha Rajan. Interpreting bert architecture predictions for peptide presentation by mhc class i proteins.arXiv preprint arXiv:2111.07137, 2021

work page arXiv 2021
[13]

Multiple sequence alignment-based rna language model and its application to structural inference.Nucleic Acids Research, 52(1):e3, 2024

Yikun Zhang, Mei Lang, Jiuhong Jiang, Zhiqiang Gao, Fan Xu, Thomas Litfin, Ke Chen, Jaswinder Singh, Xiansong Huang, Guoli Song, et al. Multiple sequence alignment-based rna language model and its application to structural inference.Nucleic Acids Research, 52(1):e3, 2024. doi: 10.1093/nar/gkad1031

work page doi:10.1093/nar/gkad1031 2024
[14]

Transfusion: Multi-modal fusion for video tag inference via translation-based knowledge embedding

Di Jin, Zhongang Qi, Yingmin Luo, and Ying Shan. Transfusion: Multi-modal fusion for video tag inference via translation-based knowledge embedding. InProceedings of the 29th ACM International Conference on Multimedia, pages 1093–1101, 2021

work page 2021
[15]

Multimodal fusion refiner networks.arXiv preprint arXiv:2104.03435, 2021

Sethuraman Sankaran, David Yang, and Ser-Nam Lim. Multimodal fusion refiner networks.arXiv preprint arXiv:2104.03435, 2021

work page arXiv 2021
[16]

Mfeclip: Clip with mapping-fusion embedding for text-guided image editing.IEEE Signal Processing Letters, 31:116–120, 2023

Fei Wu, Yongheng Ma, Hao Jin, Xiao-Yuan Jing, and Guo-Ping Jiang. Mfeclip: Clip with mapping-fusion embedding for text-guided image editing.IEEE Signal Processing Letters, 31:116–120, 2023

work page 2023
[17]

M3l: Language-based video editing via multi-modal multi-level transformers

Tsu-Jui Fu, Xin Eric Wang, Scott T Grafton, Miguel P Eckstein, and William Yang Wang. M3l: Language-based video editing via multi-modal multi-level transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10513–10522, 2022. 11 SubQuad

work page 2022
[18]

Learning discrete structures for graph neural networks

Luca Franceschi, Mathias Niepert, Massimiliano Pontil, and Xiao He. Learning discrete structures for graph neural networks. InInternational conference on machine learning, pages 1972–1982. PMLR, 2019

work page 1972
[19]

Community detection in protein-protein interaction networks and applications.IEEE/ACM Transactions on Computational Biology and Bioinformatics, 20(1):217–237, 2021

Ichcha Manipur, Maurizio Giordano, Marina Piccirillo, Seetharaman Parashuraman, and Lucia Maddalena. Community detection in protein-protein interaction networks and applications.IEEE/ACM Transactions on Computational Biology and Bioinformatics, 20(1):217–237, 2021

work page 2021
[20]

Algorithmic decision making and the cost of fairness.Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 797–806, 2017

Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic decision making and the cost of fairness.Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 797–806, 2017

work page 2017
[21]

Fairness, semi-supervised learning, and more: A general framework for clustering with stochastic pairwise constraints

Brian Brubach, Darshan Chakrabarti, John P Dickerson, Aravind Srinivasan, and Leonidas Tsepenekas. Fairness, semi-supervised learning, and more: A general framework for clustering with stochastic pairwise constraints. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 6822–6830, 2021

work page 2021
[22]

Constrained clustering: general pairwise and cardinality constraints.IEEE Access, 11:5824–5836, 2023

Adel Bibi, Ali Alqahtani, and Bernard Ghanem. Constrained clustering: general pairwise and cardinality constraints.IEEE Access, 11:5824–5836, 2023

work page 2023
[23]

Doubly constrained fair clustering

John Dickerson, Seyed Esmaeili, Jamie H Morgenstern, and Claire Jie Zhang. Doubly constrained fair clustering. Advances in Neural Information Processing Systems, 36:13267–13293, 2023

work page 2023
[24]

Fairness-aware clique-preserving spectral clustering of temporal graphs

Dongqi Fu, Dawei Zhou, Ross Maciejewski, Arie Croitoru, Marcus Boyd, and Jingrui He. Fairness-aware clique-preserving spectral clustering of temporal graphs. InProceedings of the ACM Web Conference (WWW), pages 3755–3765, 2023

work page 2023
[25]

Diversifying the genomic data science research community.Genome Research, 32(7):1231–1241, 2022

Rosa Alcazar, Maria Alvarez, Rachel Arnold, Mentewab Ayalew, et al. Diversifying the genomic data science research community.Genome Research, 32(7):1231–1241, 2022

work page 2022
[26]

Fairness-enhancing mixed effects deep learning improves fairness on in-and out-of-distribution clustered (non-iid) data.arXiv preprint arXiv:2310.03146, 2023

Son Nguyen, Adam Wang, and Albert Montillo. Fairness-enhancing mixed effects deep learning improves fairness on in-and out-of-distribution clustered (non-iid) data.arXiv preprint arXiv:2310.03146, 2023

work page arXiv 2023
[27]

Empowering bioinformatics communities with nextflow and nf-core.Genome Biology, 26(1):228, 2025

Björn E Langer, Andreia Amaral, Marie-Odile Baudement, et al. Empowering bioinformatics communities with nextflow and nf-core.Genome Biology, 26(1):228, 2025

work page 2025
[28]

Fairly big: A framework for computationally reproducible processing of large-scale data.Scientific Data, 9(1):80, 2022

Adina S Wagner, Laura K Waite, Małgorzata Wierzba, Felix Hoffstaedter, et al. Fairly big: A framework for computationally reproducible processing of large-scale data.Scientific Data, 9(1):80, 2022

work page 2022
[29]

Metanet: a scalable and integrated tool for reproducible omics network analysis.bioRxiv, pages 2025–06, 2025

Chen Peng, Zinuo Huang, Xin Wei, Liuyiqi Jiang, Xiaoping Zhu, Zhen Liu, Qiong Chen, Xiaotao Shen, Peng Gao, and Chao Jiang. Metanet: a scalable and integrated tool for reproducible omics network analysis.bioRxiv, pages 2025–06, 2025

work page 2025
[30]

Berttcr: a bert-based deep learning framework for predicting cancer-related immune status based on t cell receptor repertoire.Briefings in Bioinformatics, 25(5):bbae420, 2024

Min Zhang, Qi Cheng, Zhenyu Wei, Jiayu Xu, Shiwei Wu, Nan Xu, Chengkui Zhao, Lei Yu, and Weixing Feng. Berttcr: a bert-based deep learning framework for predicting cancer-related immune status based on t cell receptor repertoire.Briefings in Bioinformatics, 25(5):bbae420, 2024

work page 2024
[31]

Tcr-pmhc binding specificity prediction from structure using graph neural networks.IEEE Transactions on Computational Biology and Bioinformatics, 2025

Jared K Slone, Anja Conev, Mauricio M Rigo, Alexandre Reuben, and Lydia E Kavraki. Tcr-pmhc binding specificity prediction from structure using graph neural networks.IEEE Transactions on Computational Biology and Bioinformatics, 2025

work page 2025
[32]

Analyzing immunomes using sequence embedding and network analysis

Kristina Motuzenko and Ilya Makarov. Analyzing immunomes using sequence embedding and network analysis. In2023 IEEE 21st World Symposium on Applied Machine Intelligence and Informatics (SAMI), pages 000325– 000330. IEEE, 2023

work page 2023
[33]

Heterotcr: A heterogeneous graph neural network-based method for predicting peptide-tcr interaction.Communications Biology, 7(1):684, 2024

Zilan Yu, Mengnan Jiang, and Xun Lan. Heterotcr: A heterogeneous graph neural network-based method for predicting peptide-tcr interaction.Communications Biology, 7(1):684, 2024

work page 2024
[34]

Giana allows computationally-efficient tcr clustering and multi-disease repertoire classification by isometric transformation.Nature communications, 12(1):4699, 2021

Hongyi Zhang, Xiaowei Zhan, and Bo Li. Giana allows computationally-efficient tcr clustering and multi-disease repertoire classification by isometric transformation.Nature communications, 12(1):4699, 2021

work page 2021
[35]

Large-scale gpu-based network analysis of the human t-cell receptor repertoire.arXiv preprint arXiv:2112.06613, 2021

Paul Richter. Large-scale gpu-based network analysis of the human t-cell receptor repertoire.arXiv preprint arXiv:2112.06613, 2021. 12 SubQuad

work page arXiv 2021
[36]

Tcrmatch: predicting t-cell receptor specificity based on sequence similarity to previously characterized receptors.Frontiers in immunology, 12: 640725, 2021

William D Chronister, Austin Crinklaw, Swapnil Mahajan, Randi Vita, Zeynep Ko¸ salo˘glu-Yalçın, Zhen Yan, Jason A Greenbaum, Leon E Jessen, Morten Nielsen, Scott Christley, et al. Tcrmatch: predicting t-cell receptor specificity based on sequence similarity to previously characterized receptors.Frontiers in immunology, 12: 640725, 2021

work page 2021
[37]

Nair: network analysis of immune repertoire.Frontiers in Immunology, 14:1181825, 2023

Hai Yang, Jason Cham, Brian Patrick Neal, Zenghua Fan, Tao He, and Li Zhang. Nair: network analysis of immune repertoire.Frontiers in Immunology, 14:1181825, 2023

work page 2023
[38]

xtrimopglm: unified 100b-scale pre-trained transformer for deciphering the language of protein

Bo Chen, Xingyi Cheng, Pan Li, Yangli-ao Geng, Jing Gong, Shen Li, Zhilei Bei, Xu Tan, Boyan Wang, Xin Zeng, et al. xtrimopglm: unified 100b-scale pre-trained transformer for deciphering the language of protein.arXiv preprint arXiv:2401.06199, 2024

work page arXiv 2024
[39]

Vdjdb: a curated database of t-cell receptor sequences with known antigen specificity.Nucleic acids research, 46(D1):D419–D427, 2018

Mikhail Shugay, Dmitriy V Bagaev, Ivan V Zvyagin, Renske M Vroomans, Jeremy Chase Crawford, Garry Dolton, Ekaterina A Komech, Anastasiya L Sycheva, Anna E Koneva, Evgeniy S Egorov, et al. Vdjdb: a curated database of t-cell receptor sequences with known antigen specificity.Nucleic acids research, 46(D1):D419–D427, 2018

work page 2018
[40]

Mcpas-tcr: a manually curated catalogue of pathology-associated t cell receptor sequences.Bioinformatics, 33(18):2924–2929, 2017

Nili Tickotsky, Tal Sagiv, Jaime Prilusky, Eric Shifrut, and Nir Friedman. Mcpas-tcr: a manually curated catalogue of pathology-associated t cell receptor sequences.Bioinformatics, 33(18):2924–2929, 2017

work page 2017
[41]

Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

work page 2023
[42]

Protst: Multi-modality learning of protein sequences and biomedical texts

Minghao Xu, Xinyu Yuan, Santiago Miret, and Jian Tang. Protst: Multi-modality learning of protein sequences and biomedical texts. InInternational Conference on Machine Learning, pages 38749–38767. PMLR, 2023

work page 2023
[43]

Nepdb: a database of t-cell experimentally-validated neoantigens and pan-cancer predicted neoepitopes for cancer immunotherapy

Jiaqi Xia, Peng Bai, Weiliang Fan, Qiming Li, Yongzheng Li, Dehe Wang, Lei Yin, and Yu Zhou. Nepdb: a database of t-cell experimentally-validated neoantigens and pan-cancer predicted neoepitopes for cancer immunotherapy. Frontiers in Immunology, 12:644637, 2021. A Repertoire-Level Distance Measure To compare two immune repertoires at the library scale we ...

work page 2021

[1] [1]

Nguyen, and Ilya Razenshteyn

Alexandr Andoni, Piotr Indyk, Huy L. Nguyen, and Ilya Razenshteyn. Beyond locality-sensitive hashing. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1018–1028. SIAM, 2014

work page 2014

[2] [2]

Subquadratic high-dimensional hierarchical clustering.Advances in Neural Information Processing Systems, 32, 2019

Amir Abboud, Vincent Cohen-Addad, and Hussein Houdrouge. Subquadratic high-dimensional hierarchical clustering.Advances in Neural Information Processing Systems, 32, 2019

work page 2019

[3] [3]

Darwin: A hardware-acceleration framework for genomic sequence alignment.Biorxiv, page 092171, 2017

Yatish Turakhia, Kevin Jie Zheng, Gill Bejerano, and William J Dally. Darwin: A hardware-acceleration framework for genomic sequence alignment.Biorxiv, page 092171, 2017

work page 2017

[4] [4]

Genomics-gpu: a benchmark suite for gpu-accelerated genome analysis

Zhuren Liu, Shouzhe Zhang, Justin Garrigus, and Hui Zhao. Genomics-gpu: a benchmark suite for gpu-accelerated genome analysis. In2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 178–188. IEEE, 2023

work page 2023

[5] [5]

Billion-scale similarity search with gpus.IEEE Transactions on Big Data, 7(3):535–547, 2019

Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus.IEEE Transactions on Big Data, 7(3):535–547, 2019

work page 2019

[6] [6]

Soar: improved indexing for approximate nearest neighbor search.Advances in Neural Information Processing Systems, 36:3189–3204, 2023

Philip Sun, David Simcha, Dave Dopson, Ruiqi Guo, and Sanjiv Kumar. Soar: improved indexing for approximate nearest neighbor search.Advances in Neural Information Processing Systems, 36:3189–3204, 2023

work page 2023

[7] [7]

Konstantinidis

Jianshu Zhao, Jean Pierre Both, Luis M Rodriguez-R, and Konstantinos T. Konstantinidis. Gsearch: ultra-fast and scalable genome search by combining k-mer hashing with hierarchical navigable small world graphs.Nucleic Acids Research, 52(16):e74, 2024. doi: 10.1093/nar/gkae609

work page doi:10.1093/nar/gkae609 2024

[8] [8]

PhD thesis, Johannes Gutenberg-Universität Mainz, 2023

Robin Kobus.Accelerating bioinformatics applications on CUDA-enabled multi-GPU systems. PhD thesis, Johannes Gutenberg-Universität Mainz, 2023

work page 2023

[9] [9]

Fed: Fast and efficient dataset deduplication framework with gpu acceleration.arXiv preprint arXiv:2501.01046, 2025

Youngjun Son, Chaewon Kim, and Jaejin Lee. Fed: Fast and efficient dataset deduplication framework with gpu acceleration.arXiv preprint arXiv:2501.01046, 2025

work page internal anchor Pith review arXiv 2025

[10] [10]

Cs-phylo: Accelerating evolutionary distance estimation with closed syncmer-enhanced minhash

Fajun Huang, Huan Liu, Hongyu Ou, Mengyuan Wang, and Xuhui Zuo. Cs-phylo: Accelerating evolutionary distance estimation with closed syncmer-enhanced minhash. InInternational Conference on Intelligent Computing (ICIC 2025), pages 80–91. Springer, 2025

work page 2025

[11] [11]

Survey of protein sequence embedding models.Interna- tional Journal of Molecular Sciences, 24(4):3775, 2023

Chau Tran, Siddharth Khadkikar, and Aleksey Porollo. Survey of protein sequence embedding models.Interna- tional Journal of Molecular Sciences, 24(4):3775, 2023

work page 2023

[12] [12]

Interpreting bert architecture predictions for peptide presentation by mhc class i proteins.arXiv preprint arXiv:2111.07137, 2021

Hans-Christof Gasser, Georges Bedran, Bo Ren, David Goodlett, Javier Alfaro, and Ajitha Rajan. Interpreting bert architecture predictions for peptide presentation by mhc class i proteins.arXiv preprint arXiv:2111.07137, 2021

work page arXiv 2021

[13] [13]

Multiple sequence alignment-based rna language model and its application to structural inference.Nucleic Acids Research, 52(1):e3, 2024

Yikun Zhang, Mei Lang, Jiuhong Jiang, Zhiqiang Gao, Fan Xu, Thomas Litfin, Ke Chen, Jaswinder Singh, Xiansong Huang, Guoli Song, et al. Multiple sequence alignment-based rna language model and its application to structural inference.Nucleic Acids Research, 52(1):e3, 2024. doi: 10.1093/nar/gkad1031

work page doi:10.1093/nar/gkad1031 2024

[14] [14]

Transfusion: Multi-modal fusion for video tag inference via translation-based knowledge embedding

Di Jin, Zhongang Qi, Yingmin Luo, and Ying Shan. Transfusion: Multi-modal fusion for video tag inference via translation-based knowledge embedding. InProceedings of the 29th ACM International Conference on Multimedia, pages 1093–1101, 2021

work page 2021

[15] [15]

Multimodal fusion refiner networks.arXiv preprint arXiv:2104.03435, 2021

Sethuraman Sankaran, David Yang, and Ser-Nam Lim. Multimodal fusion refiner networks.arXiv preprint arXiv:2104.03435, 2021

work page arXiv 2021

[16] [16]

Mfeclip: Clip with mapping-fusion embedding for text-guided image editing.IEEE Signal Processing Letters, 31:116–120, 2023

Fei Wu, Yongheng Ma, Hao Jin, Xiao-Yuan Jing, and Guo-Ping Jiang. Mfeclip: Clip with mapping-fusion embedding for text-guided image editing.IEEE Signal Processing Letters, 31:116–120, 2023

work page 2023

[17] [17]

M3l: Language-based video editing via multi-modal multi-level transformers

Tsu-Jui Fu, Xin Eric Wang, Scott T Grafton, Miguel P Eckstein, and William Yang Wang. M3l: Language-based video editing via multi-modal multi-level transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10513–10522, 2022. 11 SubQuad

work page 2022

[18] [18]

Learning discrete structures for graph neural networks

Luca Franceschi, Mathias Niepert, Massimiliano Pontil, and Xiao He. Learning discrete structures for graph neural networks. InInternational conference on machine learning, pages 1972–1982. PMLR, 2019

work page 1972

[19] [19]

Community detection in protein-protein interaction networks and applications.IEEE/ACM Transactions on Computational Biology and Bioinformatics, 20(1):217–237, 2021

Ichcha Manipur, Maurizio Giordano, Marina Piccirillo, Seetharaman Parashuraman, and Lucia Maddalena. Community detection in protein-protein interaction networks and applications.IEEE/ACM Transactions on Computational Biology and Bioinformatics, 20(1):217–237, 2021

work page 2021

[20] [20]

Algorithmic decision making and the cost of fairness.Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 797–806, 2017

Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic decision making and the cost of fairness.Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 797–806, 2017

work page 2017

[21] [21]

Fairness, semi-supervised learning, and more: A general framework for clustering with stochastic pairwise constraints

Brian Brubach, Darshan Chakrabarti, John P Dickerson, Aravind Srinivasan, and Leonidas Tsepenekas. Fairness, semi-supervised learning, and more: A general framework for clustering with stochastic pairwise constraints. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 6822–6830, 2021

work page 2021

[22] [22]

Constrained clustering: general pairwise and cardinality constraints.IEEE Access, 11:5824–5836, 2023

Adel Bibi, Ali Alqahtani, and Bernard Ghanem. Constrained clustering: general pairwise and cardinality constraints.IEEE Access, 11:5824–5836, 2023

work page 2023

[23] [23]

Doubly constrained fair clustering

John Dickerson, Seyed Esmaeili, Jamie H Morgenstern, and Claire Jie Zhang. Doubly constrained fair clustering. Advances in Neural Information Processing Systems, 36:13267–13293, 2023

work page 2023

[24] [24]

Fairness-aware clique-preserving spectral clustering of temporal graphs

Dongqi Fu, Dawei Zhou, Ross Maciejewski, Arie Croitoru, Marcus Boyd, and Jingrui He. Fairness-aware clique-preserving spectral clustering of temporal graphs. InProceedings of the ACM Web Conference (WWW), pages 3755–3765, 2023

work page 2023

[25] [25]

Diversifying the genomic data science research community.Genome Research, 32(7):1231–1241, 2022

Rosa Alcazar, Maria Alvarez, Rachel Arnold, Mentewab Ayalew, et al. Diversifying the genomic data science research community.Genome Research, 32(7):1231–1241, 2022

work page 2022

[26] [26]

Fairness-enhancing mixed effects deep learning improves fairness on in-and out-of-distribution clustered (non-iid) data.arXiv preprint arXiv:2310.03146, 2023

Son Nguyen, Adam Wang, and Albert Montillo. Fairness-enhancing mixed effects deep learning improves fairness on in-and out-of-distribution clustered (non-iid) data.arXiv preprint arXiv:2310.03146, 2023

work page arXiv 2023

[27] [27]

Empowering bioinformatics communities with nextflow and nf-core.Genome Biology, 26(1):228, 2025

Björn E Langer, Andreia Amaral, Marie-Odile Baudement, et al. Empowering bioinformatics communities with nextflow and nf-core.Genome Biology, 26(1):228, 2025

work page 2025

[28] [28]

Fairly big: A framework for computationally reproducible processing of large-scale data.Scientific Data, 9(1):80, 2022

Adina S Wagner, Laura K Waite, Małgorzata Wierzba, Felix Hoffstaedter, et al. Fairly big: A framework for computationally reproducible processing of large-scale data.Scientific Data, 9(1):80, 2022

work page 2022

[29] [29]

Metanet: a scalable and integrated tool for reproducible omics network analysis.bioRxiv, pages 2025–06, 2025

Chen Peng, Zinuo Huang, Xin Wei, Liuyiqi Jiang, Xiaoping Zhu, Zhen Liu, Qiong Chen, Xiaotao Shen, Peng Gao, and Chao Jiang. Metanet: a scalable and integrated tool for reproducible omics network analysis.bioRxiv, pages 2025–06, 2025

work page 2025

[30] [30]

Berttcr: a bert-based deep learning framework for predicting cancer-related immune status based on t cell receptor repertoire.Briefings in Bioinformatics, 25(5):bbae420, 2024

Min Zhang, Qi Cheng, Zhenyu Wei, Jiayu Xu, Shiwei Wu, Nan Xu, Chengkui Zhao, Lei Yu, and Weixing Feng. Berttcr: a bert-based deep learning framework for predicting cancer-related immune status based on t cell receptor repertoire.Briefings in Bioinformatics, 25(5):bbae420, 2024

work page 2024

[31] [31]

Tcr-pmhc binding specificity prediction from structure using graph neural networks.IEEE Transactions on Computational Biology and Bioinformatics, 2025

Jared K Slone, Anja Conev, Mauricio M Rigo, Alexandre Reuben, and Lydia E Kavraki. Tcr-pmhc binding specificity prediction from structure using graph neural networks.IEEE Transactions on Computational Biology and Bioinformatics, 2025

work page 2025

[32] [32]

Analyzing immunomes using sequence embedding and network analysis

Kristina Motuzenko and Ilya Makarov. Analyzing immunomes using sequence embedding and network analysis. In2023 IEEE 21st World Symposium on Applied Machine Intelligence and Informatics (SAMI), pages 000325– 000330. IEEE, 2023

work page 2023

[33] [33]

Heterotcr: A heterogeneous graph neural network-based method for predicting peptide-tcr interaction.Communications Biology, 7(1):684, 2024

Zilan Yu, Mengnan Jiang, and Xun Lan. Heterotcr: A heterogeneous graph neural network-based method for predicting peptide-tcr interaction.Communications Biology, 7(1):684, 2024

work page 2024

[34] [34]

Giana allows computationally-efficient tcr clustering and multi-disease repertoire classification by isometric transformation.Nature communications, 12(1):4699, 2021

Hongyi Zhang, Xiaowei Zhan, and Bo Li. Giana allows computationally-efficient tcr clustering and multi-disease repertoire classification by isometric transformation.Nature communications, 12(1):4699, 2021

work page 2021

[35] [35]

Large-scale gpu-based network analysis of the human t-cell receptor repertoire.arXiv preprint arXiv:2112.06613, 2021

Paul Richter. Large-scale gpu-based network analysis of the human t-cell receptor repertoire.arXiv preprint arXiv:2112.06613, 2021. 12 SubQuad

work page arXiv 2021

[36] [36]

Tcrmatch: predicting t-cell receptor specificity based on sequence similarity to previously characterized receptors.Frontiers in immunology, 12: 640725, 2021

William D Chronister, Austin Crinklaw, Swapnil Mahajan, Randi Vita, Zeynep Ko¸ salo˘glu-Yalçın, Zhen Yan, Jason A Greenbaum, Leon E Jessen, Morten Nielsen, Scott Christley, et al. Tcrmatch: predicting t-cell receptor specificity based on sequence similarity to previously characterized receptors.Frontiers in immunology, 12: 640725, 2021

work page 2021

[37] [37]

Nair: network analysis of immune repertoire.Frontiers in Immunology, 14:1181825, 2023

Hai Yang, Jason Cham, Brian Patrick Neal, Zenghua Fan, Tao He, and Li Zhang. Nair: network analysis of immune repertoire.Frontiers in Immunology, 14:1181825, 2023

work page 2023

[38] [38]

xtrimopglm: unified 100b-scale pre-trained transformer for deciphering the language of protein

Bo Chen, Xingyi Cheng, Pan Li, Yangli-ao Geng, Jing Gong, Shen Li, Zhilei Bei, Xu Tan, Boyan Wang, Xin Zeng, et al. xtrimopglm: unified 100b-scale pre-trained transformer for deciphering the language of protein.arXiv preprint arXiv:2401.06199, 2024

work page arXiv 2024

[39] [39]

Vdjdb: a curated database of t-cell receptor sequences with known antigen specificity.Nucleic acids research, 46(D1):D419–D427, 2018

Mikhail Shugay, Dmitriy V Bagaev, Ivan V Zvyagin, Renske M Vroomans, Jeremy Chase Crawford, Garry Dolton, Ekaterina A Komech, Anastasiya L Sycheva, Anna E Koneva, Evgeniy S Egorov, et al. Vdjdb: a curated database of t-cell receptor sequences with known antigen specificity.Nucleic acids research, 46(D1):D419–D427, 2018

work page 2018

[40] [40]

Mcpas-tcr: a manually curated catalogue of pathology-associated t cell receptor sequences.Bioinformatics, 33(18):2924–2929, 2017

Nili Tickotsky, Tal Sagiv, Jaime Prilusky, Eric Shifrut, and Nir Friedman. Mcpas-tcr: a manually curated catalogue of pathology-associated t cell receptor sequences.Bioinformatics, 33(18):2924–2929, 2017

work page 2017

[41] [41]

Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

work page 2023

[42] [42]

Protst: Multi-modality learning of protein sequences and biomedical texts

Minghao Xu, Xinyu Yuan, Santiago Miret, and Jian Tang. Protst: Multi-modality learning of protein sequences and biomedical texts. InInternational Conference on Machine Learning, pages 38749–38767. PMLR, 2023

work page 2023

[43] [43]

Nepdb: a database of t-cell experimentally-validated neoantigens and pan-cancer predicted neoepitopes for cancer immunotherapy

Jiaqi Xia, Peng Bai, Weiliang Fan, Qiming Li, Yongzheng Li, Dehe Wang, Lei Yin, and Yu Zhou. Nepdb: a database of t-cell experimentally-validated neoantigens and pan-cancer predicted neoepitopes for cancer immunotherapy. Frontiers in Immunology, 12:644637, 2021. A Repertoire-Level Distance Measure To compare two immune repertoires at the library scale we ...

work page 2021