pith. sign in

arxiv: 1912.08735 · v5 · submitted 2019-12-18 · 🧬 q-bio.GN · cs.CE

AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes

Pith reviewed 2026-05-24 14:55 UTC · model grok-4.3

classification 🧬 q-bio.GN cs.CE
keywords read remappingreference genomesalignmentsvariant callingSNPINDELgenome analysisbioinformatics
0
0 comments X

The pith

AirLift remaps read sets to new reference genomes up to 27.4 times faster than complete remapping while preserving variant accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AirLift as a technique to remap alignments of reads from one reference genome to a similar newer one. This approach avoids the need for full re-alignment from scratch each time a reference is updated. A sympathetic reader would care because frequent reference updates make re-analysis costly in time and compute. If the method works, researchers can update their variant calls and other analyses on the latest genome versions much more quickly. The validation shows it maintains accuracy for identifying SNPs and INDELs.

Core claim

AirLift is the first read remapping tool that enables users to quickly and comprehensively map a read set, that had been previously mapped to one reference genome, to another similar reference. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces the overall execution time to remap read sets between two reference genome versions by up to 27.4x. We validate our remapping results with GATK and find that AirLift provides high accuracy in identifying ground truth SNP/INDEL variants.

What carries the argument

The remapping technique that adjusts alignments only in regions where the two reference genomes differ, rather than re-aligning all reads from scratch.

If this is right

  • Users can quickly run downstream analysis of read sets for each latest reference release.
  • Remapping execution time is reduced by up to 27.4x compared to full mapping.
  • High accuracy is maintained in identifying ground truth SNP/INDEL variants as validated by GATK.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Analyses on large genomic datasets could become more iterative, allowing frequent incorporation of updated references without prohibitive costs.
  • Similar adjustment strategies might extend to remapping in other sequencing technologies or between assemblies if similarity holds.
  • Laboratories with limited compute resources could perform more variant calling studies on updated genomes.

Load-bearing premise

The two reference genomes must be similar enough that most alignments can be adjusted by handling only the differing regions without missing or incorrectly remapping a substantial fraction of reads.

What would settle it

Running AirLift and full remapping on read sets between two dissimilar reference genomes and observing a large drop in variant calling accuracy or many unmapped reads in the AirLift output.

Figures

Figures reproduced from arXiv: 1912.08735 by Can Alkan, Can Firtina, Damla Senol Cali, Jeremie S. Kim, Meryem Banu Cavlak, Mohammed Alser, Nastaran Hajinazar, Onur Mutlu.

Figure 1
Figure 1. Figure 1: Limitations of Existing Remapping Tools. Existing remapping tools [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An example pair of reference genomes (old and new) with regions [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: AirLift uses eight key steps to identify and label regions in the old and [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Using AirLift to remap a read set. AirLift remaps each read differently [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: AirLift execution time results. We show the execution time (log￾scale y-axis) of running three remapping tools, CrossMap (blue), AirLift (or￾ange), and LiftOver (green) on a read set to a new reference genome against the baseline (red) of fully mapping a read set to the new reference genome. We plot the execution times of each tool for various pairs of reference genomes (x￾axis; where the old reference is … view at source ↗
Figure 6
Figure 6. Figure 6: AirLift memory usage results. Peak memory usage results for each of the remapping tools during remapping [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
read the original abstract

AirLift is the first read remapping tool that enables users to quickly and comprehensively map a read set, that had been previously mapped to one reference genome, to another similar reference. Users can then quickly run a downstream analysis of read sets for each latest reference release. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces the overall execution time to remap read sets between two reference genome versions by up to 27.4x. We validate our remapping results with GATK and find that AirLift provides high accuracy in identifying ground truth SNP/INDEL variants AirLift source code and readme describing how to reproduce our results are available at https://github.com/CMU-SAFARI/AirLift.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents AirLift, a read remapping tool that adjusts existing alignments between two similar reference genomes by focusing on differing intervals rather than performing full re-alignment. It claims up to 27.4x reduction in wall-clock time versus full mapping on human references (hg19↔hg38) and high accuracy in recovering ground-truth SNP/INDEL calls when downstream analysis is performed with GATK.

Significance. If the performance and accuracy results hold, the work would be useful for genomics pipelines that must periodically re-analyze large read sets against updated references; the open-source release and reproduction instructions are a concrete strength that supports reproducibility.

major comments (2)
  1. [Abstract and method description] The central speedup (up to 27.4×) and GATK concordance claims rest on the unquantified assumption that reference differences are localized and small enough that the fraction of reads requiring de-novo placement or crossing unhandled structural events remains negligible. The manuscript reports results only on hg19↔hg38 pairs whose differences are mostly small indels/SNVs but supplies no bound on tolerable inversion size or structural variation fraction; this directly affects both the reported execution-time reduction and variant-calling fidelity.
  2. [Results / validation section] Validation experiments cite GATK concordance but do not report the precise read-set sizes, coverage depths, or the handling of reads whose correct placement spans difference boundaries; without these details the claim that accuracy remains “high” cannot be assessed for generalizability beyond the tested human pairs.
minor comments (2)
  1. [Introduction] Define the term “similar reference” quantitatively (e.g., maximum allowed structural-event size) in the introduction.
  2. [Methods] Add a short algorithmic outline or pseudocode for the interval-adjustment procedure to improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We respond point-by-point to the major comments below.

read point-by-point responses
  1. Referee: [Abstract and method description] The central speedup (up to 27.4×) and GATK concordance claims rest on the unquantified assumption that reference differences are localized and small enough that the fraction of reads requiring de-novo placement or crossing unhandled structural events remains negligible. The manuscript reports results only on hg19↔hg38 pairs whose differences are mostly small indels/SNVs but supplies no bound on tolerable inversion size or structural variation fraction; this directly affects both the reported execution-time reduction and variant-calling fidelity.

    Authors: AirLift targets similar reference genomes whose differences are localized (primarily small indels and SNVs), as exemplified by the hg19–hg38 pair. The algorithm identifies differing intervals and only remaps reads overlapping those intervals or their immediate vicinity; reads outside differing intervals retain their original placements. We do not provide a quantitative bound on inversion size or SV fraction because the manuscript evaluates the specific case of human reference updates. We will add an explicit limitations paragraph stating the assumption of localized differences and noting that large structural events would require separate handling or full re-mapping, thereby clarifying the scope of the reported speedup and accuracy. revision: partial

  2. Referee: [Results / validation section] Validation experiments cite GATK concordance but do not report the precise read-set sizes, coverage depths, or the handling of reads whose correct placement spans difference boundaries; without these details the claim that accuracy remains “high” cannot be assessed for generalizability beyond the tested human pairs.

    Authors: We will revise the results and methods sections to report the exact read-set sizes, sequencing coverage depths, and the precise rule used when a read’s correct placement spans a difference boundary (such reads are extracted and re-mapped de novo by the underlying aligner). These additions will make the experimental conditions fully reproducible and allow readers to judge generalizability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on direct measurements, not self-referential derivations

full rationale

The paper presents an engineering tool whose central claims (up to 27.4× wall-clock reduction versus full mapping, high GATK concordance) are obtained by running the implemented remapper on real read sets and comparing outputs to a baseline full-mapping run on identical hardware. No equations, fitted parameters, or predictions derived from the same data appear; the method description relies on explicit region detection and adjustment rather than any self-definitional or fitted-input construction. External validation via GATK supplies an independent benchmark. No self-citations are invoked as load-bearing uniqueness theorems. The similarity assumption noted by the skeptic is a scope limitation on applicability, not a circular reduction of the reported results to their inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the premise that reference genomes differ only locally and that these differences can be used to adjust alignments comprehensively.

axioms (1)
  • domain assumption Reference genomes are sufficiently similar that remapping alignments is feasible without full re-alignment for the majority of reads.
    This premise is required for the speedup claim to hold and is implicit in the design of a remapping rather than re-mapping tool.

pith-pipeline@v0.9.0 · 5698 in / 1084 out tokens · 22395 ms · 2026-05-24T14:55:04.021639+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 1 internal anchor

  1. [1]

    : The Simons Genome Diversity Project: 300 Genomes from 142 Diverse Populations

    Mallick, S., Li, H., Lipson, M., Mathieson, I., Gymrek, M., Racimo, F., Zhao, M., Chennagiri, N., Nordenfelt, S., Tandon, A., et al. : The Simons Genome Diversity Project: 300 Genomes from 142 Diverse Populations. Nature 538(7624), 201 (2016)

  2. [2]

    : Assembly of a Pan-genome from Deep Sequencing of 910 Humans of African Descent

    Sherman, R.M., Forman, J., Antonescu, V., Puiu, D., Daya, M., Rafaels, N., Boorgula, M.P., Chavan, S., Vergara, C., Ortega, V.E., et al. : Assembly of a Pan-genome from Deep Sequencing of 910 Humans of African Descent. Nature Genetics 51(1), 30 (2019)

  3. [3]

    : Analysis of Error Profiles in Deep Next-Generation Sequencing Data

    Ma, X., Shao, Y., Tian, L., Flasch, D.A., Mulder, H.L., Edmonson, M.N., Liu, Y., Chen, X., Newman, S., Nakitandwe, J., et al. : Analysis of Error Profiles in Deep Next-Generation Sequencing Data. Genome Biology 20(1), 50 (2019)

  4. [4]

    Nature Methods 8(1), 61 (2011) Jeremie S

    Alkan, C., Sajjadian, S., Eichler, E.E.: Limitations of Next-Generation Genome Sequence Assembly. Nature Methods 8(1), 61 (2011) Jeremie S. Kim et al. Page 15 of 16

  5. [5]

    Proceedings of the IEEE 105(3), 422–435 (2017)

    Steinberg, K.M., Schneider, V.A., Alkan, C., Montague, M.J., Warren, W.C., Church, D.M., Wilson, R.K.: Building and Improving Reference Genome Assemblies. Proceedings of the IEEE 105(3), 422–435 (2017)

  6. [6]

    https://www.ncbi.nlm.nih.gov/refseq/about/human/

    RefSeq Curation and Annotation of the Human Reference Genome. https://www.ncbi.nlm.nih.gov/refseq/about/human/

  7. [7]

    https://www.ncbi.nlm.nih.gov/grc/help/patches/#frequency

    Genome Reference Consortium Introduction to Patches. https://www.ncbi.nlm.nih.gov/grc/help/patches/#frequency

  8. [8]

    Nature (2020)

    Miga, K.H., Koren, S., Rhie, A., Vollger, M.R., Gershman, A., Bzikadze, A., Brooks, S., Howe, E., Porubsky, D., Logsdon, G.A., et al.: Telomere-to-Telomere Assembly of a Complete Human X Chromosome. Nature (2020)

  9. [9]

    Genomics 109(2), 83–90 (2017)

    Guo, Y., Dai, Y., Yu, H., Zhao, S., Samuels, D.C., Shyr, Y.: Improvements and Impacts of GRCh38 Human Reference on High Throughput Sequencing Data Analysis. Genomics 109(2), 83–90 (2017)

  10. [10]

    Nature 526(7571), 68 (2015)

    1000 Genomes Project Consortium: A Global Reference for Human Genetic Variation. Nature 526(7571), 68 (2015)

  11. [11]

    GigaScience 6(7), 1–8 (2017)

    Zheng-Bradley, X., Streeter, I., Fairley, S., Richardson, D., Clarke, L., Flicek, P., Consortium, .G.P.: Alignment of 1000 Genomes Project Reads to Reference Assembly GRCh38. GigaScience 6(7), 1–8 (2017)

  12. [12]

    Bioinformatics 27(20), 2790–2796 (2011)

    Ruffalo, M., LaFramboise, T., Koyuturk, M.: Comparative Analysis of Algorithms for Next-Generation Sequencing Read Alignment. Bioinformatics 27(20), 2790–2796 (2011). doi:10.1093/bioinformatics/btr477

  13. [13]

    Proceedings of the IEEE 105(3), 436–458 (2015)

    Canzar, S., Salzberg, S.L.: Short Read Mapping: An Algorithmic Tour. Proceedings of the IEEE 105(3), 436–458 (2015)

  14. [14]

    arXiv preprint arXiv:2003.00110 (2020)

    Alser, M., Rotman, J., Taraszka, K., Shi, H., Baykal, P.I., Yang, H.T., Xue, V., Knyazev, S., Singer, B.D., Balliu, B., et al.: Technology Dictates Algorithms: Recent Developments in Read Alignment. arXiv preprint arXiv:2003.00110 (2020)

  15. [15]

    IEEE Micro (2020)

    Alser, M., Bing¨ ol, Z., Cali, D.S., Kim, J., Ghose, S., Alkan, C., Mutlu, O.: Accelerating Genome Analysis: A Primer on an Ongoing Journey. IEEE Micro (2020)

  16. [16]

    https://www.broadinstitute.org/news/ broad-institute-sequences-its-100000th-whole-human-genome-national-dna-day

    Broad Communications: Broad Institute Sequences Its 100,000th Whole Human Genome on National DNA Day. https://www.broadinstitute.org/news/ broad-institute-sequences-its-100000th-whole-human-genome-national-dna-day

  17. [17]

    https://www.broadinstitute.org/blog/harnessing-flood-scaling-data-science-big-genomics-era

    Ulrich, T.: Harnessing the Flood: Scaling up Data Science in the Big Genomics Era. https://www.broadinstitute.org/blog/harnessing-flood-scaling-data-science-big-genomics-era

  18. [18]

    Briefings in Bioinformatics 20(4), 1542–1559 (2019)

    Senol Cali, D., Kim, J.S., Ghose, S., Alkan, C., Mutlu, O.: Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysis of the Current State, Bottlenecks and Future Directions. Briefings in Bioinformatics 20(4), 1542–1559 (2019)

  19. [19]

    : Genome Sequence of the Date Palm Phoenix dactylifera L

    Al-Mssallem, I.S., Hu, S., Zhang, X., Lin, Q., Liu, W., Tan, J., Yu, X., Liu, J., Pan, L., Zhang, T., et al. : Genome Sequence of the Date Palm Phoenix dactylifera L. Nature Communications 4, 2274 (2013)

  20. [20]

    : Genome Sequence and Genetic Diversity of the Common Carp, Cyprinus carpio

    Xu, P., Zhang, X., Wang, X., Li, J., Liu, G., Kuang, Y., Xu, J., Zheng, X., Ren, L., Wang, G., et al. : Genome Sequence and Genetic Diversity of the Common Carp, Cyprinus carpio. Nature Genetics 46(11), 1212 (2014)

  21. [21]

    : The First Korean Genome Sequence and Analysis: Full Genome Sequencing for a Socio-ethnic Group

    Ahn, S.-M., Kim, T.-H., Lee, S., Kim, D., Ghang, H., Kim, D.-S., Kim, B.-C., Kim, S.-Y., Kim, W.-Y., Kim, C., et al. : The First Korean Genome Sequence and Analysis: Full Genome Sequencing for a Socio-ethnic Group. Genome Research 19(9), 1622–1629 (2009)

  22. [22]

    : The Diploid Genome Sequence of an Asian Individual

    Wang, J., Wang, W., Li, R., Li, Y., Tian, G., Goodman, L., Fan, W., Zhang, J., Li, J., Zhang, J., et al. : The Diploid Genome Sequence of an Asian Individual. Nature 456(7218), 60 (2008)

  23. [23]

    : Complete Khoisan and Bantu Genomes from Southern Africa

    Schuster, S.C., Miller, W., Ratan, A., Tomsho, L.P., Giardine, B., Kasson, L.R., Harris, R.S., Petersen, D.C., Zhao, F., Qi, J., et al. : Complete Khoisan and Bantu Genomes from Southern Africa. Nature 463(7283), 943 (2010)

  24. [24]

    BMC Genomics 16(1), 1093 (2015)

    Huang, T., Shu, Y., Cai, Y.-D.: Genetic Differences among Ethnic Groups. BMC Genomics 16(1), 1093 (2015)

  25. [25]

    BMC Genomics 20(1), 459 (2019)

    Shukla, H.G., Bawa, P.S., Srinivasan, S.: hg19KIndel: Ethnicity Normalized Human Reference Genome. BMC Genomics 20(1), 459 (2019)

  26. [26]

    https://genome.ucsc.edu/cgi-bin/hgLiftOver

    UCSC: UCSC LiftOver: Lift Genome Annotations. https://genome.ucsc.edu/cgi-bin/hgLiftOver

  27. [27]

    http://crossmap.sourceforge.net/#use-pip-to-install-crossmap

    Zhao, Hao and Sun, Zhifu and Wang, Jing and Huang, Haojie and Kocher, Jean-Pierre and Wang, Liguo: CrossMap: Convert Genome Coordinates Between Assemblies. http://crossmap.sourceforge.net/#use-pip-to-install-crossmap

  28. [28]

    https://pypi.org/project/segment-liftover/

    Gao, B.: Segment Liftover. https://pypi.org/project/segment-liftover/

  29. [29]

    F1000Research 7 (2018)

    Gao, B., Huang, Q., Baudis, M.: Segment Liftover: A Python Tool to Convert Segments Between Genome Assemblies. F1000Research 7 (2018)

  30. [30]

    Bioinformatics 30(7), 1006–1007 (2013)

    Zhao, H., Sun, Z., Wang, J., Huang, H., Kocher, J.-P., Wang, L.: CrossMap: A Versatile Tool for Coordinate Conversion Between Genome Assemblies. Bioinformatics 30(7), 1006–1007 (2013)

  31. [31]

    https://www.ncbi.nlm.nih.gov/genome/tools/remap

    NCBI: NCBI Genome Remapping Service. https://www.ncbi.nlm.nih.gov/genome/tools/remap

  32. [32]

    https://www.usegalaxy.org

    The Galaxy Team: Galaxy. https://www.usegalaxy.org

  33. [33]

    https://pypi.org/project/pyliftover/

    Tretyakov, K.: PyLiftover. https://pypi.org/project/pyliftover/

  34. [34]

    http://samtools.github.io/hts-specs/

    SAM/BAM and related specifications. http://samtools.github.io/hts-specs/

  35. [35]

    Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

    Li, H.: Aligning Sequence Reads, Clone Sequences and Assembly Contigs with BWA-MEM. arXiv:1303.3997 (2013)

  36. [36]

    Genome Research 20(9), 1297–1303 (2010)

    McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., DePristo, M.A.: The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research 20(9), 1297–1303 (2010). doi:10.1101/gr.107524.110

  37. [37]

    https://genome.ucsc.edu/goldenPath/help/blatSpec.html

    UCSC: Blat Suite Program Specifications and User Guide. https://genome.ucsc.edu/goldenPath/help/blatSpec.html

  38. [38]

    Current Protocols in Bioinformatics 43(1) (2013)

    Auwera, G.A., Carneiro, M.O., Hartl, C., Poplin, R., del Angel, G., Levy-Moonshine, A., Jordan, T., Shakir, K., Roazen, D., Thibault, J., Banks, E., Garimella, K.V., Altshuler, D., Gabriel, S., DePristo, M.A.: From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. Current Protocols in Bioinformatics 43(1) (20...

  39. [39]

    : The variant call format and vcftools

    Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., et al. : The variant call format and vcftools. Bioinformatics 27(15), 2156–2158 Jeremie S. Kim et al. Page 16 of 16 (2011)

  40. [40]

    Genome Research 27(1), 157–164 (2017)

    Eberle, M.A., Fritzilas, E., Krusche, P., K¨ allberg, M., Moore, B.L., Bekritsky, M.A., Iqbal, Z., Chuang, H.-Y., Humphray, S.J., Halpern, A.L., Kruglyak, S., Margulies, E.H., McVean, G., Bentley, D.R.: A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome R...

  41. [41]

    Nature Biotechnology (2014)

    Zook, J.M., Chapman, B., Wang, J., Mittelman, D., Hofmann, O., Hide, W., Salit, M.: Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nature Biotechnology (2014). doi:10.1038/nbt.2835

  42. [42]

    Bioinformatics 32(15), 2243–2247 (2016)

    Firtina, C., Alkan, C.: On genomic repeats and reproducibility. Bioinformatics 32(15), 2243–2247 (2016). doi:10.1093/bioinformatics/btw139