AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes

Can Alkan; Can Firtina; Damla Senol Cali; Jeremie S. Kim; Meryem Banu Cavlak; Mohammed Alser; Nastaran Hajinazar; Onur Mutlu

arxiv: 1912.08735 · v5 · submitted 2019-12-18 · 🧬 q-bio.GN · cs.CE

AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes

Jeremie S. Kim , Can Firtina , Meryem Banu Cavlak , Damla Senol Cali , Mohammed Alser , Nastaran Hajinazar , Can Alkan , Onur Mutlu This is my paper

Pith reviewed 2026-05-24 14:55 UTC · model grok-4.3

classification 🧬 q-bio.GN cs.CE

keywords read remappingreference genomesalignmentsvariant callingSNPINDELgenome analysisbioinformatics

0 comments

The pith

AirLift remaps read sets to new reference genomes up to 27.4 times faster than complete remapping while preserving variant accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AirLift as a technique to remap alignments of reads from one reference genome to a similar newer one. This approach avoids the need for full re-alignment from scratch each time a reference is updated. A sympathetic reader would care because frequent reference updates make re-analysis costly in time and compute. If the method works, researchers can update their variant calls and other analyses on the latest genome versions much more quickly. The validation shows it maintains accuracy for identifying SNPs and INDELs.

Core claim

AirLift is the first read remapping tool that enables users to quickly and comprehensively map a read set, that had been previously mapped to one reference genome, to another similar reference. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces the overall execution time to remap read sets between two reference genome versions by up to 27.4x. We validate our remapping results with GATK and find that AirLift provides high accuracy in identifying ground truth SNP/INDEL variants.

What carries the argument

The remapping technique that adjusts alignments only in regions where the two reference genomes differ, rather than re-aligning all reads from scratch.

If this is right

Users can quickly run downstream analysis of read sets for each latest reference release.
Remapping execution time is reduced by up to 27.4x compared to full mapping.
High accuracy is maintained in identifying ground truth SNP/INDEL variants as validated by GATK.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Analyses on large genomic datasets could become more iterative, allowing frequent incorporation of updated references without prohibitive costs.
Similar adjustment strategies might extend to remapping in other sequencing technologies or between assemblies if similarity holds.
Laboratories with limited compute resources could perform more variant calling studies on updated genomes.

Load-bearing premise

The two reference genomes must be similar enough that most alignments can be adjusted by handling only the differing regions without missing or incorrectly remapping a substantial fraction of reads.

What would settle it

Running AirLift and full remapping on read sets between two dissimilar reference genomes and observing a large drop in variant calling accuracy or many unmapped reads in the AirLift output.

Figures

Figures reproduced from arXiv: 1912.08735 by Can Alkan, Can Firtina, Damla Senol Cali, Jeremie S. Kim, Meryem Banu Cavlak, Mohammed Alser, Nastaran Hajinazar, Onur Mutlu.

**Figure 2.** Figure 2: An example pair of reference genomes (old and new) with regions [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: AirLift uses eight key steps to identify and label regions in the old and [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Using AirLift to remap a read set. AirLift remaps each read differently [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: AirLift execution time results. We show the execution time (logscale y-axis) of running three remapping tools, CrossMap (blue), AirLift (orange), and LiftOver (green) on a read set to a new reference genome against the baseline (red) of fully mapping a read set to the new reference genome. We plot the execution times of each tool for various pairs of reference genomes (xaxis; where the old reference is … view at source ↗

**Figure 6.** Figure 6: AirLift memory usage results. Peak memory usage results for each of the remapping tools during remapping [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

AirLift is the first read remapping tool that enables users to quickly and comprehensively map a read set, that had been previously mapped to one reference genome, to another similar reference. Users can then quickly run a downstream analysis of read sets for each latest reference release. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces the overall execution time to remap read sets between two reference genome versions by up to 27.4x. We validate our remapping results with GATK and find that AirLift provides high accuracy in identifying ground truth SNP/INDEL variants AirLift source code and readme describing how to reproduce our results are available at https://github.com/CMU-SAFARI/AirLift.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AirLift targets remapping by re-aligning only in differing intervals between references, delivering reported speedups up to 27x on close human pairs with GATK validation, but the gains hinge on differences staying small and localized.

read the letter

The main takeaway is that this paper introduces a remapping tool that avoids full re-alignment by detecting changed intervals between two references and only adjusting alignments inside those intervals. On hg19 to hg38 it reports up to 27.4x wall-clock reduction versus starting over, with variant calls that match GATK ground truth at high rates. The code is on GitHub, which helps with checking the claims directly.

Referee Report

2 major / 2 minor

Summary. The manuscript presents AirLift, a read remapping tool that adjusts existing alignments between two similar reference genomes by focusing on differing intervals rather than performing full re-alignment. It claims up to 27.4x reduction in wall-clock time versus full mapping on human references (hg19↔hg38) and high accuracy in recovering ground-truth SNP/INDEL calls when downstream analysis is performed with GATK.

Significance. If the performance and accuracy results hold, the work would be useful for genomics pipelines that must periodically re-analyze large read sets against updated references; the open-source release and reproduction instructions are a concrete strength that supports reproducibility.

major comments (2)

[Abstract and method description] The central speedup (up to 27.4×) and GATK concordance claims rest on the unquantified assumption that reference differences are localized and small enough that the fraction of reads requiring de-novo placement or crossing unhandled structural events remains negligible. The manuscript reports results only on hg19↔hg38 pairs whose differences are mostly small indels/SNVs but supplies no bound on tolerable inversion size or structural variation fraction; this directly affects both the reported execution-time reduction and variant-calling fidelity.
[Results / validation section] Validation experiments cite GATK concordance but do not report the precise read-set sizes, coverage depths, or the handling of reads whose correct placement spans difference boundaries; without these details the claim that accuracy remains “high” cannot be assessed for generalizability beyond the tested human pairs.

minor comments (2)

[Introduction] Define the term “similar reference” quantitatively (e.g., maximum allowed structural-event size) in the introduction.
[Methods] Add a short algorithmic outline or pseudocode for the interval-adjustment procedure to improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We respond point-by-point to the major comments below.

read point-by-point responses

Referee: [Abstract and method description] The central speedup (up to 27.4×) and GATK concordance claims rest on the unquantified assumption that reference differences are localized and small enough that the fraction of reads requiring de-novo placement or crossing unhandled structural events remains negligible. The manuscript reports results only on hg19↔hg38 pairs whose differences are mostly small indels/SNVs but supplies no bound on tolerable inversion size or structural variation fraction; this directly affects both the reported execution-time reduction and variant-calling fidelity.

Authors: AirLift targets similar reference genomes whose differences are localized (primarily small indels and SNVs), as exemplified by the hg19–hg38 pair. The algorithm identifies differing intervals and only remaps reads overlapping those intervals or their immediate vicinity; reads outside differing intervals retain their original placements. We do not provide a quantitative bound on inversion size or SV fraction because the manuscript evaluates the specific case of human reference updates. We will add an explicit limitations paragraph stating the assumption of localized differences and noting that large structural events would require separate handling or full re-mapping, thereby clarifying the scope of the reported speedup and accuracy. revision: partial
Referee: [Results / validation section] Validation experiments cite GATK concordance but do not report the precise read-set sizes, coverage depths, or the handling of reads whose correct placement spans difference boundaries; without these details the claim that accuracy remains “high” cannot be assessed for generalizability beyond the tested human pairs.

Authors: We will revise the results and methods sections to report the exact read-set sizes, sequencing coverage depths, and the precise rule used when a read’s correct placement spans a difference boundary (such reads are extracted and re-mapped de novo by the underlying aligner). These additions will make the experimental conditions fully reproducible and allow readers to judge generalizability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on direct measurements, not self-referential derivations

full rationale

The paper presents an engineering tool whose central claims (up to 27.4× wall-clock reduction versus full mapping, high GATK concordance) are obtained by running the implemented remapper on real read sets and comparing outputs to a baseline full-mapping run on identical hardware. No equations, fitted parameters, or predictions derived from the same data appear; the method description relies on explicit region detection and adjustment rather than any self-definitional or fitted-input construction. External validation via GATK supplies an independent benchmark. No self-citations are invoked as load-bearing uniqueness theorems. The similarity assumption noted by the skeptic is a scope limitation on applicability, not a circular reduction of the reported results to their inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the premise that reference genomes differ only locally and that these differences can be used to adjust alignments comprehensively.

axioms (1)

domain assumption Reference genomes are sufficiently similar that remapping alignments is feasible without full re-alignment for the majority of reads.
This premise is required for the speedup claim to hold and is implicit in the design of a remapping rather than re-mapping tool.

pith-pipeline@v0.9.0 · 5698 in / 1084 out tokens · 22395 ms · 2026-05-24T14:55:04.021639+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 1 internal anchor

[1]

: The Simons Genome Diversity Project: 300 Genomes from 142 Diverse Populations

Mallick, S., Li, H., Lipson, M., Mathieson, I., Gymrek, M., Racimo, F., Zhao, M., Chennagiri, N., Nordenfelt, S., Tandon, A., et al. : The Simons Genome Diversity Project: 300 Genomes from 142 Diverse Populations. Nature 538(7624), 201 (2016)

work page 2016
[2]

: Assembly of a Pan-genome from Deep Sequencing of 910 Humans of African Descent

Sherman, R.M., Forman, J., Antonescu, V., Puiu, D., Daya, M., Rafaels, N., Boorgula, M.P., Chavan, S., Vergara, C., Ortega, V.E., et al. : Assembly of a Pan-genome from Deep Sequencing of 910 Humans of African Descent. Nature Genetics 51(1), 30 (2019)

work page 2019
[3]

: Analysis of Error Proﬁles in Deep Next-Generation Sequencing Data

Ma, X., Shao, Y., Tian, L., Flasch, D.A., Mulder, H.L., Edmonson, M.N., Liu, Y., Chen, X., Newman, S., Nakitandwe, J., et al. : Analysis of Error Proﬁles in Deep Next-Generation Sequencing Data. Genome Biology 20(1), 50 (2019)

work page 2019
[4]

Nature Methods 8(1), 61 (2011) Jeremie S

Alkan, C., Sajjadian, S., Eichler, E.E.: Limitations of Next-Generation Genome Sequence Assembly. Nature Methods 8(1), 61 (2011) Jeremie S. Kim et al. Page 15 of 16

work page 2011
[5]

Proceedings of the IEEE 105(3), 422–435 (2017)

Steinberg, K.M., Schneider, V.A., Alkan, C., Montague, M.J., Warren, W.C., Church, D.M., Wilson, R.K.: Building and Improving Reference Genome Assemblies. Proceedings of the IEEE 105(3), 422–435 (2017)

work page 2017
[6]

https://www.ncbi.nlm.nih.gov/refseq/about/human/

RefSeq Curation and Annotation of the Human Reference Genome. https://www.ncbi.nlm.nih.gov/refseq/about/human/

work page
[7]

https://www.ncbi.nlm.nih.gov/grc/help/patches/#frequency

Genome Reference Consortium Introduction to Patches. https://www.ncbi.nlm.nih.gov/grc/help/patches/#frequency

work page
[8]

Nature (2020)

Miga, K.H., Koren, S., Rhie, A., Vollger, M.R., Gershman, A., Bzikadze, A., Brooks, S., Howe, E., Porubsky, D., Logsdon, G.A., et al.: Telomere-to-Telomere Assembly of a Complete Human X Chromosome. Nature (2020)

work page 2020
[9]

Genomics 109(2), 83–90 (2017)

Guo, Y., Dai, Y., Yu, H., Zhao, S., Samuels, D.C., Shyr, Y.: Improvements and Impacts of GRCh38 Human Reference on High Throughput Sequencing Data Analysis. Genomics 109(2), 83–90 (2017)

work page 2017
[10]

Nature 526(7571), 68 (2015)

1000 Genomes Project Consortium: A Global Reference for Human Genetic Variation. Nature 526(7571), 68 (2015)

work page 2015
[11]

GigaScience 6(7), 1–8 (2017)

Zheng-Bradley, X., Streeter, I., Fairley, S., Richardson, D., Clarke, L., Flicek, P., Consortium, .G.P.: Alignment of 1000 Genomes Project Reads to Reference Assembly GRCh38. GigaScience 6(7), 1–8 (2017)

work page 2017
[12]

Bioinformatics 27(20), 2790–2796 (2011)

Ruﬀalo, M., LaFramboise, T., Koyuturk, M.: Comparative Analysis of Algorithms for Next-Generation Sequencing Read Alignment. Bioinformatics 27(20), 2790–2796 (2011). doi:10.1093/bioinformatics/btr477

work page doi:10.1093/bioinformatics/btr477 2011
[13]

Proceedings of the IEEE 105(3), 436–458 (2015)

Canzar, S., Salzberg, S.L.: Short Read Mapping: An Algorithmic Tour. Proceedings of the IEEE 105(3), 436–458 (2015)

work page 2015
[14]

arXiv preprint arXiv:2003.00110 (2020)

Alser, M., Rotman, J., Taraszka, K., Shi, H., Baykal, P.I., Yang, H.T., Xue, V., Knyazev, S., Singer, B.D., Balliu, B., et al.: Technology Dictates Algorithms: Recent Developments in Read Alignment. arXiv preprint arXiv:2003.00110 (2020)

work page arXiv 2003
[15]

IEEE Micro (2020)

Alser, M., Bing¨ ol, Z., Cali, D.S., Kim, J., Ghose, S., Alkan, C., Mutlu, O.: Accelerating Genome Analysis: A Primer on an Ongoing Journey. IEEE Micro (2020)

work page 2020
[16]

https://www.broadinstitute.org/news/ broad-institute-sequences-its-100000th-whole-human-genome-national-dna-day

Broad Communications: Broad Institute Sequences Its 100,000th Whole Human Genome on National DNA Day. https://www.broadinstitute.org/news/ broad-institute-sequences-its-100000th-whole-human-genome-national-dna-day

work page
[17]

https://www.broadinstitute.org/blog/harnessing-ﬂood-scaling-data-science-big-genomics-era

Ulrich, T.: Harnessing the Flood: Scaling up Data Science in the Big Genomics Era. https://www.broadinstitute.org/blog/harnessing-ﬂood-scaling-data-science-big-genomics-era

work page
[18]

Brieﬁngs in Bioinformatics 20(4), 1542–1559 (2019)

Senol Cali, D., Kim, J.S., Ghose, S., Alkan, C., Mutlu, O.: Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysis of the Current State, Bottlenecks and Future Directions. Brieﬁngs in Bioinformatics 20(4), 1542–1559 (2019)

work page 2019
[19]

: Genome Sequence of the Date Palm Phoenix dactylifera L

Al-Mssallem, I.S., Hu, S., Zhang, X., Lin, Q., Liu, W., Tan, J., Yu, X., Liu, J., Pan, L., Zhang, T., et al. : Genome Sequence of the Date Palm Phoenix dactylifera L. Nature Communications 4, 2274 (2013)

work page 2013
[20]

: Genome Sequence and Genetic Diversity of the Common Carp, Cyprinus carpio

Xu, P., Zhang, X., Wang, X., Li, J., Liu, G., Kuang, Y., Xu, J., Zheng, X., Ren, L., Wang, G., et al. : Genome Sequence and Genetic Diversity of the Common Carp, Cyprinus carpio. Nature Genetics 46(11), 1212 (2014)

work page 2014
[21]

: The First Korean Genome Sequence and Analysis: Full Genome Sequencing for a Socio-ethnic Group

Ahn, S.-M., Kim, T.-H., Lee, S., Kim, D., Ghang, H., Kim, D.-S., Kim, B.-C., Kim, S.-Y., Kim, W.-Y., Kim, C., et al. : The First Korean Genome Sequence and Analysis: Full Genome Sequencing for a Socio-ethnic Group. Genome Research 19(9), 1622–1629 (2009)

work page 2009
[22]

: The Diploid Genome Sequence of an Asian Individual

Wang, J., Wang, W., Li, R., Li, Y., Tian, G., Goodman, L., Fan, W., Zhang, J., Li, J., Zhang, J., et al. : The Diploid Genome Sequence of an Asian Individual. Nature 456(7218), 60 (2008)

work page 2008
[23]

: Complete Khoisan and Bantu Genomes from Southern Africa

Schuster, S.C., Miller, W., Ratan, A., Tomsho, L.P., Giardine, B., Kasson, L.R., Harris, R.S., Petersen, D.C., Zhao, F., Qi, J., et al. : Complete Khoisan and Bantu Genomes from Southern Africa. Nature 463(7283), 943 (2010)

work page 2010
[24]

BMC Genomics 16(1), 1093 (2015)

Huang, T., Shu, Y., Cai, Y.-D.: Genetic Diﬀerences among Ethnic Groups. BMC Genomics 16(1), 1093 (2015)

work page 2015
[25]

BMC Genomics 20(1), 459 (2019)

Shukla, H.G., Bawa, P.S., Srinivasan, S.: hg19KIndel: Ethnicity Normalized Human Reference Genome. BMC Genomics 20(1), 459 (2019)

work page 2019
[26]

https://genome.ucsc.edu/cgi-bin/hgLiftOver

UCSC: UCSC LiftOver: Lift Genome Annotations. https://genome.ucsc.edu/cgi-bin/hgLiftOver

work page
[27]

http://crossmap.sourceforge.net/#use-pip-to-install-crossmap

Zhao, Hao and Sun, Zhifu and Wang, Jing and Huang, Haojie and Kocher, Jean-Pierre and Wang, Liguo: CrossMap: Convert Genome Coordinates Between Assemblies. http://crossmap.sourceforge.net/#use-pip-to-install-crossmap

work page
[28]

https://pypi.org/project/segment-liftover/

Gao, B.: Segment Liftover. https://pypi.org/project/segment-liftover/

work page
[29]

F1000Research 7 (2018)

Gao, B., Huang, Q., Baudis, M.: Segment Liftover: A Python Tool to Convert Segments Between Genome Assemblies. F1000Research 7 (2018)

work page 2018
[30]

Bioinformatics 30(7), 1006–1007 (2013)

Zhao, H., Sun, Z., Wang, J., Huang, H., Kocher, J.-P., Wang, L.: CrossMap: A Versatile Tool for Coordinate Conversion Between Genome Assemblies. Bioinformatics 30(7), 1006–1007 (2013)

work page 2013
[31]

https://www.ncbi.nlm.nih.gov/genome/tools/remap

NCBI: NCBI Genome Remapping Service. https://www.ncbi.nlm.nih.gov/genome/tools/remap

work page
[32]

https://www.usegalaxy.org

The Galaxy Team: Galaxy. https://www.usegalaxy.org

work page
[33]

https://pypi.org/project/pyliftover/

Tretyakov, K.: PyLiftover. https://pypi.org/project/pyliftover/

work page
[34]

http://samtools.github.io/hts-specs/

SAM/BAM and related speciﬁcations. http://samtools.github.io/hts-specs/

work page
[35]

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

Li, H.: Aligning Sequence Reads, Clone Sequences and Assembly Contigs with BWA-MEM. arXiv:1303.3997 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013
[36]

Genome Research 20(9), 1297–1303 (2010)

McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., DePristo, M.A.: The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research 20(9), 1297–1303 (2010). doi:10.1101/gr.107524.110

work page doi:10.1101/gr.107524.110 2010
[37]

https://genome.ucsc.edu/goldenPath/help/blatSpec.html

UCSC: Blat Suite Program Speciﬁcations and User Guide. https://genome.ucsc.edu/goldenPath/help/blatSpec.html

work page
[38]

Current Protocols in Bioinformatics 43(1) (2013)

Auwera, G.A., Carneiro, M.O., Hartl, C., Poplin, R., del Angel, G., Levy-Moonshine, A., Jordan, T., Shakir, K., Roazen, D., Thibault, J., Banks, E., Garimella, K.V., Altshuler, D., Gabriel, S., DePristo, M.A.: From FastQ Data to High-Conﬁdence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. Current Protocols in Bioinformatics 43(1) (20...

work page doi:10.1002/0471250953.bi1110s43 2013
[39]

: The variant call format and vcftools

Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., et al. : The variant call format and vcftools. Bioinformatics 27(15), 2156–2158 Jeremie S. Kim et al. Page 16 of 16 (2011)

work page 2011
[40]

Genome Research 27(1), 157–164 (2017)

Eberle, M.A., Fritzilas, E., Krusche, P., K¨ allberg, M., Moore, B.L., Bekritsky, M.A., Iqbal, Z., Chuang, H.-Y., Humphray, S.J., Halpern, A.L., Kruglyak, S., Margulies, E.H., McVean, G., Bentley, D.R.: A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome R...

work page doi:10.1101/gr.210500.116 2017
[41]

Nature Biotechnology (2014)

Zook, J.M., Chapman, B., Wang, J., Mittelman, D., Hofmann, O., Hide, W., Salit, M.: Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nature Biotechnology (2014). doi:10.1038/nbt.2835

work page doi:10.1038/nbt.2835 2014
[42]

Bioinformatics 32(15), 2243–2247 (2016)

Firtina, C., Alkan, C.: On genomic repeats and reproducibility. Bioinformatics 32(15), 2243–2247 (2016). doi:10.1093/bioinformatics/btw139

work page doi:10.1093/bioinformatics/btw139 2016

[1] [1]

: The Simons Genome Diversity Project: 300 Genomes from 142 Diverse Populations

Mallick, S., Li, H., Lipson, M., Mathieson, I., Gymrek, M., Racimo, F., Zhao, M., Chennagiri, N., Nordenfelt, S., Tandon, A., et al. : The Simons Genome Diversity Project: 300 Genomes from 142 Diverse Populations. Nature 538(7624), 201 (2016)

work page 2016

[2] [2]

: Assembly of a Pan-genome from Deep Sequencing of 910 Humans of African Descent

Sherman, R.M., Forman, J., Antonescu, V., Puiu, D., Daya, M., Rafaels, N., Boorgula, M.P., Chavan, S., Vergara, C., Ortega, V.E., et al. : Assembly of a Pan-genome from Deep Sequencing of 910 Humans of African Descent. Nature Genetics 51(1), 30 (2019)

work page 2019

[3] [3]

: Analysis of Error Proﬁles in Deep Next-Generation Sequencing Data

Ma, X., Shao, Y., Tian, L., Flasch, D.A., Mulder, H.L., Edmonson, M.N., Liu, Y., Chen, X., Newman, S., Nakitandwe, J., et al. : Analysis of Error Proﬁles in Deep Next-Generation Sequencing Data. Genome Biology 20(1), 50 (2019)

work page 2019

[4] [4]

Nature Methods 8(1), 61 (2011) Jeremie S

Alkan, C., Sajjadian, S., Eichler, E.E.: Limitations of Next-Generation Genome Sequence Assembly. Nature Methods 8(1), 61 (2011) Jeremie S. Kim et al. Page 15 of 16

work page 2011

[5] [5]

Proceedings of the IEEE 105(3), 422–435 (2017)

Steinberg, K.M., Schneider, V.A., Alkan, C., Montague, M.J., Warren, W.C., Church, D.M., Wilson, R.K.: Building and Improving Reference Genome Assemblies. Proceedings of the IEEE 105(3), 422–435 (2017)

work page 2017

[6] [6]

https://www.ncbi.nlm.nih.gov/refseq/about/human/

RefSeq Curation and Annotation of the Human Reference Genome. https://www.ncbi.nlm.nih.gov/refseq/about/human/

work page

[7] [7]

https://www.ncbi.nlm.nih.gov/grc/help/patches/#frequency

Genome Reference Consortium Introduction to Patches. https://www.ncbi.nlm.nih.gov/grc/help/patches/#frequency

work page

[8] [8]

Nature (2020)

Miga, K.H., Koren, S., Rhie, A., Vollger, M.R., Gershman, A., Bzikadze, A., Brooks, S., Howe, E., Porubsky, D., Logsdon, G.A., et al.: Telomere-to-Telomere Assembly of a Complete Human X Chromosome. Nature (2020)

work page 2020

[9] [9]

Genomics 109(2), 83–90 (2017)

Guo, Y., Dai, Y., Yu, H., Zhao, S., Samuels, D.C., Shyr, Y.: Improvements and Impacts of GRCh38 Human Reference on High Throughput Sequencing Data Analysis. Genomics 109(2), 83–90 (2017)

work page 2017

[10] [10]

Nature 526(7571), 68 (2015)

1000 Genomes Project Consortium: A Global Reference for Human Genetic Variation. Nature 526(7571), 68 (2015)

work page 2015

[11] [11]

GigaScience 6(7), 1–8 (2017)

Zheng-Bradley, X., Streeter, I., Fairley, S., Richardson, D., Clarke, L., Flicek, P., Consortium, .G.P.: Alignment of 1000 Genomes Project Reads to Reference Assembly GRCh38. GigaScience 6(7), 1–8 (2017)

work page 2017

[12] [12]

Bioinformatics 27(20), 2790–2796 (2011)

Ruﬀalo, M., LaFramboise, T., Koyuturk, M.: Comparative Analysis of Algorithms for Next-Generation Sequencing Read Alignment. Bioinformatics 27(20), 2790–2796 (2011). doi:10.1093/bioinformatics/btr477

work page doi:10.1093/bioinformatics/btr477 2011

[13] [13]

Proceedings of the IEEE 105(3), 436–458 (2015)

Canzar, S., Salzberg, S.L.: Short Read Mapping: An Algorithmic Tour. Proceedings of the IEEE 105(3), 436–458 (2015)

work page 2015

[14] [14]

arXiv preprint arXiv:2003.00110 (2020)

Alser, M., Rotman, J., Taraszka, K., Shi, H., Baykal, P.I., Yang, H.T., Xue, V., Knyazev, S., Singer, B.D., Balliu, B., et al.: Technology Dictates Algorithms: Recent Developments in Read Alignment. arXiv preprint arXiv:2003.00110 (2020)

work page arXiv 2003

[15] [15]

IEEE Micro (2020)

Alser, M., Bing¨ ol, Z., Cali, D.S., Kim, J., Ghose, S., Alkan, C., Mutlu, O.: Accelerating Genome Analysis: A Primer on an Ongoing Journey. IEEE Micro (2020)

work page 2020

[16] [16]

https://www.broadinstitute.org/news/ broad-institute-sequences-its-100000th-whole-human-genome-national-dna-day

Broad Communications: Broad Institute Sequences Its 100,000th Whole Human Genome on National DNA Day. https://www.broadinstitute.org/news/ broad-institute-sequences-its-100000th-whole-human-genome-national-dna-day

work page

[17] [17]

https://www.broadinstitute.org/blog/harnessing-ﬂood-scaling-data-science-big-genomics-era

Ulrich, T.: Harnessing the Flood: Scaling up Data Science in the Big Genomics Era. https://www.broadinstitute.org/blog/harnessing-ﬂood-scaling-data-science-big-genomics-era

work page

[18] [18]

Brieﬁngs in Bioinformatics 20(4), 1542–1559 (2019)

Senol Cali, D., Kim, J.S., Ghose, S., Alkan, C., Mutlu, O.: Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysis of the Current State, Bottlenecks and Future Directions. Brieﬁngs in Bioinformatics 20(4), 1542–1559 (2019)

work page 2019

[19] [19]

: Genome Sequence of the Date Palm Phoenix dactylifera L

Al-Mssallem, I.S., Hu, S., Zhang, X., Lin, Q., Liu, W., Tan, J., Yu, X., Liu, J., Pan, L., Zhang, T., et al. : Genome Sequence of the Date Palm Phoenix dactylifera L. Nature Communications 4, 2274 (2013)

work page 2013

[20] [20]

: Genome Sequence and Genetic Diversity of the Common Carp, Cyprinus carpio

Xu, P., Zhang, X., Wang, X., Li, J., Liu, G., Kuang, Y., Xu, J., Zheng, X., Ren, L., Wang, G., et al. : Genome Sequence and Genetic Diversity of the Common Carp, Cyprinus carpio. Nature Genetics 46(11), 1212 (2014)

work page 2014

[21] [21]

: The First Korean Genome Sequence and Analysis: Full Genome Sequencing for a Socio-ethnic Group

Ahn, S.-M., Kim, T.-H., Lee, S., Kim, D., Ghang, H., Kim, D.-S., Kim, B.-C., Kim, S.-Y., Kim, W.-Y., Kim, C., et al. : The First Korean Genome Sequence and Analysis: Full Genome Sequencing for a Socio-ethnic Group. Genome Research 19(9), 1622–1629 (2009)

work page 2009

[22] [22]

: The Diploid Genome Sequence of an Asian Individual

Wang, J., Wang, W., Li, R., Li, Y., Tian, G., Goodman, L., Fan, W., Zhang, J., Li, J., Zhang, J., et al. : The Diploid Genome Sequence of an Asian Individual. Nature 456(7218), 60 (2008)

work page 2008

[23] [23]

: Complete Khoisan and Bantu Genomes from Southern Africa

Schuster, S.C., Miller, W., Ratan, A., Tomsho, L.P., Giardine, B., Kasson, L.R., Harris, R.S., Petersen, D.C., Zhao, F., Qi, J., et al. : Complete Khoisan and Bantu Genomes from Southern Africa. Nature 463(7283), 943 (2010)

work page 2010

[24] [24]

BMC Genomics 16(1), 1093 (2015)

Huang, T., Shu, Y., Cai, Y.-D.: Genetic Diﬀerences among Ethnic Groups. BMC Genomics 16(1), 1093 (2015)

work page 2015

[25] [25]

BMC Genomics 20(1), 459 (2019)

Shukla, H.G., Bawa, P.S., Srinivasan, S.: hg19KIndel: Ethnicity Normalized Human Reference Genome. BMC Genomics 20(1), 459 (2019)

work page 2019

[26] [26]

https://genome.ucsc.edu/cgi-bin/hgLiftOver

UCSC: UCSC LiftOver: Lift Genome Annotations. https://genome.ucsc.edu/cgi-bin/hgLiftOver

work page

[27] [27]

http://crossmap.sourceforge.net/#use-pip-to-install-crossmap

Zhao, Hao and Sun, Zhifu and Wang, Jing and Huang, Haojie and Kocher, Jean-Pierre and Wang, Liguo: CrossMap: Convert Genome Coordinates Between Assemblies. http://crossmap.sourceforge.net/#use-pip-to-install-crossmap

work page

[28] [28]

https://pypi.org/project/segment-liftover/

Gao, B.: Segment Liftover. https://pypi.org/project/segment-liftover/

work page

[29] [29]

F1000Research 7 (2018)

Gao, B., Huang, Q., Baudis, M.: Segment Liftover: A Python Tool to Convert Segments Between Genome Assemblies. F1000Research 7 (2018)

work page 2018

[30] [30]

Bioinformatics 30(7), 1006–1007 (2013)

Zhao, H., Sun, Z., Wang, J., Huang, H., Kocher, J.-P., Wang, L.: CrossMap: A Versatile Tool for Coordinate Conversion Between Genome Assemblies. Bioinformatics 30(7), 1006–1007 (2013)

work page 2013

[31] [31]

https://www.ncbi.nlm.nih.gov/genome/tools/remap

NCBI: NCBI Genome Remapping Service. https://www.ncbi.nlm.nih.gov/genome/tools/remap

work page

[32] [32]

https://www.usegalaxy.org

The Galaxy Team: Galaxy. https://www.usegalaxy.org

work page

[33] [33]

https://pypi.org/project/pyliftover/

Tretyakov, K.: PyLiftover. https://pypi.org/project/pyliftover/

work page

[34] [34]

http://samtools.github.io/hts-specs/

SAM/BAM and related speciﬁcations. http://samtools.github.io/hts-specs/

work page

[35] [35]

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

Li, H.: Aligning Sequence Reads, Clone Sequences and Assembly Contigs with BWA-MEM. arXiv:1303.3997 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013

[36] [36]

Genome Research 20(9), 1297–1303 (2010)

McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., DePristo, M.A.: The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research 20(9), 1297–1303 (2010). doi:10.1101/gr.107524.110

work page doi:10.1101/gr.107524.110 2010

[37] [37]

https://genome.ucsc.edu/goldenPath/help/blatSpec.html

UCSC: Blat Suite Program Speciﬁcations and User Guide. https://genome.ucsc.edu/goldenPath/help/blatSpec.html

work page

[38] [38]

Current Protocols in Bioinformatics 43(1) (2013)

Auwera, G.A., Carneiro, M.O., Hartl, C., Poplin, R., del Angel, G., Levy-Moonshine, A., Jordan, T., Shakir, K., Roazen, D., Thibault, J., Banks, E., Garimella, K.V., Altshuler, D., Gabriel, S., DePristo, M.A.: From FastQ Data to High-Conﬁdence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. Current Protocols in Bioinformatics 43(1) (20...

work page doi:10.1002/0471250953.bi1110s43 2013

[39] [39]

: The variant call format and vcftools

Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., et al. : The variant call format and vcftools. Bioinformatics 27(15), 2156–2158 Jeremie S. Kim et al. Page 16 of 16 (2011)

work page 2011

[40] [40]

Genome Research 27(1), 157–164 (2017)

Eberle, M.A., Fritzilas, E., Krusche, P., K¨ allberg, M., Moore, B.L., Bekritsky, M.A., Iqbal, Z., Chuang, H.-Y., Humphray, S.J., Halpern, A.L., Kruglyak, S., Margulies, E.H., McVean, G., Bentley, D.R.: A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome R...

work page doi:10.1101/gr.210500.116 2017

[41] [41]

Nature Biotechnology (2014)

Zook, J.M., Chapman, B., Wang, J., Mittelman, D., Hofmann, O., Hide, W., Salit, M.: Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nature Biotechnology (2014). doi:10.1038/nbt.2835

work page doi:10.1038/nbt.2835 2014

[42] [42]

Bioinformatics 32(15), 2243–2247 (2016)

Firtina, C., Alkan, C.: On genomic repeats and reproducibility. Bioinformatics 32(15), 2243–2247 (2016). doi:10.1093/bioinformatics/btw139

work page doi:10.1093/bioinformatics/btw139 2016