pith. sign in

arxiv: 2112.07953 · v2 · submitted 2021-12-15 · 🧬 q-bio.GN

Learning the statistics and landscape of somatic mutation-induced insertions and deletions in antibodies

Pith reviewed 2026-05-24 12:26 UTC · model grok-4.3

classification 🧬 q-bio.GN
keywords somatic hypermutationindelsantibody repertoiregeometric distributionaffinity maturationprobabilistic inferenceimmunoglobulin heavy chain
0
0 comments X

The pith

The lengths of insertions and deletions during antibody affinity maturation follow a geometric distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a probabilistic inference method to extract statistics of insertions and deletions from antibody repertoire sequencing data while accounting for variable mutational loads across sequences. This approach avoids biases in standard annotation tools. Applied to a large set of human immunoglobulin heavy chains, the model identifies distinct hotspots for insertions versus deletions. The central result is that indel lengths follow a geometric distribution, which directly limits the possible mechanisms that can generate these mutations during somatic hypermutation.

Core claim

We present a probabilistic inference tool that learns the statistics of indels from repertoire sequencing data, which overcomes the pitfalls and biases of standard annotation methods. The model includes antibody-specific maturation ages to account for variable mutational loads in the repertoire. After validation on synthetic data, application to human immunoglobulin heavy chains reveals distinct insertion and deletion hotspots and shows that the distribution of lengths of indels follows a geometric distribution.

What carries the argument

A probabilistic inference tool that incorporates antibody-specific maturation ages to infer indel statistics directly from sequencing data.

If this is right

  • Mechanistic models of somatic hypermutation must produce geometric length distributions for indels.
  • Insertion and deletion events occur at distinct sequence hotspots in heavy chains.
  • Universal statistical features of indels exist across human heavy chain repertoires.
  • The inferred model can be used to annotate indels in new sequencing datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The geometric length distribution may point to a memoryless process in how DNA segments are added or removed during hypermutation.
  • The same inference approach could be tested on light chains or on data from other species to check for conserved features.
  • If the geometric property holds, it simplifies simulation of antibody sequence diversity in computational immunology.

Load-bearing premise

The probabilistic inference tool overcomes the pitfalls and biases of standard annotation methods without introducing comparable new biases of its own.

What would settle it

New repertoire sequencing data in which the length histogram of indels deviates significantly from a geometric distribution after the same inference procedure.

Figures

Figures reproduced from arXiv: 2112.07953 by Aleksandra M. Walczak, Cosimo Lupo, Natanael Spisak, Thierry Mora.

Figure 1
Figure 1. Figure 1: FIG. 1 [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FIG. 2 [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: FIG. 3 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: FIG. 4 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Affinity maturation is crucial for improving the binding affinity of antibodies to antigens. This process is mainly driven by point substitutions caused by somatic hypermutations of the immunoglobulin gene. It also includes deletions and insertions of genomic material known as indels. While the landscape of point substitutions has been extensively studied, a detailed statistical description of indels is still lacking. Here we present a probabilistic inference tool to learn the statistics of indels from repertoire sequencing data, which overcomes the pitfalls and biases of standard annotation methods. The model includes antibody-specific maturation ages to account for variable mutational loads in the repertoire. After validation on synthetic data, we applied our tool to a large dataset of human immunoglobulin heavy chains. The inferred model allows us to identify universal statistical features of indels in heavy chains. We report distinct insertion and deletion hotspots, and show that the distribution of lengths of indels follows a geometric distribution, which puts constraints on future mechanistic models of the hypermutation process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents a probabilistic inference tool to extract statistics of somatic hypermutation-induced insertions and deletions (indels) from antibody repertoire sequencing data. The model incorporates antibody-specific maturation ages to account for variable mutational loads. After validation on synthetic data, the tool is applied to a large set of human immunoglobulin heavy chain sequences, identifying distinct insertion and deletion hotspots and reporting that indel length distributions follow a geometric distribution, which constrains mechanistic models of hypermutation.

Significance. If the inference tool recovers true indel length statistics without introducing new length-dependent biases, the geometric distribution result would supply an important empirical constraint on hypermutation mechanisms, addressing a gap relative to the extensively characterized point-mutation landscape.

major comments (1)
  1. [Abstract] Abstract: validation on synthetic data is stated without quantitative performance metrics, error analysis, or comparisons to baselines. This is load-bearing for the central claim that indel lengths follow a geometric distribution, because the result depends on the tool correctly extracting length statistics from real data after overcoming annotation biases.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the single major comment below and will revise the manuscript to improve the presentation of our synthetic validation results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: validation on synthetic data is stated without quantitative performance metrics, error analysis, or comparisons to baselines. This is load-bearing for the central claim that indel lengths follow a geometric distribution, because the result depends on the tool correctly extracting length statistics from real data after overcoming annotation biases.

    Authors: We agree that the abstract would be strengthened by including quantitative performance metrics, error analysis, and baseline comparisons from the synthetic validation, as these details support the reliability of the inferred geometric length distributions. The full manuscript provides these in the Methods (model validation procedure) and Results (recovery accuracy, length-dependent bias quantification, and comparisons to standard annotation pipelines) sections, including metrics such as precision/recall on simulated indels of varying lengths and error bars across replicate simulations. However, the abstract currently summarizes this only qualitatively. We will revise the abstract to incorporate key quantitative results (e.g., overall recovery rate of indel lengths and reduction in annotation bias relative to baselines) while remaining within length limits. This addresses the load-bearing concern without altering the central claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a probabilistic inference tool that incorporates antibody-specific maturation ages, validates it on synthetic data, and applies the tool to real human IgH repertoire data to extract indel statistics. The reported geometric length distribution is presented as an output of this inference on the empirical data rather than an input assumption or a quantity fitted by construction. No equations, self-citations, or ansatzes are shown in the provided text that reduce the central claim to a renaming, a prior, or a self-referential definition. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that a new probabilistic model can be learned from repertoire data while correctly handling annotation biases and variable mutational loads; no free parameters, axioms, or invented entities are specified in the abstract.

pith-pipeline@v0.9.0 · 5705 in / 1039 out tokens · 22079 ms · 2026-05-24T12:26:51.396884+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages

  1. [1]

    Projection was done using the procedure described in Ref. [60]. Speeding up the computation The algorithm described so far is computationally very costly. Just the basic step of computing the align- ment likelihood L(s|µs;φ) for a single sequence at fixed µs is time-consuming: if we allow for a maximum size ℓ = Θ for single-event deletions and insertions, ...

  2. [2]

    Hozumi N, Tonegawa S (1976) Evidence for somatic rear- rangement of immunoglobulin genes coding for variable and constant regions. Proc. Natl. Acad. Sci. 73:3628

  3. [3]

    (2009) Measurement and Clinical Monitoring of Human Lymphocyte Clonality by Mas- sively Parallel V-D-J Pyrosequencing

    Boyd SD, et al. (2009) Measurement and Clinical Monitoring of Human Lymphocyte Clonality by Mas- sively Parallel V-D-J Pyrosequencing. Sci. Transl. Med. 14 1:12ra23

  4. [4]

    (2009) Precise determination of the diversity of a combinatorial antibody library gives insight into the human immunoglobulin repertoire

    Glanville J, et al. (2009) Precise determination of the diversity of a combinatorial antibody library gives insight into the human immunoglobulin repertoire. Proc. Natl. Acad. Sci. 106:20216

  5. [5]

    Larimore K, McCormick MW, Robins HS, Greenberg PD (2012) Shaping of Human Germline IgH Repertoires Re- vealed by Deep Sequencing. J. Immunol. 189:3221

  6. [6]

    (2015) Inferring processes underlying B-cell repertoire diversity

    Elhanati Y, et al. (2015) Inferring processes underlying B-cell repertoire diversity. Philos. Trans. R. Soc. B Biol. Sci. 370:20140243

  7. [7]

    (2016) A Public Database of Mem- ory and Naive B-Cell Receptor Sequences

    DeWitt WS, et al. (2016) A Public Database of Mem- ory and Naive B-Cell Receptor Sequences. PLoS One 11:e0160853

  8. [8]

    Marcou Q, Mora T, Walczak AM (2018) High- throughput immune repertoire analysis with IGoR. Nat. Commun. 9:561

  9. [9]

    Nature 566:393

    Briney B, Inderbitzin A, Joyce C, Burton DR (2019) Commonality despite exceptional diversity in the baseline human antibody repertoire. Nature 566:393

  10. [10]

    Immunity 53:1136–1150

    Elsner RA, Shlomchik MJ (2020) Germinal Center and Extrafollicular B Cell Responses in Vaccination, Immu- nity, and Autoimmunity. Immunity 53:1136–1150

  11. [11]

    Victora GD, Nussenzweig MC (2012) Germinal Centers. Annu. Rev. Immunol. 30:429

  12. [12]

    Cobey S, Wilson PC, Matsen IV FA (2015) The evo- lution within us. Philos. Trans. R. Soc. B Biol. Sci. 370:20140235

  13. [13]

    Immunity 45:471

    Mesin L, Ersching J, Victora GD (2016) Germinal Center B Cell Dynamics. Immunity 45:471

  14. [14]

    Trends Immunol

    Feng Y, Seija N, Di Noia JM, Martin A (2020) AID in Antibody Diversification: There and Back Again. Trends Immunol. 41:P586

  15. [15]

    Kleinstein SH, Louzoun Y, Shlomchik MJ (2003) Esti- mating Hypermutation Rates from Clonal Tree Data. J. Immunol. 171:4639

  16. [16]

    Odegard VH, Schatz DG (2006) Targeting of somatic hypermutation. Nat. Rev. Immunol. 6:573

  17. [17]

    (2013) Models of Somatic Hypermuta- tion Targeting and Substitution Based on Synonymous Mutations from High-Throughput Immunoglobulin Se- quencing Data

    Yaari G, et al. (2013) Models of Somatic Hypermuta- tion Targeting and Substitution Based on Synonymous Mutations from High-Throughput Immunoglobulin Se- quencing Data. Front. Immunol. 4:358

  18. [18]

    (2015) Quantifying evolutionary con- straints on B-cell affinity maturation

    McCoy CO, et al. (2015) Quantifying evolutionary con- straints on B-cell affinity maturation. Philos. Trans. R. Soc. B Biol. Sci. 370:20140244

  19. [19]

    (2016) A Model of Somatic Hypermuta- tion Targeting in Mice Based on High-Throughput Ig Se- quencing Data

    Cui A, et al. (2016) A Model of Somatic Hypermuta- tion Targeting in Mice Based on High-Throughput Ig Se- quencing Data. J. Immunol. 197:3566

  20. [20]

    (2017) Gene-Specific Substitution Pro- files Describe the Types and Frequencies of Amino Acid Changes during Antibody Somatic Hypermutation

    Sheng Z, et al. (2017) Gene-Specific Substitution Pro- files Describe the Types and Frequencies of Amino Acid Changes during Antibody Somatic Hypermutation. Front. Immunol. 8

  21. [21]

    Ge- netics 206:417

    Hoehn KB, Lunter G, Pybus OG (2017) A Phylogenetic Codon Substitution Model for Antibody Lineages. Ge- netics 206:417

  22. [22]

    PLOS Comput

    Dhar A, Davidsen K, Matsen IV FA, Minin VN (2018) Predicting B cell receptor substitution profiles using pub- lic repertoire data. PLOS Comput. Biol. 14:e1006388

  23. [23]

    Nucleic Acids Res

    Spisak N, Walczak AM, Mora T (2020) Learning the het- erogeneous hypermutation landscape of immunoglobulins from high-throughput repertoire data. Nucleic Acids Res. 48:10702

  24. [24]

    (1998) Somatic Hypermutation Intro- duces Insertions and Deletions into Immunoglobulin V Genes

    Wilson PC, et al. (1998) Somatic Hypermutation Intro- duces Insertions and Deletions into Immunoglobulin V Genes. J. Exp. Med. 187:59

  25. [25]

    Wilson PC, Liu YJ, Bonchereau J, Capra JD, Pascual V (1998) Amino acid insertions and deletions contribute to diversify the human Ig repertoire. Immunol. Rev. 162:143

  26. [26]

    (1998) Somatic hypermutation in normal and transformed human B cells

    Klein U, et al. (1998) Somatic hypermutation in normal and transformed human B cells. Immunol. Rev. 162:261

  27. [27]

    Fischer M, K¨ uppers R (1998) Human IgA- and IgM- secreting intestinal plasma cells carry heavily mutated VH region genes. Eur. J. Immunol. 28:2971

  28. [28]

    Goossens T, Klein U, K¨ uppers R (1998) Frequent occur- rence of deletions and duplications during somatic hyper- mutation: Implications for oncogene translocations and heavy chain disease. Proc. Natl. Acad. Sci. 95:2463

  29. [29]

    Ohlin M, Borrebaeck CAK (1998) Insertions and dele- tions in hypervariable loops of antibody heavy chains contribute to molecular diversity. Mol. Immunol. 35:233

  30. [30]

    de Wildt RMT, van Venrooij WJ, Winter G, Hoet RMA, Tomlinson IM (1999) Somatic insertions and deletions shape the human antibody repertoire. J. Mol. Biol. 294:701

  31. [31]

    B Cell Neoplasia 1998

    K¨ uppers R, Goossens T, Klein U (1999) inMech. B Cell Neoplasia 1998. Curr. Top. Microbiol. Immunol. vol 246 , eds Melchers F, Potter M (Springer, Berlin), p 193

  32. [32]

    Genes, Chromo- som

    Bemark M, Neuberger MS (2003) By-products of im- munoglobulin somatic hypermutation. Genes, Chromo- som. Cancer 38:32

  33. [33]

    Reason DC, Zhou J (2006) Codon insertion and dele- tion functions as a somatic diversification mechanism in human antibody repertoires. Biol. Direct 1:24

  34. [34]

    Genes & Immun

    Briney BS, Willis JR, Crowe JE (2012) Location and length distribution of somatic hypermutation-associated DNA insertions and deletions reveals regions of antibody structural plasticity. Genes & Immun. 13:523

  35. [35]

    (2015) Sequence-Intrinsic Mechanisms that Target AID Mutational Outcomes on Antibody Genes

    Yeap LS, et al. (2015) Sequence-Intrinsic Mechanisms that Target AID Mutational Outcomes on Antibody Genes. Cell 163:1124

  36. [36]

    Zhou J, Lottenbach KR, Barenkamp SJ, Reason DC (2004) Somatic Hypermutation and Diverse Im- munoglobulin Gene Usage in the Human Antibody Re- sponse to the Capsular Polysaccharide of S treptococcus pneumoniae Type 6B. Infect. Immun. 72:3505

  37. [37]

    (2010) Rational Design of Envelope Identi- fies Broadly Neutralizing Human Monoclonal Antibodies to HIV-1

    Wu X, et al. (2010) Rational Design of Envelope Identi- fies Broadly Neutralizing Human Monoclonal Antibodies to HIV-1. Science 329:856

  38. [38]

    (2009) Broad and Potent Neutralizing Antibodies from an African Donor Reveal a New HIV-1 Vaccine Target

    Walker LM, et al. (2009) Broad and Potent Neutralizing Antibodies from an African Donor Reveal a New HIV-1 Vaccine Target. Science 326:285

  39. [39]

    (2011) Broad neutralization cover- age of HIV by multiple highly potent antibodies

    Walker LM, et al. (2011) Broad neutralization cover- age of HIV by multiple highly potent antibodies. Nature 477:466

  40. [40]

    (2014) Immunoglobulin Gene Inser- tions and Deletions in the Affinity Maturation of HIV-1 Broadly Reactive Neutralizing Antibodies

    Kepler TB, et al. (2014) Immunoglobulin Gene Inser- tions and Deletions in the Affinity Maturation of HIV-1 Broadly Reactive Neutralizing Antibodies. Cell Host & Microbe 16:304

  41. [41]

    (2011) An Insertion Mutation That Distorts Antibody Binding Site Architecture Enhances Function of a Human Antibody

    Krause JC, et al. (2011) An Insertion Mutation That Distorts Antibody Binding Site Architecture Enhances Function of a Human Antibody. MBio 2:e00345–10

  42. [42]

    (2011) A Potent and Broad Neutraliz- ing Antibody Recognizes and Penetrates the HIV Glycan Shield

    Pejchal R, et al. (2011) A Potent and Broad Neutraliz- ing Antibody Recognizes and Penetrates the HIV Glycan Shield. Science 334:1097

  43. [43]

    (2011) Focused Evolution of HIV-1 Neu- tralizing Antibodies Revealed by Structures and Deep Sequencing

    Wu X, et al. (2011) Focused Evolution of HIV-1 Neu- tralizing Antibodies Revealed by Structures and Deep Sequencing. Science 333:1593. 15

  44. [44]

    Mascola JR, Haynes BF (2013) HIV-1 neutralizing an- tibodies: understanding nature’s pathways. Immunol. Rev. 254:225

  45. [45]

    (2019) A generalized HIV vaccine design strategy for priming of broadly neutralizing anti- body responses

    Steichen JM, et al. (2019) A generalized HIV vaccine design strategy for priming of broadly neutralizing anti- body responses. Science 366:eaax4380

  46. [46]

    (1966) Frameshift Mutations and the Genetic Code

    Streisinger G, et al. (1966) Frameshift Mutations and the Genetic Code. Cold Spring Harb. Symp. Quant. Biol. 31:77

  47. [47]

    Genetics 115:169

    Golding GB, Gearhart PJ, Glickman BW (1987) Pat- terns of Somatic Mutations in Immunoglobulin Variable Genes. Genetics 115:169

  48. [48]

    Murugan A, Mora T, Walczak AM, Callan Jr CG (2012) Statistical inference of the generation probability of T- cell receptors from sequence repertoires. Proc. Natl. Acad. Sci. 109:16161

  49. [49]

    Nucleic Acids Res

    Ye J, Ma N, Madden TL, Ostell JM (2013) IgBLAST: an immunoglobulin variable domain sequence analysis tool. Nucleic Acids Res. 41:W34–W40

  50. [50]

    (2017) Sequence intrinsic somatic mutation mechanisms contribute to affinity maturation of VRC01-class HIV-1 broadly neutralizing antibodies

    Hwang JK, et al. (2017) Sequence intrinsic somatic mutation mechanisms contribute to affinity maturation of VRC01-class HIV-1 broadly neutralizing antibodies. Proc. Natl. Acad. Sci. 114:8614

  51. [51]

    (2006) IMGT/LIGM-DB, the IMGT® comprehensive database of immunoglobulin and T cell receptor nucleotide sequences

    Giudicelli V, et al. (2006) IMGT/LIGM-DB, the IMGT® comprehensive database of immunoglobulin and T cell receptor nucleotide sequences. Nucleic Acids Res. 34:D781–D784

  52. [52]

    Saini J, Hershberg U (2015) B cell Variable genes have evolved their codon usage to focus the targeted patterns of somatic mutation on the complementarity determining regions. Mol. Immunol. 65:157

  53. [53]

    (2020) An Integrated Multi-omic Single- Cell Atlas of Human B Cell Identity

    Glass DR, et al. (2020) An Integrated Multi-omic Single- Cell Atlas of Human B Cell Identity. Immunity 53:217– 232.e5

  54. [54]

    Cell Rep

    Horns F, Dekker CL, Quake SR (2020) Memory B Cell Activation, Broad Anti-influenza Antibodies, and By- stander Activation Revealed by Single-Cell Transcrip- tomics. Cell Rep. 30:905–913.e6

  55. [55]

    Sok D, Burton DR (2018) Recent progress in broadly neutralizing antibodies to HIV. Nat. Immunol. 19:1179

  56. [56]

    (2014) pRESTO: a toolkit for processing high-throughput sequencing raw reads of lym- phocyte receptor repertoires

    Vander Heiden JA, et al. (2014) pRESTO: a toolkit for processing high-throughput sequencing raw reads of lym- phocyte receptor repertoires. Bioinformatics 30:1930

  57. [57]

    Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Bio- logical Sequence Analysis (Cambridge University Press)

  58. [58]

    Dempster AP, Laird NM, Rubin DB (1977) Maximum Likelihood from Incomplete Data via the EM Algorithm. J. R. Stat. Soc. Ser. B 39:1

  59. [59]

    McLachlan GJ, Krishnan T (2008) The EM Algorithm and Extensions (Wiley)

  60. [60]

    Parikh N, Boyd S (2014) Proximal Algorithms. Found. Trends Optim. 1:127

  61. [61]

    Duchi J, Shalev-Shwartz S, Singer Y, Chandra T (2008) Efficient projections onto the ℓ1-ball for learning in high dimensions (ACM Press, New York, New York, USA), p 272

  62. [62]

    high-quality

    Kluge T (2015) C++ cubic spline interpolation. 16 Supplementary information 0 5 10 15 20 25 30 insertion length ℓins 0.0 0.2 0.4 0.6 0.8 1.0 3′ end overlap unbalance IgG NP true insertions random insertions FIG. S1: Fraction of times the overlap between inserted base pairs and same-length flanking region on the 3 ′ end is larger than the overlap with the 5...