Agglomerative Attention

Matthew Spellings

arxiv: 1907.06607 · v1 · pith:RZRKV4KSnew · submitted 2019-07-15 · 💻 cs.LG · stat.ML

Agglomerative Attention

Matthew Spellings This is my paper

Pith reviewed 2026-05-24 21:28 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords attention mechanismtransformerlinear scalinglanguage modelingsequence modelingneural network architecture

0 comments

The pith

Agglomerative attention reduces memory and computation to linear scaling while matching full attention performance on language modeling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an attention mechanism for sequence modeling networks that requires only linear memory and computation time rather than the quadratic cost of computing attention over all pairs of elements. It shows that networks built with this mechanism reach performance levels comparable to standard full-attention transformers on language modeling tasks. The result matters because quadratic scaling has limited the size of trainable transformer models, so a linear alternative could support longer sequences or larger models within the same resources. The work centers on proving that the simplified structure still passes sufficient contextual information among sequence positions.

Core claim

The paper introduces agglomerative attention, an attention layer that operates with linear requirements in both memory and computation time. Despite the simpler structure, neural networks that use this layer attain performance comparable to networks that employ full pairwise attention when trained and evaluated on language modeling tasks.

What carries the argument

Agglomerative attention, a linear-time attention structure that aggregates contextual information across sequence elements without exhaustive pairwise comparisons.

If this is right

Sequence lengths can increase without a quadratic explosion in memory or compute.
Transformer-style models can be trained at larger scale under fixed hardware budgets.
The same linear mechanism can be substituted into existing attention-based architectures for language tasks.
Contextual exchange remains sufficient to support next-token prediction at full-attention quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may transfer to other sequence domains such as audio or time-series data where quadratic attention is also a bottleneck.
If the linear structure preserves long-range dependencies, it could support deeper stacks of layers within the same compute envelope.
Direct measurement of information flow across distant positions would test whether the aggregation step loses critical signals that full attention retains.

Load-bearing premise

The simplified linear attention structure still exchanges enough contextual information to match the modeling power of full pairwise attention on the target tasks.

What would settle it

A controlled language-modeling experiment in which the agglomerative-attention network produces perplexity or accuracy more than a few percent worse than an otherwise identical full-attention baseline.

Figures

Figures reproduced from arXiv: 1907.06607 by Matthew Spellings.

**Figure 2.** Figure 2: Average training (solid lines) and vali [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Average training (solid) and validation (dashed) set loss and perplexity of word-level language [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Neural networks using transformer-based architectures have recently demonstrated great power and flexibility in modeling sequences of many types. One of the core components of transformer networks is the attention layer, which allows contextual information to be exchanged among sequence elements. While many of the prevalent network structures thus far have utilized full attention -- which operates on all pairs of sequence elements -- the quadratic scaling of this attention mechanism significantly constrains the size of models that can be trained. In this work, we present an attention model that has only linear requirements in memory and computation time. We show that, despite the simpler attention model, networks using this attention mechanism can attain comparable performance to full attention networks on language modeling tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims a linear-complexity attention variant called agglomerative attention matches full attention on language modeling, but the abstract gives no experiments, baselines, or method details to support it.

read the letter

The main point is that this work proposes agglomerative attention as a linear-time and linear-memory replacement for standard pairwise attention in transformers, with the assertion that it still delivers comparable results on language modeling tasks. It correctly flags the quadratic scaling limit that constrains model size and context length. The framing of the problem is straightforward and on target for anyone dealing with long sequences. What is new is the specific named mechanism, though the abstract does not spell out the agglomeration rule or show how it differs from other linear attention ideas already cited in the literature. The paper does a clean job of stating the scaling issue without overclaiming theoretical novelty. The clear soft spot is the complete absence of supporting evidence. No results, no error bars, no implementation sketch, and no comparison to full attention or to existing linear alternatives appear in the text. This leaves the central empirical claim unevaluated and makes it impossible to check whether the simplified structure really passes enough context. The assumption that linear attention preserves modeling power is stated but not tested here. The citation pattern cannot be assessed from the abstract alone. This kind of paper would interest researchers focused on efficient transformers and scaling sequence models, but only if the full version contains reproducible experiments on standard benchmarks. As presented, the lack of any data means it does not supply enough substance for a reading group or for serious referee time.

Referee Report

1 major / 0 minor

Summary. The paper proposes an 'agglomerative attention' mechanism with linear memory and computation requirements as an alternative to the quadratic full attention in transformer networks, and claims that networks using this mechanism attain comparable performance to full-attention networks on language modeling tasks.

Significance. If the empirical claim is substantiated with rigorous experiments, the work would address a key scalability bottleneck in attention-based models, potentially enabling longer sequences or larger models in sequence modeling.

major comments (1)

[Abstract] Abstract: the central claim that agglomerative attention attains 'comparable performance' on language modeling is asserted without any quantitative results, baselines, error bars, dataset details, or implementation information, rendering the claim impossible to evaluate from the manuscript text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that agglomerative attention attains 'comparable performance' on language modeling is asserted without any quantitative results, baselines, error bars, dataset details, or implementation information, rendering the claim impossible to evaluate from the manuscript text.

Authors: We agree that the abstract would be strengthened by including quantitative support for the central claim. Although the body of the manuscript reports specific experimental results (perplexity on standard language modeling benchmarks, direct comparisons to full-attention baselines, and implementation details), these are not summarized in the abstract. In the revised manuscript we will update the abstract to include key quantitative metrics, dataset names, and baseline comparisons so that the performance claim can be evaluated directly from the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces an agglomerative (linear) attention mechanism and reports an empirical result: networks using it attain comparable language-modeling performance to full pairwise attention. No derivation chain, first-principles prediction, or fitted parameter is presented whose output is shown to reduce to its inputs by construction. The central claim is strictly experimental and externally falsifiable via benchmark comparisons; no self-definitional equations, self-citation load-bearing steps, or renamed known results appear in the provided abstract or claim structure. The reader's assessment of circularity score 0.0 is therefore confirmed.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review; no free parameters, background axioms, or additional invented entities are specified beyond the core proposal of the attention model itself.

invented entities (1)

Agglomerative attention no independent evidence
purpose: Provide linear memory and computation attention for sequence modeling
Introduced in the abstract to solve the quadratic scaling of full attention.

pith-pipeline@v0.9.0 · 5621 in / 990 out tokens · 21371 ms · 2026-05-24T21:28:06.978347+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 7 internal anchors

[1]

Attention Is All You Need

A. Vaswani et al. “Attention Is All You Need”. In: Advances in Neural Information Processing Systems 30 . Ed. by I. Guyon et al. Curran As- Table 2: Number of weights, average test set perplexity over ﬁve replicas, and training time per epoch of word-level models shown in Figure 3. Attention type Sequence encoding Model size T est perplexity Epoch time (s...

work page 2017
[2]

Universal Transformers

M. Dehghani et al. Universal Transformers . July 10, 2018. arXiv: 1807.03819. url: http: //arxiv.org/abs/1807.03819

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

Language Models Are Unsu- pervised Multitask Learners

A. Radford et al. “Language Models Are Unsu- pervised Multitask Learners”. In: OpenAI Blog 1.8 (2019). url: https://openai.com/blog/ better-language-models/

work page 2019
[4]

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Z. Dai et al. Transformer-XL: Attentive Lan- guage Models beyond a Fixed-Length Context . Jan. 9, 2019. arXiv: 1901.02860 . url: http: //arxiv.org/abs/1901.02860

work page internal anchor Pith review Pith/arXiv arXiv 2019
[5]

Generating Long Sequences with Sparse Transformers

R. Child et al. Generating Long Sequences with Sparse Transformers . Apr. 23, 2019. arXiv: 1904.10509 . url: http://arxiv.org/abs/ 1904.10509

work page internal anchor Pith review Pith/arXiv arXiv 2019
[6]

C.-Z. A. Huang et al. Music Transformer . Sept. 12, 2018. arXiv: 1809.04281. url: http: //arxiv.org/abs/1809.04281

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

C. Payne. MuseNet. Apr. 25, 2019. url: https: //openai.com/blog/musenet/

work page 2019
[8]

Gradient-Based Learning Ap- plied to Document Recognition

Y. Lecun et al. “Gradient-Based Learning Ap- plied to Document Recognition”. In: Proceed- ings of the IEEE 86.11 (Nov. 1998), pp. 2278–

work page 1998
[9]

Lecun, L

issn: 0018-9219. doi: 10.1109/5.726791

work page doi:10.1109/5.726791
[10]

ImageNet Classiﬁcation with Deep Convolu- tional Neural Networks

A. Krizhevsky, I. Sutskever, and G. E. Hinton. “ImageNet Classiﬁcation with Deep Convolu- tional Neural Networks”. In: Advances in Neu- ral Information Processing Systems 25 . Ed. by F. Pereira et al. Curran Associates, Inc., 2012, pp. 1097–1105. url: http : / / papers . nips . cc/paper/4824-imagenet-classification- with - deep - convolutional - neural - ...

work page 2012
[11]

Visualizing and Understanding Convolutional Networks

M. D. Zeiler and R. Fergus. “Visualizing and Understanding Convolutional Networks”. In: Computer Vision ECCV 2014 . Ed. by D. Fleet et al. Lecture Notes in Computer Sci- ence. Springer International Publishing, 2014, pp. 818–833. isbn: 978-3-319-10590-1

work page 2014
[12]

Feature Visualization

C. Olah, A. Mordvintsev, and L. Schu- bert. “Feature Visualization”. In: Distill 2.11 (Nov. 7, 2017), e7. issn: 2476-0757. doi: 10 . 23915/distill.00007

work page 2017
[13]

Bi-Directional Block Self- Attention for Fast and Memory-Eﬃcient Se- quence Modeling

T. Shen et al. “Bi-Directional Block Self- Attention for Fast and Memory-Eﬃcient Se- quence Modeling”. In: International Confer- ence on Representation Learning . 2018

work page 2018
[14]

Guo et al

Q. Guo et al. Star-Transformer. Feb. 25, 2019. arXiv: 1902.09113. url: http://arxiv.org/ abs/1902.09113

work page arXiv 2019
[15]

Segtree Transformer: Iterative Re- ﬁnement of Hierarchical Features

Z. Ye et al. “Segtree Transformer: Iterative Re- ﬁnement of Hierarchical Features”. In: ICLR 2019 Workshop on ”Representation Learning on Graphs and Manifolds”. 2019. url: https: //rlgm.github.io/papers/

work page 2019
[16]

Dominant Forces in Protein Fold- ing

K. A. Dill. “Dominant Forces in Protein Fold- ing”. In: Biochemistry 29.31 (Aug. 7, 1990), pp. 7133–7155. issn: 0006-2960. doi: 10.1021/ bi00483a001

work page 1990
[17]

Initial Hydrophobic Collapse in the Folding of Barstar

V. R. Agashe, M. C. R. Shastry, and J. B. Udgaonkar. “Initial Hydrophobic Collapse in the Folding of Barstar”. In: Nature 377.6551 (Oct. 1995), p. 754. issn: 1476-4687. doi: 10. 1038/377754a0

work page 1995
[18]

How Fast Is Protein Hydrophobic Collapse?

M. Sadqi, L. J. Lapidus, and V. Muoz. “How Fast Is Protein Hydrophobic Collapse?” In: Proceedings of the National Academy of Sci- ences 100.21 (Oct. 14, 2003), pp. 12117–12122. issn: 0027-8424, 1091-6490. doi: 10 . 1073 / pnas.2033863100. pmid: 14530404

work page 2003
[19]

Hydrophobic Collapse in (in Silico) Protein Folding

M. Brylinski, L. Konieczny, and I. Roter- man. “Hydrophobic Collapse in (in Silico) Protein Folding”. In: Computational Biology and Chemistry 30.4 (Aug. 1, 2006), pp. 255–

work page 2006
[20]

issn: 1476-9271. doi: 10 . 1016 / j . compbiolchem.2006.04.007

work page 2006
[21]

gradient descent

G. Haran. “How, When and Why Proteins Col- lapse: The Relation to Folding”. In: Current Opinion in Structural Biology 22.1 (Feb. 2012), pp. 14–20. issn: 0959-440X. doi: 10.1016/j. sbi.2011.10.005. pmid: 22104965

work page doi:10.1016/j 2012
[22]

Mavreshko

K. Mavreshko. Keras-Transformer. GitHub,

work page
[23]

com / kpot / keras-transformer

url: https : / / github . com / kpot / keras-transformer

work page
[24]

F. Chollet. Keras. GitHub, 2015. url: https: //github.com/fchollet/keras

work page 2015
[25]

Improving Language Un- derstanding by Generative Pre-Training

A. Radford et al. “Improving Language Un- derstanding by Generative Pre-Training”. In: (2018), p. 12

work page 2018
[26]

M. Mahoney. About the Test Data . Dec. 17,

work page
[27]

url: https://cs.fit.edu/ ~mmahoney/ compression/textdata.html. 6

work page
[28]

Pointer Sentinel Mixture Models

S. Merity et al. Pointer Sentinel Mixture Mod- els. Sept. 26, 2016. arXiv: 1609 . 07843. url: http://arxiv.org/abs/1609.07843

work page internal anchor Pith review Pith/arXiv arXiv 2016
[29]

M. D. Zeiler. ADADELTA: An Adaptive Learn- ing Rate Method . Dec. 22, 2012. arXiv: 1212

work page 2012
[30]

org / abs / 1212

url: http : / / arxiv . org / abs / 1212 . 5701

work page
[31]

S. Bai, J. Z. Kolter, and V. Koltun. An Em- pirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. Mar. 3, 2018. arXiv: 1803.01271. url: http: //arxiv.org/abs/1803.01271

work page internal anchor Pith review Pith/arXiv arXiv 2018
[32]

Neural Machine Translation of Rare Words with Subword Units

R. Sennrich, B. Haddow, and A. Birch. Neural Machine Translation of Rare Words with Sub- word Units. Aug. 31, 2015. arXiv: 1508.07909. url: http://arxiv.org/abs/1508.07909

work page internal anchor Pith review Pith/arXiv arXiv 2015
[33]

”Found in Translation

P. Schwaller et al. “”Found in Translation”: Predicting Outcomes of Complex Organic Chemistry Reactions Using Neural Sequence- to-Sequence Models”. In: Chemical Science 9.28 (2018), pp. 6091–6098. doi: 10 . 1039 / C8SC02339E

work page 2018
[34]

Biological Structure and Func- tion Emerge from Scaling Unsupervised Learn- ing to 250 Million Protein Sequences

A. Rives et al. “Biological Structure and Func- tion Emerge from Scaling Unsupervised Learn- ing to 250 Million Protein Sequences”. In: bioRxiv (May 29, 2019), p. 622803. doi: 10 . 1101/622803. 7

work page 2019

[1] [1]

Attention Is All You Need

A. Vaswani et al. “Attention Is All You Need”. In: Advances in Neural Information Processing Systems 30 . Ed. by I. Guyon et al. Curran As- Table 2: Number of weights, average test set perplexity over ﬁve replicas, and training time per epoch of word-level models shown in Figure 3. Attention type Sequence encoding Model size T est perplexity Epoch time (s...

work page 2017

[2] [2]

Universal Transformers

M. Dehghani et al. Universal Transformers . July 10, 2018. arXiv: 1807.03819. url: http: //arxiv.org/abs/1807.03819

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

Language Models Are Unsu- pervised Multitask Learners

A. Radford et al. “Language Models Are Unsu- pervised Multitask Learners”. In: OpenAI Blog 1.8 (2019). url: https://openai.com/blog/ better-language-models/

work page 2019

[4] [4]

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Z. Dai et al. Transformer-XL: Attentive Lan- guage Models beyond a Fixed-Length Context . Jan. 9, 2019. arXiv: 1901.02860 . url: http: //arxiv.org/abs/1901.02860

work page internal anchor Pith review Pith/arXiv arXiv 2019

[5] [5]

Generating Long Sequences with Sparse Transformers

R. Child et al. Generating Long Sequences with Sparse Transformers . Apr. 23, 2019. arXiv: 1904.10509 . url: http://arxiv.org/abs/ 1904.10509

work page internal anchor Pith review Pith/arXiv arXiv 2019

[6] [6]

C.-Z. A. Huang et al. Music Transformer . Sept. 12, 2018. arXiv: 1809.04281. url: http: //arxiv.org/abs/1809.04281

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

C. Payne. MuseNet. Apr. 25, 2019. url: https: //openai.com/blog/musenet/

work page 2019

[8] [8]

Gradient-Based Learning Ap- plied to Document Recognition

Y. Lecun et al. “Gradient-Based Learning Ap- plied to Document Recognition”. In: Proceed- ings of the IEEE 86.11 (Nov. 1998), pp. 2278–

work page 1998

[9] [9]

Lecun, L

issn: 0018-9219. doi: 10.1109/5.726791

work page doi:10.1109/5.726791

[10] [10]

ImageNet Classiﬁcation with Deep Convolu- tional Neural Networks

A. Krizhevsky, I. Sutskever, and G. E. Hinton. “ImageNet Classiﬁcation with Deep Convolu- tional Neural Networks”. In: Advances in Neu- ral Information Processing Systems 25 . Ed. by F. Pereira et al. Curran Associates, Inc., 2012, pp. 1097–1105. url: http : / / papers . nips . cc/paper/4824-imagenet-classification- with - deep - convolutional - neural - ...

work page 2012

[11] [11]

Visualizing and Understanding Convolutional Networks

M. D. Zeiler and R. Fergus. “Visualizing and Understanding Convolutional Networks”. In: Computer Vision ECCV 2014 . Ed. by D. Fleet et al. Lecture Notes in Computer Sci- ence. Springer International Publishing, 2014, pp. 818–833. isbn: 978-3-319-10590-1

work page 2014

[12] [12]

Feature Visualization

C. Olah, A. Mordvintsev, and L. Schu- bert. “Feature Visualization”. In: Distill 2.11 (Nov. 7, 2017), e7. issn: 2476-0757. doi: 10 . 23915/distill.00007

work page 2017

[13] [13]

Bi-Directional Block Self- Attention for Fast and Memory-Eﬃcient Se- quence Modeling

T. Shen et al. “Bi-Directional Block Self- Attention for Fast and Memory-Eﬃcient Se- quence Modeling”. In: International Confer- ence on Representation Learning . 2018

work page 2018

[14] [14]

Guo et al

Q. Guo et al. Star-Transformer. Feb. 25, 2019. arXiv: 1902.09113. url: http://arxiv.org/ abs/1902.09113

work page arXiv 2019

[15] [15]

Segtree Transformer: Iterative Re- ﬁnement of Hierarchical Features

Z. Ye et al. “Segtree Transformer: Iterative Re- ﬁnement of Hierarchical Features”. In: ICLR 2019 Workshop on ”Representation Learning on Graphs and Manifolds”. 2019. url: https: //rlgm.github.io/papers/

work page 2019

[16] [16]

Dominant Forces in Protein Fold- ing

K. A. Dill. “Dominant Forces in Protein Fold- ing”. In: Biochemistry 29.31 (Aug. 7, 1990), pp. 7133–7155. issn: 0006-2960. doi: 10.1021/ bi00483a001

work page 1990

[17] [17]

Initial Hydrophobic Collapse in the Folding of Barstar

V. R. Agashe, M. C. R. Shastry, and J. B. Udgaonkar. “Initial Hydrophobic Collapse in the Folding of Barstar”. In: Nature 377.6551 (Oct. 1995), p. 754. issn: 1476-4687. doi: 10. 1038/377754a0

work page 1995

[18] [18]

How Fast Is Protein Hydrophobic Collapse?

M. Sadqi, L. J. Lapidus, and V. Muoz. “How Fast Is Protein Hydrophobic Collapse?” In: Proceedings of the National Academy of Sci- ences 100.21 (Oct. 14, 2003), pp. 12117–12122. issn: 0027-8424, 1091-6490. doi: 10 . 1073 / pnas.2033863100. pmid: 14530404

work page 2003

[19] [19]

Hydrophobic Collapse in (in Silico) Protein Folding

M. Brylinski, L. Konieczny, and I. Roter- man. “Hydrophobic Collapse in (in Silico) Protein Folding”. In: Computational Biology and Chemistry 30.4 (Aug. 1, 2006), pp. 255–

work page 2006

[20] [20]

issn: 1476-9271. doi: 10 . 1016 / j . compbiolchem.2006.04.007

work page 2006

[21] [21]

gradient descent

G. Haran. “How, When and Why Proteins Col- lapse: The Relation to Folding”. In: Current Opinion in Structural Biology 22.1 (Feb. 2012), pp. 14–20. issn: 0959-440X. doi: 10.1016/j. sbi.2011.10.005. pmid: 22104965

work page doi:10.1016/j 2012

[22] [22]

Mavreshko

K. Mavreshko. Keras-Transformer. GitHub,

work page

[23] [23]

com / kpot / keras-transformer

url: https : / / github . com / kpot / keras-transformer

work page

[24] [24]

F. Chollet. Keras. GitHub, 2015. url: https: //github.com/fchollet/keras

work page 2015

[25] [25]

Improving Language Un- derstanding by Generative Pre-Training

A. Radford et al. “Improving Language Un- derstanding by Generative Pre-Training”. In: (2018), p. 12

work page 2018

[26] [26]

M. Mahoney. About the Test Data . Dec. 17,

work page

[27] [27]

url: https://cs.fit.edu/ ~mmahoney/ compression/textdata.html. 6

work page

[28] [28]

Pointer Sentinel Mixture Models

S. Merity et al. Pointer Sentinel Mixture Mod- els. Sept. 26, 2016. arXiv: 1609 . 07843. url: http://arxiv.org/abs/1609.07843

work page internal anchor Pith review Pith/arXiv arXiv 2016

[29] [29]

M. D. Zeiler. ADADELTA: An Adaptive Learn- ing Rate Method . Dec. 22, 2012. arXiv: 1212

work page 2012

[30] [30]

org / abs / 1212

url: http : / / arxiv . org / abs / 1212 . 5701

work page

[31] [31]

S. Bai, J. Z. Kolter, and V. Koltun. An Em- pirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. Mar. 3, 2018. arXiv: 1803.01271. url: http: //arxiv.org/abs/1803.01271

work page internal anchor Pith review Pith/arXiv arXiv 2018

[32] [32]

Neural Machine Translation of Rare Words with Subword Units

R. Sennrich, B. Haddow, and A. Birch. Neural Machine Translation of Rare Words with Sub- word Units. Aug. 31, 2015. arXiv: 1508.07909. url: http://arxiv.org/abs/1508.07909

work page internal anchor Pith review Pith/arXiv arXiv 2015

[33] [33]

”Found in Translation

P. Schwaller et al. “”Found in Translation”: Predicting Outcomes of Complex Organic Chemistry Reactions Using Neural Sequence- to-Sequence Models”. In: Chemical Science 9.28 (2018), pp. 6091–6098. doi: 10 . 1039 / C8SC02339E

work page 2018

[34] [34]

Biological Structure and Func- tion Emerge from Scaling Unsupervised Learn- ing to 250 Million Protein Sequences

A. Rives et al. “Biological Structure and Func- tion Emerge from Scaling Unsupervised Learn- ing to 250 Million Protein Sequences”. In: bioRxiv (May 29, 2019), p. 622803. doi: 10 . 1101/622803. 7

work page 2019