What language model architecture and pretraining objective works best for zero-shot generalization? In International Conference on Machine Learning, pp.\ 22964--22984

Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien Launay, Colin Raffel · 2022

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

When Attention Sink Emerges in Language Models: An Empirical View

cs.CL · 2024-10-14 · accept · novelty 6.0

Attention sinks emerge in language models from softmax-induced token dependence on attention scores and do not appear when using sigmoid attention without normalization in models up to 1B parameters.

citing papers explorer

Showing 1 of 1 citing paper.

When Attention Sink Emerges in Language Models: An Empirical View cs.CL · 2024-10-14 · accept · none · ref 51
Attention sinks emerge in language models from softmax-induced token dependence on attention scores and do not appear when using sigmoid attention without normalization in models up to 1B parameters.

What language model architecture and pretraining objective works best for zero-shot generalization? In International Conference on Machine Learning, pp.\ 22964--22984

fields

years

verdicts

representative citing papers

citing papers explorer