Linformer: Self-Attention with Linear Complexity
Pith reviewed 2026-05-12 00:29 UTC · model grok-4.3
The pith
Self-attention in transformers can be approximated by a low-rank matrix to reduce complexity to linear in sequence length.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The self-attention mechanism can be approximated by a low-rank matrix. We further exploit this finding to propose a new self-attention mechanism, which reduces the overall self-attention complexity from O(n²) to O(n) in both time and space. The resulting linear transformer, the Linformer, performs on par with standard Transformer models, while being much more memory- and time-efficient.
What carries the argument
Low-rank projection matrices applied to the key and value vectors before attention, which replace the full n-by-n matrix with a much smaller n-by-k matrix where k is fixed and far smaller than n.
Load-bearing premise
The low-rank projections, whether learned or fixed, retain enough information from the original attention scores for the model to succeed on the tasks and sequence lengths it will see.
What would settle it
If the Linformer shows a clear accuracy gap compared with the standard transformer on a task that uses sequences several times longer than those seen during training, the low-rank approximation would be shown insufficient.
read the original abstract
Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences, as the standard self-attention mechanism of the Transformer uses $O(n^2)$ time and space with respect to sequence length. In this paper, we demonstrate that the self-attention mechanism can be approximated by a low-rank matrix. We further exploit this finding to propose a new self-attention mechanism, which reduces the overall self-attention complexity from $O(n^2)$ to $O(n)$ in both time and space. The resulting linear transformer, the \textit{Linformer}, performs on par with standard Transformer models, while being much more memory- and time-efficient.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that self-attention in Transformers can be approximated via low-rank projections on the key and value matrices (using fixed or learned E, F matrices of size k x n with k << n), reducing attention complexity from O(n²) to O(n) in time and space. The resulting Linformer model is shown to achieve competitive performance with standard Transformers on GLUE, WikiText-103, and machine translation benchmarks while being more memory- and time-efficient.
Significance. If the empirical claims hold under broader validation, this is a significant contribution to efficient sequence modeling. It offers a practical architectural change that preserves the core attention mechanism while delivering linear scaling, which is valuable for long-context applications. The algebraic correctness of the low-rank rewriting and the competitive numbers on public NLP benchmarks are strengths; the work provides a clear efficiency gain without requiring entirely new attention formulations.
major comments (3)
- [§3] §3 (Method), around the definition of the projected attention: the low-rank approximation is presented without any error bound or analysis showing how the approximation error depends on sequence length n, rank k, or the effective rank of the attention matrix. This is load-bearing for the central claim of retained performance, as the paper's own skeptic note and experiments are confined to fixed training lengths.
- [§4] §4 (Experiments), Tables 1-3 and associated text: no standard deviations or results across multiple random seeds are reported for the GLUE or MT scores, and there are no ablations on the choice of projection dimension k as a function of n or task. This makes the 'on par' claim difficult to assess rigorously and directly tests the weakest assumption about projection sufficiency.
- [§4.2] §4.2 and §5: all reported experiments use fixed sequence lengths matching the training regime; no results are provided for substantially longer sequences or domain shifts. This leaves untested whether the learned low-rank projections preserve the necessary subspace when the effective rank of attention grows with n.
minor comments (2)
- [Figure 1] Figure 1 and the surrounding text could include a small diagram explicitly showing the shapes of E and F and how they are applied to K and V.
- [§3.2] The complexity analysis in §3.2 would benefit from an explicit step-by-step derivation of the O(n) claim including the cost of the projections themselves.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We provide point-by-point responses to the major comments below, indicating the revisions we intend to make.
read point-by-point responses
-
Referee: [§3] §3 (Method), around the definition of the projected attention: the low-rank approximation is presented without any error bound or analysis showing how the approximation error depends on sequence length n, rank k, or the effective rank of the attention matrix. This is load-bearing for the central claim of retained performance, as the paper's own skeptic note and experiments are confined to fixed training lengths.
Authors: We appreciate this observation. While the manuscript does not include a formal error bound, we provide empirical analysis demonstrating that attention matrices exhibit low effective rank, justifying the projection (see the singular value plots in the paper). The performance retention is validated across multiple tasks. In revision, we will add further discussion on how the approximation error scales with k and n based on these observations, though a complete theoretical bound remains an open question for future work. revision: partial
-
Referee: [§4] §4 (Experiments), Tables 1-3 and associated text: no standard deviations or results across multiple random seeds are reported for the GLUE or MT scores, and there are no ablations on the choice of projection dimension k as a function of n or task. This makes the 'on par' claim difficult to assess rigorously and directly tests the weakest assumption about projection sufficiency.
Authors: We agree that multiple random seeds and ablations would enhance the rigor. Our reported results follow the single-run convention common for such large-scale experiments due to resource constraints. We will rerun key experiments with multiple seeds to report means and standard deviations, and include ablations on the projection dimension k for different n and tasks in the revised manuscript. revision: yes
-
Referee: [§4.2] §4.2 and §5: all reported experiments use fixed sequence lengths matching the training regime; no results are provided for substantially longer sequences or domain shifts. This leaves untested whether the learned low-rank projections preserve the necessary subspace when the effective rank of attention grows with n.
Authors: This point highlights an important aspect of generalization. The current experiments adhere to the standard fixed-length settings of the benchmarks. We will extend the evaluation in the revision to include tests with longer sequences and some domain shifts to verify that the learned projections maintain effectiveness when the attention rank increases with n. revision: yes
- Providing a formal error bound or complete theoretical analysis of the approximation error's dependence on n, k, and effective rank
Circularity Check
No significant circularity; architectural proposal with independent empirical validation
full rationale
The Linformer derivation proposes an explicit architectural change—projecting the key and value matrices via learned low-rank matrices E and F of size k x n (k << n)—to approximate the O(n^2) attention matrix with O(n) complexity. This is not obtained by fitting parameters to a target quantity and then renaming the fit as a prediction, nor by self-referential definitions or load-bearing self-citations. The low-rank property is motivated by empirical observation of attention matrices but the method itself is a constructive proposal whose performance is measured on held-out public benchmarks (e.g., GLUE, SQuAD) with standard training protocols. No equation reduces to its own input by construction, and the central claim retains independent content beyond any cited prior work.
Axiom & Free-Parameter Ledger
free parameters (1)
- projection dimension k
axioms (1)
- domain assumption The attention matrix admits a useful low-rank approximation for the tasks considered.
Forward citations
Cited by 60 Pith papers
-
Convergent Stochastic Training of Attention and Understanding LoRA
Attention and LoRA regression losses induce Poincaré inequalities under mild regularization, so SGD-mimicking SDEs converge to minimizers with no assumptions on data or model size.
-
Nearly Optimal Attention Coresets
ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.
-
VMamba: Visual State Space Model
VMamba introduces a state-space vision backbone using 2D selective scanning across four routes to achieve linear complexity and strong performance on image tasks.
-
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
-
Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity
Derives a blockwise resolvent-style attention operator that exploits structured sparsity for subquadratic O(n^{4/3}d) entity tracking while matching dense accuracy.
-
Patch Hierarchical Attention Transformer for Efficient Particle Jet Tagging
PHAT-JeT combines geometric message-passing with hierarchical patch attention to reach state-of-the-art accuracy and background rejection among resource-constrained jet tagging models on four benchmarks.
-
ASAP: Amortized Doubly-Stochastic Attention via Sliced Dual Projection
ASAP amortizes Sinkhorn-based doubly-stochastic attention by learning a parametric map from 1D potentials to the Sinkhorn dual and reconstructing the plan via two-sided entropic c-transform, delivering 5.3x faster inf...
-
VORT: Adaptive Power-Law Memory for NLP Transformers
VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.
-
Projection-Free Transformers via Gaussian Kernel Attention
Gaussian Kernel Attention replaces learned QKV projections with a Gaussian RBF kernel on per-head token features, using 0.42x parameters and 0.49x FLOPs while showing competitive language modeling performance at depth 20.
-
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
-
Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling
Stream-CQSA uses CQS-based decomposition to stream exact attention computations for billion-token sequences on limited-memory hardware.
-
Learning Posterior Predictive Distributions for Node Classification from Synthetic Graph Priors
NodePFN pre-trains on synthetic graphs with controllable homophily and causal feature-label models to achieve 71.27 average accuracy on 23 node classification benchmarks without graph-specific training.
-
Hierarchical Kernel Transformer: Multi-Scale Attention with an Information-Theoretic Approximation Analysis
HKT is a multi-scale attention architecture that bounds computation at 1.31x standard attention, proves kernel and decomposition properties, and reports accuracy gains on ListOps, sequential CIFAR-10, and character-le...
-
Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling
Cactus uses constrained optimization to guarantee bounded divergence from the verifier LLM distribution during speculative sampling, raising acceptance rates without the distortion seen in typical acceptance sampling.
-
Collapse-Free Prototype Readout Layer for Transformer Encoders
DDCL-Attention introduces a collapse-free prototype readout for transformers that decomposes the training loss exactly into reconstruction and diversity terms while providing stability guarantees via singular perturba...
-
The Volterra signature
The Volterra signature is a kernel-weighted tensor feature map for paths that is injective, universally approximating, and computable via linear ODEs or a two-parameter integral equation.
-
Mixture-of-Top-k Attention: Efficient Attention via Scalable Fast Weights
MiTA makes attention scalable by gathering query-aware top-k key-value pairs through landmarks as deformable routed experts and compressing the N-width fast-weight MLP into a shared narrower expert.
-
CoFrGeNet: Continued Fraction Architectures for Language Generation
CoFrGeNets implement a continued-fraction function class as plug-in replacements for transformer blocks, delivering competitive or superior downstream performance on GPT2-xl and Llama3-scale models with one-half to tw...
-
Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers
One of the Q, K or V weights in transformer self-attention is redundant and replaceable by the identity matrix under mild assumptions, reducing parameters by 25 percent with no loss in small-model performance.
-
RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts
RACE Attention is a strictly linear-time attention mechanism that approximates softmax attention outputs using Gaussian projections and soft LSH to enable training on contexts up to 12 million tokens.
-
Transformer Neural Processes - Kernel Regression
TNP-KR adds a kernel regression transformer block, kernel attention bias, scan attention for translation invariance, and deep kernel attention to achieve lower complexity and state-of-the-art results on meta-regressio...
-
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.
-
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
FlashAttention reduces GPU high-bandwidth memory accesses in self-attention via tiling, delivering exact attention with lower IO complexity, 2-3x wall-clock speedups on models like GPT-2, and the ability to train on s...
-
Perceiver IO: A General Architecture for Structured Inputs & Outputs
Perceiver IO is a general architecture that processes arbitrary structured inputs and outputs with linear scaling and achieves strong results on GLUE, Sintel optical flow, multi-task reasoning, and StarCraft II withou...
-
Rethinking Attention with Performers
Performers approximate full-rank softmax attention in Transformers via FAVOR+ random features for linear complexity, with theoretical guarantees of unbiased estimation and competitive results on pixel, text, and prote...
-
ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention
ThriftAttention recovers 89.1% of the FP16 quality gap versus pure FP4 attention by running only 5% of query-key blocks in FP16 on long-context benchmarks.
-
Towards Understanding Self-Pretraining for Sequence Classification
Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.
-
RoPeSLR: 3D RoPE-driven Sparse-LowRank Attention for Efficient Diffusion Transformers
RoPeSLR combines 3D RoPE-guided sparse attention with head-wise low-rank parameterization to achieve sub-quadratic complexity in DiTs while preserving distance awareness for efficient ultra-long video synthesis.
-
COAgents: Multi-Agent Framework to Learn and Navigate Routing Problems Search Space
COAgents introduces a cooperative multi-agent system with a partial search graph to guide intensification and diversification in vehicle routing problems, achieving new state-of-the-art results among learning-based me...
-
Spectral Progressive Diffusion for Efficient Image and Video Generation
Spectral Progressive Diffusion accelerates image and video generation in pretrained diffusion models by progressively growing resolution along the denoising trajectory using spectral noise expansion and a power spectr...
-
ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices
ElasticDiT introduces an elastic DiT architecture with adjustable spatial compression and block depth plus Shift Sparse Block Attention and a distilled VAE to enable a single model to cover multiple fidelity-latency p...
-
SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection
SToRe3D delivers up to 3x faster inference for multi-view 3D object detection in ViTs by selecting relevant 2D tokens and 3D queries via mutual relevance heads with only marginal accuracy loss.
-
Elastic Attention Cores for Scalable Vision Transformers
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...
-
Search Your Block Floating Point Scales!
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
-
Nectar: Neural Estimation of Cached-Token Attention via Regression
Nectar fits small per-layer per-head neural networks via regression to predict attention outputs and normalizers, enabling constant-time inference independent of context length while preserving semantic generation quality.
-
Practical Wi-Fi-based Motion Recognition Under Variable Traffic Patterns
A sampling-rate-versatile transformer network with dynamic augmentation achieves stable high accuracy for Wi-Fi-based motion and gesture recognition across variable sampling rates and traffic patterns.
-
Gated Subspace Inference for Transformer Acceleration
Gated Subspace Inference accelerates transformer linear layers 3-10x via low-rank cached subspace computation and per-token gating to skip residuals while preserving output distribution to high accuracy.
-
Stochastic Sparse Attention for Memory-Bound Inference
SANTA sparsifies post-softmax value aggregation via stratified sampling of S << n_k indices to produce an unbiased estimator, delivering 1.5x decode attention speedup on RTX 6000 Ada at 32k contexts while matching bas...
-
Linear-Time Global Visual Modeling without Explicit Attention
Dynamic parameterization of standard layers can replace explicit attention for linear-time global visual modeling.
-
GateMOT: Q-Gated Attention for Dense Object Tracking
GateMOT proposes Q-Gated Attention to enable linear-complexity, spatially aware attention for state-of-the-art dense object tracking on benchmarks like BEE24.
-
ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers
ELSA casts online softmax attention as a prefix scan over monoid (m,S,W) to deliver exact FP32 semantics, O(n) memory, O(log n) depth, and Tensor-Core independence as a drop-in kernel.
-
HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models
HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.
-
DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion
DynamicRad achieves 1.7x-2.5x inference speedups in long video diffusion with over 80% sparsity by grounding adaptive selection in a radial locality prior, using dual-mode static/dynamic strategies and offline BO with...
-
RankUp: Towards High-rank Representations for Large Scale Advertising Recommender Systems
RankUp raises effective rank of representations in deep MetaFormer recommenders via randomized splitting and multi-embeddings, delivering 2-5% GMV gains in production deployments at Weixin.
-
On the Effectiveness of Context Compression for Repository-Level Tasks: An Empirical Investigation
Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.
-
Tracing the Chain: Deep Learning for Stepping-Stone Intrusion Detection
ESPRESSO achieves over 0.99 true positive rate at 10^{-3} false positive rate for stepping-stone intrusion detection on synthetic data for SSH, SOCAT, ICMP, DNS and mixed protocols, outperforming DeepCoFFEA while also...
-
PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer
PoM is a new linear-complexity token mixer using learned polynomials that matches attention performance in transformers while enabling efficient long-sequence processing.
-
Why Attend to Everything? Focus is the Key
Focus learns a few centroids to gate long-range token attention, producing sparse attention that matches or beats full attention quality with up to 8.6x speedup at million-token lengths.
-
CoFrGeNet: Continued Fraction Architectures for Language Generation
CoFrGeNet uses continued-fraction function classes to build transformer replacements that match or beat GPT-2 and Llama performance with half to two-thirds the parameters.
-
When to Think Fast and Slow? AMOR: Adaptive Entropy Gate for Hybrid Models
AMOR uses output entropy to gate attention in recurrent hybrids, matching full attention performance at roughly 22% attention invocations across 180M-1.5B models.
-
BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models
BOOST delivers 1.46-2.27x end-to-end speedups for low-rank bottleneck LLMs by redesigning tensor parallelism around the bottleneck structure plus supporting optimizations.
-
SURF: Signature-Retained Fast Video Generation
SURF accelerates high-resolution video generation up to 12.5x by using noise reshifting for low-res previews from pretrained models and a shifting-window Refiner for efficient upscaling that retains original signatures.
-
Cambrian-S: Towards Spatial Supersensing in Video
Cambrian-S introduces VSI-SUPER benchmarks for long-horizon spatial recall and counting, shows data scaling yields 30% gains on existing tests, and demonstrates a self-supervised next-latent predictor using surprise o...
-
Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live
Continuum applies a time-to-live mechanism to KV cache retention during tool calls in multi-turn LLM agents, reporting over 8x faster average job completion times on benchmarks including SWE-Bench with models up to 35...
-
Higher-order Linear Attention
Higher-order Linear Attention realizes second-order and higher interactions in linear-time causal attention via constant-size state and associative scans.
-
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
-
SpikingBrain: Spiking Brain-inspired Large Models
SpikingBrain-7B and SpikingBrain-76B achieve Transformer-comparable performance after continual pre-training on 150B tokens, with over 100x TTFT speedup on 4M-token sequences and 69.15% sparsity from event-driven spiking.
-
Lizard: An Efficient Linearization Framework for Large Language Models
Lizard linearizes Transformer LLMs via subquadratic attention and adaptive learnable modules, recovering near-original performance while outperforming prior linearization methods on MMLU and associative recall.
Reference graph
Works this paper leans on
-
[1]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150,
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[2]
Language Models are Few-Shot Learners
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165,
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[3]
Training Deep Nets with Sublinear Memory Cost
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174,
work page internal anchor Pith review arXiv
-
[4]
Generating Long Sequences with Sparse Transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[5]
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186,
work page 2019
-
[6]
arXiv preprint arXiv:2004.07320 , year=
Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Remi Gribonval, Herve Jegou, and Armand Joulin. Training with quantization noise for extreme fixed-point compression. arXiv preprint arXiv:2004.07320,
-
[7]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[9]
Pointer Sentinel Mixture Models
9 Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740,
work page internal anchor Pith review arXiv
-
[11]
Abdelrahman Mohamed, Dmytro Okhonko, and Luke Zettlemoyer. Transformers with convolutional context for asr. arXiv preprint arXiv:1904.11660,
-
[12]
fairseq: A fast, extensible toolkit for sequence modeling
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp. 48–53,
work page 2019
-
[13]
arXiv preprint arXiv:1911.02972 , year=
Jiezhong Qiu, Hao Ma, Omer Levy, Scott Wen-tau Yih, Sinong Wang, and Jie Tang. Blockwise self-attention for long document understanding. arXiv preprint arXiv:1911.02972,
-
[14]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683,
work page internal anchor Pith review arXiv 1910
-
[15]
Squad: 100,000+ questions for machine comprehension of text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392,
work page 2016
-
[16]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108,
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[17]
Recursive deep models for semantic compositionality over a sentiment treebank
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642,
work page 2013
-
[19]
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
URL http://arxiv.org/abs/1804.07461. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp. 19–27,
work page internal anchor Pith review arXiv
-
[20]
(JL, for short), the following version is from (Arriaga & Vempala, 2006). Lemma
work page 2006
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.