Longformer: The Long-Document Transformer

Arman Cohan; Iz Beltagy; Matthew E. Peters

arxiv: 2004.05150 · v2 · submitted 2020-04-10 · 💻 cs.CL

Longformer: The Long-Document Transformer

Iz Beltagy , Matthew E. Peters , Arman Cohan This is my paper

Pith reviewed 2026-05-10 13:24 UTC · model grok-4.3

classification 💻 cs.CL

keywords transformerlong documentsattention mechanismlinear scalingpretrainingquestion answeringsummarizationlanguage modeling

0 comments

The pith

Longformer's attention mechanism scales linearly with sequence length as a drop-in replacement for standard self-attention and outperforms RoBERTa on long document tasks after pretraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the limitation of standard transformers that cannot handle long sequences because self-attention scales quadratically. It introduces Longformer, whose attention mechanism uses local windows around each position combined with global attention on a few tokens to achieve linear scaling. This design is pretrained and then applied to downstream tasks where it beats the RoBERTa baseline on long documents and establishes new state-of-the-art scores on the WikiHop and TriviaQA datasets. The work also presents the Longformer-Encoder-Decoder variant that supports generative tasks on long inputs, with results on arXiv summarization. A reader should care because this makes advanced language models usable on full-length articles, books, or reports instead of short snippets.

Core claim

Longformer's attention mechanism is a drop-in replacement for standard self-attention that scales linearly with sequence length and combines a local windowed attention with a task motivated global attention. After pretraining, it consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA. The Longformer-Encoder-Decoder variant demonstrates effectiveness on long document generative sequence-to-sequence tasks such as arXiv summarization.

What carries the argument

The attention mechanism consisting of fixed-size local windowed attention combined with a small number of global attention tokens.

Load-bearing premise

The specific combination of fixed-size local windows plus a small number of global attention tokens is sufficient to capture the long-range dependencies required by downstream tasks.

What would settle it

A controlled experiment on a long-document task where removing the global attention tokens causes performance to drop to the level of a purely local window model or where full quadratic attention still shows measurable gains over the proposed pattern.

read the original abstract

Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA. We finally introduce the Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Longformer gives a practical local-window-plus-global attention fix that scales linearly and beats RoBERTa after pretraining on long-document tasks, with the main limitation being that global token selection stays task-specific.

read the letter

The paper's main advance is a drop-in attention pattern that uses sliding local windows of fixed size plus a small number of global tokens. This keeps complexity linear while letting distant tokens interact through the globals. They pretrain the model and show it outperforms RoBERTa on several long-document benchmarks, with new state-of-the-art numbers on WikiHop and TriviaQA. The LED encoder-decoder version also works on arXiv summarization. Ablations on the attention pattern and standard reporting of metrics make the results easy to check.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces Longformer, a Transformer variant whose self-attention scales linearly with sequence length by combining fixed-size local windowed attention with a small number of task-motivated global attention tokens. The authors pretrain the model on long documents, demonstrate state-of-the-art results on character-level language modeling (text8, enwik8), and show consistent gains over RoBERTa on downstream long-document tasks with new SOTAs on WikiHop and TriviaQA. They further present the Longformer-Encoder-Decoder (LED) variant and evaluate it on arXiv summarization.

Significance. If the results hold, this provides a practical, drop-in replacement for standard self-attention that enables efficient processing of documents with thousands of tokens while preserving or improving performance. The pretraining experiments, ablation studies on attention patterns, and consistent benchmark gains are explicit strengths that support the central claim. The work also introduces LED for generative seq2seq tasks, broadening its applicability.

minor comments (3)

§3.1: The description of the global attention implementation would benefit from an explicit statement of how the global tokens are chosen for each downstream task (e.g., the exact positions used for WikiHop versus TriviaQA) to improve reproducibility.
Table 2: The reported perplexity numbers on enwik8 and text8 are given without standard deviations across multiple runs; adding these would strengthen the SOTA claim.
Figure 4: The attention visualization would be clearer if the local window boundaries and global token indices were annotated directly on the plot.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript, accurate summary of our contributions, and recommendation to accept. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The Longformer paper defines its attention mechanism explicitly as a combination of fixed-size local windows plus a small set of task-motivated global tokens, then states that this construction yields linear scaling with sequence length. That scaling property follows directly from the definition (O(window_size * n + global_tokens * n) with both window and global set held constant) rather than from any derived prediction or fitted parameter. Downstream claims of outperformance over RoBERTa and new SOTA on WikiHop/TriviaQA rest on separate pretraining and fine-tuning runs whose metrics are reported independently of the architectural equations. No load-bearing self-citation, uniqueness theorem, or ansatz is invoked to justify the core design; the mechanism is presented as an engineering choice validated empirically. The derivation chain is therefore self-contained and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of a hand-designed sparse attention mask rather than a derivation from first principles; the only free parameters are standard hyperparameters (window size, number of global tokens) chosen by validation performance.

free parameters (2)

attention window size
Fixed hyperparameter (typically 512) chosen to balance local context and compute; affects all reported results.
number and placement of global attention tokens
Task-dependent choice (e.g., [CLS] token or question tokens) that is set by hand for each downstream task.

axioms (1)

domain assumption Standard Transformer layer norms, feed-forward networks, and positional embeddings remain unchanged and sufficient when attention is sparsified.
Invoked throughout the architecture description without additional justification.

pith-pipeline@v0.9.0 · 5483 in / 1336 out tokens · 19976 ms · 2026-05-10T13:24:20.707423+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention.
IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Preisach Attention: A Hysteretic Model of Sequential Memory
cs.LG 2026-05 unverdicted novelty 8.0

PAL uses the classical Preisach hysteresis operator with learned thresholds and an extrema stack to model sequences, proving O(1)-depth Turing completeness via two-stack PDA simulation and incomparability with standar...
FlowCompile: An Optimizing Compiler for Structured LLM Workflows
cs.CL 2026-05 unverdicted novelty 8.0

FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.
Convergent Stochastic Training of Attention and Understanding LoRA
cs.LG 2026-05 unverdicted novelty 8.0

Attention and LoRA regression losses induce Poincaré inequalities under mild regularization, so SGD-mimicking SDEs converge to minimizers with no assumptions on data or model size.
Nearly Optimal Attention Coresets
cs.DS 2026-05 unverdicted novelty 8.0

ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.
Characterizing the Expressivity of Local Attention in Transformers
cs.CL 2026-05 unverdicted novelty 8.0

Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressive...
When Does Content-Based Routing Work? Representation Requirements for Selective Attention in Hybrid Sequence Models
cs.LG 2026-03 conditional novelty 8.0

Content-based routing succeeds only when models provide bidirectional context and perform pairwise comparisons, with bidirectional Mamba plus rank-1 projection reaching 99.7% precision at linear inference cost.
Sparse Attention as Compact Kernel Regression
cs.LG 2026-01 unverdicted novelty 8.0

Sparse attention arises from compact kernel regression, with Epanechnikov and similar kernels mapping to normalized ReLU, sparsemax, and alpha-entmax attention.
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
cs.CL 2023-08 unverdicted novelty 8.0

LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second
cs.LG 2022-07 conditional novelty 8.0

TabPFN is a Prior-Data Fitted Network that approximates Bayesian inference for small tabular classification by training a Transformer once on synthetic data drawn from a causal prior, then solves new tasks in a single...
Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity
cs.LG 2026-05 unverdicted novelty 7.0

Derives a blockwise resolvent-style attention operator that exploits structured sparsity for subquadratic O(n^{4/3}d) entity tracking while matching dense accuracy.
Tensor Cache: Eviction-conditioned Associative Memory for Transformers
cs.LG 2026-05 unverdicted novelty 7.0

Tensor Cache augments sliding-window attention with an eviction-fed outer-product associative memory and a training correction to improve long-context performance under bounded memory.
Patch Hierarchical Attention Transformer for Efficient Particle Jet Tagging
hep-ex 2026-05 unverdicted novelty 7.0

PHAT-JeT combines geometric message-passing with hierarchical patch attention to reach state-of-the-art accuracy and background rejection among resource-constrained jet tagging models on four benchmarks.
SSV: Sparse Speculative Verification for Efficient LLM Inference
cs.OS 2026-05 unverdicted novelty 7.0

SpecSA is a sparse speculative-verification framework that integrates speculative decoding and dynamic sparse attention to achieve up to 3.49x end-to-end throughput and 6.86x kernel speedups on H100 GPUs for long-cont...
Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding
cs.LG 2026-05 unverdicted novelty 7.0

Chronicle is the first model jointly pretrained from scratch on text and time series in a unified transformer that matches a comparable language model on NLU tasks and sets new bars for time series classification and ...
WorldParticle: Unified World Simulation of Lagrangian Particle Dynamics via Transformer
cs.GR 2026-05 unverdicted novelty 7.0

A transformer with prediction-correction and hierarchical super-token merging unifies simulation of six physical dynamics categories on Lagrangian particles and generalizes to unseen conditions.
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
cs.CL 2026-05 unverdicted novelty 7.0

A dataset-agnostic framework converts text tool-calling benchmarks to paired audio versions via TTS and noise, showing model-dependent performance with small text-to-voice gaps of 1.8-4.8 points on Confetti and When2Call.
EndPrompt: Efficient Long-Context Extension via Terminal Anchoring
cs.CL 2026-05 conditional novelty 7.0

EndPrompt induces reliable long-context generalization in LLaMA models from sparse positional supervision via a two-segment short-sequence construction with terminal anchoring.
End-to-End Population Inference from Gravitational-Wave Strain using Transformers
gr-qc 2026-05 unverdicted novelty 7.0

Dingo-Pop uses a transformer to perform amortized, end-to-end population inference from GW strain data in seconds, bypassing per-event Monte Carlo sampling.
RelFlexformer: Efficient Attention 3D-Transformers for Integrable Relative Positional Encodings
cs.LG 2026-05 unverdicted novelty 7.0

RelFlexformers enable flexible integrable 3D RPE in attention via NU-FFT, generalizing prior methods to heterogeneous token positions with O(L log L) complexity.
Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases
cs.LG 2026-05 unverdicted novelty 7.0

ALiBi bias is the expectation of positional LSH-induced block masks, yielding spectral and max-norm approximation bounds that reduce long-context biased attention to randomized short-context unbiased attention.
VORT: Adaptive Power-Law Memory for NLP Transformers
cs.LG 2026-05 unverdicted novelty 7.0

VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.
Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
cs.DC 2026-05 unverdicted novelty 7.0

Dooly reduces LLM inference profiling costs by 56.4% via configuration-agnostic taint-based labeling and selective database reuse, delivering simulation accuracy within 5% MAPE for TTFT and 8% for TPOT across 12 models.
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
cs.LG 2026-05 conditional novelty 7.0

MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
Retrieval from Within: An Intrinsic Capability of Attention-Based Models
cs.LG 2026-05 unverdicted novelty 7.0

Attention-based models can intrinsically retrieve and reuse pre-encoded evidence chunks via decoder attention queries, unifying retrieval with generation and outperforming external RAG pipelines on QA benchmarks.
Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs
cs.DC 2026-05 unverdicted novelty 7.0

Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
cs.MA 2026-05 unverdicted novelty 7.0

MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
SpecEdit: Training-Free Acceleration for Diffusion based Image Editing via Semantic Locking
cs.CV 2026-05 unverdicted novelty 7.0

SpecEdit accelerates diffusion-based image editing up to 10x by using a low-resolution draft to identify edit-relevant tokens via semantic discrepancies for selective high-resolution denoising.
SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving
cs.DC 2026-05 unverdicted novelty 7.0

SplitZip is a new GPU-friendly lossless compressor for KV cache tensors that exploits exponent redundancy to achieve over 600 GB/s compression throughput and up to 1.32x faster transfers in disaggregated LLM serving.
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
cs.CV 2026-05 unverdicted novelty 7.0

VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
Characterizing the Expressivity of Local Attention in Transformers
cs.CL 2026-05 conditional novelty 7.0

Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by introducing a second temporal operator in LTL, with global and local attention being expressively complementary.
RouteProfile: Elucidating the Design Space of LLM Profiles for Routing
cs.NI 2026-04 unverdicted novelty 7.0

RouteProfile organizes LLM profile design into organizational form, representation type, aggregation depth, and learning configuration, with evaluations showing structured profiles outperform flat ones and aid general...
Adaptive Head Budgeting for Efficient Multi-Head Attention
cs.LG 2026-04 unverdicted novelty 7.0

BudgetFormer adaptively budgets the number and selection of attention heads per input in Transformers, reducing FLOPs and memory on text classification while matching or exceeding standard multi-head performance.
How English Print Media Frames Human-Elephant Conflicts in India
cs.AI 2026-04 unverdicted novelty 7.0

English print media in India frames human-elephant conflicts with predominantly fear-inducing and aggression-related language.
How English Print Media Frames Human-Elephant Conflicts in India
cs.AI 2026-04 unverdicted novelty 7.0

English print media coverage of human-elephant conflicts in India is dominated by fear-inducing and aggression-related language.
Subject-level Inference for Realistic Text Anonymization Evaluation
cs.CL 2026-04 unverdicted novelty 7.0

SPIA benchmark reveals that subject-level inference protection falls to as low as 33% even after masking over 90% of PII spans, with non-target subjects remaining highly exposed under target-focused anonymization.
Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling
cs.LG 2026-04 unverdicted novelty 7.0

Stream-CQSA uses CQS-based decomposition to stream exact attention computations for billion-token sequences on limited-memory hardware.
AsyncSparse: Accelerating Sparse Matrix-Matrix Multiplication on Asynchronous GPU Architectures
cs.DC 2026-04 unverdicted novelty 7.0

AsyncSparse presents BCSR and WCSR kernels that use TMA and warp specialization to accelerate SpMM, outperforming prior libraries by 1.47-6.24x on SuiteSparse and achieving 2.66x end-to-end speedup on Qwen2.5-7B at 90...
Improving Sparse Autoencoder with Dynamic Attention
cs.LG 2026-04 unverdicted novelty 7.0

A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.
TokenFormer: Unify the Multi-Field and Sequential Recommendation Worlds
cs.IR 2026-04 unverdicted novelty 7.0

TokenFormer unifies multi-field and sequential recommendation modeling via bottom-full-top-sliding attention and non-linear interaction representations to avoid sequential collapse and deliver state-of-the-art performance.
Hierarchical Kernel Transformer: Multi-Scale Attention with an Information-Theoretic Approximation Analysis
cs.LG 2026-04 unverdicted novelty 7.0

HKT is a multi-scale attention architecture that bounds computation at 1.31x standard attention, proves kernel and decomposition properties, and reports accuracy gains on ListOps, sequential CIFAR-10, and character-le...
BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLMs
cs.CL 2026-04 unverdicted novelty 7.0

BOSCH decomposes attention-head selection for short-context hybridization into layer probing, adaptive ratio assignment, and grouped binary optimization, yielding better efficiency-performance tradeoffs than static or...
Anchored Cyclic Generation: A Novel Paradigm for Long-Sequence Symbolic Music Generation
cs.SD 2026-04 unverdicted novelty 7.0

Anchored Cyclic Generation uses anchor features from known music to mitigate error accumulation in autoregressive models, with the Hi-ACG framework delivering better long-sequence symbolic music and music completion p...
Fast Cross-Operator Optimization of Attention Dataflow
cs.AR 2026-04 unverdicted novelty 7.0

MMEE encodes dataflow decisions in matrix form for fast exhaustive search, delivering 40-69% lower latency and energy use than prior methods while running 64-343x faster.
NativeTernary: A Self-Delimiting Binary Encoding with Unary Run-Length Hierarchy Markers for Ternary Neural Network Weights, Structured Data, and General Computing Infrastructure
cs.LG 2026-04 unverdicted novelty 7.0

NativeTernary encodes ternary weights at exactly 2 bits each with 460x lower overhead than GGUF for BitNet-style models.
Efficient Remote KV Cache Reuse with GPU-native Video Codec
cs.DC 2026-02 conditional novelty 7.0

KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.
Mixture-of-Top-k Attention: Efficient Attention via Scalable Fast Weights
cs.LG 2026-02 unverdicted novelty 7.0

MiTA makes attention scalable by gathering query-aware top-k key-value pairs through landmarks as deformable routed experts and compressing the N-width fast-weight MLP into a shared narrower expert.
NEST: Nested Event Stream Transformer for Sequences of Multisets
cs.LG 2026-01 unverdicted novelty 7.0

NEST is a nested transformer for sequences of multisets that uses masked set modeling to learn improved set-level representations from hierarchical event streams like EHRs.
Measuring Investor Learning in Private Markets: A Sequential LLM-Bayesian Analysis of Expert Network Calls
cs.CE 2025-12 unverdicted novelty 7.0

A new LLM-Bayesian framework extracts investor beliefs from expert calls, showing these calls raise investment probability by 7-9 points and improve simulated portfolio returns by 15%.
BERT-APC: A Reference-free Framework for Automatic Pitch Correction via Musical Context Inference
eess.AS 2025-11 unverdicted novelty 7.0

BERT-APC is a reference-free automatic pitch correction system that uses a repurposed music language model to infer intended pitches from detuned vocals and applies note-level corrections while preserving expressive d...
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators
cs.AI 2025-11 unverdicted novelty 7.0

SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy ...
DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning
cs.CL 2025-10 conditional novelty 7.0

DELTA partitions layers into full, delta, and sparse groups to select salient tokens via aggregated attention scores, matching full-attention accuracy on AIME and GPQA while cutting attended tokens up to 4.25x and ach...
When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs
cs.CL 2025-10 unverdicted novelty 7.0

Thought templates derived from training traces and refined via natural-language feedback improve multi-hop reasoning performance in long-context LMs across benchmarks and can be distilled into smaller models.
RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts
cs.LG 2025-10 unverdicted novelty 7.0

RACE Attention is a strictly linear-time attention mechanism that approximates softmax attention outputs using Gaussian projections and soft LSH to enable training on contexts up to 12 million tokens.
Explaining Sources of Uncertainty in Automated Fact-Checking
cs.CL 2025-05 unverdicted novelty 7.0

CLUE generates natural language explanations of model uncertainty in fact-checking by unsupervised identification of claim-evidence and inter-evidence conflicts and agreements, followed by prompting and attention steering.
IAFormer: Interaction-Aware Transformer network for collider data analysis
hep-ph 2025-05 unverdicted novelty 7.0

IAFormer uses boost-invariant pairwise quantities and differential attention to create a sparse Transformer that achieves state-of-the-art classification on top-quark and quark-gluon jet datasets while using over an o...
Transformer Neural Processes - Kernel Regression
cs.LG 2024-11 unverdicted novelty 7.0

TNP-KR adds a kernel regression transformer block, kernel attention bias, scan attention for translation invariance, and deep kernel attention to achieve lower complexity and state-of-the-art results on meta-regressio...
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
cs.CL 2024-10 conditional novelty 7.0

DuoAttention identifies retrieval heads requiring full KV cache and streaming heads using constant-length cache to reduce memory and latency in long-context LLM inference.
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
cs.CL 2024-10 unverdicted novelty 7.0

LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
cs.LG 2024-07 accept novelty 7.0

FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
cs.CL 2024-04 conditional novelty 7.0

Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.

Reference graph

Works this paper leans on

131 extracted references · 131 canonical work pages · cited by 240 Pith papers · 4 internal anchors

[1]

NAACL-HLT 2018 , year =

A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents , author =. NAACL-HLT 2018 , year =

work page 2018
[2]

A Simple yet Strong Pipeline for

Dirk Groeneveld and Tushar Khot and Mausam and Ashish Sabhwaral , journal=. A Simple yet Strong Pipeline for

work page
[3]

Proceedings of NAACL-HLT 2019: Demonstrations , year =

fairseq: A Fast, Extensible Toolkit for Sequence Modeling , author =. Proceedings of NAACL-HLT 2019: Demonstrations , year =

work page 2019
[4]

arXiv preprint , year=

Is Graph Structure Necessary for Multi-hop Reasoning? , author=. arXiv preprint , year=

work page
[5]

arXiv preprint , year=

Span Selection Pre-training for Question Answering , author=. arXiv preprint , year=

work page
[6]

arXiv preprint , year=

Unsupervised Data Augmentation for Consistency Training , author=. arXiv preprint , year=

work page
[7]

Chen, Tianqi and Moreau, Thierry and Jiang, Ziheng and Zheng, Lianmin and Yan, Eddie and Shen, Haichen and Cowan, Meghan and Wang, Leyuan and Hu, Yuwei and Ceze, Luis and others , booktitle=

work page
[8]

arXiv preprint , year=

Coreference Resolution as Query-based Span Prediction , author=. arXiv preprint , year=

work page
[9]

Anonymous title , author=

work page
[10]

ACL , year=

Sentiment Classification Using Document Embeddings Trained with Cosine Similarity , author=. ACL , year=

work page
[11]

Adam Fisch and Alon Talmor and Robin Jia and Minjoon Seo and Eunsol Choi and Danqi Chen , booktitle=

work page
[12]

Carbonell and Quoc V

Zihang Dai and Zhilin Yang and Yiming Yang and Jaime G. Carbonell and Quoc V. Le and Ruslan Salakhutdinov , booktitle=. Transformer-

work page
[13]

ACL , year=

Adaptive Attention Span in Transformers , author=. ACL , year=

work page
[14]

ICLR , year=

Compressive Transformers for Long-Range Sequence Modelling , author=. ICLR , year=

work page
[15]

ICLR , year=

Reformer: The Efficient Transformer , author=. ICLR , year=

work page
[16]

arXiv preprint , year=

Generating Long Sequences with Sparse Transformers , author=. arXiv preprint , year=

work page
[17]

AAAI , year=

Character-Level Language Modeling with Deeper Self-Attention , author=. AAAI , year=

work page
[18]

arXiv preprint , year=

On Layer Normalization in the Transformer Architecture , author=. arXiv preprint , year=

work page
[19]

Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova , booktitle=

work page
[20]

2019 , volume=

Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov , journal=. 2019 , volume=

work page 2019
[21]

ACL , year=

Simple and Effective Multi-Paragraph Reading Comprehension , author=. ACL , year=

work page
[22]

Weld and Luke Zettlemoyer and Omer Levy , journal=

Mandar Joshi and Danqi Chen and Yinhan Liu and Daniel S. Weld and Luke Zettlemoyer and Omer Levy , journal=. 2019 , volume=

work page 2019
[23]

arXiv preprint , year=

Blockwise Self-Attention for Long Document Understanding , author=. arXiv preprint , year=

work page
[24]

2019 , volume=

Zihao Ye and Qipeng Guo and Quan Gan and Xipeng Qiu and Zheng Zhang , journal=. 2019 , volume=

work page 2019
[25]

EMNLP/IJCNLP , year=

Adaptively Sparse Transformers , author=. EMNLP/IJCNLP , year=

work page
[26]

arXiv preprint , year=

Sparse Sinkhorn Attention , author=. arXiv preprint , year=

work page
[27]

Large Text Compression Benchmark , author=

work page
[28]

arXiv preprint , year=

Training Deep Nets with Sublinear Memory Cost , author=. arXiv preprint , year=

work page
[29]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[30]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page
[31]

Dan Gusfield , title =. 1997

work page 1997
[32]

A Particle Filter algorithm for

Benjamin Borschinger and Mark Johnson , Booktitle =. A Particle Filter algorithm for

work page
[33]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page
[34]

Noise reduction and targeted exploration in imitation learning for Abstract Meaning Representation parsing

Goodman, James and Vlachos, Andreas and Naradowsky, Jason. Noise reduction and targeted exploration in imitation learning for Abstract Meaning Representation parsing. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1001

work page doi:10.18653/v1/p16-1001 2016
[35]

Learning from 26 Languages: Program Management and Science in the Babel Program

Harper, Mary. Learning from 26 Languages: Program Management and Science in the Babel Program. Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. 2014

work page 2014
[36]

NIPS , year=

Attention is All you Need , author=. NIPS , year=

work page
[37]

Language Models are Unsupervised Multitask Learners , author=

work page
[38]

arXiv preprint , year=

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , author=. arXiv preprint , year=

work page
[39]

ACL , year=

Universal Language Model Fine-tuning for Text Classification , author=. ACL , year=

work page
[40]

GPU Kernels for Block-Sparse Weights , author=

work page
[41]

arXiv preprint , year=

Pay Less Attention with Lightweight and Dynamic Convolutions , author=. arXiv preprint , year=

work page
[42]

SSW , year=

WaveNet: A Generative Model for Raw Audio , author=. SSW , year=

work page
[43]

ICCV , year=

Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , author=. ICCV , year=

work page
[44]

arXiv preprint , year=

A Simple Method for Commonsense Reasoning , author=. arXiv preprint , year=

work page
[45]

NeurIPS , year=

Defending Against Neural Fake News , author=. NeurIPS , year=

work page
[46]

EMNLP/IJCNLP , year=

Revealing the Dark Secrets of BERT , author=. EMNLP/IJCNLP , year=

work page
[47]

Cohen and Ruslan Salakhutdinov and Christopher D

Zhilin Yang and Peng Qi and Saizheng Zhang and Yoshua Bengio and William W. Cohen and Ruslan Salakhutdinov and Christopher D. Manning , booktitle=

work page
[48]

TACL , year=

Constructing Datasets for Multi-hop Reading Comprehension Across Documents , author=. TACL , year=

work page
[49]

Weld and Luke Zettlemoyer , booktitle=

Mandar Joshi and Eunsol Choi and Daniel S. Weld and Luke Zettlemoyer , booktitle=

work page
[51]

Hierarchical Graph Network for Multi-hop Question Answering

Fang, Yuwei and Sun, Siqi and Gan, Zhe and Pillai, Rohit and Wang, Shuohang and Liu, Jingjing. Hierarchical Graph Network for Multi-hop Question Answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020

work page 2020
[52]

NAACL , year=

Deep contextualized word representations , author=. NAACL , year=

work page
[53]

NeurIPS , year =

Semi-supervised Sequence Learning , author =. NeurIPS , year =

work page
[54]

Improving Language Understanding by Generative Pre-Training , author =

work page
[55]

BERT for Coreference Resolution: Baselines and Analysis

Joshi, Mandar and Levy, Omer and Zettlemoyer, Luke and Weld, Daniel. BERT for Coreference Resolution: Baselines and Analysis. EMNLP-IJCNLP. 2019

work page 2019
[56]

Higher-Order Coreference Resolution with Coarse-to-Fine Inference

Lee, Kenton and He, Luheng and Zettlemoyer, Luke. Higher-Order Coreference Resolution with Coarse-to-Fine Inference. NAACL. 2018

work page 2018
[57]

C o NLL -2012 Shared Task: Modeling Multilingual Unrestricted Coreference in O nto N otes

Pradhan, Sameer and Moschitti, Alessandro and Xue, Nianwen and Uryupina, Olga and Zhang, Yuchen. C o NLL -2012 Shared Task: Modeling Multilingual Unrestricted Coreference in O nto N otes. Joint Conference on EMNLP and C o NLL - Shared Task. 2012

work page 2012
[58]

and Daly, Raymond E

Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher , title =. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , month =. 2011 , address =

work page 2011
[59]

arXiv preprint , year=

What Does BERT Look At? An Analysis of BERT's Attention , author=. arXiv preprint , year=

work page
[60]

arXiv preprint , year=

Multi-hop Question Answering via Reasoning Chains , author=. arXiv preprint , year=

work page
[61]

NeurIPS Graph Representation Learning workshop , year=

Graph Sequential Network for Reasoning over Sequences , author=. NeurIPS Graph Representation Learning workshop , year=

work page
[62]

arXiv preprint , year=

Select, Answer and Explain: Interpretable Multi-hop Reading Comprehension over Multiple Documents , author=. arXiv preprint , year=

work page
[63]

ACL , year=

Reading Wikipedia to Answer Open-Domain Questions , author=. ACL , year=

work page
[64]

ICLR , year=

Semi-supervised classification with graph convolutional networks , author=. ICLR , year=

work page
[65]

2020 , journal=

Efficient Content-Based Sparse Attention with Routing Transformers , author=. 2020 , journal=

work page 2020
[67]

ArXiv , year=

A Divide-and-Conquer Approach to the Summarization of Academic Articles , author=. ArXiv , year=

work page
[68]

EMNLP , year=

On Extractive and Abstractive Neural Document Summarization with Transformer Language Models , author=. EMNLP , year=

work page
[69]

ICML , year=

Pegasus: Pre-training with extracted gap-sentences for abstractive summarization , author=. ICML , year=

work page
[70]

ETC : Encoding Long and Structured Inputs in Transformers

Ainslie, Joshua and Ontanon, Santiago and Alberti, Chris and Cvicek, Vaclav and Fisher, Zachary and Pham, Philip and Ravula, Anirudh and Sanghai, Sumit and Wang, Qifan and Yang, Li. ETC : Encoding Long and Structured Inputs in Transformers. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020

work page 2020
[71]

ArXiv , year=

Big Bird: Transformers for Longer Sequences , author=. ArXiv , year=

work page
[72]

ArXiv , year=

GMAT: Global Memory Augmentation for Transformers , author=. ArXiv , year=

work page
[73]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. J. Mach. Learn. Res. , year=

work page
[74]

NIPS , year=

Sequence to Sequence Learning with Neural Networks , author=. NIPS , year=

work page
[75]

Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham, Anirudh Ravula, Sumit Sanghai, Qifan Wang, and Li Yang. 2020. https://www.aclweb.org/anthology/2020.emnlp-main.19 ETC : Encoding long and structured inputs in transformers . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing ...

work page 2020
[76]

Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. 2018. Character-level language modeling with deeper self-attention. In AAAI

work page 2018
[77]

Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading wikipedia to answer open-domain questions. In ACL

work page 2017
[78]

Jifan Chen, Shih-Ting Lin, and Greg Durrett. 2019. Multi-hop question answering via reasoning chains. arXiv preprint, abs/1910.02610

work page arXiv 2019
[79]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. TVM : An automated end-to-end optimizing compiler for deep learning. In OSDI

work page 2018
[80]

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint, abs/1604.06174

work page internal anchor Pith review arXiv 2016
[81]

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv preprint, abs/1904.10509

work page internal anchor Pith review arXiv 2019
[82]

Christopher Clark and Matt Gardner. 2017. Simple and effective multi-paragraph reading comprehension. In ACL

work page 2017

Showing first 80 references.

[1] [1]

NAACL-HLT 2018 , year =

A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents , author =. NAACL-HLT 2018 , year =

work page 2018

[2] [2]

A Simple yet Strong Pipeline for

Dirk Groeneveld and Tushar Khot and Mausam and Ashish Sabhwaral , journal=. A Simple yet Strong Pipeline for

work page

[3] [3]

Proceedings of NAACL-HLT 2019: Demonstrations , year =

fairseq: A Fast, Extensible Toolkit for Sequence Modeling , author =. Proceedings of NAACL-HLT 2019: Demonstrations , year =

work page 2019

[4] [4]

arXiv preprint , year=

Is Graph Structure Necessary for Multi-hop Reasoning? , author=. arXiv preprint , year=

work page

[5] [5]

arXiv preprint , year=

Span Selection Pre-training for Question Answering , author=. arXiv preprint , year=

work page

[6] [6]

arXiv preprint , year=

Unsupervised Data Augmentation for Consistency Training , author=. arXiv preprint , year=

work page

[7] [7]

Chen, Tianqi and Moreau, Thierry and Jiang, Ziheng and Zheng, Lianmin and Yan, Eddie and Shen, Haichen and Cowan, Meghan and Wang, Leyuan and Hu, Yuwei and Ceze, Luis and others , booktitle=

work page

[8] [8]

arXiv preprint , year=

Coreference Resolution as Query-based Span Prediction , author=. arXiv preprint , year=

work page

[9] [9]

Anonymous title , author=

work page

[10] [10]

ACL , year=

Sentiment Classification Using Document Embeddings Trained with Cosine Similarity , author=. ACL , year=

work page

[11] [11]

Adam Fisch and Alon Talmor and Robin Jia and Minjoon Seo and Eunsol Choi and Danqi Chen , booktitle=

work page

[12] [12]

Carbonell and Quoc V

Zihang Dai and Zhilin Yang and Yiming Yang and Jaime G. Carbonell and Quoc V. Le and Ruslan Salakhutdinov , booktitle=. Transformer-

work page

[13] [13]

ACL , year=

Adaptive Attention Span in Transformers , author=. ACL , year=

work page

[14] [14]

ICLR , year=

Compressive Transformers for Long-Range Sequence Modelling , author=. ICLR , year=

work page

[15] [15]

ICLR , year=

Reformer: The Efficient Transformer , author=. ICLR , year=

work page

[16] [16]

arXiv preprint , year=

Generating Long Sequences with Sparse Transformers , author=. arXiv preprint , year=

work page

[17] [17]

AAAI , year=

Character-Level Language Modeling with Deeper Self-Attention , author=. AAAI , year=

work page

[18] [18]

arXiv preprint , year=

On Layer Normalization in the Transformer Architecture , author=. arXiv preprint , year=

work page

[19] [19]

Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova , booktitle=

work page

[20] [20]

2019 , volume=

Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov , journal=. 2019 , volume=

work page 2019

[21] [21]

ACL , year=

Simple and Effective Multi-Paragraph Reading Comprehension , author=. ACL , year=

work page

[22] [22]

Weld and Luke Zettlemoyer and Omer Levy , journal=

Mandar Joshi and Danqi Chen and Yinhan Liu and Daniel S. Weld and Luke Zettlemoyer and Omer Levy , journal=. 2019 , volume=

work page 2019

[23] [23]

arXiv preprint , year=

Blockwise Self-Attention for Long Document Understanding , author=. arXiv preprint , year=

work page

[24] [24]

2019 , volume=

Zihao Ye and Qipeng Guo and Quan Gan and Xipeng Qiu and Zheng Zhang , journal=. 2019 , volume=

work page 2019

[25] [25]

EMNLP/IJCNLP , year=

Adaptively Sparse Transformers , author=. EMNLP/IJCNLP , year=

work page

[26] [26]

arXiv preprint , year=

Sparse Sinkhorn Attention , author=. arXiv preprint , year=

work page

[27] [27]

Large Text Compression Benchmark , author=

work page

[28] [28]

arXiv preprint , year=

Training Deep Nets with Sublinear Memory Cost , author=. arXiv preprint , year=

work page

[29] [29]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[30] [30]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page

[31] [31]

Dan Gusfield , title =. 1997

work page 1997

[32] [32]

A Particle Filter algorithm for

Benjamin Borschinger and Mark Johnson , Booktitle =. A Particle Filter algorithm for

work page

[33] [33]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page

[34] [34]

Noise reduction and targeted exploration in imitation learning for Abstract Meaning Representation parsing

Goodman, James and Vlachos, Andreas and Naradowsky, Jason. Noise reduction and targeted exploration in imitation learning for Abstract Meaning Representation parsing. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1001

work page doi:10.18653/v1/p16-1001 2016

[35] [35]

Learning from 26 Languages: Program Management and Science in the Babel Program

Harper, Mary. Learning from 26 Languages: Program Management and Science in the Babel Program. Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. 2014

work page 2014

[36] [36]

NIPS , year=

Attention is All you Need , author=. NIPS , year=

work page

[37] [37]

Language Models are Unsupervised Multitask Learners , author=

work page

[38] [38]

arXiv preprint , year=

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , author=. arXiv preprint , year=

work page

[39] [39]

ACL , year=

Universal Language Model Fine-tuning for Text Classification , author=. ACL , year=

work page

[40] [40]

GPU Kernels for Block-Sparse Weights , author=

work page

[41] [41]

arXiv preprint , year=

Pay Less Attention with Lightweight and Dynamic Convolutions , author=. arXiv preprint , year=

work page

[42] [42]

SSW , year=

WaveNet: A Generative Model for Raw Audio , author=. SSW , year=

work page

[43] [43]

ICCV , year=

Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , author=. ICCV , year=

work page

[44] [44]

arXiv preprint , year=

A Simple Method for Commonsense Reasoning , author=. arXiv preprint , year=

work page

[45] [45]

NeurIPS , year=

Defending Against Neural Fake News , author=. NeurIPS , year=

work page

[46] [46]

EMNLP/IJCNLP , year=

Revealing the Dark Secrets of BERT , author=. EMNLP/IJCNLP , year=

work page

[47] [47]

Cohen and Ruslan Salakhutdinov and Christopher D

Zhilin Yang and Peng Qi and Saizheng Zhang and Yoshua Bengio and William W. Cohen and Ruslan Salakhutdinov and Christopher D. Manning , booktitle=

work page

[48] [48]

TACL , year=

Constructing Datasets for Multi-hop Reading Comprehension Across Documents , author=. TACL , year=

work page

[49] [49]

Weld and Luke Zettlemoyer , booktitle=

Mandar Joshi and Eunsol Choi and Daniel S. Weld and Luke Zettlemoyer , booktitle=

work page

[50] [51]

Hierarchical Graph Network for Multi-hop Question Answering

Fang, Yuwei and Sun, Siqi and Gan, Zhe and Pillai, Rohit and Wang, Shuohang and Liu, Jingjing. Hierarchical Graph Network for Multi-hop Question Answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020

work page 2020

[51] [52]

NAACL , year=

Deep contextualized word representations , author=. NAACL , year=

work page

[52] [53]

NeurIPS , year =

Semi-supervised Sequence Learning , author =. NeurIPS , year =

work page

[53] [54]

Improving Language Understanding by Generative Pre-Training , author =

work page

[54] [55]

BERT for Coreference Resolution: Baselines and Analysis

Joshi, Mandar and Levy, Omer and Zettlemoyer, Luke and Weld, Daniel. BERT for Coreference Resolution: Baselines and Analysis. EMNLP-IJCNLP. 2019

work page 2019

[55] [56]

Higher-Order Coreference Resolution with Coarse-to-Fine Inference

Lee, Kenton and He, Luheng and Zettlemoyer, Luke. Higher-Order Coreference Resolution with Coarse-to-Fine Inference. NAACL. 2018

work page 2018

[56] [57]

C o NLL -2012 Shared Task: Modeling Multilingual Unrestricted Coreference in O nto N otes

Pradhan, Sameer and Moschitti, Alessandro and Xue, Nianwen and Uryupina, Olga and Zhang, Yuchen. C o NLL -2012 Shared Task: Modeling Multilingual Unrestricted Coreference in O nto N otes. Joint Conference on EMNLP and C o NLL - Shared Task. 2012

work page 2012

[57] [58]

and Daly, Raymond E

Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher , title =. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , month =. 2011 , address =

work page 2011

[58] [59]

arXiv preprint , year=

What Does BERT Look At? An Analysis of BERT's Attention , author=. arXiv preprint , year=

work page

[59] [60]

arXiv preprint , year=

Multi-hop Question Answering via Reasoning Chains , author=. arXiv preprint , year=

work page

[60] [61]

NeurIPS Graph Representation Learning workshop , year=

Graph Sequential Network for Reasoning over Sequences , author=. NeurIPS Graph Representation Learning workshop , year=

work page

[61] [62]

arXiv preprint , year=

Select, Answer and Explain: Interpretable Multi-hop Reading Comprehension over Multiple Documents , author=. arXiv preprint , year=

work page

[62] [63]

ACL , year=

Reading Wikipedia to Answer Open-Domain Questions , author=. ACL , year=

work page

[63] [64]

ICLR , year=

Semi-supervised classification with graph convolutional networks , author=. ICLR , year=

work page

[64] [65]

2020 , journal=

Efficient Content-Based Sparse Attention with Routing Transformers , author=. 2020 , journal=

work page 2020

[65] [67]

ArXiv , year=

A Divide-and-Conquer Approach to the Summarization of Academic Articles , author=. ArXiv , year=

work page

[66] [68]

EMNLP , year=

On Extractive and Abstractive Neural Document Summarization with Transformer Language Models , author=. EMNLP , year=

work page

[67] [69]

ICML , year=

Pegasus: Pre-training with extracted gap-sentences for abstractive summarization , author=. ICML , year=

work page

[68] [70]

ETC : Encoding Long and Structured Inputs in Transformers

Ainslie, Joshua and Ontanon, Santiago and Alberti, Chris and Cvicek, Vaclav and Fisher, Zachary and Pham, Philip and Ravula, Anirudh and Sanghai, Sumit and Wang, Qifan and Yang, Li. ETC : Encoding Long and Structured Inputs in Transformers. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020

work page 2020

[69] [71]

ArXiv , year=

Big Bird: Transformers for Longer Sequences , author=. ArXiv , year=

work page

[70] [72]

ArXiv , year=

GMAT: Global Memory Augmentation for Transformers , author=. ArXiv , year=

work page

[71] [73]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. J. Mach. Learn. Res. , year=

work page

[72] [74]

NIPS , year=

Sequence to Sequence Learning with Neural Networks , author=. NIPS , year=

work page

[73] [75]

Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham, Anirudh Ravula, Sumit Sanghai, Qifan Wang, and Li Yang. 2020. https://www.aclweb.org/anthology/2020.emnlp-main.19 ETC : Encoding long and structured inputs in transformers . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing ...

work page 2020

[74] [76]

Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. 2018. Character-level language modeling with deeper self-attention. In AAAI

work page 2018

[75] [77]

Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading wikipedia to answer open-domain questions. In ACL

work page 2017

[76] [78]

Jifan Chen, Shih-Ting Lin, and Greg Durrett. 2019. Multi-hop question answering via reasoning chains. arXiv preprint, abs/1910.02610

work page arXiv 2019

[77] [79]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. TVM : An automated end-to-end optimizing compiler for deep learning. In OSDI

work page 2018

[78] [80]

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint, abs/1604.06174

work page internal anchor Pith review arXiv 2016

[79] [81]

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv preprint, abs/1904.10509

work page internal anchor Pith review arXiv 2019

[80] [82]

Christopher Clark and Matt Gardner. 2017. Simple and effective multi-paragraph reading comprehension. In ACL

work page 2017