Language Modeling with Gated Convolutional Networks

Angela Fan; David Grangier; Michael Auli; Yann N. Dauphin

arxiv: 1612.08083 · v3 · pith:YNBBOI34new · submitted 2016-12-23 · 💻 cs.CL

Language Modeling with Gated Convolutional Networks

Yann N. Dauphin , Angela Fan , Michael Auli , David Grangier This is my paper

classification 💻 cs.CL

keywords approachlanguagerecurrentbenchmarkcompetitivecontextmodelingnetworks

0 comments

read the original abstract

The pre-dominant approach to language modeling to date is based on recurrent neural networks. Their success on this task is often linked to their ability to capture unbounded context. In this paper we develop a finite context approach through stacked convolutions, which can be more efficient since they allow parallelization over sequential tokens. We propose a novel simplified gating mechanism that outperforms Oord et al (2016) and investigate the impact of key architectural decisions. The proposed approach achieves state-of-the-art on the WikiText-103 benchmark, even though it features long-term dependencies, as well as competitive results on the Google Billion Words benchmark. Our model reduces the latency to score a sentence by an order of magnitude compared to a recurrent baseline. To our knowledge, this is the first time a non-recurrent approach is competitive with strong recurrent models on these large scale language tasks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability
cs.LG 2026-05 unverdicted novelty 7.0

Tensor similarity is a symmetry-invariant metric that measures functional equivalence between tensor-based networks using a recursive algorithm for cross-layer mechanisms.
Searching for Activation Functions
cs.NE 2017-10 conditional novelty 7.0

Automated search discovers Swish activation f(x) = x * sigmoid(βx) that improves top-1 ImageNet accuracy over ReLU by 0.9% on Mobile NASNet-A and 0.6% on Inception-ResNet-v2.
Hardware-Software Co-Design of Scalable, Energy-Efficient Analog Recurrent Computations
cs.AR 2026-05 unverdicted novelty 6.0

BMRUs enable a direct one-to-one mapping from learned parameters to current-mode analog circuit elements, with discrete hysteretic outputs suppressing noise by at least 20x and supporting sub-microwatt RNN inference i...
The Falcon Series of Open Language Models
cs.CL 2023-11 conditional novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
Compressive Transformers for Long-Range Sequence Modelling
cs.LG 2019-11 unverdicted novelty 6.0

Compressive Transformer sets new records on WikiText-103 (17.1 ppl) and Enwik8 (0.97 bpc) via memory compression and introduces the PG-19 long-range language benchmark.
ASTRAFier: A Novel and Scalable Transformer-based Stellar Variability Classifier
astro-ph.IM 2026-04 unverdicted novelty 5.0

ASTRAFier is a Transformer-BiLSTM-CNN model that classifies stellar variability from light curves, reporting 94.26% accuracy on Kepler data and 88.22% on TESS, then applied to 2.8 million TESS curves to release a catalog.
CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models
cs.CV 2026-01 unverdicted novelty 5.0

CG-MLLM is a multimodal LLM using a Mixture-of-Transformer architecture with separate TokenAR and BlockAR components integrated with a pre-trained vision-language backbone and 3D VAE to enable 3D captioning and high-f...
Resource-Efficient CSI Prediction: A Gated Fusion and Factorized Projection Approach
eess.SP 2026-05 unverdicted novelty 4.0

A gated-fusion CSI predictor using GRU, attention, and DSLH reaches -13.84 dB NMSE with 26% fewer parameters and 2.3x higher throughput than a LinFormer baseline on 3GPP channels.
Data-Driven Reduction of Fault Location Errors in Onshore Wind Farm Collectors
eess.SY 2025-11 unverdicted novelty 4.0

A Gated Residual Network correction model reduces fault location error by 76% in simulated onshore wind farm collector networks compared to state-of-the-art methods.
GLU Variants Improve Transformer
cs.LG 2020-02 unverdicted novelty 4.0

Some GLU variants using non-sigmoid nonlinearities improve Transformer quality over ReLU and GELU in feed-forward sublayers.
Fake News Detection as Natural Language Inference
cs.CL 2019-07 unverdicted novelty 4.0

Framing fake news classification as natural language inference and ensembling NLI models with BERT, plus transitivity rules, achieves 88.063% test accuracy in the WSDM 2019 challenge.