hub

Language Modeling with Gated Convolutional Networks

Yann N Dauphin, Angela Fan, Michael Auli, David Grangier · 2016 · cs.CL · arXiv 1612.08083

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

open full Pith review browse 11 citing papers arXiv PDF

abstract

The pre-dominant approach to language modeling to date is based on recurrent neural networks. Their success on this task is often linked to their ability to capture unbounded context. In this paper we develop a finite context approach through stacked convolutions, which can be more efficient since they allow parallelization over sequential tokens. We propose a novel simplified gating mechanism that outperforms Oord et al (2016) and investigate the impact of key architectural decisions. The proposed approach achieves state-of-the-art on the WikiText-103 benchmark, even though it features long-term dependencies, as well as competitive results on the Google Billion Words benchmark. Our model reduces the latency to score a sentence by an order of magnitude compared to a recurrent baseline. To our knowledge, this is the first time a non-recurrent approach is competitive with strong recurrent models on these large scale language tasks.

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

Tensor similarity is a symmetry-invariant metric that measures functional equivalence between tensor-based networks using a recursive algorithm for cross-layer mechanisms.

Searching for Activation Functions

cs.NE · 2017-10-16 · conditional · novelty 7.0

Automated search discovers Swish activation f(x) = x * sigmoid(βx) that improves top-1 ImageNet accuracy over ReLU by 0.9% on Mobile NASNet-A and 0.6% on Inception-ResNet-v2.

The Falcon Series of Open Language Models

cs.CL · 2023-11-28 · conditional · novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.

Compressive Transformers for Long-Range Sequence Modelling

cs.LG · 2019-11-13 · unverdicted · novelty 6.0

Compressive Transformer sets new records on WikiText-103 (17.1 ppl) and Enwik8 (0.97 bpc) via memory compression and introduces the PG-19 long-range language benchmark.

CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models

cs.CV · 2026-01-29 · unverdicted · novelty 5.0

CG-MLLM is a multimodal LLM using a Mixture-of-Transformer architecture with separate TokenAR and BlockAR components integrated with a pre-trained vision-language backbone and 3D VAE to enable 3D captioning and high-fidelity generation.

ASTRAFier: A Novel and Scalable Transformer-based Stellar Variability Classifier

astro-ph.IM · 2026-04-08 · unverdicted · novelty 5.0

ASTRAFier is a Transformer-BiLSTM-CNN model that classifies stellar variability from light curves, reporting 94.26% accuracy on Kepler data and 88.22% on TESS, then applied to 2.8 million TESS curves to release a catalog.

Data-Driven Reduction of Fault Location Errors in Onshore Wind Farm Collectors

eess.SY · 2025-11-26 · unverdicted · novelty 4.0

A Gated Residual Network correction model reduces fault location error by 76% in simulated onshore wind farm collector networks compared to state-of-the-art methods.

Fake News Detection as Natural Language Inference

cs.CL · 2019-07-17 · unverdicted · novelty 4.0

Framing fake news classification as natural language inference and ensembling NLI models with BERT, plus transitivity rules, achieves 88.063% test accuracy in the WSDM 2019 challenge.

Resource-Efficient CSI Prediction: A Gated Fusion and Factorized Projection Approach

eess.SP · 2026-05-07 · unverdicted · novelty 4.0

A gated-fusion CSI predictor using GRU, attention, and DSLH reaches -13.84 dB NMSE with 26% fewer parameters and 2.3x higher throughput than a LinFormer baseline on 3GPP channels.

GLU Variants Improve Transformer

cs.LG · 2020-02-12 · unverdicted · novelty 4.0

Some GLU variants using non-sigmoid nonlinearities improve Transformer quality over ReLU and GELU in feed-forward sublayers.

Hardware-Software Co-Design of Scalable, Energy-Efficient Analog Recurrent Computations

cs.AR · 2026-05-12

citing papers explorer

Showing 11 of 11 citing papers.

When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability cs.LG · 2026-05-14 · unverdicted · none · ref 9 · internal anchor
Tensor similarity is a symmetry-invariant metric that measures functional equivalence between tensor-based networks using a recursive algorithm for cross-layer mechanisms.
Searching for Activation Functions cs.NE · 2017-10-16 · conditional · none · ref 4
Automated search discovers Swish activation f(x) = x * sigmoid(βx) that improves top-1 ImageNet accuracy over ReLU by 0.9% on Mobile NASNet-A and 0.6% on Inception-ResNet-v2.
The Falcon Series of Open Language Models cs.CL · 2023-11-28 · conditional · none · ref 44 · internal anchor
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
Compressive Transformers for Long-Range Sequence Modelling cs.LG · 2019-11-13 · unverdicted · none · ref 72 · internal anchor
Compressive Transformer sets new records on WikiText-103 (17.1 ppl) and Enwik8 (0.97 bpc) via memory compression and introduces the PG-19 long-range language benchmark.
CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models cs.CV · 2026-01-29 · unverdicted · none · ref 56 · internal anchor
CG-MLLM is a multimodal LLM using a Mixture-of-Transformer architecture with separate TokenAR and BlockAR components integrated with a pre-trained vision-language backbone and 3D VAE to enable 3D captioning and high-fidelity generation.
ASTRAFier: A Novel and Scalable Transformer-based Stellar Variability Classifier astro-ph.IM · 2026-04-08 · unverdicted · none · ref 21
ASTRAFier is a Transformer-BiLSTM-CNN model that classifies stellar variability from light curves, reporting 94.26% accuracy on Kepler data and 88.22% on TESS, then applied to 2.8 million TESS curves to release a catalog.
Data-Driven Reduction of Fault Location Errors in Onshore Wind Farm Collectors eess.SY · 2025-11-26 · unverdicted · none · ref 20 · internal anchor
A Gated Residual Network correction model reduces fault location error by 76% in simulated onshore wind farm collector networks compared to state-of-the-art methods.
Fake News Detection as Natural Language Inference cs.CL · 2019-07-17 · unverdicted · none · ref 4 · internal anchor
Framing fake news classification as natural language inference and ensembling NLI models with BERT, plus transitivity rules, achieves 88.063% test accuracy in the WSDM 2019 challenge.
Resource-Efficient CSI Prediction: A Gated Fusion and Factorized Projection Approach eess.SP · 2026-05-07 · unverdicted · none · ref 17
A gated-fusion CSI predictor using GRU, attention, and DSLH reaches -13.84 dB NMSE with 26% fewer parameters and 2.3x higher throughput than a LinFormer baseline on 3GPP channels.
GLU Variants Improve Transformer cs.LG · 2020-02-12 · unverdicted · none · ref 1
Some GLU variants using non-sigmoid nonlinearities improve Transformer quality over ReLU and GELU in feed-forward sublayers.
Hardware-Software Co-Design of Scalable, Energy-Efficient Analog Recurrent Computations cs.AR · 2026-05-12 · unreviewed · ref 106 · internal anchor

Language Modeling with Gated Convolutional Networks

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer