pith. sign in

hub

Language Modeling with Gated Convolutional Networks

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it
abstract

The pre-dominant approach to language modeling to date is based on recurrent neural networks. Their success on this task is often linked to their ability to capture unbounded context. In this paper we develop a finite context approach through stacked convolutions, which can be more efficient since they allow parallelization over sequential tokens. We propose a novel simplified gating mechanism that outperforms Oord et al (2016) and investigate the impact of key architectural decisions. The proposed approach achieves state-of-the-art on the WikiText-103 benchmark, even though it features long-term dependencies, as well as competitive results on the Google Billion Words benchmark. Our model reduces the latency to score a sentence by an order of magnitude compared to a recurrent baseline. To our knowledge, this is the first time a non-recurrent approach is competitive with strong recurrent models on these large scale language tasks.

hub tools

representative citing papers

Searching for Activation Functions

cs.NE · 2017-10-16 · conditional · novelty 7.0

Automated search discovers Swish activation f(x) = x * sigmoid(βx) that improves top-1 ImageNet accuracy over ReLU by 0.9% on Mobile NASNet-A and 0.6% on Inception-ResNet-v2.

The Falcon Series of Open Language Models

cs.CL · 2023-11-28 · conditional · novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.

Fake News Detection as Natural Language Inference

cs.CL · 2019-07-17 · unverdicted · novelty 4.0

Framing fake news classification as natural language inference and ensembling NLI models with BERT, plus transitivity rules, achieves 88.063% test accuracy in the WSDM 2019 challenge.

GLU Variants Improve Transformer

cs.LG · 2020-02-12 · unverdicted · novelty 4.0

Some GLU variants using non-sigmoid nonlinearities improve Transformer quality over ReLU and GELU in feed-forward sublayers.

citing papers explorer

Showing 11 of 11 citing papers.