pith. machine review for the scientific record. sign in

arxiv: 1712.05382 · v2 · submitted 2017-12-14 · 💻 cs.CL · stat.ML

Recognition: unknown

Monotonic Chunkwise Attention

Authors on Pith no claims yet
classification 💻 cs.CL stat.ML
keywords attentionmonotonicsoftappliedchunkwisedecodingmochamodel
0
0 comments X
read the original abstract

Sequence-to-sequence models with soft attention have been successfully applied to a wide variety of problems, but their decoding process incurs a quadratic time and space cost and is inapplicable to real-time sequence transduction. To address these issues, we propose Monotonic Chunkwise Attention (MoChA), which adaptively splits the input sequence into small chunks over which soft attention is computed. We show that models utilizing MoChA can be trained efficiently with standard backpropagation while allowing online and linear-time decoding at test time. When applied to online speech recognition, we obtain state-of-the-art results and match the performance of a model using an offline soft attention mechanism. In document summarization experiments where we do not expect monotonic alignments, we show significantly improved performance compared to a baseline monotonic attention-based model.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

    cs.LG 2026-05 unverdicted novelty 7.0

    PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...

  2. Generating Long Sequences with Sparse Transformers

    cs.LG 2019-04 unverdicted novelty 7.0

    Sparse Transformers factorize attention to handle sequences tens of thousands long, achieving new SOTA density modeling on Enwik8, CIFAR-10, and ImageNet-64.