arxiv: 1807.03819 · v3 · submitted 2018-07-10 · 💻 cs.CL · cs.LG· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Universal Transformers

Mostafa Dehghani , Stephan Gouws , Oriol Vinyals , Jakob Uszkoreit , {\L}ukasz Kaiser

Authors on Pith no claims yet

Pith reviewed 2026-05-13 08:00 UTC · model grok-4.3

classification 💻 cs.CL cs.LGstat.ML

keywords universal transformerrecurrent sequence modelself-attentionturing completenesssequence modelingmachine translationlanguage modelingdynamic halting

0 comments

The pith

The Universal Transformer adds recurrence and per-position halting to the standard Transformer to achieve Turing completeness and better generalization on longer sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Universal Transformer as a parallel-in-time self-attentive recurrent sequence model that generalizes the Transformer by repeating its layers across steps while keeping full parallelism. This design aims to fix the standard Transformer's inability to handle tasks like string copying or logical inference when input lengths exceed those seen in training. The model retains the Transformer's global receptive field and speed advantages but incorporates the inductive bias of RNNs through recurrence, plus a dynamic halting mechanism that lets positions stop computing at different times. If the approach works as described, sequence models could process variable-length data more reliably without the sequential slowdown of traditional RNNs.

Core claim

The Universal Transformer generalizes the Transformer by applying the same self-attention and feed-forward sub-layers recurrently across a variable number of steps in parallel, combined with a dynamic per-position halting mechanism that allows adaptive computation depth per input position. Under certain assumptions this construction is Turing-complete, unlike the fixed-depth standard Transformer, and yields higher accuracy on algorithmic tasks that require generalization beyond training lengths as well as on language understanding benchmarks.

What carries the argument

Recurrent self-attention with dynamic per-position halting, which repeats the core Transformer layers across steps while allowing each position to decide independently when to stop.

Load-bearing premise

Adding recurrence through repeated layer application plus dynamic halting supplies enough inductive bias for reliable generalization to sequence lengths and structures not seen in training.

What would settle it

A direct test showing that the Universal Transformer achieves no better accuracy than a standard Transformer on a string-copying task with test strings longer than any seen during training would falsify the generalization benefit.

read the original abstract

Recurrent neural networks (RNNs) sequentially process data by updating their state with each new data point, and have long been the de facto choice for sequence modeling tasks. However, their inherently sequential computation makes them slow to train. Feed-forward and convolutional architectures have recently been shown to achieve superior results on some sequence modeling tasks such as machine translation, with the added advantage that they concurrently process all inputs in the sequence, leading to easy parallelization and faster training times. Despite these successes, however, popular feed-forward sequence models like the Transformer fail to generalize in many simple tasks that recurrent models handle with ease, e.g. copying strings or even simple logical inference when the string or formula lengths exceed those observed at training time. We propose the Universal Transformer (UT), a parallel-in-time self-attentive recurrent sequence model which can be cast as a generalization of the Transformer model and which addresses these issues. UTs combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of RNNs. We also add a dynamic per-position halting mechanism and find that it improves accuracy on several tasks. In contrast to the standard Transformer, under certain assumptions, UTs can be shown to be Turing-complete. Our experiments show that UTs outperform standard Transformers on a wide range of algorithmic and language understanding tasks, including the challenging LAMBADA language modeling task where UTs achieve a new state of the art, and machine translation where UTs achieve a 0.9 BLEU improvement over Transformers on the WMT14 En-De dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The Universal Transformer makes the self-attention layers recurrent with shared weights and per-position dynamic halting, delivering measurable gains on length-extrapolation tasks while claiming Turing completeness under loose assumptions.

read the letter

The core contribution is turning the Transformer into a recurrent model by iterating the same self-attention and feed-forward blocks multiple times, with a learned halting probability per position that stops computation early for some tokens. This keeps the parallel training speed of the original model while adding a recurrent inductive bias. The experiments back this up on algorithmic tasks such as copying and logical inference, where standard Transformers fail to generalize beyond training lengths, and they report a 0.9 BLEU lift on WMT14 En-De plus a new state of the art on LAMBADA at the time. The halting mechanism appears to help without major extra cost in the reported runs. The Turing-completeness statement is the weakest part. It is qualified by unspecified assumptions about halting behavior and step bounds, and the paper does not show that the learned model actually meets the conditions needed for unbounded computation. That leaves the claimed link between recurrence and better extrapolation somewhat unanchored. The empirical results are solid enough to stand on their own, but the formal claim would benefit from tighter statements. This work is aimed at researchers building sequence models who want a middle ground between pure feed-forward Transformers and RNNs. It has enough concrete experiments and a clean architectural idea to justify sending it to peer review, even if the formal section needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper proposes the Universal Transformer (UT), a self-attentive recurrent sequence model that generalizes the standard Transformer by adding recurrence across time steps and a dynamic per-position halting mechanism. It claims that UTs combine the parallelizability of Transformers with the inductive bias of RNNs, are Turing-complete under certain (unspecified) assumptions, and empirically outperform Transformers on algorithmic tasks, achieve a new state-of-the-art on LAMBADA, and yield a 0.9 BLEU improvement on WMT14 En-De translation.

Significance. If the Turing-completeness result can be made rigorous and the reported gains prove robust to controls for compute and hyperparameters, the work would supply a concrete architecture that addresses the length-extrapolation failures of pure feed-forward sequence models while retaining their training efficiency, with direct implications for algorithmic reasoning and long-context language modeling.

major comments (2)

[Abstract and §3] The central claim that 'under certain assumptions, UTs can be shown to be Turing-complete' (abstract) is load-bearing for the argument that recurrence supplies useful inductive bias for generalization beyond training lengths. The assumptions (e.g., whether halting scores are required to be exactly binary, whether the number of recurrent steps may be unbounded without divergence, and the required numerical precision of the per-position state) are never stated explicitly, so the formal claim cannot be verified from the manuscript.
[§4] §4 (experiments): performance claims on algorithmic tasks and LAMBADA are presented without error bars, number of random seeds, or ablation isolating the contribution of recurrence versus the halting mechanism, so it is impossible to assess whether the reported gains are statistically reliable or merely the result of increased capacity.

minor comments (2)

[§3] Notation for the halting probability and the recurrent state update is introduced without a compact equation reference, making it hard to follow the precise difference from a standard Transformer layer.
[Abstract] The abstract states a 'new state of the art' on LAMBADA but does not cite the previous best score or the exact evaluation protocol, which should be supplied for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will incorporate revisions to improve clarity and rigor in the manuscript.

read point-by-point responses

Referee: [Abstract and §3] The central claim that 'under certain assumptions, UTs can be shown to be Turing-complete' (abstract) is load-bearing for the argument that recurrence supplies useful inductive bias for generalization beyond training lengths. The assumptions (e.g., whether halting scores are required to be exactly binary, whether the number of recurrent steps may be unbounded without divergence, and the required numerical precision of the per-position state) are never stated explicitly, so the formal claim cannot be verified from the manuscript.

Authors: We agree that the assumptions underlying the Turing-completeness claim must be stated explicitly for the result to be verifiable. In the revised manuscript we will add a dedicated paragraph in §3 that enumerates the precise assumptions: halting scores are required to be exactly binary, the number of recurrent steps is allowed to be unbounded, and the model is assumed to operate with sufficient numerical precision to simulate an arbitrary Turing machine. These clarifications will directly support the claim that recurrence provides a useful inductive bias for length generalization. revision: yes
Referee: [§4] §4 (experiments): performance claims on algorithmic tasks and LAMBADA are presented without error bars, number of random seeds, or ablation isolating the contribution of recurrence versus the halting mechanism, so it is impossible to assess whether the reported gains are statistically reliable or merely the result of increased capacity.

Authors: We acknowledge that the experimental section lacks statistical reporting and targeted ablations. In the revised version we will add error bars computed from at least three independent random seeds for all algorithmic tasks and the LAMBADA results. We will also include an ablation study that compares (i) the full UT, (ii) a recurrent variant with fixed step count, and (iii) a non-recurrent Transformer with equivalent capacity, thereby isolating the contribution of recurrence and the dynamic halting mechanism from raw parameter count. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on design choices and experiments, not self-referential reductions.

full rationale

The paper introduces the Universal Transformer as a recurrent self-attentive model with dynamic halting, claims Turing-completeness only under unspecified external assumptions, and supports performance gains via experiments on algorithmic and language tasks. No quoted equations or sections reduce a central prediction to a fitted parameter, self-citation chain, or definitional tautology. The recurrent inductive bias is presented as an architectural addition rather than derived from the results it explains. The Turing-completeness statement is qualified and does not serve as a load-bearing premise that loops back to the model's own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no concrete free parameters, axioms, or invented entities; the Turing-completeness claim rests on unspecified assumptions.

pith-pipeline@v0.9.0 · 5595 in / 1033 out tokens · 31018 ms · 2026-05-13T08:00:29.425818+00:00 · methodology

discussion (0)

Forward citations

Cited by 30 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Stability and Generalization in Looped Transformers
cs.LG 2026-04 unverdicted novelty 8.0

Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant ...
On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication
cs.LG 2026-03 unverdicted novelty 8.0

Long-range dependency in integer multiplication is a mirage from 1D representation; a 2D grid reduces it to local 3x3 operations, letting a 321-parameter neural cellular automaton generalize perfectly to inputs 683 ti...
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
cs.LG 2022-01 unverdicted novelty 8.0

Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.
Neural Weight Norm = Kolmogorov Complexity
cs.LG 2026-05 unverdicted novelty 7.0

Minimal weight norm of fixed-precision looped neural networks equals Kolmogorov complexity of output string up to log factor, making weight decay match the optimal universal prior up to polynomial factor.
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
cs.LG 2026-05 unverdicted novelty 7.0

LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
SMolLM: Small Language Models Learn Small Molecular Grammar
cs.LG 2026-05 unverdicted novelty 7.0

A 53K-parameter model generates 95% valid SMILES on ZINC-250K, outperforming larger models, by resolving chemical constraints in fixed order: brackets first, rings second, valence last.
LoopCTR: Unlocking the Loop Scaling Power for Click-Through Rate Prediction
cs.IR 2026-04 unverdicted novelty 7.0

LoopCTR trains CTR models with recursive layer reuse and process supervision so that zero-loop inference outperforms baselines on public and industrial datasets.
Depth Adaptive Efficient Visual Autoregressive Modeling
cs.CV 2026-04 unverdicted novelty 7.0

DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
A Mechanistic Analysis of Looped Reasoning Language Models
cs.LG 2026-04 unverdicted novelty 7.0

Looped LLMs converge to distinct cyclic fixed points per layer, repeating feedforward-style inference stages across recurrences.
LA-Sign: Looped Transformers with Geometry-aware Alignment for Skeleton-based Sign Language Recognition
cs.CV 2026-03 unverdicted novelty 7.0

LA-Sign achieves state-of-the-art skeleton-based sign language recognition on WLASL and MSASL by using recurrent looped transformers with adaptive hyperbolic geometry alignment.
Scaling Latent Reasoning via Looped Language Models
cs.CL 2025-10 unverdicted novelty 7.0

Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
cs.LG 2025-02 unverdicted novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
cs.CL 2019-09 accept novelty 7.0

ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
Elastic Attention Cores for Scalable Vision Transformers
cs.CV 2026-05 unverdicted novelty 6.0

VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...
Sparse Layers are Critical to Scaling Looped Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.
ZAYA1-8B Technical Report
cs.AI 2026-05 unverdicted novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

SST V2 introduces parallel-trainable nonlinear recurrence in latent space to let transformers reason continuously across positions, delivering +15 points on GPQA-Diamond and halving remaining GSM8K errors over matched...
Do Not Imitate, Reinforce: Iterative Classification via Belief Refinement
cs.LG 2026-04 unverdicted novelty 6.0

RIC replaces single-pass label imitation with RL-driven iterative belief refinement, recovering cross-entropy optima while enabling adaptive halting via a value function.
Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning
cs.LG 2026-04 conditional novelty 6.0

Memory tokens are required for non-trivial performance in adaptive Universal Transformers on Sudoku-Extreme, with 8-32 tokens yielding stable 57% exact-match accuracy while trading off against ponder depth.
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
cs.LG 2026-04 unverdicted novelty 6.0

Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task
cs.LG 2026-04 unverdicted novelty 6.0

Transformers show limited adaptive depth use on relational reasoning, with clearer evidence after finetuning on the task.
ELT: Elastic Looped Transformers for Visual Generation
cs.CV 2026-04 unverdicted novelty 6.0

Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.
LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling
cs.CL 2026-03 unverdicted novelty 6.0

LPC-SM is a hybrid architecture separating local attention, persistent memory, predictive correction, and control with ONT for memory writes, showing loss reductions on 158M-parameter models up to 4096-token contexts.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
Emergent Abilities of Large Language Models
cs.CL 2022-06 unverdicted novelty 6.0

Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Dependency Parsing Across the Resource Spectrum: Evaluating Architectures on High and Low-Resource Languages
cs.CL 2026-05 unverdicted novelty 5.0

Biaffine LSTM outperforms transformer parsers like AfroXLMR and RemBERT in low-resource dependency parsing, with transformers gaining advantage as data increases and morphological complexity as a secondary predictor.
Hyperloop Transformers
cs.LG 2026-04 unverdicted novelty 5.0

Hyperloop Transformers outperform standard and mHC Transformers with roughly 50% fewer parameters by looping a middle block of layers and applying hyper-connections only after each loop.
Hierarchical Reasoning Model
cs.AI 2025-06 unverdicted novelty 5.0

HRM is a recurrent architecture with high-level planning and low-level execution modules that reaches near-perfect accuracy on complex Sudoku, maze navigation, and ARC benchmarks using 27M parameters and 1000 samples ...

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 30 Pith papers · 5 internal anchors

[1]

Weighted transformer network for machine translation

Karim Ahmed, Nitish Shirish Keskar, and Richard Socher. Weighted transformer network for machine translation. arXiv preprint arXiv:1711.02132,

work page arXiv
[3]

Layer Normalization

URL http://arxiv.org/abs/1607.06450. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Neural Machine Translation by Jointly Learning to Align and Translate

URL http://arxiv.org/abs/1409.0473. Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

URL http://arxiv.org/abs/1406.1078. Francois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Linguistic knowledge as memory for recurrent neural networks

Bhuwan Dhingra, Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. Linguistic knowledge as memory for recurrent neural networks. arXiv preprint arXiv:1703.02620,

work page arXiv
[7]

Neural models for reasoning over multiple mentions using coreference.arXiv preprint arXiv:1804.05922,

Bhuwan Dhingra, Qiao Jin, Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. Neural models for reasoning over multiple mentions using coreference.arXiv preprint arXiv:1804.05922,

work page arXiv
[9]

Edouard Grave, Armand Joulin, and Nicolas Usunier

URL http://arxiv.org/abs/1705.03122. Edouard Grave, Armand Joulin, and Nicolas Usunier. Improving neural language models with a continuous cache. arXiv preprint arXiv:1612.04426,

work page arXiv
[11]

arXiv preprint arXiv:1308.0850 (2013) 4, 5

URL http://arxiv.org/abs/1308.0850. Alex Graves. Adaptive computation time for recurrent neural networks.arXiv preprint arXiv:1603.08983,

work page arXiv
[13]

Neural Turing Machines

URL http://arxiv.org/abs/1410.5401. Caglar Gulcehre, Misha Denil, Mateusz Malinowski, Ali Razavi, Razvan Pascanu, Karl Moritz Hermann, Peter Battaglia, Victor Bapst, David Raposo, Adam Santoro, et al. Hyperbolic attention networks.arXiv preprint arXiv:1805.09786,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Tracking the world state with recurrent entity networks

Mikael Henaff, Jason Weston, Arthur Szlam, Antoine Bordes, and Yann LeCun. Tracking the world state with recurrent entity networks. arXiv preprint arXiv:1612.03969,

work page arXiv
[15]

Łukasz Kaiser, Aidan N

URL https://arxiv.org/abs/1511.08228. Łukasz Kaiser, Aidan N. Gomez, and Francois Chollet. Depthwise separable convolutions for neural machine translation. CoRR, abs/1706.03059,

work page arXiv
[16]

Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher

URL http://arxiv.org/abs/1706.03059. Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. Ask me anything: Dynamic memory networks for natural language processing. In International Conference on Machine Learning, pp. 1378–1387,

work page arXiv
[17]

A structured self-attentive sentence embedding.arXiv preprint arXiv:1703.03130,

Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding.arXiv preprint arXiv:1703.03130,

work page arXiv
[18]

Jack Rae, Jonathan J Hunt, Ivo Danihelka, Timothy Harley, Andrew W Senior, Gregory Wayne, Alex Graves, and Tim Lillicrap

URL https: //arxiv.org/pdf/1606.01933.pdf. Jack Rae, Jonathan J Hunt, Ivo Danihelka, Timothy Harley, Andrew W Senior, Gregory Wayne, Alex Graves, and Tim Lillicrap. Scaling memory-augmented neural networks with sparse reads and writes. InAdvances in Neural Information Processing Systems, pp. 3621–3629,

work page arXiv
[19]

Query-reduction networks for question answering

Minjoon Seo, Sewon Min, Ali Farhadi, and Hannaneh Hajishirzi. Query-reduction networks for question answering. arXiv preprint arXiv:1606.04582,

work page arXiv
[20]

Dropout: a simple way to prevent neural networks from overﬁtting

Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting. Journal of Machine Learning Research, 15(1): 1929–1958,

work page 1929
[21]

Sequence to Sequence Learning with Neural Networks

URL http://arxiv.org/abs/1409.3215. Ke Tran, Arianna Bisazza, and Christof Monz. The importance of being recurrent for modeling hierarchical structure. In Proceedings of NAACL’18,

work page Pith review arXiv
[22]

Attention Is All You Need

URL http://arxiv.org/abs/1706.03762. Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, and Jakob Uszkoreit. Tensor2tensor for neural machine translation.CoRR, abs/1803.07416,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Towards ai-complete question answering: A set of prerequisite toy tasks

11 Published as a conference paper at ICLR 2019 Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov. Towards ai-complete question answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698,

work page arXiv 2019
[25]

URL http://arxiv.org/abs/1410.4615. 12 Published as a conference paper at ICLR 2019 APPENDIX A D ETAILED SCHEMA OF THE UNIVERSAL TRANSFORMER Input Sequence Embed Input Symbols Position embedding Timestep embedding Multihead Self-Attention Transition Function Dropout Layer Normalization Dropout Layer Normalization + + + + Multihead Attention Target Sequenc...

work page arXiv 2019
[26]

as follows in TensorFlow. In each step of the UT with dynamic halting, we are given the halting probabilities, remainders, number of updates up to that point, and the previous state (all initialized as zeros), as well as a scalar threshold between 0 and 1 (a hyper-parameter). We then compute the new state for each position and calculate the new per-positi...

work page 2019
[27]

consists of 20 different synthetic tasks7. The aim is that each task tests a unique aspect of language understanding and reasoning, including the ability of: reasoning from supporting facts in a story, answering true/false type questions, counting, understanding negation and indeﬁnite knowledge, understanding coreferences, time reasoning, positional and s...

work page 2015
[28]

Yes, I thought I was going to lose the baby

is a broad context language modeling task. In this task, given a narrative passage, the goal is to predict the last word (target word) of the last sentence (target sentence) in the passage. These passages are speciﬁcally selected in a way that human subjects are easily able to guess their last word if they are exposed to a long passage, but not if they on...

work page 2019
[29]

Where is Mary?

Input: j=8584 for x in range(8): j+=920 b=(1500+j) print((b+7567)) Target: 25011 16 Published as a conference paper at ICLR 2019 APPENDIX E BABI DETAILED RESULTS Best seed run for each task (out of 10 runs) Task id 10K 1K train single train joint train single train joint 1 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.5 3 0.4 1.2 3.7 5.4 4 0.0 0.0 0.0 0.0 5 0.0 0.0 0.0...

work page 2019