arxiv: 2205.06175 · v3 · submitted 2022-05-12 · 💻 cs.AI · cs.CL· cs.LG· cs.RO

Recognition: 2 theorem links

· Lean Theorem

A Generalist Agent

Scott Reed , Konrad Zolna , Emilio Parisotto , Sergio Gomez Colmenarejo , Alexander Novikov , Gabriel Barth-Maron , Mai Gimenez , Yury Sulsky

show 12 more authors

Jackie Kay Jost Tobias Springenberg Tom Eccles Jake Bruce Ali Razavi Ashley Edwards Nicolas Heess Yutian Chen Raia Hadsell Oriol Vinyals Mahyar Bordbar Nando de Freitas

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:19 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LGcs.RO

keywords generalist agentmulti-modal transformermulti-task policymulti-embodimentsequence modelingAtarirobotics

0 comments

The pith

A single transformer with fixed weights can play Atari, caption images, chat, and control a robot arm by choosing the right output tokens

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Gato as a multi-modal, multi-task, multi-embodiment policy built from one transformer. The model is trained on a mixture of data that includes game frames, images, text, and robot sensor readings, then uses the same weights to generate button presses, image descriptions, dialogue, or joint torques depending on the current context. A sympathetic reader cares because the result points toward AI systems that avoid the need for many separate specialized models. The work documents that this unified approach already reaches competent performance on each included task.

Core claim

The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens.

What carries the argument

A decoder-only transformer that tokenizes inputs and outputs from text, vision, and embodiment domains into one shared sequence space, allowing next-token prediction to produce actions or language as needed.

Load-bearing premise

Training one transformer on a mixed collection of multi-modal and multi-embodiment data produces competent performance across domains without large negative interference between tasks.

What would settle it

A direct comparison showing that the joint model scores substantially lower on any single task than a model trained only on that task, or that the model frequently selects the wrong output modality for a given context.

read the original abstract

Inspired by progress in large-scale language modeling, we apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens. In this report we describe the model and the data, and document the current capabilities of Gato.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Gato, a single transformer network with fixed weights that functions as a multi-modal, multi-task, multi-embodiment generalist policy. The same model processes a unified token sequence and, depending on context, outputs text, joint torques, button presses, or other actions to perform tasks including Atari gameplay, image captioning, dialogue, and real-robot block stacking.

Significance. If the reported capabilities hold under fuller scrutiny, the work provides concrete empirical support for scaling transformer architectures to generalist agents that operate across modalities and embodiments without task-specific heads or weights, which could accelerate progress toward unified, context-adaptive AI systems.

major comments (1)

[Evaluation] Evaluation sections: the manuscript documents capabilities through training and testing but supplies only limited quantitative results, ablations on data-mixture proportions, and head-to-head baselines against specialist models; this weakens the support for the central claim that a single set of weights achieves competent performance across domains without major interference.

minor comments (2)

Figure captions and axis labels in the results plots would benefit from explicit units and clearer distinction between modalities to improve readability.
[Model] The description of the tokenization scheme for continuous control signals (joint torques) could be expanded with a short equation or pseudocode for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. We address the single major comment on evaluation below.

read point-by-point responses

Referee: [Evaluation] Evaluation sections: the manuscript documents capabilities through training and testing but supplies only limited quantitative results, ablations on data-mixture proportions, and head-to-head baselines against specialist models; this weakens the support for the central claim that a single set of weights achieves competent performance across domains without major interference.

Authors: We acknowledge the evaluation is primarily demonstration-oriented across many tasks and modalities. The manuscript does report quantitative metrics (e.g., Atari scores, captioning BLEU, robotics success rates) and includes direct comparisons to specialist models on several benchmarks, showing that the single set of weights reaches competent performance without task-specific heads. We agree that additional ablations on data-mixture proportions would further substantiate the absence of major interference. In the revised version we will expand the evaluation section with further head-to-head baselines where data permits and include a brief discussion of the mixture ratios used during training. Full exhaustive ablations remain computationally prohibitive, but the existing results already illustrate that a unified transformer can handle the reported range of embodiments and modalities without catastrophic forgetting or degradation. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical report describing the training and evaluation of a single transformer (Gato) on a mixture of multi-modal and multi-embodiment data. It makes no first-principles derivations, uniqueness theorems, or mathematical predictions that could reduce to fitted inputs by construction. Capabilities are documented via direct training runs and task-specific testing; the central claim (one network handling text, Atari, robotics, etc., via context) follows from the architecture and data mixture without self-referential definitions or load-bearing self-citations. This is a standard empirical result with no circular steps.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The work relies on standard transformer assumptions and empirical scaling from language models, with hyperparameters for architecture and data mixture chosen during development.

free parameters (2)

Transformer size and hyperparameters
Number of layers, heads, and embedding dimensions selected to balance performance and compute.
Data mixture proportions
Relative amounts of text, vision, game, and robotics data chosen to enable multi-task training.

axioms (1)

domain assumption A transformer can process tokenized sequences from multiple modalities and generate appropriate outputs for each.
Assumed from prior success of transformers in language and vision tasks.

pith-pipeline@v0.9.0 · 5474 in / 1241 out tokens · 43527 ms · 2026-05-13T06:19:57.774067+00:00 · methodology

discussion (0)

Forward citations

Cited by 30 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
cs.CR 2024-06 unverdicted novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 7.0

TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
TokaMind for Power Grid: Cross-Domain Transfer from Fusion Plasma
physics.plasm-ph 2026-05 unverdicted novelty 7.0

TokaMind, pre-trained on MAST tokamak data, transfers to power grid PMU data for severe event classification with F1 0.837, where difficulty depends on grid topology and CSD indicators boost early-warning performance ...
Transformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient Descent
cs.LG 2026-05 conditional novelty 7.0

Multi-layer transformers can implement in-context logistic regression by performing normalized gradient descent steps layer by layer, obtained via supervised training of a single attention layer followed by recurrent ...
KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis
cs.RO 2026-04 unverdicted novelty 7.0

KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.
Factorization Regret mediates compositional generalization in latent space
cs.LG 2026-03 unverdicted novelty 7.0

Factorization Regret measures how latent variable interactions affect performance, and RCCs enable learning them to achieve compositional generalization in partially observable tasks.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
cs.CL 2025-11 unverdicted novelty 7.0

Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
Mastering Diverse Domains through World Models
cs.AI 2023-01 unverdicted novelty 7.0

DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 6.0

TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception
cs.RO 2026-05 unverdicted novelty 6.0

StereoPolicy fuses stereo image pairs via a Stereo Transformer on pretrained 2D encoders to boost robotic manipulation policies, showing gains over monocular, RGB-D, point cloud, and multi-view methods in simulations ...
Geometric Pareto Control: Riemannian Gradient Flow of Energy Function via Lie Group Homotopy
eess.SY 2026-05 unverdicted novelty 6.0

Geometric Pareto Control embeds Pareto solutions in a Lie group submanifold and navigates via Riemannian gradient flow to achieve 100% feasibility and low suboptimality in control tasks without retraining.
RELO: Reinforcement Learning to Localize for Visual Object Tracking
cs.CV 2026-05 unverdicted novelty 6.0

RELO replaces handcrafted spatial priors with a reinforcement learning policy for target localization in visual tracking and reports 57.5% AUC on LaSOText without template updates.
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
cs.LG 2026-05 unverdicted novelty 6.0

MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills
cs.RO 2026-04 unverdicted novelty 6.0

M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.
Bridging MARL to SARL: An Order-Independent Multi-Agent Transformer via Latent Consensus
cs.LG 2026-04 conditional novelty 6.0

CMAT uses a transformer decoder to produce a high-level consensus vector in latent space, enabling simultaneous order-independent actions by all agents and optimization via single-agent PPO, with superior results on S...
ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

ProGAL-VLA uses 3D graphs, symbolic sub-goals, and a Grounding Alignment Contrastive loss to ground actions on verified embeddings, raising robustness from 30.3% to 71.5% and ambiguity AUROC to 0.81 on robotic benchmarks.
ADAPTive Input Training for Many-to-One Pre-Training on Time-Series Classification
cs.LG 2026-04 unverdicted novelty 6.0

ADAPT is a new pre-training paradigm that aligns physical properties of time-series data to allow simultaneous training on 162 diverse classification datasets, achieving new state-of-the-art performance.
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
cs.RO 2024-10 unverdicted novelty 6.0

GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.
LLaVA-Video: Video Instruction Tuning With Synthetic Data
cs.CV 2024-10 unverdicted novelty 6.0

LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
cs.RO 2023-12 conditional novelty 6.0

A GPT-style model pre-trained on large video datasets achieves 94.9% success on CALVIN multi-task manipulation and 85.4% zero-shot generalization, outperforming prior baselines.
TD-MPC2: Scalable, Robust World Models for Continuous Control
cs.LG 2023-10 conditional novelty 6.0

TD-MPC2 scales an implicit world-model RL method to a 317M-parameter agent that masters 80 tasks across four domains with a single hyperparameter configuration.
PaLM-E: An Embodied Multimodal Language Model
cs.LG 2023-03 conditional novelty 6.0

PaLM-E is a single 562B-parameter multimodal model that performs embodied reasoning tasks like robotic manipulation planning and visual question answering by interleaving vision, state, and text inputs with positive t...
Large Language Models for Sequential Decision-Making: Improving In-Context Learning via Supervised Fine-Tuning
cs.LG 2026-05 unverdicted novelty 5.0

Supervised fine-tuning of pretrained LLMs on offline trajectories yields better few-shot sequential decision-making than in-context-only baselines, with a theoretical suboptimality bound derived for linear MDPs by int...
Towards Systematic Generalization for Power Grid Optimization Problems
cs.LG 2026-05 unverdicted novelty 5.0

A shared graph neural network framework jointly solves ACOPF and SCUC problems using physics constraints and shows improved generalization to unseen grid topologies.
Neural Computers
cs.LG 2026-04 unverdicted novelty 5.0

Neural Computers are introduced as a new machine form where computation, memory, and I/O are unified in a learned runtime state, with initial video-model experiments showing acquisition of basic interface primitives f...
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
cs.AI 2025-09 conditional novelty 5.0

UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Bridging Perception and Action: A Lightweight Multimodal Meta-Planner Framework for Robust Earth Observation Agents
cs.MA 2026-05 unverdicted novelty 4.0

The LMMP framework improves tool-calling accuracy and task success rates for Earth observation agents by grounding plans in multimodal features and remote sensing expert knowledge via a two-stage training process.
The Biggest Risk of Embodied AI is Governance Lag
cs.CY 2026-04 unverdicted novelty 3.0

Governance lag in observing, regulating, and distributing embodied AI is presented as the primary risk, appearing in observational, institutional, and distributive forms.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 29 Pith papers · 25 internal anchors

[1]

Maximum a posteriori policy optimisation

Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Ried- miller. Maximum a posteriori policy optimisation.Preprint arXiv:1806.06920,

work page arXiv
[2]

ACL, 2020.ht tps://arxiv.org/abs/2005.00928

Samira Abnar and Willem Zuidema. Quantifying attention ﬂow in transformers.Preprint arXiv:2005.00928,

work page arXiv 2005
[3]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as i can, not as i say: Grounding language in robotic aﬀordances.Preprint arXiv:2204.01691,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeﬀ Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, ZhitaoGong, SinaSamangooei, MarianneMonteiro, JacobMenick, SebastianBorgeaud, AndyBrock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo ...

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety.Preprint arXiv:1606.06565,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoﬀrey E Hinton. Layer normalization.Preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Baker, I

Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoﬀet, Brandon Houghton, Raul Sampedro, and Jeﬀ Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. Preprint arXiv::2206.11795,

work page arXiv
[8]

Distributed distributional deterministic policy gradients

Gabriel Barth-Maron, Matthew W Hoﬀman, David Budden, Will Dabney, Dan Horgan, Dhruva Tb, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy gradients. Preprint arXiv:1804.08617,

work page arXiv
[9]

DeepMind Lab

Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, et al. DeepMind lab. Preprint arXiv:1612.03801,

work page Pith review arXiv
[10]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. Preprint arXiv:2108.07258,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Improving language models by retrieving from trillions of tokens.Preprint arXiv:2112.04426,

Sebastian Borgeaud, Arthur Mensch, Jordan Hoﬀmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens.Preprint arXiv:2112.04426,

work page arXiv
[12]

OpenAI Gym

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym.Preprint arXiv:1606.01540,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Language models are few-shot learners

21 Published in Transactions on Machine Learning Research (11/2022) TB Brown, B Mann, N Ryder, M Subbiah, J Kaplan, P Dhariwal, A Neelakantan, P Shyam, G Sastry, A Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, pp. 1877–1901,

work page 2022
[14]

Scaling data-driven robotics with reward sketching and batch reinforcement learning.Preprint arXiv:1909.12200,

Serkan Cabi, Sergio Gómez Colmenarejo, Alexander Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, Mel Vecerik, et al. Scaling data-driven robotics with reward sketching and batch reinforcement learning.Preprint arXiv:1909.12200,

work page arXiv 1909
[15]

in-the-wild

Annie S Chen, Suraj Nair, and Chelsea Finn. Learning generalizable robotic reward functions from “in-the- wild" human videos.Preprint arXiv:2103.16817, 2021a. Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Ar- avind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence model...

work page arXiv
[16]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. Preprint arXiv:1504.00325,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

BabyAI: A platform to study the sample eﬃciency of grounded language learning

Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, and Yoshua Bengio. BabyAI: A platform to study the sample eﬃciency of grounded language learning. Preprint arXiv:1810.08272,

work page arXiv
[18]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways.Preprint arXiv:2204.02311,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Leveraging procedural generation to benchmark reinforcement learning

Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. InInternational Conference on Machine Learning, pp. 2048–2056,

work page 2048
[20]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirec- tional transformers for language understanding.Preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Un- terthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.Preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[22]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

22 Published in Transactions on Machine Learning Research (11/2022) Justin Fu, Aviral Kumar, Oﬁr Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for deep data- driven reinforcement learning.Preprint arXiv:2004.07219,

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

arXiv preprint arXiv:2111.10364 , year=

Hiroki Furuta, Yutaka Matsuo, and Shixiang Shane Gu. Generalized decision transformer for oﬄine hindsight information matching.Preprint arXiv:2111.10364,

work page arXiv
[24]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs).Preprint arXiv:1606.08415,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Muesli: Combining improvements in policy optimization.Preprint arXiv:2104.06159,

Matteo Hessel, Ivo Danihelka, Fabio Viola, Arthur Guez, Simon Schmitt, Laurent Sifre, Theophane Weber, David Silver, and Hado van Hasselt. Muesli: Combining improvements in policy optimization.Preprint arXiv:2104.06159,

work page arXiv
[26]

Training Compute-Optimal Large Language Models

Jordan Hoﬀmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.Preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Deep networks with stochastic depth

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with stochastic depth. Preprint arXiv:1603.09382,

work page arXiv
[28]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents.Preprint arXiv:2201.07207,

work page arXiv
[29]

Babyai 1.1

David Yu-Tung Hui, Maxime Chevalier-Boisvert, Dzmitry Bahdanau, and Yoshua Bengio. Babyai 1.1. Preprint arXiv:2007.12770,

work page arXiv 2007
[30]

Bowen Jing, Bonnie Berger, and Tommi Jaakkola

Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver IO: A general architecture for structured inputs & outputs.Preprint arXiv:2107.14795,

work page arXiv
[31]

Massively multilingual neural machine translation

23 Published in Transactions on Machine Learning Research (11/2022) Melvin Johnson, Orhan Firat, and Roee Aharoni. Massively multilingual neural machine translation. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3874–3884,

work page 2022
[32]

One model to learn them all.Preprint arXiv:1706.05137,

Lukasz Kaiser, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit. One model to learn them all.Preprint arXiv:1706.05137,

work page arXiv
[33]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeﬀrey Wu, and Dario Amodei. Scaling laws for neural language models.Preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[34]

Alignment of language agents

Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Mikulik, and Geoﬀrey Irving. Alignment of language agents.Preprint arXiv:2103.14659,

work page arXiv
[35]

Varshney, Caiming Xiong, and Richard Socher

Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. CTRL: A conditional transformer language model for controllable generation.Preprint arXiv:1909.05858,

work page arXiv 1909
[36]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.Preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

My body is a cage: the role of morphology in graph-based incompatible control.Preprint arXiv:2010.01856,

Vitaly Kurin, Maximilian Igl, Tim Rocktäschel, Wendelin Boehmer, and Shimon Whiteson. My body is a cage: the role of morphology in graph-based incompatible control.Preprint arXiv:2010.01856,

work page arXiv 2010
[38]

How to spend your robot time: Bridging kickstarting and oﬄine reinforcement learning for vision-based robotic manipulation.Preprint arXiv:2205.03353,

Alex X Lee, Coline Manon Devin, Jost Tobias Springenberg, Yuxiang Zhou, Thomas Lampe, Abbas Abdol- maleki, and Konstantinos Bousmalis. How to spend your robot time: Bridging kickstarting and oﬄine reinforcement learning for vision-based robotic manipulation.Preprint arXiv:2205.03353,

work page arXiv
[39]

Pre-trained language models for interactive decision-making.Preprint arXiv:2202.01771, 2022a

Shuang Li, Xavier Puig, Chris Paxton, Yilun Du, Clinton Wang, Linxi Fan, Tao Chen, De-An Huang, Ekin Akyürek, Anima Anandkumar, Jacob Andreas, Igor Mordatch, Antonio Torralba, and Yuke Zhu. Pre-trained language models for interactive decision-making.Preprint arXiv:2202.01771, 2022a. Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser,...

work page arXiv
[40]

Teaching Language Models to Support Answers with Verified Quotes.CoRR, abs/2203.11147,

24 Published in Transactions on Machine Learning Research (11/2022) JacobMenick, MajaTrebacz, VladimirMikulik, JohnAslanides, FrancisSong, MartinChadwick, MiaGlaese, Susannah Young, Lucy Campbell-Gillingham, Geoﬀrey Irving, et al. Teaching language models to support answers with veriﬁed quotes.Preprint arXiv:2203.11147,

work page arXiv 2022
[41]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeﬀ Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. WebGPT: Browser-assisted question-answering with human feedback.Preprint arXiv:2112.09332,

work page internal anchor Pith review Pith/arXiv arXiv
[42]

WaveNet: A Generative Model for Raw Audio

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A generative model for raw audio. Preprint arXiv:1609.03499,

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Shaking the foundations: delusions in sequence models for interaction and control.Preprint arXiv:2110.10819,

Pedro A Ortega, Markus Kunesch, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Joel Veness, Jonas Buchli, Jonas Degrave, Bilal Piot, Julien Perolat, et al. Shaking the foundations: delusions in sequence models for interaction and control.Preprint arXiv:2110.10819,

work page arXiv
[44]

Training language models to follow instructions with human feedback

Long Ouyang, Jeﬀ Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Preprint arXiv:2203.02155,

work page internal anchor Pith review Pith/arXiv arXiv
[45]

The unsurprising effectiveness of pre-trained vision models for control

Simone Parisi, Aravind Rajeswaran, Senthil Purushwalkam, and Abhinav Gupta. The unsurprising eﬀec- tiveness of pre-trained vision models for control.Preprint arXiv:2203.03580,

work page arXiv
[46]

Massively multilingual ASR: 50 languages, 1 model, 1 billion parameters.Preprint arXiv:2007.03001,

Vineel Pratap, Anuroop Sriram, Paden Tomasello, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, and Ronan Collobert. Massively multilingual ASR: 50 languages, 1 model, 1 billion parameters.Preprint arXiv:2007.03001,

work page arXiv 2007
[47]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoﬀmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher.Preprint arXiv:2112.11446,

work page internal anchor Pith review Pith/arXiv arXiv
[48]

Can Wikipedia help oﬄine reinforcement learning? Preprint arXiv:2201.12122,

Machel Reid, Yutaro Yamada, and Shixiang Shane Gu. Can Wikipedia help oﬄine reinforcement learning? Preprint arXiv:2201.12122,

work page arXiv
[49]

Progressive Neural Networks

Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, RazvanPascanu, andRaiaHadsell. Progressiveneuralnetworks. Preprint arXiv:1606.04671,

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Multitask prompted training enables zero-shot task generalization

25 Published in Transactions on Machine Learning Research (11/2022) Victor Sanh, Albert Webson, Colin Raﬀel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaﬃn, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonat...

work page 2022
[51]

One big net for everything.Preprint arXiv:1802.08864,

Jürgen Schmidhuber. One big net for everything.Preprint arXiv:1802.08864,

work page arXiv
[52]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.Preprint arXiv::2002.05202,

work page internal anchor Pith review Pith/arXiv arXiv 2002
[53]

Dropout: A simple way to prevent neural networks from overﬁtting.Journal of Machine Learning Research, 15(56): 1929–1958,

Nitish Srivastava, Geoﬀrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overﬁtting.Journal of Machine Learning Research, 15(56): 1929–1958,

work page 1929
[54]

DeepMind Control Suite

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. DeepMind control suite.Preprint arXiv:1801.00690,

work page internal anchor Pith review arXiv
[55]

LaMDA: Language Models for Dialog Applications

Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. LaMDA: Language models for dialog applications. Preprint arXiv:2201.08239,

work page internal anchor Pith review Pith/arXiv arXiv
[56]

arXiv preprint arXiv:2108.10904 , year=

Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision.Preprint arXiv:2108.10904,

work page arXiv
[57]

Finetuned Language Models Are Zero-Shot Learners

26 Published in Transactions on Machine Learning Research (11/2022) Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners.Preprint arXiv:2109.01652,

work page internal anchor Pith review Pith/arXiv arXiv 2022
[58]

Ethical and social risks of harm from Language Models

LauraWeidinger, JohnMellor, MaribethRauh, ConorGriﬃn, JonathanUesato, Po-SenHuang, MyraCheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. Ethical and social risks of harm from language models. Preprint arXiv:2112.04359,

work page internal anchor Pith review Pith/arXiv arXiv
[59]

Online decision transformer.Preprint arXiv:2202.05607,

Qinqing Zheng, Amy Zhang, and Aditya Grover. Online decision transformer.Preprint arXiv:2202.05607,

work page arXiv
[60]

Oﬄine learning from demonstrations and unlabeled experience

Konrad Zolna, Alexander Novikov, Ksenia Konyushkova, Caglar Gulcehre, Ziyu Wang, Yusuf Aytar, Misha Denil, Nando de Freitas, and Scott Reed. Oﬄine learning from demonstrations and unlabeled experience. Preprint arXiv:2011.13885,

work page arXiv 2011
[61]

27 Published in Transactions on Machine Learning Research (11/2022) Supplementary Material A Model card We present a model card for Gato in Table

work page 2022
[62]

Model details Organization DeepMind Model Date May 2022 Model Type Transformer with ResNet patch embedding for multi-task, multi-modal behavior cloning

Table 4:Gato Model Card.We follow the framework proposed in (Mitchell et al., 2019). Model details Organization DeepMind Model Date May 2022 Model Type Transformer with ResNet patch embedding for multi-task, multi-modal behavior cloning. Model Version Initial release. Feedback on the Model reedscot@google.com Intended Uses Primary Intended Uses Learn to a...

work page 2019
[63]

Finally, they are discretized using bins of uniform width on the domain[−1, 1]

(If the ﬂoating-point tensor is in the action set, we do not need to compand the elements in the sequence because actions are only deﬁned in the range[−1, 1] for all our environments.) All the elements are subsequently clipped so that they fall in the set[−1, 1]. Finally, they are discretized using bins of uniform width on the domain[−1, 1]. We use 1024 b...

work page 2022
[64]

(instead of LayerNorm (Ba et al., 2016)) normalization, and GELU (Hendrycks & Gimpel,

work page 2016
[65]

C Model Architecture C.1 Transformer Hyperparameters Table 5:Gato transformer hyperparameters

(instead of RELU) activation functions. C Model Architecture C.1 Transformer Hyperparameters Table 5:Gato transformer hyperparameters. Hyperparameter Gato 1.18B 364M 79M Transformer blocks 24 12 8 Attention heads 16 12 24 Layer width 2048 1536 768 Feedforward hidden size 8192 6144 3072 Key/value size 128 128 32 Shared embedding True Layer normalization Pr...

work page 2048
[66]

with 32 groups instead of LayerNorm (Ba et al., 2016), and GELU (Hendrycks & Gimpel,

work page 2016
[67]

These are described below

C.3 Position Encodings After tokens are mapped into token embeddings, two position encodings are added to the token embeddings (when applicable) to provide temporal and spatial information to the model. These are described below. 33 Published in Transactions on Machine Learning Research (11/2022) Figure 17:Patch position encodings.Calculating patch positi...

work page 2022
[68]

The image is of resolution80× 64 and each patch is 16× 16, meaning there are 5× 4 = 20 patches total

We will follow the process with the patch highlighted in red on the left of the subﬁgure. The image is of resolution80× 64 and each patch is 16× 16, meaning there are 5× 4 = 20 patches total. The highlighted patch starts at pixel row interval [16, 32] and pixel column interval[32, 64]. Normalized, the row interval is therefore[0.25, 0.5] and the column in...

work page 2022
[69]

each Multi-Head Attention and Dense Feedforward layer) is skipped with a probability of 0.1

during pretraining, where each of the transformer sub-layers (i.e. each Multi-Head Attention and Dense Feedforward layer) is skipped with a probability of 0.1. Table 6:Learning rate schedule hyperparameters for the diﬀerent model scales. Hyperparameter Gato 1.18B 364M 79M Maximum Learning Rate 1e-4 2e-4 1e-4 Minimum Learning Rate 1e-5 2e-5 1e-5 35 Publish...

work page 2022
[70]

Evaluation: We evaluate agent every 100 learning steps

with a rate of 0.1. Evaluation: We evaluate agent every 100 learning steps. Each evaluation reports the average of 10 runs of a given checkpoint. The moving average of 5 such scores is computed (to gather 50 runs together). The ﬁnal ﬁne-tuning performance is deﬁned as the maximum of these smoothed scores. Datasets: We generated data for the ﬁne-tuning tas...

work page 2000
[71]

We record approximately 20,000 random episodes generated by the agent during training

agent for 200M total environment steps. We record approximately 20,000 random episodes generated by the agent during training. F.2 Sokoban Sokoban is a planning problem (Racanière et al., 2017), in which the agent has to push boxes to target locations. Some of the moves are irreversible and consequently mistakes can render the puzzle unsolvable. Planning ...

work page 2017
[72]

We collect 100,000 episodes for each level

for more details about the bot. We collect 100,000 episodes for each level. 3Basic Math, Breakout, Crossbow, Darkchambers, Entombed, ET, Flag Capture, Human Cannonball, Klax, Laser Gates, Ms. Pac-Man, Solaris, Space War. 36 Published in Transactions on Machine Learning Research (11/2022) F.4 DeepMind Control Suite The DeepMind Control Suite (Tunyasuvunako...

work page 2022
[73]

Data was collected by executing the agent on these 18 levels, as well as an additional set of 237 levels handcrafted to test a diverse set of skills

agent jointly on a set of 18 parent DM Lab levels that generate maps procedurally for each new episode. Data was collected by executing the agent on these 18 levels, as well as an additional set of 237 levels handcrafted to test a diverse set of skills. The 18 parent levels are characterized by high diversity of generated maps. The diﬀerence between the l...

work page 2016
[74]

Each variant is a morphological modiﬁcation of the original body: the set of 4Available athttps://neurips.cc/virtual/2021/workshop/21865#wse-detail-22801

Walker2d-v2, Humanoid-v2, and Hopper-v2. Each variant is a morphological modiﬁcation of the original body: the set of 4Available athttps://neurips.cc/virtual/2021/workshop/21865#wse-detail-22801. 37 Published in Transactions on Machine Learning Research (11/2022) morphologies is generated by enumerating all possible subsets of limbs, and keeping only thos...

work page 2021
[75]

We also implemented a parallel sampling scheme where all the action tokens are zeroed out in the input sequences during training so we can sample all tokens corresponding to a robot action in a single model inference step instead of autoregressively as it’s done in other domains. We found that the1.18B parameter model was able to run on the hardware accel...

work page 2021
[76]

Table 7:Success rates of specialist Meta-World agent.Averaged over 500 evaluations

We evaluated agent 500 times for each task. Table 7:Success rates of specialist Meta-World agent.Averaged over 500 evaluations. Task name Success rate assembly-v2 0.980 basketball-v2 0.964 bin-picking-v2 0.954 box-close-v2 0.958 button-press-topdown-v2 0.996 button-press-topdown-wall-v2 0.998 button-press-v2 0.996 button-press-wall-v2 1.000 coffee-button-...

work page 2022