pith. machine review for the scientific record. sign in

arxiv: 2205.06175 · v3 · submitted 2022-05-12 · 💻 cs.AI · cs.CL· cs.LG· cs.RO

Recognition: 2 theorem links

· Lean Theorem

A Generalist Agent

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:19 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LGcs.RO
keywords generalist agentmulti-modal transformermulti-task policymulti-embodimentsequence modelingAtarirobotics
0
0 comments X

The pith

A single transformer with fixed weights can play Atari, caption images, chat, and control a robot arm by choosing the right output tokens

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Gato as a multi-modal, multi-task, multi-embodiment policy built from one transformer. The model is trained on a mixture of data that includes game frames, images, text, and robot sensor readings, then uses the same weights to generate button presses, image descriptions, dialogue, or joint torques depending on the current context. A sympathetic reader cares because the result points toward AI systems that avoid the need for many separate specialized models. The work documents that this unified approach already reaches competent performance on each included task.

Core claim

The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens.

What carries the argument

A decoder-only transformer that tokenizes inputs and outputs from text, vision, and embodiment domains into one shared sequence space, allowing next-token prediction to produce actions or language as needed.

Load-bearing premise

Training one transformer on a mixed collection of multi-modal and multi-embodiment data produces competent performance across domains without large negative interference between tasks.

What would settle it

A direct comparison showing that the joint model scores substantially lower on any single task than a model trained only on that task, or that the model frequently selects the wrong output modality for a given context.

read the original abstract

Inspired by progress in large-scale language modeling, we apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens. In this report we describe the model and the data, and document the current capabilities of Gato.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Gato, a single transformer network with fixed weights that functions as a multi-modal, multi-task, multi-embodiment generalist policy. The same model processes a unified token sequence and, depending on context, outputs text, joint torques, button presses, or other actions to perform tasks including Atari gameplay, image captioning, dialogue, and real-robot block stacking.

Significance. If the reported capabilities hold under fuller scrutiny, the work provides concrete empirical support for scaling transformer architectures to generalist agents that operate across modalities and embodiments without task-specific heads or weights, which could accelerate progress toward unified, context-adaptive AI systems.

major comments (1)
  1. [Evaluation] Evaluation sections: the manuscript documents capabilities through training and testing but supplies only limited quantitative results, ablations on data-mixture proportions, and head-to-head baselines against specialist models; this weakens the support for the central claim that a single set of weights achieves competent performance across domains without major interference.
minor comments (2)
  1. Figure captions and axis labels in the results plots would benefit from explicit units and clearer distinction between modalities to improve readability.
  2. [Model] The description of the tokenization scheme for continuous control signals (joint torques) could be expanded with a short equation or pseudocode for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. We address the single major comment on evaluation below.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation sections: the manuscript documents capabilities through training and testing but supplies only limited quantitative results, ablations on data-mixture proportions, and head-to-head baselines against specialist models; this weakens the support for the central claim that a single set of weights achieves competent performance across domains without major interference.

    Authors: We acknowledge the evaluation is primarily demonstration-oriented across many tasks and modalities. The manuscript does report quantitative metrics (e.g., Atari scores, captioning BLEU, robotics success rates) and includes direct comparisons to specialist models on several benchmarks, showing that the single set of weights reaches competent performance without task-specific heads. We agree that additional ablations on data-mixture proportions would further substantiate the absence of major interference. In the revised version we will expand the evaluation section with further head-to-head baselines where data permits and include a brief discussion of the mixture ratios used during training. Full exhaustive ablations remain computationally prohibitive, but the existing results already illustrate that a unified transformer can handle the reported range of embodiments and modalities without catastrophic forgetting or degradation. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical report describing the training and evaluation of a single transformer (Gato) on a mixture of multi-modal and multi-embodiment data. It makes no first-principles derivations, uniqueness theorems, or mathematical predictions that could reduce to fitted inputs by construction. Capabilities are documented via direct training runs and task-specific testing; the central claim (one network handling text, Atari, robotics, etc., via context) follows from the architecture and data mixture without self-referential definitions or load-bearing self-citations. This is a standard empirical result with no circular steps.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The work relies on standard transformer assumptions and empirical scaling from language models, with hyperparameters for architecture and data mixture chosen during development.

free parameters (2)
  • Transformer size and hyperparameters
    Number of layers, heads, and embedding dimensions selected to balance performance and compute.
  • Data mixture proportions
    Relative amounts of text, vision, game, and robotics data chosen to enable multi-task training.
axioms (1)
  • domain assumption A transformer can process tokenized sequences from multiple modalities and generate appropriate outputs for each.
    Assumed from prior success of transformers in language and vision tasks.

pith-pipeline@v0.9.0 · 5474 in / 1241 out tokens · 43527 ms · 2026-05-13T06:19:57.774067+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 30 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

    cs.CR 2024-06 unverdicted novelty 8.0

    AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

  2. TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

    cs.CL 2026-05 unverdicted novelty 7.0

    TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.

  3. TokaMind for Power Grid: Cross-Domain Transfer from Fusion Plasma

    physics.plasm-ph 2026-05 unverdicted novelty 7.0

    TokaMind, pre-trained on MAST tokamak data, transfers to power grid PMU data for severe event classification with F1 0.837, where difficulty depends on grid topology and CSD indicators boost early-warning performance ...

  4. Transformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient Descent

    cs.LG 2026-05 conditional novelty 7.0

    Multi-layer transformers can implement in-context logistic regression by performing normalized gradient descent steps layer by layer, obtained via supervised training of a single attention layer followed by recurrent ...

  5. KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis

    cs.RO 2026-04 unverdicted novelty 7.0

    KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.

  6. Factorization Regret mediates compositional generalization in latent space

    cs.LG 2026-03 unverdicted novelty 7.0

    Factorization Regret measures how latent variable interactions affect performance, and RCCs enable learning them to achieve compositional generalization in partially observable tasks.

  7. Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

    cs.CL 2025-11 unverdicted novelty 7.0

    Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.

  8. Mastering Diverse Domains through World Models

    cs.AI 2023-01 unverdicted novelty 7.0

    DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.

  9. TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

    cs.CL 2026-05 unverdicted novelty 6.0

    TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.

  10. StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception

    cs.RO 2026-05 unverdicted novelty 6.0

    StereoPolicy fuses stereo image pairs via a Stereo Transformer on pretrained 2D encoders to boost robotic manipulation policies, showing gains over monocular, RGB-D, point cloud, and multi-view methods in simulations ...

  11. Geometric Pareto Control: Riemannian Gradient Flow of Energy Function via Lie Group Homotopy

    eess.SY 2026-05 unverdicted novelty 6.0

    Geometric Pareto Control embeds Pareto solutions in a Lie group submanifold and navigates via Riemannian gradient flow to achieve 100% feasibility and low suboptimality in control tasks without retraining.

  12. RELO: Reinforcement Learning to Localize for Visual Object Tracking

    cs.CV 2026-05 unverdicted novelty 6.0

    RELO replaces handcrafted spatial priors with a reinforcement learning policy for target localization in visual tracking and reports 57.5% AUC on LaSOText without template updates.

  13. A Meta Reinforcement Learning Approach to Goals-Based Wealth Management

    cs.LG 2026-05 unverdicted novelty 6.0

    MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.

  14. Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.

  15. $M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills

    cs.RO 2026-04 unverdicted novelty 6.0

    M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.

  16. Bridging MARL to SARL: An Order-Independent Multi-Agent Transformer via Latent Consensus

    cs.LG 2026-04 conditional novelty 6.0

    CMAT uses a transformer decoder to produce a high-level consensus vector in latent space, enabling simultaneous order-independent actions by all agents and optimization via single-agent PPO, with superior results on S...

  17. ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 6.0

    ProGAL-VLA uses 3D graphs, symbolic sub-goals, and a Grounding Alignment Contrastive loss to ground actions on verified embeddings, raising robustness from 30.3% to 71.5% and ambiguity AUROC to 0.81 on robotic benchmarks.

  18. ADAPTive Input Training for Many-to-One Pre-Training on Time-Series Classification

    cs.LG 2026-04 unverdicted novelty 6.0

    ADAPT is a new pre-training paradigm that aligns physical properties of time-series data to allow simultaneous training on 162 diverse classification datasets, achieving new state-of-the-art performance.

  19. GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    cs.RO 2024-10 unverdicted novelty 6.0

    GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.

  20. LLaVA-Video: Video Instruction Tuning With Synthetic Data

    cs.CV 2024-10 unverdicted novelty 6.0

    LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.

  21. Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

    cs.RO 2023-12 conditional novelty 6.0

    A GPT-style model pre-trained on large video datasets achieves 94.9% success on CALVIN multi-task manipulation and 85.4% zero-shot generalization, outperforming prior baselines.

  22. TD-MPC2: Scalable, Robust World Models for Continuous Control

    cs.LG 2023-10 conditional novelty 6.0

    TD-MPC2 scales an implicit world-model RL method to a 317M-parameter agent that masters 80 tasks across four domains with a single hyperparameter configuration.

  23. PaLM-E: An Embodied Multimodal Language Model

    cs.LG 2023-03 conditional novelty 6.0

    PaLM-E is a single 562B-parameter multimodal model that performs embodied reasoning tasks like robotic manipulation planning and visual question answering by interleaving vision, state, and text inputs with positive t...

  24. Large Language Models for Sequential Decision-Making: Improving In-Context Learning via Supervised Fine-Tuning

    cs.LG 2026-05 unverdicted novelty 5.0

    Supervised fine-tuning of pretrained LLMs on offline trajectories yields better few-shot sequential decision-making than in-context-only baselines, with a theoretical suboptimality bound derived for linear MDPs by int...

  25. Towards Systematic Generalization for Power Grid Optimization Problems

    cs.LG 2026-05 unverdicted novelty 5.0

    A shared graph neural network framework jointly solves ACOPF and SCUC problems using physics constraints and shows improved generalization to unseen grid topologies.

  26. Neural Computers

    cs.LG 2026-04 unverdicted novelty 5.0

    Neural Computers are introduced as a new machine form where computation, memory, and I/O are unified in a learned runtime state, with initial video-model experiments showing acquisition of basic interface primitives f...

  27. UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    cs.AI 2025-09 conditional novelty 5.0

    UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.

  28. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  29. Bridging Perception and Action: A Lightweight Multimodal Meta-Planner Framework for Robust Earth Observation Agents

    cs.MA 2026-05 unverdicted novelty 4.0

    The LMMP framework improves tool-calling accuracy and task success rates for Earth observation agents by grounding plans in multimodal features and remote sensing expert knowledge via a two-stage training process.

  30. The Biggest Risk of Embodied AI is Governance Lag

    cs.CY 2026-04 unverdicted novelty 3.0

    Governance lag in observing, regulating, and distributing embodied AI is presented as the primary risk, appearing in observational, institutional, and distributive forms.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 29 Pith papers · 25 internal anchors

  1. [1]

    Maximum a posteriori policy optimisation

    Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Ried- miller. Maximum a posteriori policy optimisation.Preprint arXiv:1806.06920,

  2. [2]

    ACL, 2020.ht tps://arxiv.org/abs/2005.00928

    Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers.Preprint arXiv:2005.00928,

  3. [3]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as i can, not as i say: Grounding language in robotic affordances.Preprint arXiv:2204.01691,

  4. [4]

    Flamingo: a Visual Language Model for Few-Shot Learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, ZhitaoGong, SinaSamangooei, MarianneMonteiro, JacobMenick, SebastianBorgeaud, AndyBrock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo ...

  5. [5]

    Concrete Problems in AI Safety

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety.Preprint arXiv:1606.06565,

  6. [6]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.Preprint arXiv:1607.06450,

  7. [7]

    Baker, I

    Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. Preprint arXiv::2206.11795,

  8. [8]

    Distributed Distributional Deterministic Policy Gradients

    Gabriel Barth-Maron, Matthew W Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva Tb, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy gradients. Preprint arXiv:1804.08617,

  9. [9]

    DeepMind Lab

    Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, et al. DeepMind lab. Preprint arXiv:1612.03801,

  10. [10]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. Preprint arXiv:2108.07258,

  11. [11]

    Improving language models by retrieving from trillions of tokens.Preprint arXiv:2112.04426,

    Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens.Preprint arXiv:2112.04426,

  12. [12]

    OpenAI Gym

    Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym.Preprint arXiv:1606.01540,

  13. [13]

    Language models are few-shot learners

    21 Published in Transactions on Machine Learning Research (11/2022) TB Brown, B Mann, N Ryder, M Subbiah, J Kaplan, P Dhariwal, A Neelakantan, P Shyam, G Sastry, A Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, pp. 1877–1901,

  14. [14]

    Scaling data-driven robotics with reward sketching and batch reinforcement learning.Preprint arXiv:1909.12200,

    Serkan Cabi, Sergio Gómez Colmenarejo, Alexander Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, Mel Vecerik, et al. Scaling data-driven robotics with reward sketching and batch reinforcement learning.Preprint arXiv:1909.12200,

  15. [15]

    in-the-wild

    Annie S Chen, Suraj Nair, and Chelsea Finn. Learning generalizable robotic reward functions from “in-the- wild" human videos.Preprint arXiv:2103.16817, 2021a. Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Ar- avind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence model...

  16. [16]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. Preprint arXiv:1504.00325,

  17. [17]

    BabyAI: A platform to study the sample efficiency of grounded language learning

    Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, and Yoshua Bengio. BabyAI: A platform to study the sample efficiency of grounded language learning. Preprint arXiv:1810.08272,

  18. [18]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways.Preprint arXiv:2204.02311,

  19. [19]

    Leveraging procedural generation to benchmark reinforcement learning

    Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. InInternational Conference on Machine Learning, pp. 2048–2056,

  20. [20]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirec- tional transformers for language understanding.Preprint arXiv:1810.04805,

  21. [21]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Un- terthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.Preprint arXiv:2010.11929,

  22. [22]

    D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    22 Published in Transactions on Machine Learning Research (11/2022) Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for deep data- driven reinforcement learning.Preprint arXiv:2004.07219,

  23. [23]

    arXiv preprint arXiv:2111.10364 , year=

    Hiroki Furuta, Yutaka Matsuo, and Shixiang Shane Gu. Generalized decision transformer for offline hindsight information matching.Preprint arXiv:2111.10364,

  24. [24]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs).Preprint arXiv:1606.08415,

  25. [25]

    Muesli: Combining improvements in policy optimization.Preprint arXiv:2104.06159,

    Matteo Hessel, Ivo Danihelka, Fabio Viola, Arthur Guez, Simon Schmitt, Laurent Sifre, Theophane Weber, David Silver, and Hado van Hasselt. Muesli: Combining improvements in policy optimization.Preprint arXiv:2104.06159,

  26. [26]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.Preprint arXiv:2203.15556,

  27. [27]

    Deep networks with stochastic depth

    Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with stochastic depth. Preprint arXiv:1603.09382,

  28. [28]

    Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,

    Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents.Preprint arXiv:2201.07207,

  29. [29]

    Babyai 1.1

    David Yu-Tung Hui, Maxime Chevalier-Boisvert, Dzmitry Bahdanau, and Yoshua Bengio. Babyai 1.1. Preprint arXiv:2007.12770,

  30. [30]

    Bowen Jing, Bonnie Berger, and Tommi Jaakkola

    Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver IO: A general architecture for structured inputs & outputs.Preprint arXiv:2107.14795,

  31. [31]

    Massively multilingual neural machine translation

    23 Published in Transactions on Machine Learning Research (11/2022) Melvin Johnson, Orhan Firat, and Roee Aharoni. Massively multilingual neural machine translation. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3874–3884,

  32. [32]

    One model to learn them all.Preprint arXiv:1706.05137,

    Lukasz Kaiser, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit. One model to learn them all.Preprint arXiv:1706.05137,

  33. [33]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.Preprint arXiv:2001.08361,

  34. [34]

    Alignment of language agents

    Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Mikulik, and Geoffrey Irving. Alignment of language agents.Preprint arXiv:2103.14659,

  35. [35]

    Varshney, Caiming Xiong, and Richard Socher

    Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. CTRL: A conditional transformer language model for controllable generation.Preprint arXiv:1909.05858,

  36. [36]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.Preprint arXiv:1412.6980,

  37. [37]

    My body is a cage: the role of morphology in graph-based incompatible control.Preprint arXiv:2010.01856,

    Vitaly Kurin, Maximilian Igl, Tim Rocktäschel, Wendelin Boehmer, and Shimon Whiteson. My body is a cage: the role of morphology in graph-based incompatible control.Preprint arXiv:2010.01856,

  38. [38]

    How to spend your robot time: Bridging kickstarting and offline reinforcement learning for vision-based robotic manipulation.Preprint arXiv:2205.03353,

    Alex X Lee, Coline Manon Devin, Jost Tobias Springenberg, Yuxiang Zhou, Thomas Lampe, Abbas Abdol- maleki, and Konstantinos Bousmalis. How to spend your robot time: Bridging kickstarting and offline reinforcement learning for vision-based robotic manipulation.Preprint arXiv:2205.03353,

  39. [39]

    Pre-trained language models for interactive decision-making.Preprint arXiv:2202.01771, 2022a

    Shuang Li, Xavier Puig, Chris Paxton, Yilun Du, Clinton Wang, Linxi Fan, Tao Chen, De-An Huang, Ekin Akyürek, Anima Anandkumar, Jacob Andreas, Igor Mordatch, Antonio Torralba, and Yuke Zhu. Pre-trained language models for interactive decision-making.Preprint arXiv:2202.01771, 2022a. Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser,...

  40. [40]

    Teaching Language Models to Support Answers with Verified Quotes.CoRR, abs/2203.11147,

    24 Published in Transactions on Machine Learning Research (11/2022) JacobMenick, MajaTrebacz, VladimirMikulik, JohnAslanides, FrancisSong, MartinChadwick, MiaGlaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, et al. Teaching language models to support answers with verified quotes.Preprint arXiv:2203.11147,

  41. [41]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. WebGPT: Browser-assisted question-answering with human feedback.Preprint arXiv:2112.09332,

  42. [42]

    WaveNet: A Generative Model for Raw Audio

    Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A generative model for raw audio. Preprint arXiv:1609.03499,

  43. [43]

    Shaking the foundations: delusions in sequence models for interaction and control.Preprint arXiv:2110.10819,

    Pedro A Ortega, Markus Kunesch, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Joel Veness, Jonas Buchli, Jonas Degrave, Bilal Piot, Julien Perolat, et al. Shaking the foundations: delusions in sequence models for interaction and control.Preprint arXiv:2110.10819,

  44. [44]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Preprint arXiv:2203.02155,

  45. [45]

    The unsurprising effectiveness of pre-trained vision models for control

    Simone Parisi, Aravind Rajeswaran, Senthil Purushwalkam, and Abhinav Gupta. The unsurprising effec- tiveness of pre-trained vision models for control.Preprint arXiv:2203.03580,

  46. [46]

    Massively multilingual ASR: 50 languages, 1 model, 1 billion parameters.Preprint arXiv:2007.03001,

    Vineel Pratap, Anuroop Sriram, Paden Tomasello, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, and Ronan Collobert. Massively multilingual ASR: 50 languages, 1 model, 1 billion parameters.Preprint arXiv:2007.03001,

  47. [47]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher.Preprint arXiv:2112.11446,

  48. [48]

    Can Wikipedia help offline reinforcement learning? Preprint arXiv:2201.12122,

    Machel Reid, Yutaro Yamada, and Shixiang Shane Gu. Can Wikipedia help offline reinforcement learning? Preprint arXiv:2201.12122,

  49. [49]

    Progressive Neural Networks

    Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, RazvanPascanu, andRaiaHadsell. Progressiveneuralnetworks. Preprint arXiv:1606.04671,

  50. [50]

    Multitask prompted training enables zero-shot task generalization

    25 Published in Transactions on Machine Learning Research (11/2022) Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonat...

  51. [51]

    One big net for everything.Preprint arXiv:1802.08864,

    Jürgen Schmidhuber. One big net for everything.Preprint arXiv:1802.08864,

  52. [52]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer.Preprint arXiv::2002.05202,

  53. [53]

    Dropout: A simple way to prevent neural networks from overfitting.Journal of Machine Learning Research, 15(56): 1929–1958,

    Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting.Journal of Machine Learning Research, 15(56): 1929–1958,

  54. [54]

    DeepMind Control Suite

    Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. DeepMind control suite.Preprint arXiv:1801.00690,

  55. [55]

    LaMDA: Language Models for Dialog Applications

    Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. LaMDA: Language models for dialog applications. Preprint arXiv:2201.08239,

  56. [56]

    arXiv preprint arXiv:2108.10904 , year=

    Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision.Preprint arXiv:2108.10904,

  57. [57]

    Finetuned Language Models Are Zero-Shot Learners

    26 Published in Transactions on Machine Learning Research (11/2022) Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners.Preprint arXiv:2109.01652,

  58. [58]

    Ethical and social risks of harm from Language Models

    LauraWeidinger, JohnMellor, MaribethRauh, ConorGriffin, JonathanUesato, Po-SenHuang, MyraCheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. Ethical and social risks of harm from language models. Preprint arXiv:2112.04359,

  59. [59]

    Online decision transformer.Preprint arXiv:2202.05607,

    Qinqing Zheng, Amy Zhang, and Aditya Grover. Online decision transformer.Preprint arXiv:2202.05607,

  60. [60]

    Offline learning from demonstrations and unlabeled experience

    Konrad Zolna, Alexander Novikov, Ksenia Konyushkova, Caglar Gulcehre, Ziyu Wang, Yusuf Aytar, Misha Denil, Nando de Freitas, and Scott Reed. Offline learning from demonstrations and unlabeled experience. Preprint arXiv:2011.13885,

  61. [61]

    27 Published in Transactions on Machine Learning Research (11/2022) Supplementary Material A Model card We present a model card for Gato in Table

  62. [62]

    Model details Organization DeepMind Model Date May 2022 Model Type Transformer with ResNet patch embedding for multi-task, multi-modal behavior cloning

    Table 4:Gato Model Card.We follow the framework proposed in (Mitchell et al., 2019). Model details Organization DeepMind Model Date May 2022 Model Type Transformer with ResNet patch embedding for multi-task, multi-modal behavior cloning. Model Version Initial release. Feedback on the Model reedscot@google.com Intended Uses Primary Intended Uses Learn to a...

  63. [63]

    Finally, they are discretized using bins of uniform width on the domain[−1, 1]

    (If the floating-point tensor is in the action set, we do not need to compand the elements in the sequence because actions are only defined in the range[−1, 1] for all our environments.) All the elements are subsequently clipped so that they fall in the set[−1, 1]. Finally, they are discretized using bins of uniform width on the domain[−1, 1]. We use 1024 b...

  64. [64]

    (instead of LayerNorm (Ba et al., 2016)) normalization, and GELU (Hendrycks & Gimpel,

  65. [65]

    C Model Architecture C.1 Transformer Hyperparameters Table 5:Gato transformer hyperparameters

    (instead of RELU) activation functions. C Model Architecture C.1 Transformer Hyperparameters Table 5:Gato transformer hyperparameters. Hyperparameter Gato 1.18B 364M 79M Transformer blocks 24 12 8 Attention heads 16 12 24 Layer width 2048 1536 768 Feedforward hidden size 8192 6144 3072 Key/value size 128 128 32 Shared embedding True Layer normalization Pr...

  66. [66]

    with 32 groups instead of LayerNorm (Ba et al., 2016), and GELU (Hendrycks & Gimpel,

  67. [67]

    These are described below

    C.3 Position Encodings After tokens are mapped into token embeddings, two position encodings are added to the token embeddings (when applicable) to provide temporal and spatial information to the model. These are described below. 33 Published in Transactions on Machine Learning Research (11/2022) Figure 17:Patch position encodings.Calculating patch positi...

  68. [68]

    The image is of resolution80× 64 and each patch is 16× 16, meaning there are 5× 4 = 20 patches total

    We will follow the process with the patch highlighted in red on the left of the subfigure. The image is of resolution80× 64 and each patch is 16× 16, meaning there are 5× 4 = 20 patches total. The highlighted patch starts at pixel row interval [16, 32] and pixel column interval[32, 64]. Normalized, the row interval is therefore[0.25, 0.5] and the column in...

  69. [69]

    each Multi-Head Attention and Dense Feedforward layer) is skipped with a probability of 0.1

    during pretraining, where each of the transformer sub-layers (i.e. each Multi-Head Attention and Dense Feedforward layer) is skipped with a probability of 0.1. Table 6:Learning rate schedule hyperparameters for the different model scales. Hyperparameter Gato 1.18B 364M 79M Maximum Learning Rate 1e-4 2e-4 1e-4 Minimum Learning Rate 1e-5 2e-5 1e-5 35 Publish...

  70. [70]

    Evaluation: We evaluate agent every 100 learning steps

    with a rate of 0.1. Evaluation: We evaluate agent every 100 learning steps. Each evaluation reports the average of 10 runs of a given checkpoint. The moving average of 5 such scores is computed (to gather 50 runs together). The final fine-tuning performance is defined as the maximum of these smoothed scores. Datasets: We generated data for the fine-tuning tas...

  71. [71]

    We record approximately 20,000 random episodes generated by the agent during training

    agent for 200M total environment steps. We record approximately 20,000 random episodes generated by the agent during training. F.2 Sokoban Sokoban is a planning problem (Racanière et al., 2017), in which the agent has to push boxes to target locations. Some of the moves are irreversible and consequently mistakes can render the puzzle unsolvable. Planning ...

  72. [72]

    We collect 100,000 episodes for each level

    for more details about the bot. We collect 100,000 episodes for each level. 3Basic Math, Breakout, Crossbow, Darkchambers, Entombed, ET, Flag Capture, Human Cannonball, Klax, Laser Gates, Ms. Pac-Man, Solaris, Space War. 36 Published in Transactions on Machine Learning Research (11/2022) F.4 DeepMind Control Suite The DeepMind Control Suite (Tunyasuvunako...

  73. [73]

    Data was collected by executing the agent on these 18 levels, as well as an additional set of 237 levels handcrafted to test a diverse set of skills

    agent jointly on a set of 18 parent DM Lab levels that generate maps procedurally for each new episode. Data was collected by executing the agent on these 18 levels, as well as an additional set of 237 levels handcrafted to test a diverse set of skills. The 18 parent levels are characterized by high diversity of generated maps. The difference between the l...

  74. [74]

    Each variant is a morphological modification of the original body: the set of 4Available athttps://neurips.cc/virtual/2021/workshop/21865#wse-detail-22801

    Walker2d-v2, Humanoid-v2, and Hopper-v2. Each variant is a morphological modification of the original body: the set of 4Available athttps://neurips.cc/virtual/2021/workshop/21865#wse-detail-22801. 37 Published in Transactions on Machine Learning Research (11/2022) morphologies is generated by enumerating all possible subsets of limbs, and keeping only thos...

  75. [75]

    We also implemented a parallel sampling scheme where all the action tokens are zeroed out in the input sequences during training so we can sample all tokens corresponding to a robot action in a single model inference step instead of autoregressively as it’s done in other domains. We found that the1.18B parameter model was able to run on the hardware accel...

  76. [76]

    Table 7:Success rates of specialist Meta-World agent.Averaged over 500 evaluations

    We evaluated agent 500 times for each task. Table 7:Success rates of specialist Meta-World agent.Averaged over 500 evaluations. Task name Success rate assembly-v2 0.980 basketball-v2 0.964 bin-picking-v2 0.954 box-close-v2 0.958 button-press-topdown-v2 0.996 button-press-topdown-wall-v2 0.998 button-press-v2 0.996 button-press-wall-v2 1.000 coffee-button-...