pith. machine review for the scientific record. sign in

arxiv: 2605.08268 · v1 · submitted 2026-05-08 · 💻 cs.MA · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Insider Attacks in Multi-Agent LLM Consensus Systems

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:45 UTC · model grok-4.3

classification 💻 cs.MA cs.AI
keywords insider attacksmulti-agent systemslarge language modelsconsensus formationreinforcement learningworld modelsadversarial manipulation
0
0 comments X

The pith

A malicious insider in a multi-agent LLM system can learn surrogate dynamics over benign agents' latent states and use reinforcement learning to delay consensus more effectively than static malicious prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how one malicious agent participating in a group of language-model agents can manipulate their iterative natural-language exchanges to prevent or delay reaching a shared decision. It formalizes the attack as a sequential decision-making task and proposes learning a compact surrogate model of the benign agents' hidden behavioral patterns, then optimizing the attacker's messages against this model with reinforcement learning. If the approach holds, adaptive model-based attacks pose a greater threat to collaborative LLM systems than simple prompt injections, because they can exploit observed patterns in how the benign agents respond and update. Preliminary experiments show the learned attacker lowers the benign consensus rate and extends disagreement periods beyond what a direct malicious-prompt baseline achieves.

Core claim

A malicious insider learns surrogate dynamics over the latent behavioral states of benign agents and trains an attacker policy via reinforcement learning; this policy reduces the benign consensus rate and prolongs disagreement more effectively than direct malicious prompting.

What carries the argument

The world-model-based attack framework that learns surrogate dynamics over latent behavioral states of benign agents to enable reinforcement learning optimization of the attacker's message choices.

If this is right

  • The trained attacker reduces the benign consensus rate more effectively than the direct malicious-prompt baseline.
  • It prolongs disagreement among agents more than the baseline does.
  • Combining latent world models with reinforcement learning offers a promising direction for adaptive insider attacks in language-based multi-agent systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the surrogate-model approach proves robust, comparable techniques might apply to other multi-agent LLM tasks such as joint planning or negotiation.
  • Systems relying on LLM consensus may need defenses like message-pattern monitoring or agent-behavior verification to counter learned attacks.
  • The method could be tested by swapping the underlying LLMs used by benign agents to see whether the attack transfer holds.

Load-bearing premise

The surrogate world model learned over latent behavioral states of benign agents accurately captures the dynamics needed for effective RL-based attack optimization in the real system.

What would settle it

Deploy the RL attacker trained on the surrogate model against real benign LLM agents in a consensus task and check whether it fails to produce lower consensus rates or shorter disagreement times than the direct malicious-prompt baseline.

Figures

Figures reproduced from arXiv: 2605.08268 by Xiaolin Sun, Yibin Hu, Zixuan Liu, Zizhan Zheng.

Figure 1
Figure 1. Figure 1: Distribution of episode rounds across attacker settings. D. Extra Results D.1. Distribution of number of rounds [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly deployed in multi-agent systems where agents communicate in natural language to solve tasks jointly. A key capability in such systems is consensus formation, where agents iteratively exchange messages and update decisions to reach a shared outcome. However, most existing multi-agent LLM frameworks assume that all participating agents are aligned with the system objective. In practice, a malicious insider may participate as a legitimate member of the group while pursuing a hidden adversarial goal. In this work, we study insider manipulation in multi-agent LLM consensus systems. We formalize the problem as a sequential decision-making task in which a malicious agent seeks to delay or prevent agreement among benign agents. To make attack optimization tractable, we propose a world-model-based framework that learns surrogate dynamics over the latent behavioral states of benign agents and then trains an attacker using reinforcement learning based on this learned model. Preliminary results show that the trained attacker reduces the benign consensus rate and prolongs disagreement more effectively than the direct malicious-prompt baseline. These results suggest that combining latent world models with reinforcement learning is a promising direction for adaptive insider attacks in language-based multi-agent systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper formalizes insider attacks in multi-agent LLM consensus systems as a sequential decision-making task where a malicious agent aims to delay or prevent agreement among benign agents. It proposes a world-model-based framework that first learns surrogate dynamics over the latent behavioral states of benign agents and then trains an RL attacker policy on this model. Preliminary results are reported showing that the trained attacker reduces the benign consensus rate and prolongs disagreement more effectively than a direct malicious-prompt baseline.

Significance. If the surrogate model is shown to be accurate and the RL policy transfers, the work would be significant for highlighting security vulnerabilities in language-based multi-agent systems and for introducing a tractable RL approach to adaptive insider attacks. It could inform the design of more robust consensus protocols. The preliminary nature of the results, however, makes the current significance tentative.

major comments (2)
  1. [Abstract] Abstract: The effectiveness claim rests on 'preliminary results' showing the RL attacker outperforms the baseline, but no experimental details are provided (e.g., consensus task definition, metrics for consensus rate and disagreement duration, number of trials, statistical tests, or exact baseline implementation). This prevents assessment of whether the data support the central claim.
  2. Framework description (implied in abstract): The approach requires that the learned surrogate dynamics over latent behavioral states accurately capture real LLM interactions for the RL-optimized policy to transfer. No surrogate validation metrics, ablation on model fidelity, or discussion of sim-to-real gaps (e.g., stochastic response generation or semantic drift) are mentioned, which is load-bearing for the reported improvement over the baseline.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'latent behavioral states' is used without any indication of how these states are extracted or represented from natural-language messages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The effectiveness claim rests on 'preliminary results' showing the RL attacker outperforms the baseline, but no experimental details are provided (e.g., consensus task definition, metrics for consensus rate and disagreement duration, number of trials, statistical tests, or exact baseline implementation). This prevents assessment of whether the data support the central claim.

    Authors: We agree that the abstract omits key experimental details, limiting evaluation of the claims. The manuscript presents only preliminary results without these specifics. In revision we will expand the abstract to define the consensus task (iterative natural-language exchanges toward a shared binary decision), specify metrics (consensus rate as fraction of trials reaching agreement within a round limit; disagreement duration as average rounds to agreement or timeout), state the number of trials, note any statistical tests, and describe the baseline as a fixed adversarial prompt. A new experimental section will supply full methodology. revision: yes

  2. Referee: [—] Framework description (implied in abstract): The approach requires that the learned surrogate dynamics over latent behavioral states accurately capture real LLM interactions for the RL-optimized policy to transfer. No surrogate validation metrics, ablation on model fidelity, or discussion of sim-to-real gaps (e.g., stochastic response generation or semantic drift) are mentioned, which is load-bearing for the reported improvement over the baseline.

    Authors: We concur that surrogate fidelity is essential for policy transfer and is not addressed in the current manuscript. We will add a subsection on world-model training that reports validation metrics (e.g., prediction error on held-out benign-agent transitions), includes ablations on latent-state dimensionality and model capacity, and discusses sim-to-real gaps such as LLM output stochasticity and semantic drift across extended dialogues. These additions will strengthen support for the observed gains over the baseline. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results provide independent validation against baseline

full rationale

The paper describes a standard world-model + RL pipeline for training an insider attacker and reports preliminary empirical results comparing its performance to a direct malicious-prompt baseline on the real multi-agent LLM system. No derivation step reduces by construction to its own inputs: the surrogate is learned from observed benign trajectories, the RL policy is optimized on that model, and effectiveness is measured via actual consensus rates in the target environment. The central claim is falsifiable via the reported comparison and does not rely on self-citation chains, uniqueness theorems, or renaming of known results. This is the common case of a data-driven method whose validity rests on experimental transfer rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; full text would be required to audit these.

pith-pipeline@v0.9.0 · 5498 in / 951 out tokens · 43714 ms · 2026-05-12T00:45:43.178507+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

241 extracted references · 241 canonical work pages · 8 internal anchors

  1. [1]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  2. [2]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  3. [3]

    M. J. Kearns , title =

  4. [4]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  5. [5]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  6. [6]

    Suppressed for Anonymity , author=

  7. [7]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  8. [8]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

  9. [9]

    2018 , edition =

    Reinforcement Learning: An Introduction , author =. 2018 , edition =

  10. [10]

    Nature , volume =

    Human-level control through deep reinforcement learning , author =. Nature , volume =

  11. [11]

    Nature , volume =

    Mastering the game of Go with deep neural networks and tree search , author =. Nature , volume =

  12. [12]

    Journal of Machine Learning Research , volume =

    End-to-end training of deep visuomotor policies , author =. Journal of Machine Learning Research , volume =

  13. [13]

    2016 , publisher =

    Handbook of Computational Social Choice , author =. 2016 , publisher =

  14. [14]

    2011 , publisher =

    Security and Game Theory: Algorithms, Deployed Systems, Lessons Learned , author =. 2011 , publisher =

  15. [15]

    Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS) , year =

    Robust Deep Reinforcement Learning with Adversarial Attacks , author =. Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS) , year =

  16. [16]

    International Conference on Learning Representations (ICLR) , year =

    Intriguing properties of neural networks , author =. International Conference on Learning Representations (ICLR) , year =

  17. [17]

    International Conference on Learning Representations (ICLR) , year =

    Explaining and Harnessing Adversarial Examples , author =. International Conference on Learning Representations (ICLR) , year =

  18. [18]

    ICLR Workshop on Security and Privacy in Machine Learning , year =

    Adversarial Attacks on Neural Network Policies , author =. ICLR Workshop on Security and Privacy in Machine Learning , year =

  19. [19]

    International Conference on Learning Representations (ICLR) , year=

    Who is the strongest enemy? towards optimal and efficient evasion attacks in deep rl , author=. International Conference on Learning Representations (ICLR) , year=

  20. [20]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Robust deep reinforcement learning against adversarial perturbations on state observations , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  21. [21]

    International Conference on Learning Representations (ICLR) , year =

    Continuous control with deep reinforcement learning , author =. International Conference on Learning Representations (ICLR) , year =

  22. [22]

    Proceedings of the 35th International Conference on Machine Learning (ICML) , year =

    Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , author =. Proceedings of the 35th International Conference on Machine Learning (ICML) , year =

  23. [23]

    Proximal Policy Optimization Algorithms

    Proximal Policy Optimization Algorithms , author =. arXiv preprint arXiv:1707.06347 , year =

  24. [24]

    International Conference on Learning Representations (ICLR) , year =

    ReAct: Synergizing Reasoning and Acting in Language Models , author =. International Conference on Learning Representations (ICLR) , year =

  25. [25]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Toolformer: Language Models Can Teach Themselves to Use Tools , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  26. [26]

    2025 , howpublished =

    OpenClaw: Open-Source Autonomous AI Agent , author =. 2025 , howpublished =

  27. [27]

    2025 , eprint=

    AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning , author=. 2025 , eprint=

  28. [28]

    Jailbreak Attacks and Defenses Against Large Language Models: A Survey

    Jailbreak Attacks and Defenses Against Large Language Models: A Survey , author =. arXiv preprint arXiv:2407.04295 , year =

  29. [29]

    Conference on Neural Information Processing Systems (NeurIPS) , year=

    CAMEL: Communicative Agents for ``Mind'' Exploration of Large Language Model Society , author=. Conference on Neural Information Processing Systems (NeurIPS) , year=

  30. [31]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Efficient Adversarial Training without Attacking: Worst-Case-Aware Robust Reinforcement Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  31. [32]

    Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year=

    Adversarial robust deep reinforcement learning requires redefining robustness , author=. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year=

  32. [33]

    Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI) , year =

    Ben Abramowitz and Nicholas Mattei , title =. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI) , year =

  33. [34]

    IEEE Transactions on Intelligent Transportation Systems , year=

    Deep Reinforcement Learning for Autonomous Driving: A Survey , author=. IEEE Transactions on Intelligent Transportation Systems , year=

  34. [35]

    1957 , publisher=

    Dynamic Programming , author=. 1957 , publisher=

  35. [36]

    IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

    Enhanced Adversarial Strategically-Timed Attacks Against Deep Reinforcement Learning , author=. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

  36. [37]

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    Robust Physical-World Attacks on Deep Learning Visual Classification , author=. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  37. [38]

    International Conference on Learning Representations (ICLR) , year=

    Illusory Attacks: Information-theoretic detectability matters in adversarial attacks , author=. International Conference on Learning Representations (ICLR) , year=

  38. [39]

    International Conference on Machine Learning (ICML) , year=

    Towards Optimal Adversarial Robust Q-learning with Bellman Infinity-error , author=. International Conference on Machine Learning (ICML) , year=

  39. [40]

    European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) , year=

    Defending Observation Attacks In Deep Reinforcement Learning Via Detection And Denoising , author=. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) , year=

  40. [41]

    International Conference on Machine Learning (ICML) , year=

    Detecting Adversarial Directions in Deep Reinforcement Learning to Make Robust Decisions , author=. International Conference on Machine Learning (ICML) , year=

  41. [42]

    Transactions on Machine Learning Research (TMLR) , year=

    Robust Multi-Agent Reinforcement Learning with State Uncertainty , author=. Transactions on Machine Learning Research (TMLR) , year=

  42. [43]

    Zhihe Yang and Yunjian Xu , booktitle=

  43. [44]

    AAAI Conference on Artificial Intelligence (AAAI) , year=

    Improve Robustness of Reinforcement Learning against Observation Perturbations via l_ Lipschitz Policy Networks , author=. AAAI Conference on Artificial Intelligence (AAAI) , year=

  44. [45]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Rethinking Lipschitz Neural Networks and Certified Robustness: A Boolean Function Perspective , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  45. [46]

    International Conference on Learning Representations (ICLR) , year=

    GRAD: Game-Theoretical Defense against Temporally Coupled Attacks in Deep Reinforcement Learning , author=. International Conference on Learning Representations (ICLR) , year=

  46. [47]

    International Conference on Learning Representations (ICLR) , year=

    Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies , author=. International Conference on Learning Representations (ICLR) , year=

  47. [48]

    ACM Transactions on Programming Languages and Systems , year=

    The Byzantine Generals Problem , author=. ACM Transactions on Programming Languages and Systems , year=

  48. [49]

    Sensors , VOLUME =

    Wang, Jingyao and Deng, Xingming and Guo, Jinghua and Zeng, Zeqin , TITLE =. Sensors , VOLUME =. 2023 , NUMBER =

  49. [50]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation , author=. arXiv preprint arXiv:2308.08155 , year=

  50. [51]

    Lee and A

    Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems , author=. arXiv preprint arXiv:2410.07283 , year=

  51. [52]

    arXiv preprint arXiv:2507.14928 , year=

    Byzantine-Robust Decentralized Coordination of LLM Agents , author=. arXiv preprint arXiv:2507.14928 , year=

  52. [54]

    Advances in neural information processing systems (NeurIPS) , year=

    Denoising diffusion probabilistic models , author=. Advances in neural information processing systems (NeurIPS) , year=

  53. [55]

    Decision and Game Theory for Security (GameSec) , year=

    Deceptive Reinforcement Learning Under Adversarial Manipulations on Cost Signals , author=. Decision and Game Theory for Security (GameSec) , year=

  54. [56]

    International Conference on Machine Learning (ICML) , year=

    Adaptive Reward-Poisoning Attacks against Reinforcement Learning , author=. International Conference on Machine Learning (ICML) , year=

  55. [57]

    AAAI Conference on Artificial Intelligence (AAAI) , year=

    Spatiotemporally Constrained Action Space Attacks on Deep Reinforcement Learning Agents , author=. AAAI Conference on Artificial Intelligence (AAAI) , year=

  56. [58]

    International Conference on Learning Representations (ICLR) , year=

    Adversarial Policies: Attacking Deep Reinforcement Learning , author=. International Conference on Learning Representations (ICLR) , year=

  57. [59]

    ACM Asia Conference on Computer and Communications Security (ASIACCS) , year=

    Stealing Deep Reinforcement Learning Models for Fun and Profit , author=. ACM Asia Conference on Computer and Communications Security (ASIACCS) , year=

  58. [60]

    ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) , year=

    Malicious Attacks against Deep Reinforcement Learning Interpretations , author=. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) , year=

  59. [61]

    Particle Filter Recurrent Neural Networks , booktitle =

    Xiao Ma and P. Particle Filter Recurrent Neural Networks , booktitle =

  60. [62]

    and Tsitsiklis, John N

    Bertsekas, Dimitri P. and Tsitsiklis, John N. , title =. 1996 , isbn =

  61. [63]

    2013 , eprint=

    Playing Atari with Deep Reinforcement Learning , author=. 2013 , eprint=

  62. [64]

    2023 , eprint=

    GPT-4 Technical Report , author=. 2023 , eprint=

  63. [65]

    2023 , eprint=

    Pandering in a Flexible Representative Democracy , author=. 2023 , eprint=

  64. [66]

    Challenges and Countermeasures for Adversarial Attacks on Deep Reinforcement Learning , year=

    Ilahi, Inaam and Usama, Muhammad and Qadir, Junaid and Janjua, Muhammad Umar and Al-Fuqaha, Ala and Hoang, Dinh Thai and Niyato, Dusit , journal=. Challenges and Countermeasures for Adversarial Attacks on Deep Reinforcement Learning , year=

  65. [67]

    Web Intelligence and Agent Systems: An international journal , volume=

    Asymmetric multiagent reinforcement learning , author=. Web Intelligence and Agent Systems: An international journal , volume=. 2004 , publisher=

  66. [68]

    International Conference on Machine Learning (ICML) , year=

    Implicit learning dynamics in stackelberg games: Equilibria characterization, convergence analysis, and empirical study , author=. International Conference on Machine Learning (ICML) , year=

  67. [69]

    Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year=

    Stackelberg actor-critic: Game-theoretic reinforcement learning algorithms , author=. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year=

  68. [70]

    ICLR 2022 Workshop on Gamification and Multiagent Solutions , year=

    Stackelberg Policy Gradient: Evaluating the Performance of Leaders and Followers , author=. ICLR 2022 Workshop on Gamification and Multiagent Solutions , year=

  69. [71]

    International Conference on Machine Learning (ICML) , year=

    Flow-based recurrent belief state learning for pomdps , author=. International Conference on Machine Learning (ICML) , year=

  70. [72]

    Nature , volume=

    Outracing champion Gran Turismo drivers with deep reinforcement learning , author=. Nature , volume=. 2022 , publisher=

  71. [73]

    Adversarial attacks on neural network policies , author=

  72. [74]

    OpenAI Gym

    Openai gym , author=. arXiv preprint arXiv:1606.01540 , year=

  73. [75]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , year=

    Scalable verified training for provably robust image classification , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , year=

  74. [76]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Robust deep reinforcement learning through adversarial loss , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  75. [77]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Provable Defense against Backdoor Policies in Reinforcement Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  76. [78]

    Journal of mathematical analysis and applications , volume=

    Optimal control of Markov processes with incomplete state information , author=. Journal of mathematical analysis and applications , volume=

  77. [79]

    Journal of artificial intelligence research , volume=

    Finding approximate POMDP solutions through belief compression , author=. Journal of artificial intelligence research , volume=

  78. [80]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  79. [81]

    Variational Inference for Data-Efficient Model Learning in POMDPs

    Variational inference for data-efficient model learning in pomdps , author=. arXiv preprint arXiv:1805.09281 , year=

  80. [82]

    International Conference on Learning Representations (ICLR) , year=

    Mastering Atari with Discrete World Models , author=. International Conference on Learning Representations (ICLR) , year=

Showing first 80 references.