arxiv: 1805.00899 · v2 · submitted 2018-05-02 · 📊 stat.ML · cs.LG

Recognition: 2 theorem links

· Lean Theorem

AI safety via debate

Geoffrey Irving , Paul Christiano , Dario Amodei

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:16 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords AI safetydebatealignmentcomplexity theoryself-playMNISTmachine learning

0 comments

The pith

Training AIs via self-play debate lets human judges handle questions too complex for direct evaluation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes training AI agents through self-play on a zero-sum debate game to specify complex human goals that are hard to judge directly. Two agents alternate short statements about a question or action, after which a human judge selects the side that provided the most true and useful information. This approach draws on a complexity-theory analogy: optimal debate can resolve any PSPACE question with only polynomial-time judges, whereas direct judgment is limited to NP questions. The authors test the idea on an MNIST task and show accuracy gains for a sparse classifier, from 59.4 percent to 88.9 percent with six pixels and from 48.2 percent to 85.2 percent with four pixels. They also discuss scaling challenges and call for further human and computer experiments.

Core claim

By training agents to compete in a zero-sum debate where they take turns making short statements and a human judge picks the more truthful and useful side, the system can extract correct answers to questions in PSPACE using only polynomial-time judgment, exceeding the NP limit of direct evaluation.

What carries the argument

Zero-sum debate game in which two agents alternate short statements and a human judge selects the winner on truth and usefulness.

If this is right

Optimal play in the debate game solves any PSPACE question with polynomial-time judges.
Self-play training on debate can help AIs learn complex goals that direct human feedback cannot specify.
The MNIST experiment shows debate raises sparse-classifier accuracy from 59.4 percent to 88.9 percent with six pixels.
The approach requires empirical checks on human judges and on tasks that scale beyond the initial demonstration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could combine with other alignment techniques to address tasks beyond PSPACE.
Controlled human trials on real decision problems would test whether judges stay reliable against stronger agents.
If debate works, it might reduce the need for fully automated oversight in early AI systems.

Load-bearing premise

Human judges can reliably pick the more truthful and useful side even when the question is too complex for them to evaluate directly.

What would settle it

Run a controlled experiment on a question whose correct answer is known in advance but cannot be judged directly; if human judges consistently select the agent arguing for the wrong answer, the method fails.

read the original abstract

To make AI systems broadly useful for challenging real-world tasks, we need them to learn complex human goals and preferences. One approach to specifying complex goals asks humans to judge during training which agent behaviors are safe and useful, but this approach can fail if the task is too complicated for a human to directly judge. To help address this concern, we propose training agents via self play on a zero sum debate game. Given a question or proposed action, two agents take turns making short statements up to a limit, then a human judges which of the agents gave the most true, useful information. In an analogy to complexity theory, debate with optimal play can answer any question in PSPACE given polynomial time judges (direct judging answers only NP questions). In practice, whether debate works involves empirical questions about humans and the tasks we want AIs to perform, plus theoretical questions about the meaning of AI alignment. We report results on an initial MNIST experiment where agents compete to convince a sparse classifier, boosting the classifier's accuracy from 59.4% to 88.9% given 6 pixels and from 48.2% to 85.2% given 4 pixels. Finally, we discuss theoretical and practical aspects of the debate model, focusing on potential weaknesses as the model scales up, and we propose future human and computer experiments to test these properties.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The debate protocol is a new oversight idea with a suggestive PSPACE analogy and a toy MNIST result, but the human judgment assumption stays untested for hard cases.

read the letter

The paper's core contribution is a self-play debate setup where two agents argue in short turns and a human judge picks the more truthful side. This is positioned as a way to oversee AI on tasks too complex for direct human evaluation, with an informal analogy claiming optimal debate can handle PSPACE questions using only polynomial-time judges while direct judgment is limited to NP. They back it with a small MNIST experiment showing accuracy gains for a sparse classifier when using debate to select informative pixels. That experiment is straightforward and demonstrates the basic mechanism can improve performance in a controlled setting where ground truth is easy to check. The proposal itself is new in the alignment literature they cite and gives a clear training game without obvious circularity. The main limitation is that the MNIST results use a regime where the judge can directly verify claims, so they do not test whether humans can reliably spot misleading but locally plausible arguments on complex problems. The complexity analogy remains suggestive rather than formal, and the paper itself flags that scaling behavior and human reliability need further experiments. This is aimed at researchers thinking about scalable oversight and alignment techniques. It deserves peer review because the framework is concrete enough to discuss and the preliminary data is worth examining, even though the central assumptions require more evidence.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes training AI agents to learn complex human goals via a zero-sum debate game: given a question or action, two agents alternate short statements up to a fixed limit, after which a human judge selects the agent that provided the most true and useful information. The authors draw an analogy to complexity theory asserting that optimal-play debate can resolve any PSPACE question with only polynomial-time judges (while direct judgment is limited to NP). They report an MNIST experiment in which debate improves a sparse classifier from 59.4% to 88.9% accuracy with 6 pixels and from 48.2% to 85.2% with 4 pixels. The paper discusses theoretical and practical aspects of the model, potential scaling weaknesses, and directions for future human and computational experiments.

Significance. If the debate protocol functions as described, it would offer a concrete mechanism for scalable oversight on tasks exceeding direct human judgment, addressing a central challenge in AI alignment. The complexity-theoretic analogy supplies an intriguing theoretical motivation, and the MNIST results constitute preliminary empirical evidence of accuracy gains under a simplified regime. The manuscript's explicit identification of scaling issues and call for targeted experiments are constructive contributions that can guide subsequent work.

major comments (2)

[§3] §3 (complexity-theoretic analogy): the claim that optimal-play debate answers PSPACE questions with polynomial-time judges rests on the unproven assumption that the protocol forces all nested quantifiers and implicit facts into short, locally checkable statements. No explicit reduction or proof sketch is supplied showing how an arbitrary PSPACE instance is encoded so that a human judge can verify truthfulness in poly time; the analogy therefore remains informal and does not yet support the central separation from NP.
[§4] §4 (MNIST experiment): the reported accuracy improvements (59.4% to 88.9% with 6 pixels) are obtained in a regime where the judge has direct access to ground-truth labels. This setup does not test multi-turn debate on complex reasoning tasks where the correct answer depends on unverifiable subclaims, leaving the key assumption about reliable human judgment for hard questions unexamined and limiting support for the PSPACE claim.

minor comments (2)

[§2] The description of the debate protocol in §2 would benefit from a concise pseudocode listing of the turn order, statement length bound, and judge decision rule to improve reproducibility.
[§4] Figure captions for the MNIST results should report the number of independent runs and any error bars or statistical tests performed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of the debate protocol for scalable oversight. We address each major comment below, clarifying the scope of our claims and indicating revisions where appropriate.

read point-by-point responses

Referee: [§3] §3 (complexity-theoretic analogy): the claim that optimal-play debate answers PSPACE questions with polynomial-time judges rests on the unproven assumption that the protocol forces all nested quantifiers and implicit facts into short, locally checkable statements. No explicit reduction or proof sketch is supplied showing how an arbitrary PSPACE instance is encoded so that a human judge can verify truthfulness in poly time; the analogy therefore remains informal and does not yet support the central separation from NP.

Authors: We agree that the manuscript presents the PSPACE connection as a high-level analogy rather than a formal proof with an explicit reduction. The statement draws motivation from known results such as IP=PSPACE, adapted to a two-agent debate setting, but does not derive or encode an arbitrary PSPACE instance into the protocol. Our primary focus is the AI alignment application and the initial empirical demonstration; a complete formal mapping is left as future work. We will revise §3 to state more explicitly that the claim is analogical, to reference the underlying complexity results, and to note the absence of a detailed reduction. revision: yes
Referee: [§4] §4 (MNIST experiment): the reported accuracy improvements (59.4% to 88.9% with 6 pixels) are obtained in a regime where the judge has direct access to ground-truth labels. This setup does not test multi-turn debate on complex reasoning tasks where the correct answer depends on unverifiable subclaims, leaving the key assumption about reliable human judgment for hard questions unexamined and limiting support for the PSPACE claim.

Authors: We acknowledge that the MNIST experiment uses a judge with direct access to ground-truth labels and therefore operates in a simplified regime that does not examine unverifiable subclaims or fully test the human-judgment assumptions underlying the PSPACE analogy. The experiment serves only as a controlled proof-of-concept that debate can improve accuracy under information constraints. The manuscript already describes it as an initial result and proposes future human experiments on more complex tasks. We will revise the experimental section and discussion to highlight this limitation more explicitly and to clarify its implications for the theoretical claims. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the debate protocol or PSPACE analogy

full rationale

The paper proposes a zero-sum debate game for training and draws an explicit analogy to complexity theory (PSPACE vs NP) without deriving the separation from any fitted parameters, self-definitional equations, or load-bearing self-citations. The MNIST results are presented as separate empirical measurements of accuracy gains under direct observation, not as predictions forced by the same inputs. No steps reduce by construction to prior author work or rename known results; the central claim rests on an independent assumption about human judges that is stated openly rather than smuggled in via citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested assumption that optimal play in the debate game produces truthful revelations that human judges can correctly evaluate on complex tasks.

axioms (1)

standard math Debate with optimal play solves PSPACE questions using polynomial-time judges
Invoked as the key complexity-theoretic justification for why the method can handle harder questions than direct judgment.

pith-pipeline@v0.9.0 · 5534 in / 1101 out tokens · 60891 ms · 2026-05-13T21:16:37.707928+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

In an analogy to complexity theory, debate with optimal play can answer any question in PSPACE given polynomial time judges (direct judging answers only NP questions).
IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two agents take turns making short statements up to a limit, then a human judges which of the agents gave the most true, useful information

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Risks from Learned Optimization in Advanced Machine Learning Systems
cs.AI 2019-06 accept novelty 9.0

Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.
AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation
cs.CV 2026-05 conditional novelty 7.0

AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domai...
MathDuels: Evaluating LLMs as Problem Posers and Solvers
cs.CL 2026-04 unverdicted novelty 7.0

Self-play between LLMs for problem authoring and solving, scored via Rasch modeling, shows that authoring and solving skills are partially decoupled and that the benchmark difficulty evolves with new models.
Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery
cs.CR 2026-04 unverdicted novelty 7.0

Refute-or-Promote applies adversarial multi-agent review with kill gates and empirical verification to filter LLM defect candidates, killing 79-83% before disclosure and yielding 4 CVEs plus multiple accepted fixes ac...
Fine-Tuning Language Models from Human Preferences
cs.CL 2019-09 unverdicted novelty 7.0

Language models fine-tuned via RL on 5k-60k human preference comparisons produce stylistically better text continuations and human-preferred summaries that sometimes copy input sentences.
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
cs.LG 2026-05 unverdicted novelty 6.0

Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
cs.LG 2026-05 unverdicted novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
CHAL: Council of Hierarchical Agentic Language
cs.AI 2026-05 unverdicted novelty 6.0

CHAL is a multi-agent dialectic system that performs structured belief optimization over defeasible domains using Bayesian-inspired graph representations and configurable meta-cognitive value system hyperparameters.
Positive Alignment: Artificial Intelligence for Human Flourishing
cs.AI 2026-05 unverdicted novelty 6.0

Positive Alignment introduces AI systems that support human flourishing pluralistically and proactively while remaining safe, as a necessary complement to traditional safety-focused alignment research.
The Endogeneity of Miscalibration: Impossibility and Escape in Scored Reporting
cs.GT 2026-05 unverdicted novelty 6.0

Non-affine approval functions create unavoidable miscalibration in proper scoring rules for strategic agents, but step-function thresholds enable first-best screening without it, uniquely for the Brier score.
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
Automated alignment is harder than you think
cs.AI 2026-05 unverdicted novelty 6.0

Automating alignment research with AI agents risks undetected systematic errors in fuzzy tasks, producing overconfident but misleading safety evaluations that could enable deployment of misaligned AI.
Automated alignment is harder than you think
cs.AI 2026-05 unverdicted novelty 6.0

Automating alignment research with AI agents risks generating hard-to-detect errors in fuzzy tasks, producing misleading safety evaluations even without deliberate sabotage.
Intentmaking and Sensemaking: Human Interaction with AI-Guided Mathematical Discovery
cs.AI 2026-05 unverdicted novelty 6.0

Expert mathematicians using an AI coding agent for discovery engage in repeated cycles of intentmaking to define goals and sensemaking to interpret outputs.
Stayin' Aligned Over Time: Towards Longitudinal Human-LLM Alignment via Contextual Reflection and Privacy-Preserving Behavioral Data
cs.HC 2026-05 unverdicted novelty 6.0

A methodological framework and browser system BITE for collecting evolving user preferences on LLM outputs through context-triggered reflections and privacy-preserving data over time.
The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

Closed-system multi-step LLM reasoning is subject to an information-theoretic bound where mutual information with evidence decreases, preserving accuracy while eroding faithfulness, with EGSR recovering it on SciFact ...
AI Alignment via Incentives and Correction
cs.LG 2026-05 unverdicted novelty 6.0

AI alignment is reframed as a fixed-point incentive problem in a solver-auditor pipeline, solved via bilevel optimization and bandit search over reward profiles to maintain monitoring and reduce hallucinations in LLM ...
AI Alignment via Incentives and Correction
cs.LG 2026-05 unverdicted novelty 6.0

AI alignment is framed as inducing equilibrium behavior in a solver-auditor interaction via adaptive rewards found by bandit optimization, yielding improved oversight and reduced errors in LLM coding experiments.
Causal Foundations of Collective Agency
cs.AI 2026-04 unverdicted novelty 6.0

Collective agency arises when a group's joint actions are faithfully captured by a simpler causal model of unified rational behavior.
From Soliloquy to Agora: Memory-Enhanced LLM Agents with Decentralized Debate for Optimization Modeling
math.OC 2026-04 unverdicted novelty 6.0

Agora-Opt uses decentralized debate among LLM agent teams plus a read-write memory bank to produce more accurate optimization models from text than prior LLM methods.
Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture
cs.AI 2026-04 unverdicted novelty 6.0

A separation-of-powers system architecture for AI agents uses independent layers, cryptographic capability tokens, and a formal verification framework to maintain goal integrity even under model compromise.
Improving Factuality and Reasoning in Language Models through Multiagent Debate
cs.CL 2023-05 unverdicted novelty 6.0

Multiagent debate among LLMs improves mathematical reasoning, strategic reasoning, and factual accuracy while reducing hallucinations.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
cs.LG 2026-04 unverdicted novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...
Extrapolating Volition with Recursive Information Markets
cs.GT 2026-04 unverdicted novelty 5.0

Recursive information markets with forgetful LLM buyers can align information prices with true value and extend to scalable oversight in AI alignment.
Positive Alignment: Artificial Intelligence for Human Flourishing
cs.AI 2026-05 unverdicted novelty 4.0

Positive Alignment is introduced as a distinct AI agenda that supports human flourishing through pluralistic and context-sensitive design, complementing traditional safety-focused alignment.
Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems
cs.AI 2026-05 unverdicted novelty 4.0

Frontier AI needs contextual multi-objective optimization to select and balance multiple context-dependent objectives rather than relying on single stable goals.
AICCE: AI Driven Compliance Checker Engine
cs.CR 2026-04 unverdicted novelty 4.0

AICCE combines RAG-based retrieval of protocol specs with dual LLM pipelines for debate-driven explanations or fast script execution, reporting up to 99% accuracy on IPv6 samples.
Coupled Control, Structured Memory, and Verifiable Action in Agentic AI (SCRAT -- Stochastic Control with Retrieval and Auditable Trajectories): A Comparative Perspective from Squirrel Locomotion and Scatter-Hoarding
cs.AI 2026-04 unverdicted novelty 2.0

Squirrel behaviors supply a comparative template for a hierarchical control model that integrates latent dynamics, episodic memory, observer beliefs, and delayed verification in agentic AI.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · cited by 26 Pith papers · 2 internal anchors

[1]

Russell, Daniel Dewey, and Max Tegmark

Stuart J. Russell, Daniel Dewey, and Max Tegmark. Research priorities for robust and beneficial artificial intelligence. CoRR, abs/1602.03506, 2016. URL https://arxiv.org/abs/1602.03506

work page arXiv 2016
[2]

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dandelion Man \' e . Concrete problems in AI safety. CoRR, abs/1606.06565, 2016. URL https://arxiv.org/abs/1606.06565

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

Mirror mirror: Reflections on quantitative fairness

Shira Mitchell and Jackie Shadlen. Mirror mirror: Reflections on quantitative fairness. https://speak-statistics-to-power.github.io/fairness, 2018

work page 2018
[4]

Deep reinforcement learning from human preferences

Paul Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4302--4310, 2017

work page 2017
[5]

Mastering the game of Go with deep neural networks and tree search

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529 0 (7587): 0 484--489, 2016

work page 2016
[6]

Mastering the game of Go without human knowledge

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of Go without human knowledge. Nature, 550 0 (7676): 0 354, 2017 a

work page 2017
[7]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017 b

work page Pith review arXiv 2017
[8]

More on D ota 2

OpenAI. More on D ota 2. https://blog.openai.com/more-on-dota-2, 2017

work page 2017
[9]

Supervising strong learners by amplifying weak experts

Paul Christiano, Buck Shlegeris, and Dario Amodei. Supervising strong learners by amplifying weak experts. arXiv preprint arXiv:1810.08575, 2018

work page Pith review arXiv 2018
[10]

Towards an automatic T uring test: Learning to evaluate dialogue responses

Ryan Lowe, Michael Noseworthy, Iulian V Serban, Nicolas Angelard-Gontier, Yoshua Bengio, and Joelle Pineau. Towards an automatic T uring test: Learning to evaluate dialogue responses. arXiv preprint arXiv:1708.07149, 2017 a

work page arXiv 2017
[11]

Introduction to the Theory of Computation

Michael Sipser. Introduction to the Theory of Computation. Course Technology, Boston, MA, third edition, 2013. ISBN 113318779X

work page 2013
[12]

Jeffrey C Lagarias and Andrew M. Odlyzko. Computing (x) : An analytic method. Journal of Algorithms, 8 0 (2): 0 173--191, 1987

work page 1987
[13]

A simple neural attentive meta-learner

Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-learner. In NIPS 2017 Workshop on Meta-Learning, 2017

work page 2017
[14]

Interpretable and pedagogical examples

Smitha Milli, Pieter Abbeel, and Igor Mordatch. Interpretable and pedagogical examples. arXiv preprint arXiv:1711.00694, 2017

work page arXiv 2017
[15]

Efficient selectivity and backup operators in monte-carlo tree search

R \'e mi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pages 72--83. Springer, 2006

work page 2006
[16]

Combinatorics of Go

John Tromp and Gunnar Farneb \"a ck. Combinatorics of Go . In International Conference on Computers and Games, pages 84--99. Springer, 2006

work page 2006
[17]

Debatable

Radiolab. Debatable. https://www.radiolab.org/story/debatable, March 2016

work page 2016
[18]

Emergent Complexity via Multi-Agent Competition

Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, and Igor Mordatch. Emergent complexity via multi-agent competition. arXiv preprint arXiv:1710.03748, 2017

work page Pith review arXiv 2017
[19]

A unified game-theoretic approach to multiagent reinforcement learning

Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Julien Perolat, David Silver, Thore Graepel, et al. A unified game-theoretic approach to multiagent reinforcement learning. In Advances in Neural Information Processing Systems, pages 4193--4206, 2017

work page 2017
[20]

Generative adversarial networks

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. In Advances in Neural Information Processing Systems, pages 2672--2680, 2014

work page 2014
[21]

Progressive Growing of GANs for Improved Quality, Stability, and Variation

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Reasoning independently of prior belief and individual differences in actively open-minded thinking

Keith E Stanovich and Richard F West. Reasoning independently of prior belief and individual differences in actively open-minded thinking. Journal of Educational Psychology, 89 0 (2): 0 342, 1997

work page 1997
[23]

Individual differences and the belief bias effect: Mental models, logical necessity, and abstract reasoning

Donna Torrens. Individual differences and the belief bias effect: Mental models, logical necessity, and abstract reasoning. Thinking & Reasoning, 5 0 (1): 0 1--28, 1999

work page 1999
[24]

Weekend update: You'd have to be science illiterate to think ``belief in evolution'' measures science literacy

Dan Kahan. Weekend update: You'd have to be science illiterate to think ``belief in evolution'' measures science literacy. http://www.culturalcognition.net/blog/2014/5/24/weekend-update-youd-have-to-be-science-illiterate-to-think-b.html, May 2014

work page 2014
[25]

Jonathan St. B. T. Evans and Jodie Curtis-Holmes. Rapid responding increases belief bias: Evidence for the dual-process theory of reasoning. Thinking & Reasoning, 11 0 (4): 0 382--389, 2005

work page 2005
[26]

Belief-based and analytic processing in transitive inference depends on premise integration difficulty

Glenda Andrews. Belief-based and analytic processing in transitive inference depends on premise integration difficulty. Memory & cognition, 38 0 (7): 0 928--940, 2010

work page 2010
[27]

Reasoning under time pressure: A study of causal conditional inference

Jonathan St BT Evans, Simon J Handley, and Alison M Bacon. Reasoning under time pressure: A study of causal conditional inference. Experimental Psychology, 56 0 (2): 0 77, 2009

work page 2009
[28]

Negative emotions can attenuate the influence of beliefs on logical reasoning

Vinod Goel and Oshin Vartanian. Negative emotions can attenuate the influence of beliefs on logical reasoning. Cognition and Emotion, 25 0 (1): 0 121--131, 2011

work page 2011
[29]

The superintelligent will: Motivation and instrumental rationality in advanced artificial agents

Nick Bostrom. The superintelligent will: Motivation and instrumental rationality in advanced artificial agents. Minds and Machines, 22 0 (2): 0 71--85, 2012

work page 2012
[30]

The AI -box experiment

Eliezer Yudkowsky. The AI -box experiment. http://yudkowsky.net/singularity/aibox, 2002

work page 2002
[31]

Superintelligence

Nick Bostrom. Superintelligence. Dunod, 2017

work page 2017
[32]

Faulty reward functions in the wild

OpenAI. Faulty reward functions in the wild. https://blog.openai.com/faulty-reward-functions, 2016

work page 2016
[33]

Multi-agent actor-critic for mixed cooperative-competitive environments

Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pages 6382--6393, 2017 b

work page 2017