pith. machine review for the scientific record. sign in

arxiv: 1805.00899 · v2 · submitted 2018-05-02 · 📊 stat.ML · cs.LG

Recognition: 2 theorem links

· Lean Theorem

AI safety via debate

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:16 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords AI safetydebatealignmentcomplexity theoryself-playMNISTmachine learning
0
0 comments X

The pith

Training AIs via self-play debate lets human judges handle questions too complex for direct evaluation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes training AI agents through self-play on a zero-sum debate game to specify complex human goals that are hard to judge directly. Two agents alternate short statements about a question or action, after which a human judge selects the side that provided the most true and useful information. This approach draws on a complexity-theory analogy: optimal debate can resolve any PSPACE question with only polynomial-time judges, whereas direct judgment is limited to NP questions. The authors test the idea on an MNIST task and show accuracy gains for a sparse classifier, from 59.4 percent to 88.9 percent with six pixels and from 48.2 percent to 85.2 percent with four pixels. They also discuss scaling challenges and call for further human and computer experiments.

Core claim

By training agents to compete in a zero-sum debate where they take turns making short statements and a human judge picks the more truthful and useful side, the system can extract correct answers to questions in PSPACE using only polynomial-time judgment, exceeding the NP limit of direct evaluation.

What carries the argument

Zero-sum debate game in which two agents alternate short statements and a human judge selects the winner on truth and usefulness.

If this is right

  • Optimal play in the debate game solves any PSPACE question with polynomial-time judges.
  • Self-play training on debate can help AIs learn complex goals that direct human feedback cannot specify.
  • The MNIST experiment shows debate raises sparse-classifier accuracy from 59.4 percent to 88.9 percent with six pixels.
  • The approach requires empirical checks on human judges and on tasks that scale beyond the initial demonstration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could combine with other alignment techniques to address tasks beyond PSPACE.
  • Controlled human trials on real decision problems would test whether judges stay reliable against stronger agents.
  • If debate works, it might reduce the need for fully automated oversight in early AI systems.

Load-bearing premise

Human judges can reliably pick the more truthful and useful side even when the question is too complex for them to evaluate directly.

What would settle it

Run a controlled experiment on a question whose correct answer is known in advance but cannot be judged directly; if human judges consistently select the agent arguing for the wrong answer, the method fails.

read the original abstract

To make AI systems broadly useful for challenging real-world tasks, we need them to learn complex human goals and preferences. One approach to specifying complex goals asks humans to judge during training which agent behaviors are safe and useful, but this approach can fail if the task is too complicated for a human to directly judge. To help address this concern, we propose training agents via self play on a zero sum debate game. Given a question or proposed action, two agents take turns making short statements up to a limit, then a human judges which of the agents gave the most true, useful information. In an analogy to complexity theory, debate with optimal play can answer any question in PSPACE given polynomial time judges (direct judging answers only NP questions). In practice, whether debate works involves empirical questions about humans and the tasks we want AIs to perform, plus theoretical questions about the meaning of AI alignment. We report results on an initial MNIST experiment where agents compete to convince a sparse classifier, boosting the classifier's accuracy from 59.4% to 88.9% given 6 pixels and from 48.2% to 85.2% given 4 pixels. Finally, we discuss theoretical and practical aspects of the debate model, focusing on potential weaknesses as the model scales up, and we propose future human and computer experiments to test these properties.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes training AI agents to learn complex human goals via a zero-sum debate game: given a question or action, two agents alternate short statements up to a fixed limit, after which a human judge selects the agent that provided the most true and useful information. The authors draw an analogy to complexity theory asserting that optimal-play debate can resolve any PSPACE question with only polynomial-time judges (while direct judgment is limited to NP). They report an MNIST experiment in which debate improves a sparse classifier from 59.4% to 88.9% accuracy with 6 pixels and from 48.2% to 85.2% with 4 pixels. The paper discusses theoretical and practical aspects of the model, potential scaling weaknesses, and directions for future human and computational experiments.

Significance. If the debate protocol functions as described, it would offer a concrete mechanism for scalable oversight on tasks exceeding direct human judgment, addressing a central challenge in AI alignment. The complexity-theoretic analogy supplies an intriguing theoretical motivation, and the MNIST results constitute preliminary empirical evidence of accuracy gains under a simplified regime. The manuscript's explicit identification of scaling issues and call for targeted experiments are constructive contributions that can guide subsequent work.

major comments (2)
  1. [§3] §3 (complexity-theoretic analogy): the claim that optimal-play debate answers PSPACE questions with polynomial-time judges rests on the unproven assumption that the protocol forces all nested quantifiers and implicit facts into short, locally checkable statements. No explicit reduction or proof sketch is supplied showing how an arbitrary PSPACE instance is encoded so that a human judge can verify truthfulness in poly time; the analogy therefore remains informal and does not yet support the central separation from NP.
  2. [§4] §4 (MNIST experiment): the reported accuracy improvements (59.4% to 88.9% with 6 pixels) are obtained in a regime where the judge has direct access to ground-truth labels. This setup does not test multi-turn debate on complex reasoning tasks where the correct answer depends on unverifiable subclaims, leaving the key assumption about reliable human judgment for hard questions unexamined and limiting support for the PSPACE claim.
minor comments (2)
  1. [§2] The description of the debate protocol in §2 would benefit from a concise pseudocode listing of the turn order, statement length bound, and judge decision rule to improve reproducibility.
  2. [§4] Figure captions for the MNIST results should report the number of independent runs and any error bars or statistical tests performed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of the debate protocol for scalable oversight. We address each major comment below, clarifying the scope of our claims and indicating revisions where appropriate.

read point-by-point responses
  1. Referee: [§3] §3 (complexity-theoretic analogy): the claim that optimal-play debate answers PSPACE questions with polynomial-time judges rests on the unproven assumption that the protocol forces all nested quantifiers and implicit facts into short, locally checkable statements. No explicit reduction or proof sketch is supplied showing how an arbitrary PSPACE instance is encoded so that a human judge can verify truthfulness in poly time; the analogy therefore remains informal and does not yet support the central separation from NP.

    Authors: We agree that the manuscript presents the PSPACE connection as a high-level analogy rather than a formal proof with an explicit reduction. The statement draws motivation from known results such as IP=PSPACE, adapted to a two-agent debate setting, but does not derive or encode an arbitrary PSPACE instance into the protocol. Our primary focus is the AI alignment application and the initial empirical demonstration; a complete formal mapping is left as future work. We will revise §3 to state more explicitly that the claim is analogical, to reference the underlying complexity results, and to note the absence of a detailed reduction. revision: yes

  2. Referee: [§4] §4 (MNIST experiment): the reported accuracy improvements (59.4% to 88.9% with 6 pixels) are obtained in a regime where the judge has direct access to ground-truth labels. This setup does not test multi-turn debate on complex reasoning tasks where the correct answer depends on unverifiable subclaims, leaving the key assumption about reliable human judgment for hard questions unexamined and limiting support for the PSPACE claim.

    Authors: We acknowledge that the MNIST experiment uses a judge with direct access to ground-truth labels and therefore operates in a simplified regime that does not examine unverifiable subclaims or fully test the human-judgment assumptions underlying the PSPACE analogy. The experiment serves only as a controlled proof-of-concept that debate can improve accuracy under information constraints. The manuscript already describes it as an initial result and proposes future human experiments on more complex tasks. We will revise the experimental section and discussion to highlight this limitation more explicitly and to clarify its implications for the theoretical claims. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the debate protocol or PSPACE analogy

full rationale

The paper proposes a zero-sum debate game for training and draws an explicit analogy to complexity theory (PSPACE vs NP) without deriving the separation from any fitted parameters, self-definitional equations, or load-bearing self-citations. The MNIST results are presented as separate empirical measurements of accuracy gains under direct observation, not as predictions forced by the same inputs. No steps reduce by construction to prior author work or rename known results; the central claim rests on an independent assumption about human judges that is stated openly rather than smuggled in via citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested assumption that optimal play in the debate game produces truthful revelations that human judges can correctly evaluate on complex tasks.

axioms (1)
  • standard math Debate with optimal play solves PSPACE questions using polynomial-time judges
    Invoked as the key complexity-theoretic justification for why the method can handle harder questions than direct judgment.

pith-pipeline@v0.9.0 · 5534 in / 1101 out tokens · 60891 ms · 2026-05-13T21:16:37.707928+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Risks from Learned Optimization in Advanced Machine Learning Systems

    cs.AI 2019-06 accept novelty 9.0

    Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.

  2. AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation

    cs.CV 2026-05 conditional novelty 7.0

    AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domai...

  3. MathDuels: Evaluating LLMs as Problem Posers and Solvers

    cs.CL 2026-04 unverdicted novelty 7.0

    Self-play between LLMs for problem authoring and solving, scored via Rasch modeling, shows that authoring and solving skills are partially decoupled and that the benchmark difficulty evolves with new models.

  4. Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery

    cs.CR 2026-04 unverdicted novelty 7.0

    Refute-or-Promote applies adversarial multi-agent review with kill gates and empirical verification to filter LLM defect candidates, killing 79-83% before disclosure and yielding 4 CVEs plus multiple accepted fixes ac...

  5. Fine-Tuning Language Models from Human Preferences

    cs.CL 2019-09 unverdicted novelty 7.0

    Language models fine-tuned via RL on 5k-60k human preference comparisons produce stylistically better text continuations and human-preferred summaries that sometimes copy input sentences.

  6. Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

    cs.LG 2026-05 unverdicted novelty 6.0

    Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.

  7. Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

    cs.LG 2026-05 unverdicted novelty 6.0

    A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

  8. CHAL: Council of Hierarchical Agentic Language

    cs.AI 2026-05 unverdicted novelty 6.0

    CHAL is a multi-agent dialectic system that performs structured belief optimization over defeasible domains using Bayesian-inspired graph representations and configurable meta-cognitive value system hyperparameters.

  9. Positive Alignment: Artificial Intelligence for Human Flourishing

    cs.AI 2026-05 unverdicted novelty 6.0

    Positive Alignment introduces AI systems that support human flourishing pluralistically and proactively while remaining safe, as a necessary complement to traditional safety-focused alignment research.

  10. The Endogeneity of Miscalibration: Impossibility and Escape in Scored Reporting

    cs.GT 2026-05 unverdicted novelty 6.0

    Non-affine approval functions create unavoidable miscalibration in proper scoring rules for strategic agents, but step-function thresholds enable first-best screening without it, uniquely for the Brier score.

  11. Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...

  12. Automated alignment is harder than you think

    cs.AI 2026-05 unverdicted novelty 6.0

    Automating alignment research with AI agents risks undetected systematic errors in fuzzy tasks, producing overconfident but misleading safety evaluations that could enable deployment of misaligned AI.

  13. Automated alignment is harder than you think

    cs.AI 2026-05 unverdicted novelty 6.0

    Automating alignment research with AI agents risks generating hard-to-detect errors in fuzzy tasks, producing misleading safety evaluations even without deliberate sabotage.

  14. Intentmaking and Sensemaking: Human Interaction with AI-Guided Mathematical Discovery

    cs.AI 2026-05 unverdicted novelty 6.0

    Expert mathematicians using an AI coding agent for discovery engage in repeated cycles of intentmaking to define goals and sensemaking to interpret outputs.

  15. Stayin' Aligned Over Time: Towards Longitudinal Human-LLM Alignment via Contextual Reflection and Privacy-Preserving Behavioral Data

    cs.HC 2026-05 unverdicted novelty 6.0

    A methodological framework and browser system BITE for collecting evolving user preferences on LLM outputs through context-triggered reflections and privacy-preserving data over time.

  16. The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    Closed-system multi-step LLM reasoning is subject to an information-theoretic bound where mutual information with evidence decreases, preserving accuracy while eroding faithfulness, with EGSR recovering it on SciFact ...

  17. AI Alignment via Incentives and Correction

    cs.LG 2026-05 unverdicted novelty 6.0

    AI alignment is reframed as a fixed-point incentive problem in a solver-auditor pipeline, solved via bilevel optimization and bandit search over reward profiles to maintain monitoring and reduce hallucinations in LLM ...

  18. AI Alignment via Incentives and Correction

    cs.LG 2026-05 unverdicted novelty 6.0

    AI alignment is framed as inducing equilibrium behavior in a solver-auditor interaction via adaptive rewards found by bandit optimization, yielding improved oversight and reduced errors in LLM coding experiments.

  19. Causal Foundations of Collective Agency

    cs.AI 2026-04 unverdicted novelty 6.0

    Collective agency arises when a group's joint actions are faithfully captured by a simpler causal model of unified rational behavior.

  20. From Soliloquy to Agora: Memory-Enhanced LLM Agents with Decentralized Debate for Optimization Modeling

    math.OC 2026-04 unverdicted novelty 6.0

    Agora-Opt uses decentralized debate among LLM agent teams plus a read-write memory bank to produce more accurate optimization models from text than prior LLM methods.

  21. Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture

    cs.AI 2026-04 unverdicted novelty 6.0

    A separation-of-powers system architecture for AI agents uses independent layers, cryptographic capability tokens, and a formal verification framework to maintain goal integrity even under model compromise.

  22. Improving Factuality and Reasoning in Language Models through Multiagent Debate

    cs.CL 2023-05 unverdicted novelty 6.0

    Multiagent debate among LLMs improves mathematical reasoning, strategic reasoning, and factual accuracy while reducing hallucinations.

  23. A General Language Assistant as a Laboratory for Alignment

    cs.CL 2021-12 conditional novelty 6.0

    Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

  24. Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

    cs.LG 2026-04 unverdicted novelty 5.0

    The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...

  25. Extrapolating Volition with Recursive Information Markets

    cs.GT 2026-04 unverdicted novelty 5.0

    Recursive information markets with forgetful LLM buyers can align information prices with true value and extend to scalable oversight in AI alignment.

  26. Positive Alignment: Artificial Intelligence for Human Flourishing

    cs.AI 2026-05 unverdicted novelty 4.0

    Positive Alignment is introduced as a distinct AI agenda that supports human flourishing through pluralistic and context-sensitive design, complementing traditional safety-focused alignment.

  27. Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems

    cs.AI 2026-05 unverdicted novelty 4.0

    Frontier AI needs contextual multi-objective optimization to select and balance multiple context-dependent objectives rather than relying on single stable goals.

  28. AICCE: AI Driven Compliance Checker Engine

    cs.CR 2026-04 unverdicted novelty 4.0

    AICCE combines RAG-based retrieval of protocol specs with dual LLM pipelines for debate-driven explanations or fast script execution, reporting up to 99% accuracy on IPv6 samples.

  29. Coupled Control, Structured Memory, and Verifiable Action in Agentic AI (SCRAT -- Stochastic Control with Retrieval and Auditable Trajectories): A Comparative Perspective from Squirrel Locomotion and Scatter-Hoarding

    cs.AI 2026-04 unverdicted novelty 2.0

    Squirrel behaviors supply a comparative template for a hierarchical control model that integrates latent dynamics, episodic memory, observer beliefs, and delayed verification in agentic AI.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · cited by 26 Pith papers · 2 internal anchors

  1. [1]

    Russell, Daniel Dewey, and Max Tegmark

    Stuart J. Russell, Daniel Dewey, and Max Tegmark. Research priorities for robust and beneficial artificial intelligence. CoRR, abs/1602.03506, 2016. URL https://arxiv.org/abs/1602.03506

  2. [2]

    Concrete Problems in AI Safety

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dandelion Man \' e . Concrete problems in AI safety. CoRR, abs/1606.06565, 2016. URL https://arxiv.org/abs/1606.06565

  3. [3]

    Mirror mirror: Reflections on quantitative fairness

    Shira Mitchell and Jackie Shadlen. Mirror mirror: Reflections on quantitative fairness. https://speak-statistics-to-power.github.io/fairness, 2018

  4. [4]

    Deep reinforcement learning from human preferences

    Paul Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4302--4310, 2017

  5. [5]

    Mastering the game of Go with deep neural networks and tree search

    David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529 0 (7587): 0 484--489, 2016

  6. [6]

    Mastering the game of Go without human knowledge

    David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of Go without human knowledge. Nature, 550 0 (7676): 0 354, 2017 a

  7. [7]

    Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017 b

  8. [8]

    More on D ota 2

    OpenAI. More on D ota 2. https://blog.openai.com/more-on-dota-2, 2017

  9. [9]

    Supervising strong learners by amplifying weak experts

    Paul Christiano, Buck Shlegeris, and Dario Amodei. Supervising strong learners by amplifying weak experts. arXiv preprint arXiv:1810.08575, 2018

  10. [10]

    Towards an automatic T uring test: Learning to evaluate dialogue responses

    Ryan Lowe, Michael Noseworthy, Iulian V Serban, Nicolas Angelard-Gontier, Yoshua Bengio, and Joelle Pineau. Towards an automatic T uring test: Learning to evaluate dialogue responses. arXiv preprint arXiv:1708.07149, 2017 a

  11. [11]

    Introduction to the Theory of Computation

    Michael Sipser. Introduction to the Theory of Computation. Course Technology, Boston, MA, third edition, 2013. ISBN 113318779X

  12. [12]

    Jeffrey C Lagarias and Andrew M. Odlyzko. Computing (x) : An analytic method. Journal of Algorithms, 8 0 (2): 0 173--191, 1987

  13. [13]

    A simple neural attentive meta-learner

    Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-learner. In NIPS 2017 Workshop on Meta-Learning, 2017

  14. [14]

    Interpretable and pedagogical examples

    Smitha Milli, Pieter Abbeel, and Igor Mordatch. Interpretable and pedagogical examples. arXiv preprint arXiv:1711.00694, 2017

  15. [15]

    Efficient selectivity and backup operators in monte-carlo tree search

    R \'e mi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pages 72--83. Springer, 2006

  16. [16]

    Combinatorics of Go

    John Tromp and Gunnar Farneb \"a ck. Combinatorics of Go . In International Conference on Computers and Games, pages 84--99. Springer, 2006

  17. [17]

    Debatable

    Radiolab. Debatable. https://www.radiolab.org/story/debatable, March 2016

  18. [18]

    Emergent Complexity via Multi-Agent Competition

    Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, and Igor Mordatch. Emergent complexity via multi-agent competition. arXiv preprint arXiv:1710.03748, 2017

  19. [19]

    A unified game-theoretic approach to multiagent reinforcement learning

    Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Julien Perolat, David Silver, Thore Graepel, et al. A unified game-theoretic approach to multiagent reinforcement learning. In Advances in Neural Information Processing Systems, pages 4193--4206, 2017

  20. [20]

    Generative adversarial networks

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. In Advances in Neural Information Processing Systems, pages 2672--2680, 2014

  21. [21]

    Progressive Growing of GANs for Improved Quality, Stability, and Variation

    Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017

  22. [22]

    Reasoning independently of prior belief and individual differences in actively open-minded thinking

    Keith E Stanovich and Richard F West. Reasoning independently of prior belief and individual differences in actively open-minded thinking. Journal of Educational Psychology, 89 0 (2): 0 342, 1997

  23. [23]

    Individual differences and the belief bias effect: Mental models, logical necessity, and abstract reasoning

    Donna Torrens. Individual differences and the belief bias effect: Mental models, logical necessity, and abstract reasoning. Thinking & Reasoning, 5 0 (1): 0 1--28, 1999

  24. [24]

    Weekend update: You'd have to be science illiterate to think ``belief in evolution'' measures science literacy

    Dan Kahan. Weekend update: You'd have to be science illiterate to think ``belief in evolution'' measures science literacy. http://www.culturalcognition.net/blog/2014/5/24/weekend-update-youd-have-to-be-science-illiterate-to-think-b.html, May 2014

  25. [25]

    Jonathan St. B. T. Evans and Jodie Curtis-Holmes. Rapid responding increases belief bias: Evidence for the dual-process theory of reasoning. Thinking & Reasoning, 11 0 (4): 0 382--389, 2005

  26. [26]

    Belief-based and analytic processing in transitive inference depends on premise integration difficulty

    Glenda Andrews. Belief-based and analytic processing in transitive inference depends on premise integration difficulty. Memory & cognition, 38 0 (7): 0 928--940, 2010

  27. [27]

    Reasoning under time pressure: A study of causal conditional inference

    Jonathan St BT Evans, Simon J Handley, and Alison M Bacon. Reasoning under time pressure: A study of causal conditional inference. Experimental Psychology, 56 0 (2): 0 77, 2009

  28. [28]

    Negative emotions can attenuate the influence of beliefs on logical reasoning

    Vinod Goel and Oshin Vartanian. Negative emotions can attenuate the influence of beliefs on logical reasoning. Cognition and Emotion, 25 0 (1): 0 121--131, 2011

  29. [29]

    The superintelligent will: Motivation and instrumental rationality in advanced artificial agents

    Nick Bostrom. The superintelligent will: Motivation and instrumental rationality in advanced artificial agents. Minds and Machines, 22 0 (2): 0 71--85, 2012

  30. [30]

    The AI -box experiment

    Eliezer Yudkowsky. The AI -box experiment. http://yudkowsky.net/singularity/aibox, 2002

  31. [31]

    Superintelligence

    Nick Bostrom. Superintelligence. Dunod, 2017

  32. [32]

    Faulty reward functions in the wild

    OpenAI. Faulty reward functions in the wild. https://blog.openai.com/faulty-reward-functions, 2016

  33. [33]

    Multi-agent actor-critic for mixed cooperative-competitive environments

    Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pages 6382--6393, 2017 b