arxiv: 2605.00751 · v1 · submitted 2026-05-01 · 💻 cs.LG

Recognition: unknown

NonZero: Interaction-Guided Exploration for Multi-Agent Monte Carlo Tree Search

Sizhe Tang , Zuyuan Zhang , Mahdi Imani , Tian Lan

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:04 UTC · model grok-4.3

classification 💻 cs.LG

keywords multi-agent MCTSinteraction-guided explorationbandit proposal rulelocal regret boundscooperative planningsurrogate-guided searchjoint action space

0 comments

The pith

NonZero guides multi-agent MCTS proposals by scoring single-agent and pairwise local deviations with a surrogate, reaching graph-local optima under sublinear local regret without joint-action enumeration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that cooperative multi-agent Monte Carlo Tree Search becomes tractable by replacing direct expansion over the full joint-action space with an interaction-guided proposal rule. Single-agent deviations are ranked by predicted gain while two-agent deviations use a mixed-difference measure to detect coordination benefits that single agents cannot achieve alone. The rule is derived by casting candidate proposal as a bandit problem over these local deviations, yielding a sublinear local-regret guarantee that the search reaches approximate graph-local optima. This matters because it keeps the method practical under realistic search budgets where the joint space grows exponentially with the number of agents. Experiments on MatGame, SMAC, and SMACv2 show improved sample efficiency and final performance relative to matched model-based and model-free baselines.

Core claim

NonZero formalizes candidate proposal as a bandit problem over local deviations in a low-dimensional nonlinear representation and derives an interaction-guided rule with a sublinear local-regret guarantee for reaching approximate graph-local optima without enumerating the joint-action space.

What carries the argument

The NonZero proposal rule, which ranks single-agent deviations by predicted gain and two-agent deviations by mixed-difference interaction scores to select candidates from the surrogate representation.

If this is right

MCTS search budgets can be spent on higher-value joint actions instead of uniform enumeration.
The same interaction scores can be reused across tree nodes to further amortize surrogate cost.
Performance gains appear under fixed compute in cooperative control tasks such as StarCraft multi-agent scenarios.
The local-regret bound holds even when no single agent can improve the current joint action but pairs can.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method suggests that many real-world coordination problems admit low-order interaction structure that can be exploited without full joint modeling.
Surrogate model accuracy directly limits how tightly the local-regret bound translates into actual policy improvement.
The same proposal mechanism could be grafted onto other tree-search or planning algorithms that suffer from joint-action explosion.

Load-bearing premise

Single-agent and pairwise interaction scores from the surrogate model suffice to identify near-optimal joint actions without missing critical higher-order coordination patterns.

What would settle it

A cooperative domain in which optimal joint actions depend on three-or-more-agent synergies that pairwise scores systematically undervalue, causing NonZero to converge to a suboptimal graph-local point while exhaustive joint-action MCTS finds a better policy.

Figures

Figures reproduced from arXiv: 2605.00751 by Mahdi Imani, Sizhe Tang, Tian Lan, Zuyuan Zhang.

**Figure 1.** Figure 1: Evaluations on 3 SMAC tasks/maps. Y-axis denotes the win rate and X-axis denotes training steps. Each algorithm is executed with 3 random seeds. NonZero achieves over 96% winning rate on all 3 maps, outperforming all baselines and also gets high winning rate much faster. 0.0 0.2 0.4 0.6 0.8 1.0 Training Steps 1e6 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Win Rate Terran_5_vs_5 NonZero MAZero MAZero-NP M… view at source ↗

**Figure 2.** Figure 2: Comparisons on 3 SMACv2 tasks/maps.Y-axis denotes the win rate and X-axis denotes the training steps. NonZero nearly doubles the winning rate on these challenging maps in SMACv2 and consistently outperforms all baselines. Each algorithm is executed with 3 random seeds. 5. Conclusion We proposed NONZERO, a scalable multi-agent MCTS framework that avoids enumerating the exponential jointaction space |A| = d… view at source ↗

**Figure 3.** Figure 3: Ablation study of NonZero by removing various design components, including the mixing net for initializing the parameter θ and the Hessian for curvature-aware optimization. Full NonZero shows large performance gain compared with that without any of the key modules. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequence… view at source ↗

read the original abstract

Monte Carlo Tree Search (MCTS) scales poorly in cooperative multi-agent domains because expansion must consider an exponentially large set of joint actions, severely limiting exploration under realistic search budgets. We propose NonZero, which keeps multi-agent MCTS tractable by running surrogate-guided selection over a low-dimensional nonlinear representation using an interaction-guided proposal rule, instead of directly exploring the full joint-action space. Our exploration uses an interaction score: single-agent deviations are ranked by predicted gain, while two-agent deviations are scored by a mixed-difference measure that reveals coordination benefits even when no single agent can improve alone. We formalize candidate proposal as a bandit problem over local deviations and derive a proposal rule, NonZero, with a sublinear local-regret guarantee for reaching approximate graph-local optima without enumerating the joint-action space. Empirically, NonZero improves sample efficiency and final performance on MatGame, SMAC, and SMACv2 relative to strong model-based and model-free baselines under matched search budgets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NonZero gives multi-agent MCTS a concrete interaction score and bandit proposal rule to avoid joint-action blowup, but the sublinear local-regret claim rests on unshown steps and may not survive higher-order coordination.

read the letter

NonZero keeps multi-agent MCTS tractable by scoring single-agent deviations on predicted gain and two-agent deviations on a mixed-difference measure, then feeding those into a bandit-style proposal rule instead of expanding the full joint space. The paper claims this yields a sublinear local-regret guarantee to approximate graph-local optima and reports better sample efficiency on MatGame, SMAC, and SMACv2 against matched baselines.

Referee Report

1 major / 1 minor

Summary. The paper proposes NonZero for multi-agent MCTS, which avoids exponential joint-action enumeration by using a surrogate-guided proposal rule over single-agent and pairwise local deviations. It formalizes proposal as a bandit problem over these deviations and derives NonZero with a claimed sublinear local-regret guarantee to reach approximate graph-local optima. Empirically, it reports improved sample efficiency and performance versus model-based and model-free baselines on MatGame, SMAC, and SMACv2 under matched budgets.

Significance. If the sublinear local-regret guarantee holds and pairwise scores suffice to surface near-optimal joint actions, NonZero would address a core scalability barrier in cooperative multi-agent search, enabling deeper exploration in high-dimensional action spaces. The empirical gains on standard benchmarks indicate potential practical impact in multi-agent RL and game AI.

major comments (1)

[Abstract] Abstract (formalization of bandit over local deviations): the sublinear local-regret guarantee for approximate graph-local optima is load-bearing for the central claim, yet the proposal rule scores only single-agent gains and two-agent mixed differences. No bound is given on the approximation error when irreducible higher-order (k>2) coordination patterns exist, which could allow convergence to graph-local points far from global optima while still satisfying the local-regret bound.

minor comments (1)

[Abstract] The abstract asserts empirical gains but provides no details on surrogate model training, hyperparameter sensitivity, or ablations isolating the mixed-difference term; these would strengthen attribution of results to the interaction-guided rule.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and the insightful comment on the scope of our local-regret guarantee. We respond point-by-point below.

read point-by-point responses

Referee: [Abstract] Abstract (formalization of bandit over local deviations): the sublinear local-regret guarantee for approximate graph-local optima is load-bearing for the central claim, yet the proposal rule scores only single-agent gains and two-agent mixed differences. No bound is given on the approximation error when irreducible higher-order (k>2) coordination patterns exist, which could allow convergence to graph-local points far from global optima while still satisfying the local-regret bound.

Authors: We agree that the sublinear local-regret guarantee is stated with respect to approximate graph-local optima defined over the local moves we consider (single-agent deviations and pairwise mixed-difference deviations). Because the proposal rule is deliberately restricted to these low-order terms to keep the bandit dimension linear rather than exponential, we do not derive, and cannot derive without additional structural assumptions on the payoff function, a general bound on the suboptimality gap to a global optimum when irreducible k>2 coordination patterns are present. This is an inherent limitation of any local-search approach that avoids full joint-action enumeration. At the same time, the mixed-difference term is explicitly constructed to surface pairwise coordination benefits that would be invisible to single-agent scores alone, and the empirical results on MatGame, SMAC, and SMACv2 demonstrate that the resulting graph-local points are competitive with or superior to strong baselines under identical search budgets. We will revise the abstract and add a short paragraph in the discussion section that (i) reiterates the graph-local scope of the guarantee and (ii) explicitly notes the absence of a higher-order approximation bound as a limitation of the current proposal rule. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation of NonZero is a direct mathematical formalization from bandit setup

full rationale

The paper's central derivation formalizes candidate proposal as a bandit problem over local deviations (single-agent and pairwise) and derives the NonZero rule plus its sublinear local-regret guarantee directly from that setup. No steps reduce by construction to fitted inputs, self-citations, or renamed empirical patterns; the guarantee is presented as following from the bandit analysis without circular dependence on the target result or higher-order terms. The provided text contains no load-bearing self-citations or ansatzes that collapse the claim, leaving the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are identifiable. The surrogate model and interaction scores are introduced but their precise functional forms and any fitted constants are not specified.

pith-pipeline@v0.9.0 · 5477 in / 1211 out tokens · 63902 ms · 2026-05-09T19:04:17.015407+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Matrix-Space Reinforcement Learning for Reusing Local Transition Geometry
cs.LG 2026-05 unverdicted novelty 7.0

MSRL represents trajectory segments as PSD matrices to prove additive composition properties and bootstrap value functions for better transfer, reaching 0.73 AUC versus 0.57-0.65 baselines.

Reference graph

Works this paper leans on

75 extracted references · 16 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000
[2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980
[3]

HyperNetworks

Hypernetworks , author=. arXiv preprint arXiv:1609.09106 , year=

work page internal anchor Pith review arXiv
[4]

, author=

The Isotron Algorithm: High-Dimensional Isotonic Regression. , author=. COLT , volume=
[5]

Proceedings of the 2015 ACM SIGMOD international conference on management of data , pages=

Learning generalized linear models over normalized data , author=. Proceedings of the 2015 ACM SIGMOD international conference on management of data , pages=

2015
[6]

M. J. Kearns , title =
[7]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983
[8]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000
[9]

Cochain perspectives on temporal-difference signals for learning beyond markov dynamics.arXiv preprint arXiv:2602.06939, 2026a

Cochain Perspectives on Temporal-Difference Signals for Learning Beyond Markov Dynamics , author=. arXiv preprint arXiv:2602.06939 , year=

work page arXiv
[10]

Suppressed for Anonymity , author=
[11]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981
[12]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959
[13]

2024 , eprint=

Monte Carlo Tree Search with Boltzmann Exploration , author=. 2024 , eprint=

2024
[14]

MINT: Minimal Information Neuro-Symbolic Tree for Objective-Driven Knowledge-Gap Reasoning and Active Elicitation

MINT: Minimal Information Neuro-Symbolic Tree for Objective-Driven Knowledge-Gap Reasoning and Active Elicitation , author=. arXiv preprint arXiv:2602.05048 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

2024 , eprint=

Bayes Adaptive Monte Carlo Tree Search for Offline Model-based Reinforcement Learning , author=. 2024 , eprint=

2024
[16]

arXiv preprint arXiv:2406.00614 , year=

Efficient Monte Carlo tree search via on-the-fly state-conditioned action abstraction , author=. arXiv preprint arXiv:2406.00614 , year=

work page arXiv
[17]

Acdzero: Graph-embedding-based tree search for mastering automated cyber defense.arXiv preprint arXiv:2601.02196,

Acdzero: Graph-embedding-based tree search for mastering automated cyber defense , author=. arXiv preprint arXiv:2601.02196 , year=

work page arXiv
[18]

Bandit Algorithms , url =

Lattimore, Tor and Szepesvari, Csaba , biburl =. Bandit Algorithms , url =
[19]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Learning to collaborate with unknown agents in the absence of reward , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[20]

2023 , eprint=

PAC: Assisted Value Factorisation with Counterfactual Predictions in Multi-Agent Reinforcement Learning , author=. 2023 , eprint=

2023
[21]

Value Functions Factorization With Latent State Information Sharing in Decentralized Multi-Agent Policy Gradients , volume=

Zhou, Hanhan and Lan, Tian and Aggarwal, Vaneet , year=. Value Functions Factorization With Latent State Information Sharing in Decentralized Multi-Agent Policy Gradients , volume=. IEEE Transactions on Emerging Topics in Computational Intelligence , publisher=. doi:10.1109/tetci.2023.3293193 , number=

work page doi:10.1109/tetci.2023.3293193 2023
[22]

Advances in Neural Information Processing Systems , volume=

Pac: Assisted value factorization with counterfactual predictions in multi-agent reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
[23]

Advances in neural information processing systems , volume=

Weighted qmix: Expanding monotonic value function factorisation for deep multi-agent reinforcement learning , author=. Advances in neural information processing systems , volume=
[24]

arXiv preprint arXiv:2309.12951 , year=

Boosting Studies of Multi-Agent Reinforcement Learning on Google Research Football Environment: the Past, Present, and Future , author=. arXiv preprint arXiv:2309.12951 , year=

work page arXiv
[25]

2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=

Autonomous Swarm Robot Coordination via Mean-Field Control Embedding Multi-Agent Reinforcement Learning , author=. 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=. 2023 , organization=

2023
[26]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Mastering complex control in moba games with deep reinforcement learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[27]

MILCOM 2025-2025 IEEE Military Communications Conference (MILCOM) , pages=

Agentic AI for Cyber Defense: LLM-Guided Hierarchical Multi-Agent Reinforcement Learning , author=. MILCOM 2025-2025 IEEE Military Communications Conference (MILCOM) , pages=. 2025 , organization=

2025
[28]

IEEE Communications Magazine , volume=

Cnn partitioning and offloading for vehicular edge networks in web3 , author=. IEEE Communications Magazine , volume=. 2023 , publisher=

2023
[29]

Autonomous Agents and Multi-Agent Systems , volume=

A survey and critique of multiagent deep reinforcement learning , author=. Autonomous Agents and Multi-Agent Systems , volume=. 2019 , publisher=

2019
[30]

Nature , volume=

Mastering atari, go, chess and shogi by planning with a learned model , author=. Nature , volume=. 2020 , publisher=

2020
[31]

Agent alpha: Tree search unifying generation, exploration and evaluation for computer-use agents.arXiv preprint arXiv:2602.02995,

Agent Alpha: Tree Search Unifying Generation, Exploration and Evaluation for Computer-Use Agents , author=. arXiv preprint arXiv:2602.02995 , year=

work page arXiv
[32]

arXiv preprint arXiv:2302.10418 , year=

Mac-po: Multi-agent experience replay via collective priority optimization , author=. arXiv preprint arXiv:2302.10418 , year=

work page arXiv
[33]

Value-Decomposition Networks For Cooperative Multi-Agent Learning

Value-decomposition networks for cooperative multi-agent learning , author=. arXiv preprint arXiv:1706.05296 , year=

work page Pith review arXiv
[34]

Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization

Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization , author=. arXiv preprint arXiv:2604.07165 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

arXiv preprint arXiv:1910.05366 , year=

Learning nearly decomposable value functions via communication minimization , author=. arXiv preprint arXiv:1910.05366 , year=

work page arXiv 1910
[36]

arXiv preprint arXiv:2405.11778 , year=

Efficient Multi-agent Reinforcement Learning by Planning , author=. arXiv preprint arXiv:2405.11778 , year=

work page arXiv
[37]

IEEE Transactions on Computational Intelligence and AI in games , volume=

A survey of monte carlo tree search methods , author=. IEEE Transactions on Computational Intelligence and AI in games , volume=. 2012 , publisher=

2012
[38]

Journal of Machine Learning Research , volume=

Monotonic value function factorisation for deep multi-agent reinforcement learning , author=. Journal of Machine Learning Research , volume=
[39]

2016 , publisher=

A concise introduction to decentralized POMDPs , author=. 2016 , publisher=

2016
[40]

2024 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA) , pages=

A Crowdsensing Service Pricing Method in Vehicular Edge Computing , author=. 2024 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA) , pages=. 2024 , organization=

2024
[41]

The Annals of Mathematical Statistics , volume=

Adjustment of an inverse matrix corresponding to a change in one element of a given matrix , author=. The Annals of Mathematical Statistics , volume=. 1950 , publisher=

1950
[42]

Mathematical programming , volume=

An analysis of approximations for maximizing submodular set functions—I , author=. Mathematical programming , volume=. 1978 , publisher=

1978
[43]

SIAM Journal on Computing , volume=

Maximizing a monotone submodular function subject to a matroid constraint , author=. SIAM Journal on Computing , volume=. 2011 , publisher=

2011
[44]

Machine learning , volume=

Finite-time analysis of the multiarmed bandit problem , author=. Machine learning , volume=. 2002 , publisher=

2002
[45]

Proceedings of the 19th international conference on World wide web , pages=

A contextual-bandit approach to personalized news article recommendation , author=. Proceedings of the 19th international conference on World wide web , pages=
[46]

International Conference on Machine Learning , pages=

Learning and planning in complex action spaces , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021
[47]

Advances in Neural Information Processing Systems , volume=

Smacv2: An improved benchmark for cooperative multi-agent reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
[48]

Mikayel Samvelyan and Tabish Rashid and Christian Schroeder de Witt and Gregory Farquhar and Nantas Nardelli and Tim G. J. Rudner and Chia-Man Hung and Philiph H. S. Torr and Jakob Foerster and Shimon Whiteson , journal =
[49]

Advances in neural information processing systems , volume=

The surprising effectiveness of ppo in cooperative multi-agent games , author=. Advances in neural information processing systems , volume=
[50]

IEEE Transactions on Games , volume=

Self-adaptive Monte Carlo tree search in general game playing , author=. IEEE Transactions on Games , volume=. 2018 , publisher=

2018
[51]

arXiv preprint arXiv:2511.06142 , year=

MALinZero: Efficient Low-Dimensional Search for Mastering Complex Multi-Agent Planning , author=. arXiv preprint arXiv:2511.06142 , year=

work page arXiv
[52]

2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=

An MCTS-DRL based obstacle and occlusion avoidance methodology in robotic follow-ahead applications , author=. 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=. 2023 , organization=

2023
[53]

arXiv preprint arXiv:2312.15864 , year=

BalMCTS: Balancing Objective Function and Search Nodes in MCTS for Constraint Optimization Problems , author=. arXiv preprint arXiv:2312.15864 , year=

work page arXiv
[54]

European conference on machine learning , pages=

Bandit based monte-carlo planning , author=. European conference on machine learning , pages=. 2006 , organization=

2006
[55]

nature , volume=

Mastering the game of go without human knowledge , author=. nature , volume=. 2017 , publisher=

2017
[56]

Advances in neural information processing systems , volume=

Multi-agent actor-critic for mixed cooperative-competitive environments , author=. Advances in neural information processing systems , volume=
[57]

International conference on learning representations , year=

Dop: Off-policy multi-agent decomposed policy gradients , author=. International conference on learning representations , year=
[58]

, author=

Edge Intelligence with Distributed Processing of DNNs: A Survey. , author=. CMES-Computer Modeling in Engineering & Sciences , volume=
[59]

International conference on machine learning , pages=

Fop: Factorizing optimal joint policy of maximum-entropy multi-agent reinforcement learning , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[60]

Summer school on machine learning , pages=

Concentration inequalities , author=. Summer school on machine learning , pages=. 2003 , publisher=

2003
[61]

IEEE Transactions on Automatic Control , volume=

Bandit problems with side observations , author=. IEEE Transactions on Automatic Control , volume=. 2005 , publisher=

2005
[62]

International conference on machine learning , pages=

Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning , author=. International conference on machine learning , pages=. 2019 , organization=

2019
[63]

arXiv preprint arXiv:2101.04788 , year=

Scalable anytime planning for multi-agent mdps , author=. arXiv preprint arXiv:2101.04788 , year=

work page arXiv
[64]

2021 IEEE international conference on robotics and automation (ICRA) , pages=

Human-robot collaborative multi-agent path planning using Monte Carlo tree search and social reward sources , author=. 2021 IEEE international conference on robotics and automation (ICRA) , pages=. 2021 , organization=

2021
[65]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Decentralized monte carlo tree search for partially observable multi-agent pathfinding , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[66]

2020 , publisher=

Bandit algorithms , author=. 2020 , publisher=

2020
[67]

Applied Soft Computing , volume=

Intelligent monitoring for infectious diseases with fuzzy systems and edge computing: A survey , author=. Applied Soft Computing , volume=. 2022 , publisher=

2022
[68]

1978 , publisher=

An analysis of approximations for maximizing submodular set functions—II , author=. 1978 , publisher=

1978
[69]

Advances in neural information processing systems , volume=

Mastering atari games with limited data , author=. Advances in neural information processing systems , volume=
[70]

Advances in neural information processing systems , volume=

Parametric bandits: The generalized linear case , author=. Advances in neural information processing systems , volume=
[71]

Advances in Neural Information Processing Systems , volume=

Scalable generalized linear bandits: Online computation and hashing , author=. Advances in Neural Information Processing Systems , volume=
[72]

International Conference on Machine Learning , pages=

Provably optimal algorithms for generalized linear contextual bandits , author=. International Conference on Machine Learning , pages=. 2017 , organization=

2017
[73]

Advances in neural information processing systems , volume=

Provable model-based nonlinear bandit and reinforcement learning: Shelve optimism, embrace virtual curvature , author=. Advances in neural information processing systems , volume=
[74]

Journal of machine learning research , volume=

Rademacher and gaussian complexities: Risk bounds and structural results , author=. Journal of machine learning research , volume=
[75]

Mathematical programming , volume=

Cubic regularization of Newton method and its global performance , author=. Mathematical programming , volume=. 2006 , publisher=

2006