pith. machine review for the scientific record. sign in

arxiv: 2401.01335 · v3 · submitted 2024-01-02 · 💻 cs.LG · cs.AI· cs.CL· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLstat.ML
keywords self-play fine-tuninglanguage model self-improvementsupervised fine-tuningpreference optimizationLLM alignmentiterative refinement
0
0 comments X

The pith

Self-play fine-tuning turns a weak supervised LLM into a strong one by iteratively contrasting its own generations against fixed human data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Self-Play fIne-tuNing (SPIN), which begins with a supervised fine-tuned language model and lets it generate responses from its own prior iterations. The training objective then teaches the model to favor the original human-annotated responses over these self-generated ones. The process repeats across iterations, and theory shows the objective reaches its global minimum only when the model's policy exactly matches the target human data distribution. Experiments on standard benchmarks demonstrate consistent gains and even exceed direct preference optimization that uses additional GPT-4 preference pairs.

Core claim

SPIN refines an LLM policy through repeated self-play in which the model produces its own training examples from previous checkpoints and learns to distinguish them from the fixed set of human demonstrations; the resulting objective has a unique global optimum achieved exclusively when the policy aligns with the target data distribution.

What carries the argument

The self-play mechanism in which responses generated by the model at iteration t are contrasted against the unchanging human-annotated demonstrations to update the policy at iteration t+1.

If this is right

  • Performance rises on the HuggingFace Open LLM Leaderboard, MT-Bench, and Big-Bench tasks without any new human annotations.
  • SPIN surpasses direct preference optimization even when the latter receives supplementary GPT-4 preference data.
  • The initial supervised fine-tuning dataset alone suffices to reach higher capability levels through iterative refinement.
  • The training objective converges to the target distribution only at its global optimum, providing a clear stopping criterion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach implies that limited human demonstration data can be amplified through internal generation loops rather than external collection.
  • Self-play of this form may extend to other sequence-generation tasks where synthetic examples are inexpensive to produce.
  • If the contrast remains informative across many rounds, the method could reduce dependence on large-scale preference labeling pipelines.

Load-bearing premise

Responses generated by earlier model versions supply clean contrastive signals that steadily move the policy toward the human data distribution without accumulating biases or shifts that would block further gains.

What would settle it

If repeated self-play iterations produce no measurable reduction in divergence between the model's output distribution and the human-annotated distribution, or if benchmark scores plateau or degrade, the alignment claim would be refuted.

read the original abstract

Harnessing the power of human-annotated data through Supervised Fine-Tuning (SFT) is pivotal for advancing Large Language Models (LLMs). In this paper, we delve into the prospect of growing a strong LLM out of a weak one without the need for acquiring additional human-annotated data. We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN), which starts from a supervised fine-tuned model. At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. More specifically, the LLM generates its own training data from its previous iterations, refining its policy by discerning these self-generated responses from those obtained from human-annotated data. Our method progressively elevates the LLM from a nascent model to a formidable one, unlocking the full potential of human-annotated demonstration data for SFT. Theoretically, we prove that the global optimum to the training objective function of our method is achieved only when the LLM policy aligns with the target data distribution. Empirically, we evaluate our method on several benchmark datasets including the HuggingFace Open LLM Leaderboard, MT-Bench, and datasets from Big-Bench. Our results show that SPIN can significantly improve the LLM's performance across a variety of benchmarks and even outperform models trained through direct preference optimization (DPO) supplemented with extra GPT-4 preference data. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents. Codes are available at https://github.com/uclaml/SPIN.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes Self-Play fIne-tuNing (SPIN), which starts from an SFT model and iteratively generates responses from previous policy iterates to create contrastive training pairs against the fixed human-annotated data; the model is then fine-tuned to prefer the human data. It proves that the resulting objective has a unique global optimum precisely when the policy matches the target data distribution, and reports empirical gains on the HuggingFace Open LLM Leaderboard, MT-Bench, and Big-Bench tasks that exceed those of DPO trained with additional GPT-4 preference data.

Significance. If the empirical improvements are robust to compute-matched controls and the iterative dynamics reliably reach the claimed optimum, the result would be significant: it offers a route to strengthen LLMs using only existing SFT demonstrations, without further human or GPT-4 annotations. The theoretical statement is a standard consequence of imitation-learning objectives, but the self-play mechanism itself is the novel element whose practical reliability remains to be fully established.

major comments (1)
  1. [Abstract / Theoretical Analysis] Abstract and theoretical section: the proof establishes that the static objective attains its global minimum only at exact alignment with the target distribution, yet no analysis (contraction mapping, Lyapunov function, or fixed-point convergence argument) is supplied for the sequence of policies generated by the self-play iteration. Early weak generations could therefore induce a persistent distribution shift or suboptimal fixed point from which gradient updates cannot escape, so the static guarantee does not automatically transfer to the training trajectory.
minor comments (2)
  1. [Experiments] Empirical section: the comparison with DPO augmented by GPT-4 data should explicitly state total training tokens, learning-rate schedules, and whether the extra preference data is matched in volume to the self-generated data used by SPIN.
  2. [Method] The manuscript would benefit from a short discussion of how the self-generated negative examples are sampled (temperature, top-p, number of samples per prompt) and whether any filtering is applied to avoid degenerate outputs.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract / Theoretical Analysis] Abstract and theoretical section: the proof establishes that the static objective attains its global minimum only at exact alignment with the target distribution, yet no analysis (contraction mapping, Lyapunov function, or fixed-point convergence argument) is supplied for the sequence of policies generated by the self-play iteration. Early weak generations could therefore induce a persistent distribution shift or suboptimal fixed point from which gradient updates cannot escape, so the static guarantee does not automatically transfer to the training trajectory.

    Authors: We acknowledge that the theoretical analysis establishes the global optimum of the static objective but does not include a formal convergence argument (e.g., contraction mapping or Lyapunov function) for the sequence of policies produced by the iterative self-play procedure. This is a valid observation. The self-play iteration is motivated by the fact that each step contrasts the current policy's outputs against fixed human-annotated data, which in principle reduces distribution shift over time; however, we do not claim or prove that the iteration is guaranteed to reach the global optimum from arbitrary initializations. In the revised manuscript we will add a dedicated paragraph in the theoretical section and an accompanying figure in the experiments section that plots benchmark performance versus iteration number. These additions will empirically document monotonic improvement and the absence of observable stagnation on the evaluated tasks, thereby clarifying the practical behavior of the iteration while preserving the paper's primary contribution of the self-play objective and its static optimality result. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper proves that the global optimum of its training objective occurs precisely when the policy matches the target human data distribution. This is a direct consequence of the standard form of the preference-based loss (human responses preferred over self-generated ones), which has its unique minimum at the target distribution by construction of the objective itself rather than by any self-referential fit or redefinition. The iterative self-play generates negatives from prior policy iterates, but the static objective's minimum is independently fixed by the human-annotated data and does not reduce to the outputs of the iteration. No load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatzes are invoked to establish the claim. The derivation remains self-contained; the lack of a separate convergence argument for the sequence of iterates is a gap in dynamics analysis, not a circular reduction in the stated theorem.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions from reinforcement learning and imitation learning that a policy can be improved by contrasting against a fixed target distribution; no new free parameters or invented entities are introduced in the abstract description.

axioms (1)
  • standard math A global optimum of the training objective exists and is achieved precisely when the learned policy matches the target human data distribution.
    This is invoked as the theoretical foundation for why the self-play objective drives improvement.

pith-pipeline@v0.9.0 · 5614 in / 1298 out tokens · 51513 ms · 2026-05-14T22:55:44.631426+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Foundation.LawOfExistence defect_zero_iff_one contradicts
    ?
    contradicts

    CONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.

    Theoretically, we prove that the global optimum to the training objective function of our method is achieved only when the LLM policy aligns with the target data distribution.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability

    cs.LG 2026-05 unverdicted novelty 7.0

    The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general functi...

  2. RewardHarness: Self-Evolving Agentic Post-Training

    cs.AI 2026-05 unverdicted novelty 7.0

    RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.

  3. IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 7.0

    IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of...

  4. Structural Verification for Reliable EDA Code Generation without Tool-in-the-Loop Debugging

    cs.SE 2026-04 unverdicted novelty 7.0

    Structural dependency graphs and staged pre-execution verification raise LLM-based EDA code pass rates to 82.5% (single-step) and 70-84% (multi-step) while halving tool calls by catching dependency violations before runtime.

  5. KTO: Model Alignment as Prospect Theoretic Optimization

    cs.LG 2024-02 conditional novelty 7.0

    KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.

  6. Self-Rewarding Language Models

    cs.CL 2024-01 conditional novelty 7.0

    Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.

  7. Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

    cs.CL 2026-05 unverdicted novelty 6.0

    Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.

  8. Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout pe...

  9. Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

    cs.AI 2026-05 unverdicted novelty 6.0

    MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall i...

  10. Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

    cs.AI 2026-05 unverdicted novelty 6.0

    MORA breaks the safety-helpfulness trade-off in LLM alignment by pre-sampling single-reward prompts and rewriting them to expand multi-dimensional reward diversity, yielding 5-12.4% single-preference gains in sequenti...

  11. Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.

  12. G-Zero: Self-Play for Open-Ended Generation from Zero Data

    cs.LG 2026-05 unverdicted novelty 6.0

    G-Zero uses the Hint-δ intrinsic reward to drive co-evolution between a Proposer and Generator via GRPO and DPO, providing a theoretical suboptimality guarantee for self-improvement from internal dynamics alone.

  13. Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph

    cs.LG 2026-05 unverdicted novelty 6.0

    GraphDPO generalizes pairwise DPO to a graph-structured Plackett-Luce objective over DAGs induced by rollout rankings, enforcing transitivity with linear complexity and recovering DPO as a special case.

  14. PaT: Planning-after-Trial for Efficient Test-Time Code Generation

    cs.CL 2026-05 unverdicted novelty 6.0

    PaT defers planning until after failed trials in LLM code generation, enabling heterogeneous cheap-plus-powerful model setups that match large-model performance at roughly 69% lower cost.

  15. Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models

    cs.LG 2026-05 conditional novelty 6.0

    Gate-DPO attenuates gradients on low-probability rejected responses to reduce probability collapse and improve chosen-response likelihood during preference optimization.

  16. SignDPO: Multi-level Direct Preference Optimisation for Skeleton-based Gloss-free Sign Language Translation

    cs.CL 2026-04 unverdicted novelty 6.0

    SignDPO uses hierarchical perturbations, self-guided attention-based sampling, and an automated language-level preference generator to align skeleton trajectories with linguistic semantics, outperforming prior gloss-f...

  17. GroupDPO: Memory efficient Group-wise Direct Preference Optimization

    cs.CL 2026-04 unverdicted novelty 6.0

    GroupDPO decouples group-wise preference optimization during backpropagation to cut peak memory while keeping the same gradients, allowing larger groups and consistent gains over single-pair DPO plus an NLL term on positives.

  18. $\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

    cs.LG 2026-04 unverdicted novelty 6.0

    π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and ...

  19. Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution

    cs.CL 2026-04 unverdicted novelty 6.0

    Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.

  20. Autogenesis: A Self-Evolving Agent Protocol

    cs.AI 2026-04 unverdicted novelty 5.0

    Autogenesis Protocol defines resource and evolution layers for LLM agents, enabling a system that shows performance gains on long-horizon planning benchmarks.

  21. A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    cs.AI 2025-07 accept novelty 4.0

    The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 20 Pith papers · 46 internal anchors

  1. [1]

    arXiv preprint arXiv:2306.05268 , year=

    Factorized Contrastive Learning: Going Beyond Multi-view Redundancy , author=. arXiv preprint arXiv:2306.05268 , year=

  2. [2]

    Fine-Tuning Language Models from Human Preferences

    Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=

  3. [3]

    Self-Rewarding Language Models

    Self-rewarding language models , author=. arXiv preprint arXiv:2401.10020 , year=

  4. [4]

    rlhf: Scaling reinforcement learning from human feedback with ai feedback , author=

    Rlaif: Scaling reinforcement learning from human feedback with ai feedback , author=. arXiv preprint arXiv:2309.00267 , year=

  5. [5]

    Advances in Neural Information Processing Systems , volume=

    Learning to summarize with human feedback , author=. Advances in Neural Information Processing Systems , volume=

  6. [6]

    International Conference on Machine Learning , pages=

    Learning transferable visual models from natural language supervision , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  7. [7]

    Learning Factored Representations in a Deep Mixture of Experts

    Learning factored representations in a deep mixture of experts , author=. arXiv preprint arXiv:1312.4314 , year=

  8. [8]

    Neural Networks , volume=

    Learning and approximation capabilities of adaptive spline activation function neural networks , author=. Neural Networks , volume=. 1998 , publisher=

  9. [9]

    Advances in neural information processing systems , pages=

    Mixtures of Gaussian processes , author=. Advances in neural information processing systems , pages=. 2001 , publisher=

  10. [10]

    Advances in neural information processing systems , pages=

    Hidden Markov decision trees , author=. Advances in neural information processing systems , pages=. 1997 , publisher=

  11. [11]

    Neural computation , volume=

    A parallel mixture of SVMs for very large scale problems , author=. Neural computation , volume=. 2002 , publisher=

  12. [12]

    arXiv preprint arXiv:2005.10190 , year=

    Feature purification: How adversarial training performs robust deep learning , author=. arXiv preprint arXiv:2005.10190 , year=

  13. [13]

    arXiv preprint arXiv:2012.09816 , year=

    Towards understanding ensemble, knowledge distillation and self-distillation in deep learning , author=. arXiv preprint arXiv:2012.09816 , year=

  14. [14]

    International conference on machine learning , pages=

    Language modeling with gated convolutional networks , author=. International conference on machine learning , pages=. 2017 , organization=

  15. [15]

    Advances in neural information processing systems , pages=

    Attention is all you need , author=. Advances in neural information processing systems , pages=

  16. [16]

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. arXiv preprint arXiv:2101.03961 , year=

  17. [17]

    Neural computation , volume=

    Adaptive mixtures of local experts , author=. Neural computation , volume=. 1991 , publisher=

  18. [18]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=. arXiv preprint arXiv:1701.06538 , year=

  19. [19]

    Advances in Neural Information Processing Systems , pages=

    Global convergence of langevin dynamics based algorithms for nonconvex optimization , author=. Advances in Neural Information Processing Systems , pages=

  20. [20]

    Journal of Functional Analysis , volume=

    Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality , author=. Journal of Functional Analysis , volume=. 2000 , publisher=

  21. [21]

    arXiv preprint arXiv:1910.11508 , year=

    Over Parameterized Two-level Neural Networks Can Learn Near Optimal Feature Representations , author=. arXiv preprint arXiv:1910.11508 , year=

  22. [22]

    Gradient descent optimizes over-parameterized deep ReLU networks

    Zou, Difan and Cao, Yuan and Zhou, Dongruo and Gu, Quanquan. Gradient descent optimizes over-parameterized deep ReLU networks. Machine Learning. 2019

  23. [23]

    arXiv preprint arXiv:1904.04326 , year=

    A Comparative Analysis of the Optimization and Generalization Property of Two-layer Neural Network and Random Feature Models Under Gradient Descent Dynamics , author=. arXiv preprint arXiv:1904.04326 , year=

  24. [24]

    the Thirty-Fourth AAAI Conference on Artificial Intelligence , year=

    Generalization Error Bounds of Gradient Descent for Learning Over-parameterized Deep ReLU Networks , author=. the Thirty-Fourth AAAI Conference on Artificial Intelligence , year=

  25. [25]

    Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages=

    Globally optimal gradient descent for a convnet with gaussian inputs , author=. Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages=. 2017 , organization=

  26. [26]

    International Conference on Machine Learning , pages=

    Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path? , author=. International Conference on Machine Learning , pages=

  27. [27]

    Training Over-parameterized Deep

    Zhang, Huishuai and Yu, Da and Chen, Wei and Liu, Tie-Yan , journal=. Training Over-parameterized Deep

  28. [28]

    arXiv preprint arXiv:1902.07111 , year=

    Global Convergence of Adaptive Gradient Methods for An Over-parameterized Neural Network , author=. arXiv preprint arXiv:1902.07111 , year=

  29. [29]

    Advances in neural information processing systems , pages=

    Better mini-batch algorithms via accelerated gradient methods , author=. Advances in neural information processing systems , pages=

  30. [30]

    Zhurnal Vychislitel'noi Matematiki i Matematicheskoi Fiziki , volume=

    Gradient methods for minimizing functionals , author=. Zhurnal Vychislitel'noi Matematiki i Matematicheskoi Fiziki , volume=. 1963 , publisher=

  31. [31]

    Journal of Machine Learning Research , volume=

    Stochastic dual coordinate ascent methods for regularized loss minimization , author=. Journal of Machine Learning Research , volume=

  32. [32]

    Bell Labs Technical Journal , volume=

    The one-sided barrier problem for Gaussian noise , author=. Bell Labs Technical Journal , volume=. 1962 , publisher=

  33. [33]

    Alex Krizhevsky , title =

  34. [34]

    Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

    Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , author=. arXiv preprint arXiv:1312.6120 , year=

  35. [35]

    International Conference on Machine Learning , pages=

    Gradient descent with identity initialization efficiently learns positive definite linear transformations , author=. International Conference on Machine Learning , pages=

  36. [36]

    Electronic Communications in Probability , volume=

    A tail inequality for quadratic forms of subgaussian random vectors , author=. Electronic Communications in Probability , volume=. 2012 , publisher=

  37. [37]

    NIPS Tutorial , year=

    High-performance hardware for machine learning , author=. NIPS Tutorial , year=

  38. [38]

    Advances in neural information processing systems , pages=

    Sequence to sequence learning with neural networks , author=. Advances in neural information processing systems , pages=

  39. [39]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Going deeper with convolutions , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  40. [40]

    , author=

    Fast and Robust Neural Network Joint Models for Statistical Machine Translation. , author=. ACL (1) , pages=

  41. [41]

    Neural Machine Translation by Jointly Learning to Align and Translate

    Neural machine translation by jointly learning to align and translate , author=. arXiv preprint arXiv:1409.0473 , year=

  42. [42]

    Neural networks , volume=

    Approximation capabilities of multilayer feedforward networks , author=. Neural networks , volume=. 1991 , publisher=

  43. [43]

    Advances In Neural Information Processing Systems , pages=

    Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity , author=. Advances In Neural Information Processing Systems , pages=

  44. [44]

    Conference on Learning Theory , pages=

    On the expressive power of deep learning: A tensor analysis , author=. Conference on Learning Theory , pages=

  45. [45]

    International Conference on Machine Learning , pages=

    Convolutional rectifier networks as generalized tensor decompositions , author=. International Conference on Machine Learning , pages=

  46. [46]

    On the Expressive Power of Deep Neural Networks

    On the expressive power of deep neural networks , author=. arXiv preprint arXiv:1606.05336 , year=

  47. [47]

    Advances In Neural Information Processing Systems , pages=

    Exponential expressivity in deep neural networks through transient chaos , author=. Advances In Neural Information Processing Systems , pages=

  48. [48]

    Advances in neural information processing systems , pages=

    On the number of linear regions of deep neural networks , author=. Advances in neural information processing systems , pages=

  49. [49]

    Training , volume=

    Training a single sigmoidal neuron is hard , author=. Training , volume=. 2006 , publisher=

  50. [50]

    Advances in Neural Information Processing Systems , pages=

    On the computational efficiency of training neural networks , author=. Advances in Neural Information Processing Systems , pages=

  51. [51]

    Distribution-Specific Hardness of Learning Neural Networks

    Distribution-specific hardness of learning neural networks , author=. arXiv preprint arXiv:1609.01037 , year=

  52. [52]

    International Conference on Machine Learning , pages=

    Failures of gradient-based deep learning , author=. International Conference on Machine Learning , pages=

  53. [53]

    Weight Sharing is Crucial to Succesful Optimization

    Weight Sharing is Crucial to Succesful Optimization , author=. arXiv preprint arXiv:1706.00687 , year=

  54. [54]

    Advances in neural information processing systems , pages=

    Training a 3-node neural network is NP-complete , author=. Advances in neural information processing systems , pages=

  55. [55]

    Reliably learning the

    Goel, Surbhi and Kanade, Varun and Klivans, Adam and Thaler, Justin , journal=. Reliably learning the

  56. [56]

    Learning Halfspaces and Neural Networks with Random Initialization

    Learning halfspaces and neural networks with random initialization , author=. arXiv preprint arXiv:1511.07948 , year=

  57. [57]

    Provable Methods for Training Neural Networks with Sparse Connectivity

    Provable methods for training neural networks with sparse connectivity , author=. arXiv preprint arXiv:1412.2693 , year=

  58. [58]

    Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods

    Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods , author=. arXiv preprint arXiv:1506.08473 , year=

  59. [59]

    Conference on Learning Theory , pages=

    Escaping from saddle points—online stochastic gradient for tensor decomposition , author=. Conference on Learning Theory , pages=

  60. [60]

    Advances in Neural Information Processing Systems , pages=

    Provable efficient online matrix completion via non-convex stochastic gradient descent , author=. Advances in Neural Information Processing Systems , pages=

  61. [61]

    How to Escape Saddle Points Efficiently

    How to Escape Saddle Points Efficiently , author=. arXiv preprint arXiv:1703.00887 , year=

  62. [62]

    Depth Creates No Bad Local Minima

    Depth Creates No Bad Local Minima , author=. arXiv preprint arXiv:1702.08580 , year=

  63. [63]

    International Conference on Learning Representations , year=

    Topology and Geometry of Half-Rectified Network Optimization , author=. International Conference on Learning Representations , year=

  64. [64]

    How regularization affects the critical points in linear networks

    How regularization affects the critical points in linear networks , author=. arXiv preprint arXiv:1709.09625 , year=

  65. [65]

    Porcupine Neural Networks: (Almost) All Local Optima are Global

    Porcupine Neural Networks:(Almost) All Local Optima are Global , author=. arXiv preprint arXiv:1710.02196 , year=

  66. [66]

    "Convex Until Proven Guilty": Dimension-Free Acceleration of Gradient Descent on Non-Convex Functions

    " Convex Until Proven Guilty": Dimension-Free Acceleration of Gradient Descent on Non-Convex Functions , author=. arXiv preprint arXiv:1705.02766 , year=

  67. [67]

    Artificial Intelligence and Statistics , pages=

    The loss surfaces of multilayer networks , author=. Artificial Intelligence and Statistics , pages=

  68. [68]

    Provable learning of Noisy-or Networks

    Provable learning of Noisy-or Networks , author=. arXiv preprint arXiv:1612.08795 , year=

  69. [69]

    Artificial Intelligence and Statistics , pages=

    On the Learnability of Fully-Connected Neural Networks , author=. Artificial Intelligence and Statistics , pages=

  70. [70]

    Conference on Learning Theory , pages=

    Fast exact matrix completion with finite samples , author=. Conference on Learning Theory , pages=

  71. [71]

    International Conference on Machine Learning , pages=

    Expressiveness of rectifier networks , author=. International Conference on Machine Learning , pages=

  72. [72]

    IEEE Transactions on Information theory , volume=

    Universal approximation bounds for superpositions of a sigmoidal function , author=. IEEE Transactions on Information theory , volume=. 1993 , publisher=

  73. [73]

    The Landscape of Empirical Risk for Non-convex Losses

    The landscape of empirical risk for non-convex losses , author=. arXiv preprint arXiv:1607.06534 , year=

  74. [74]

    Advances in neural information processing systems , pages=

    Exponentially many local minima for single neurons , author=. Advances in neural information processing systems , pages=

  75. [75]

    Learning One-hidden-layer Neural Networks with Landscape Design

    Learning One-hidden-layer Neural Networks with Landscape Design , author=. arXiv preprint arXiv:1711.00501 , year=

  76. [76]

    , author=

    The Isotron Algorithm: High-Dimensional Isotonic Regression. , author=. COLT , year=

  77. [77]

    arXiv preprint arXiv:1802.06463 , year=

    Local Geometry of One-Hidden-Layer Neural Networks for Logistic Regression , author=. arXiv preprint arXiv:1802.06463 , year=

  78. [78]

    Advances in Neural Information Processing Systems , pages=

    Efficient learning of generalized linear and single index models with isotonic regression , author=. Advances in Neural Information Processing Systems , pages=

  79. [79]

    Proceedings of the forty-fifth annual ACM symposium on Theory of computing , pages=

    Low-rank matrix completion using alternating minimization , author=. Proceedings of the forty-fifth annual ACM symposium on Theory of computing , pages=. 2013 , organization=

  80. [80]

    International conference on machine learning , pages=

    On the importance of initialization and momentum in deep learning , author=. International conference on machine learning , pages=

Showing first 80 references.