{"total":31,"items":[{"citing_arxiv_id":"2606.00539","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GNMR: Runtime Stability Control for Low-Precision Large Language Model Training","primary_cat":"cs.LG","submitted_at":"2026-05-30T05:11:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GNMR is a gradient-norm-based controller that maps local stability signals to budgeted recovery actions to stabilize low-precision LLM training while preserving quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21557","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scalable Reinforcement Learning via Adaptive Batch Scaling","primary_cat":"stat.ML","submitted_at":"2026-05-20T13:46:22+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16844","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Artificial Adaptive Intelligence: The Missing Stage Between Narrow and General Intelligence","primary_cat":"cs.AI","submitted_at":"2026-05-16T07:04:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Proposes Artificial Adaptive Intelligence as the regime between narrow and general AI, defined by elimination of human-specified hyperparameters, and introduces an adaptivity index plus parametric minimality principle grounded in minimum description length.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14200","ref_index":69,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization","primary_cat":"cs.LG","submitted_at":"2026-05-13T23:32:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11255","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model","primary_cat":"cs.CL","submitted_at":"2026-05-11T21:27:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09154","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Predicting Large Model Test Losses with a Noisy Quadratic System","primary_cat":"cs.LG","submitted_at":"2026-05-09T20:35:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A noisy quadratic system predicts large model test losses from N, B, K and outperforms Chinchilla's model for extrapolation up to 1000x compute.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"NX n=1 D (w(k−1) −w ∗), v n E λnvn +γ XN n=1 ξ(k) n vn.(18) For eachn≤N, D w(k) −w (k−1), v n E =−γ D (w(k−1) −w ∗), v n E λn +γξ (k) n (19) D w(k) −w ∗, v n E = D w(k) −w (k−1), v n E + D w(k−1) −w ∗, v n E = (1−γλ n) D (w(k−1) −w ∗), v n E +γξ (k) n .(20) ThusE h w(k) −w ∗, v n \u00012i = (1−γλ n)2E h ( D (w(k−1) −w ∗), v n E )2 i +γ 2E h (ξ(k) n )2 i (21) Apply recursively, we getE h w(k) −w ∗, v n \u00012i = (1−γλ n)2kE h ( D (w(0) −w ∗), v n E )2 i + kX j=1 (1−γλ n)2(k−j) γ2E h (ξ(j) n )2 i (22) = (1−γλ n)2k 1 λn P np +γ 2 kX j=1 (1−γλ n)2(k−j) R nr 1 B (23) We also knoww (k) −w (0) ∈span{v 1, ...vN }, so w(k) −w (0), v n = 0for anyn > N. E hD w(k) −w ∗, H(w (k) −w ∗) Ei (24) =E \" NX n=1 λn D w(k) −w (0), v n"},{"citing_arxiv_id":"2605.05683","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization","primary_cat":"stat.ML","submitted_at":"2026-05-07T05:19:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"A solvable model of neural scaling laws, 2022. URLhttps://arxiv.org/abs/2210.16859. [22] Blake Bordelon, Abdulkadir Canatar, and Cengiz Pehlevan. Spectrum dependent learning curves in kernel regression and wide neural networks. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 1024-1034. PMLR, 2020. [23] Abdulkadir Canatar, Blake Bordelon, and Cengiz Pehlevan. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks.Nature Communications, 12(2914), 2021. [24] Stefano Spigler, Mario Geiger, and Matthieu Wyart. Asymptotic learning curves of kernel methods: Empirical data versus teacher-student paradigm, 2019."},{"citing_arxiv_id":"2605.02850","ref_index":73,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Quantum Tilted Loss in Variational Optimization: Theory and Applications","primary_cat":"quant-ph","submitted_at":"2026-05-04T17:25:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"QTL unifies expectation-value minimization with CVaR and Gibbs heuristics under one tunable operator, amplifying gradients in structured cases while preserving global minima and shifting the bottleneck to measurement variance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.28118","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hierarchical Fault Detection and Diagnosis for Transformer Architectures","primary_cat":"cs.SE","submitted_at":"2026-04-30T17:07:11+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"the prototype-distance objective and the contrastive separation loss both require gradients to flow through the group-level weights during training. As encoder and decoder architectures differ in the number of feature groups (G= 12vs.G= 13), we train separate encoder and decoder diagnostic models. Given the feature vectorz∈Rd partitioned intoGgroups{zg1,...,zgG}(see Table 8), each groupgis encoded by a dedicated MLP (Equation (20)). hg =MLP g(zg)∈Rh, g= 1,...,G,(20) whereh= 32is the hidden dimension per group (selected via grid search on the inner validation fold). The group embeddings form the matrixH= [h1;...;hG]∈RG×h. A projection layer maps the concatenated group embeddings to a shared representation (Equa- tion (21)). zproj =W projvec(H)∈Re,(21) 27 Algorithm 2DEFault++ hierarchical diagnosis for one input instance"},{"citing_arxiv_id":"2604.26687","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training","primary_cat":"cs.DC","submitted_at":"2026-04-29T13:52:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"COPUS co-adapts batch size and parallelism during LLM training via goodput to deliver 3.9-8% average faster convergence than fixing one while tuning the other.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Empirical CBS methods avoid this calibration problem by measuring the batch-size threshold directly. For exam- ple, branching-based methods launch several short training branches from a checkpoint, each with a different batch size and learning-rate scaling rule, and select the largest batch size whose loss remains close to smaller-batch branches after a fixed token window [29, 55]. This provides a more direct statistical target than raw GNS, but it requires additional training runs at each measurement point and returns only a 𝐵𝑔 target. It does not decide which 3D parallel layout or micro-batch decomposition maximizes wall-clock progress, so it is complementary to our Goodput controller rather than a replacement. 4CopusSystem Design"},{"citing_arxiv_id":"2604.21691","ref_index":84,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"There Will Be a Scientific Theory of Deep Learning","primary_cat":"stat.ML","submitted_at":"2026-04-23T13:58:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21215","ref_index":66,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Recurrent Transformer: Greater Effective Depth and Efficient Decoding","primary_cat":"cs.LG","submitted_at":"2026-04-23T02:12:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-matched standard transformers with fewer layers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.28743","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Rethinking Language Model Scaling under Transferable Hypersphere Optimization","primary_cat":"cs.LG","submitted_at":"2026-03-30T17:51:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.22347","ref_index":73,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Intelligence Inertia: Physical Isomorphism and Applications","primary_cat":"cs.AI","submitted_at":"2026-03-22T03:37:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Intelligence Inertia models the computational resistance to structural change in neural networks via a heuristic relativistic analogy, yielding a J-shaped cost curve that diverges from classical approximations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.17771","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Attention Sinks Induce Gradient Sinks: Massive Activations as Gradient Regulators in Transformers","primary_cat":"cs.LG","submitted_at":"2026-03-18T14:31:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Attention sinks induce gradient sinks under causal masking, with massive activations serving as adaptive RMSNorm regulators that attenuate localized gradient pressure in Transformer training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.11178","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence","primary_cat":"cs.AI","submitted_at":"2026-03-11T18:00:05+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PACED applies student pass-rate weighting w(p)=p(1-p) to distillation, concentrating on the zone of proximal development and delivering up to +8.2 gains on AIME tasks with reduced forgetting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.04686","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How does the optimizer implicitly bias the model merging loss landscape?","primary_cat":"cs.LG","submitted_at":"2025-10-06T10:56:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Effective noise scale non-monotonically governs model merging success with an optimum, unifying effects of learning rate, weight decay, batch size, and augmentation on the loss landscape.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.00432","ref_index":240,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning","primary_cat":"cs.AI","submitted_at":"2025-07-01T05:23:05+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.24275","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GradPower: Powering Gradients for Faster Language Model Pre-Training","primary_cat":"cs.LG","submitted_at":"2025-05-30T06:49:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GradPower applies sign-power to gradients before optimization and achieves lower terminal loss in language model pre-training across architectures, scales, datasets, and schedules.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.06708","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free","primary_cat":"cs.CL","submitted_at":"2025-05-10T17:15:49+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2405.04434","ref_index":155,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model","primary_cat":"cs.CL","submitted_at":"2024-05-07T15:56:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.02954","ref_index":157,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeepSeek LLM: Scaling Open-Source Language Models with Longtermism","primary_cat":"cs.CL","submitted_at":"2024-01-05T18:59:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2307.06435","ref_index":107,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Comprehensive Overview of Large Language Models","primary_cat":"cs.CL","submitted_at":"2023-07-12T20:01:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"fine-tuned using adapter layers [106] for downstream tasks. GPT-3 [6]: The GPT-3 architecture is the same as the GPT- 2 [5] but with dense and sparse attention in transformer layers similar to the Sparse Transformer [67]. It shows that large mod- els can train on larger batch sizes with a lower learning rate to decide the batch size during training, GPT-3 uses the gradient noise scale as in [107]. Overall, GPT-3 increases model param- eters to 175B showing that the performance of large language 7 Figure 7: Unified text-to-text training example, source image from [10]. Figure 8: The image is the article of [108], showing an example of PanGu- α architecture. models improves with the scale and is competitive with the fine- tuned models. mT5 [11]: A multilingual T5 model [10] trained on the mC4"},{"citing_arxiv_id":"2304.01373","ref_index":126,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling","primary_cat":"cs.CL","submitted_at":"2023-04-03T20:58:15+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2207.05221","ref_index":182,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Language Models (Mostly) Know What They Know","primary_cat":"cs.CL","submitted_at":"2022-07-11T22:59:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2204.02311","ref_index":95,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PaLM: Scaling Language Modeling with Pathways","primary_cat":"cs.CL","submitted_at":"2022-04-05T16:11:45+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2112.00861","ref_index":124,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A General Language Assistant as a Laboratory for Alignment","primary_cat":"cs.CL","submitted_at":"2021-12-01T22:24:34+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2102.01293","ref_index":92,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scaling Laws for Transfer","primary_cat":"cs.LG","submitted_at":"2021-02-02T04:07:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2010.14701","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Scaling Laws for Autoregressive Generative Modeling","primary_cat":"cs.LG","submitted_at":"2020-10-28T02:17:24+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"estimates the entropy of the underlying data distribution, while the reducible loss approximates the KL diver- gence between the data and model distributions. In the case of language we use results from [BMR+20], and only show the full lossL. 1 Introduction Large scale models, datasets, and compute budgets have driven rapid progress in machine learning. Recent work [HNA+17, RRBS19, LWS +20, RDG +20, KMH +20, SK20, BMR +20] suggests that the beneﬁts of scale are also highly predictable. When the cross-entropy loss L of a language model is bottlenecked by either the compute budgetC, dataset sizeD, or model sizeN, the loss scales with each of these quantities as a simple power-law. Sample efﬁciency also improves with model size. These results raise a number of questions."},{"citing_arxiv_id":"1912.06680","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dota 2 with Large Scale Deep Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2019-12-13T19:56:40+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"a master-level Go player [4]. Building upon this work, AlphaGoZero, AlphaZero, and ExIt discard imitation learning in favor of using Monte-Carlo Tree Search during training to obtain higher quality trajectories [12, 37, 38] and apply this to Go, Chess, Shogi, and Hex. Most recently, human-level play has been demonstrated in 3D ﬁrst-person multi-player environments [30], professional-level play in the real-time strategy game StarCraft 2 using AlphaStar [7], and superhuman performance 15 in Poker [39]. AlphaStar is particularly relevant to this paper. In that eﬀort, which ran concurrently to our own, researchers trained agents to play Starcraft 2, another complex game with real-time perfor- mance requirements, imperfect information, and long time horizons."},{"citing_arxiv_id":"1910.02054","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ZeRO: Memory Optimizations Toward Training Trillion Parameter Models","primary_cat":"cs.LG","submitted_at":"2019-10-04T17:29:39+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ZeRO removes memory redundancies in parallel training to scale deep learning models to over a trillion parameters with high throughput on current hardware.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}