{"total":28,"items":[{"citing_arxiv_id":"2605.23885","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions","primary_cat":"cs.CL","submitted_at":"2026-05-22T17:45:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LINK improves cross-lingual knowledge transfer via lexical substitutions in English pretraining data, yielding notable downstream gains and up to 2x training speedup across eight languages and five model sizes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11011","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models","primary_cat":"cs.LG","submitted_at":"2026-05-10T11:05:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"(d)Replacing the decay gate with sigmoid gating also makes optimization less stable and yields a higher final LLM, indicating that the decay-style gate better supports stable long-loop training.(e)Among the tested activation functions for the monotonicity term, SiLU [ 41] provides the most reliable optimization behavior. ReLU [42], SELU [43], and SoftPlus [58] each lead to less stable or less favorable trajectories, which supports the design choice used in the main LoopUS recipe.(f) TBPTT [59] incurs higher computational cost while plateauing at a substantially higher LLM than the standard LoopUS training recipe, indicating lower efficiency and worse performance in this setting. 5 Conclusion This paper presents Looped Depth Up-Scaling (LoopUS), a post-training framework that recasts a"},{"citing_arxiv_id":"2605.08894","ref_index":4,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-09T11:19:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Extremely quantized LLMs exhibit systematic smoothness degradation that reduces effective token candidates and degrades generation; a smoothness-preserving principle in PTQ and QAT delivers gains beyond numerical accuracy.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"URL https://doi.org/10.18653/v1/2022. acl-long.508. [3] P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrally-normalized margin bounds for neural networks.Advances in neural information processing systems (NeurIPS), 30:6240-6249, 2017. URL https://proceedings.neurips.cc/paper_files/paper/ 2017/hash/b22b257ad0519d4500539da3c8bcf4dd-Abstract.html. [4] Y . Bisk, R. Zellers, J. Gao, Y . Choi, et al. PIQA: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence (AAAI), pages 7432-7439, 2020. URLhttps://doi.org/10.48550/arXiv.1911.11641. [5] A. Chan, Y . Tay, and Y .-S. Ong. What it thinks is important is important: Robustness transfers through input gradients."},{"citing_arxiv_id":"2604.18738","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Remask, Don't Replace: Token-to-Mask Refinement in Diffusion Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-20T18:43:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Token-to-Mask remasking improves self-correction in diffusion LLMs by resetting erroneous commitments to masks rather than overwriting them, yielding +13.33 points on AIME 2025 and +8.56 on CMATH.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06169","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"In-Place Test-Time Training","primary_cat":"cs.LG","submitted_at":"2026-04-07T17:59:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Perplexityon a validation set comprised of Pile [20] and Proof-Pile-2 [40]. This metric measures perplexity on a fixed final block of tokens when extending the preceding context, where a decreasing perplexity trend indicates effective context usage. For the 4B models, we conduct a broader evaluation on a suite of downstream tasks, including common sense reasoning benchmarks (HellaSwag [66], ARC [12], MMLU [26, 27], PIQA [7]) and the long-context RULER benchmark [28]. Results and Discussion.In Figure 2, we plot the sliding window perplexity against context length for 500M/1.5B model. It can be easily seen that our In-Place TTT consistently achieves lower validation perplexity than all competitive baselines, with its performance steadily improving up to the full 32k context."},{"citing_arxiv_id":"2603.28239","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network","primary_cat":"cs.AR","submitted_at":"2026-03-30T09:59:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In the following, we address these challenges and detail the microarchitecture of SCIN. Furthermore, we present a prototype based on multiple FPGAs to validate the feasibility and effectiveness of SCIN in a real hardware system. 3.2 Network Protocol Extension In shared-memory systems designed for AI accelerators, dedicated communication protocols such as NVLink [27], Scale-Up Ethernet (SUE) [13], and Ultra Accelerator Link (UALink) [ 17] have been proposed to support memory transactions between AI accelerators. However, enabling the proposed ISA to directly access endpoint accelerator memory via the communication fabric inevitably ne- cessitates modifications to existing communication protocols and their hardware implementations. To address this challenge, we propose a hardware microarchi-"},{"citing_arxiv_id":"2605.20189","ref_index":51,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation","primary_cat":"cs.AI","submitted_at":"2026-03-23T07:18:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SOLAR introduces a self-optimizing agent using meta-learning on model weights and RL-driven strategy discovery for lifelong adaptation in LLMs, claiming superior performance on reasoning tasks across domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.20856","ref_index":98,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"NVIDIA Nemotron 3: Efficient and Open Intelligence","primary_cat":"cs.CL","submitted_at":"2025-12-24T00:24:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.24552","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Short window attention enables long-term memorization","primary_cat":"cs.LG","submitted_at":"2025-09-29T10:04:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Short sliding windows in hybrid attention-xLSTM models boost long-context performance by encouraging long-term memory use, and stochastic window sizing improves both short and long tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.18629","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HyperAdapt: Simple High-Rank Adaptation","primary_cat":"cs.LG","submitted_at":"2025-09-23T04:29:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HyperAdapt performs parameter-efficient fine-tuning by row- and column-wise diagonal scaling to induce high-rank updates with only n+m trainable parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.12119","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource","primary_cat":"cs.CL","submitted_at":"2025-06-13T17:59:05+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.16155","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PRIMETIME : Limits of LLMs in Temporal Primitives","primary_cat":"cs.NE","submitted_at":"2025-04-22T17:52:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.19786","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Gemma 3 Technical Report","primary_cat":"cs.CL","submitted_at":"2025-03-25T15:52:34+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Gemma 3 introduces multimodal open models with architectural changes for efficient long context, trained via distillation and a new post-training recipe that makes the 4B version competitive with prior 27B models and the 27B version comparable to Gemini-1.5-Pro.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.01743","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs","primary_cat":"cs.CL","submitted_at":"2025-03-03T17:05:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.12120","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws","primary_cat":"cs.LG","submitted_at":"2025-02-17T18:45:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Pretraining data determines loss-to-loss scaling laws in LLMs, while model size, optimization, tokenizer, and architecture have limited impact.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.05465","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026)","primary_cat":"cs.CL","submitted_at":"2025-01-03T19:53:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A literature survey of Small Language Models (1-8B parameters) that can perform comparably or better than larger models, covering general-purpose and task-specific approaches plus creation techniques.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[11] Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023). [12] Bhendawade, N., Belousova, I., Fu, Q., Mason, H., Rastegari, M., and Najibi, M. Speculative streaming: Fast llm inference without auxiliary models. arXiv preprint arXiv:2402.11131 (2024). [13] Bisk, Y., Zellers, R., Bras, R. L., Gao, J., and Choi, Y. Piqa: Reasoning about physical commonsense in natural language. ArXiv abs/1911.11641 (2019). [14] Cai, T., Li, Y., Geng, Z., Peng, H., Lee, J. D., Chen, D., and Dao, T. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774 (2024)."},{"citing_arxiv_id":"2412.00069","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning","primary_cat":"cs.LG","submitted_at":"2024-11-26T00:56:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CD-MoE condenses fine-grained MoE layers with shared experts into dense layers, retaining 90% accuracy with 27.5% memory cut and 1.26x speedup on DeepSeekMoE-16B, recovering 98% via brief fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.00118","ref_index":76,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Gemma 2: Improving Open Language Models at a Practical Size","primary_cat":"cs.CL","submitted_at":"2024-07-31T19:13:07+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.14219","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone","primary_cat":"cs.CL","submitted_at":"2024-04-22T14:32:33+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.08295","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Gemma: Open Models Based on Gemini Research and Technology","primary_cat":"cs.CL","submitted_at":"2024-03-13T06:59:16+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.04652","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Yi: Open Foundation Models by 01.AI","primary_cat":"cs.CL","submitted_at":"2024-03-07T16:52:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"window attention - the model remains using the full attention even the input is 200K. 5 Safety To enhance the model's trustworthiness and safety, we develop a full-stack Responsible AI Safety Engine (RAISE). RAISE ensures safe pretraining, alignment, and deployment. This section discusses our safety measures in the pretraining and alignment stages. Safety in Pretraining Aligning with standard pretraining data safety practices [5, 58, 77], we build a set of filters based on heuristic rules, keyword matching, and learned classifiers to remove text containing personal identifiers and private data, and reduce sexual, violent, and extremist content. Safety in Alignment Informed by existing research in [24, 35], we first build a comprehensive safety taxonomy. This taxonomy covers a broad spectrum of potential concerns, including environmental"},{"citing_arxiv_id":"2402.17764","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits","primary_cat":"cs.CL","submitted_at":"2024-02-27T18:56:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BitNet b1.58 shows that ternary 1.58-bit LLMs can match full-precision performance at substantially lower inference cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.17762","ref_index":106,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Massive Activations in Large Language Models","primary_cat":"cs.CL","submitted_at":"2024-02-27T18:55:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.06196","ref_index":196,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Large Language Models: A Survey","primary_cat":"cs.CL","submitted_at":"2024-02-09T05:37:09+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"SIQA dataset has 38, 000 multiple-choice questions designed to assess emotional and social intelligence in everyday circumstances. This dataset covers a wide variety of social scenarios. In SIQA, the potential answers is a mixture of human-selected responses and machine-generated ones that have been filtered through adversarial processes. • OpenBookQA (OBQA) [196] is a new kind of question-answering dataset where answering its ques- tions requires additional common and commonsense knowledge not contained in the book and rich text comprehension. This dataset includes around 6,000 multiple-choice questions. Each question is linked to one core fact, as well as an additional collection of over 6000 facts. The questions were developed"},{"citing_arxiv_id":"2309.05463","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Textbooks Are All You Need II: phi-1.5 technical report","primary_cat":"cs.CL","submitted_at":"2023-09-11T14:01:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"phi-1.5 is a 1.3B parameter model trained on synthetic textbook data that matches the reasoning performance of models five times larger on natural language, math, and basic coding tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2204.02311","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PaLM: Scaling Language Modeling with Pathways","primary_cat":"cs.CL","submitted_at":"2022-04-05T16:11:45+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"8B and 62B models can be found in Appendix H.1. 0-shot 1-shot Few-shot Task Prior SOTA PaLM 540B Prior SOTA PaLM 540B Prior SOTA PaLM 540B TriviaQA (EM) 71 .3a 76.9 75.8a 81.4 75.8a (1) 81.4 (1) Natural Questions (EM) 24.7a 21.2 26 .3a 29.3 32.5a (1) 39.6 (64) Web Questions (EM) 19.0a 10.6 25.3b 22.6 41 .1b (64) 43.5 (64) Lambada (EM) 77 .7f 77.9 80.9a 81.8 87.2c (15) 89.7 (8) HellaSwag 80 .8f 83.4 80.2c 83.6 82.4c (20) 83.8 (5) StoryCloze 83 .2b 84.6 84.7b 86.1 87.7b (70) 89.0 (5) Winograd 88 .3b 90.1 89.7 b 87.5 88 .6a (2) 89.4 (5) Winogrande 74 .9f 81.1 73.7c 83.7 79.2a (16) 85.1 (5) Drop (F1) 57 .3a 69.4 57.8a 70.8 58.6a (2) 70.8 (1) CoQA (F1) 81.5b 77.6 84.0b 79.9 85.0b (5) 81.5 (5) QuAC (F1) 41 .5b 45.2 43."},{"citing_arxiv_id":"2009.03300","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Measuring Massive Multitask Language Understanding","primary_cat":"cs.CY","submitted_at":"2020-09-07T17:59:25+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2005.14165","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Language Models are Few-Shot Learners","primary_cat":"cs.CL","submitted_at":"2020-05-28T17:29:03+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"with the hope of stimulating further study of test-time behavior of language models. 3.9.1 Arithmetic To test GPT-3's ability to perform simple arithmetic operations without task-speciﬁc training, we developed a small battery of 10 tests that involve asking GPT-3 a simple arithmetic problem in natural language: • 2 digit addition (2D+) - The model is asked to add two integers sampled uniformly from [0, 100), phrased in the form of a question, e.g. \"Q: What is 48 plus 76? A: 124.\" • 2 digit subtraction (2D-) - The model is asked to subtract two integers sampled uniformly from [0, 100); the answer may be negative. Example: \"Q: What is 34 minus 53? A: -19\". • 3 digit addition (3D+) - Same as 2 digit addition, except numbers are uniformly sampled from [0, 1000). 21 Figure 3.10: Results on all 10 arithmetic tasks in the few-shot settings for models of different sizes. There is a signiﬁcant jump from the second largest model (GPT-3 13B) to the largest model (GPT-3 175), with the latter being able to reliably accurate 2 digit arithmetic, usually accurate 3 digit arithmetic, and correct answers a signiﬁcant fraction of the time on 4-5 digit arithmetic, 2 digit multiplication, and compound operations. Results for one-shot and zero-shot are shown in the appendix. • 3 digit subtraction (3D-) - Same as 2 digit subtraction, except numbers are uniformly sampled from[0, 1000). • 4 digit addition (4D+) - Same as 3 digit addition, except uniformly sampled from [0, 10000). • 4 digit subtraction (4D-) - Same as 3 digit subtraction, except uniformly sampled from [0, 10000). • 5 digit addition (5D+) - Same as 3 digit addition, except uniformly sampled from [0, 100000). • 5 digit subtraction (5D-) - Same as 3 digit subtraction, except uniformly sampled from [0, 100000). • 2 digit multiplication (2Dx) - The model is asked to multiply two integers sampled uniformly from [0, 100), e.g. \"Q: What is 24 times 42? A: 1008\". • One-digit composite (1DC) - The model is asked to perform a"}],"limit":50,"offset":0}