{"total":31,"items":[{"citing_arxiv_id":"2605.21147","ref_index":5,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SMoA: Spectrum Modulation Adapter for Parameter-Efficient Fine-Tuning","primary_cat":"cs.LG","submitted_at":"2026-05-20T13:19:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SMoA is a new PEFT adapter that uses block-wise Hadamard-modulated low-rank branches on spectral partitions to cover more pretrained spectral directions than standard LoRA under a smaller parameter budget.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18607","ref_index":99,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Forecasting Downstream Performance of LLMs With Proxy Metrics","primary_cat":"cs.CL","submitted_at":"2026-05-18T16:17:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proxy metrics from next-token distributions over expert solutions outperform loss and compute baselines for ranking LLMs, selecting pretraining data, and extrapolating performance across compute scales.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17997","ref_index":7,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"MARR: Module-Adaptive Residual Reconstruction for Low-Bit Post-Training Quantization","primary_cat":"cs.LG","submitted_at":"2026-05-18T07:51:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MARR uses per-module adaptive residual scaling updated by PID feedback to balance error correction against Hessian-approximation bias in low-bit PTQ.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14738","ref_index":60,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"TAPIOCA: Why Task- Aware Pruning Improves OOD model Capability","primary_cat":"cs.LG","submitted_at":"2026-05-14T12:01:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Task-aware pruning improves OOD performance by removing layers that distort task-adapted representation profiles, realigning OOD inputs with the geometry observed on ID data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13429","ref_index":123,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment","primary_cat":"cs.CL","submitted_at":"2026-05-13T12:23:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08894","ref_index":6,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-09T11:19:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Extremely quantized LLMs exhibit systematic smoothness degradation that reduces effective token candidates and degrades generation; a smoothness-preserving principle in PTQ and QAT delivers gains beyond numerical accuracy.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"What it thinks is important is important: Robustness transfers through input gradients. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 332-341, 2020. URL https://openaccess. thecvf.com/content_CVPR_2020/html/Chan_What_It_Thinks_Is_Important_Is_ Important_Robustness_Transfers_Through_CVPR_2020_paper.html. [6] C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 2924-2936, 2019. URL https://doi. org/10."},{"citing_arxiv_id":"2605.08636","ref_index":6,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"EdgeFlowerTune: Evaluating Federated LLM Fine-Tuning Under Realistic Edge System Constraints","primary_cat":"cs.CL","submitted_at":"2026-05-09T03:02:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EdgeFlowerTune is a real-device benchmark that jointly assesses model quality and system costs for federated LLM fine-tuning on edge hardware using three protocols: Quality-under-Budget, Cost-to-Target, and Robustness.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"datasets into three categories:Verify,Choose, andReason. Verifyevaluates whether a model can determine if a fact, condition, or semantic relation holds. This category reflects edge scenarios such as checking whether a notification is urgent, whether a sensor event satisfies an alert condition, or whether a context supports a candidate conclusion. We instantiate this category with BoolQ [6] and QNLI [25, 20], which evaluate boolean question answering and question-answer entailment, respectively. Chooseevaluates whether a model can select the most appropriate option from multiple candidates. This category reflects scenarios where an edge assistant or local controller needs to choose among candidate replies, recommendations, actions, or explanations based on contextual information."},{"citing_arxiv_id":"2605.06366","ref_index":5,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Layer Collapse in Diffusion Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-07T14:39:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01046","ref_index":22,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Learning in the Fisher Subspace: A Guided Initialization for LoRA Fine-Tuning","primary_cat":"cs.LG","submitted_at":"2026-05-01T19:20:25+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25578","ref_index":3,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Marco-MoE: Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling","primary_cat":"cs.CL","submitted_at":"2026-04-28T12:45:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Marco-MoE delivers open multilingual MoE models with 5% activation sparsity that outperform similarly sized dense models on English and multilingual benchmarks through efficient upcycling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21100","ref_index":70,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences","primary_cat":"cs.LG","submitted_at":"2026-04-22T21:38:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19728","ref_index":14,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"VLA Foundry: A Unified Framework for Training Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-04-21T17:51:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"\"Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks\". In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, pp. 24185-24198. [13] Cheng Chi et al. \"Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots\". In:Proceedings of Robotics: Science and Systems (RSS). 2024. [14] Christopher Clark et al. \"BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions\". In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp."},{"citing_arxiv_id":"2604.19520","ref_index":23,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SimDiff: Depth Pruning via Similarity and Difference","primary_cat":"cs.AI","submitted_at":"2026-04-21T14:43:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SimDiff uses similarity and difference metrics to prune LLM layers more effectively than cosine similarity alone, retaining over 91% performance at 25% pruning on LLaMA2-7B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19398","ref_index":64,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models","primary_cat":"cs.AI","submitted_at":"2026-04-21T12:26:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GRASPrune removes 50% of parameters from LLaMA-2-7B via global gating and projected straight-through estimation, reaching 12.18 WikiText-2 perplexity and competitive zero-shot accuracy after four epochs on 512 calibration sequences.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17396","ref_index":176,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Representation-Guided Parameter-Efficient LLM Unlearning","primary_cat":"cs.CL","submitted_at":"2026-04-19T11:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06291","ref_index":5,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"TalkLoRA: Communication-Aware Mixture of Low-Rank Adaptation for Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-04-07T14:57:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TalkLoRA equips MoE-LoRA experts with a communication module that smooths routing dynamics and improves performance on language tasks under similar parameter budgets.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"yi =B iEiAix,(4) whereiranges from 1 ton. Here,rdenotes the total rank of TalkLoRA, andnrepresents the num- ber of experts. Talking Module:This module guides the router in weight allocation and relaxes the inde- pendence assumption among experts. It enables information exchange among experts prior to rout- ing. Formally, we define: ˜hi = nX j=1 Cijhj,(5) whereC∈R n×n is a learnable communication matrix andh j =A jx∈R r n serves as the internal representation of expertj. This operation allows each expert to integrate compact, task-relevant signals from other experts while preserving its own specialization. The Talk- ing Module is lightweight, adding onlyO(n 2)pa- rameters. Routing:Unlike traditional routing, which re-"},{"citing_arxiv_id":"2512.02764","ref_index":17,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models","primary_cat":"cs.CL","submitted_at":"2025-12-02T13:44:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PEFT-Factory supplies a ready-to-use, extensible codebase that unifies 19 PEFT methods and evaluation pipelines for fine-tuning large autoregressive language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.21285","ref_index":12,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark","primary_cat":"cs.CL","submitted_at":"2025-11-26T11:18:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PEFT-Bench is a standardized end-to-end benchmark for 7 PEFT methods across 27 NLP datasets on autoregressive LLMs, accompanied by the PSCP metric that penalizes based on trainable parameters, inference speed, and training memory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.23818","ref_index":8,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"ScaLoRA: Optimally Scaled Low-Rank Adaptation for Efficient High-Rank Fine-Tuning","primary_cat":"cs.LG","submitted_at":"2025-10-27T19:59:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ScaLoRA analytically derives per-update column scalings that let low-rank increments accumulate into high-rank weight updates, yielding faster convergence and higher accuracy than prior LoRA variants on LLMs up to 12B parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.15707","ref_index":4,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked?","primary_cat":"cs.CL","submitted_at":"2025-07-21T15:15:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"LLM accuracy on reasoning tasks differs significantly by question type, with step-by-step reasoning accuracy often uncorrelated to final answer selection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.00432","ref_index":213,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning","primary_cat":"cs.AI","submitted_at":"2025-07-01T05:23:05+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.00663","ref_index":21,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Titans: Learning to Memorize at Test Time","primary_cat":"cs.LG","submitted_at":"2024-12-31T22:32:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) . Ed. by Jill Burstein, Christy Doran, and Thamar Solorio. Minneapolis, Minnesota: Association for Computational Linguistics, June 2019, pp. 2924-2936. doi: 10.18653/v1/N19-1300. url: https: //aclanthology.org/N19-1300/. 18 [21] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. \"Think you have solved question answering? try arc, the ai2 reasoning challenge\". In:arXiv preprint arXiv:1803.05457 (2018). [22] Nelson Cowan. \"What are the differences between long-term, short-term, and working memory?\" In:Progress in brain research 169 (2008), pp."},{"citing_arxiv_id":"2412.06464","ref_index":100,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Gated Delta Networks: Improving Mamba2 with Delta Rule","primary_cat":"cs.CL","submitted_at":"2024-12-09T13:09:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Gated DeltaNet integrates gating and delta rules into linear transformers, outperforming Mamba2 and DeltaNet on language modeling, reasoning, retrieval, and long-context tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.11794","ref_index":43,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"DataComp-LM: In search of the next generation of training sets for language models","primary_cat":"cs.LG","submitted_at":"2024-06-17T17:42:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"formats, considering varied domains like mathematics, text-book knowledge, and common-sense reasoning. To evaluate data curation algorithms, we focus on three main performance metrics. First, we consider MMLU 5-shot accuracy [78], which is widely used to compare state-of-the-art models like GPT-4 [122] and Llama 3 70B [4]. Second, we propose the CORE centered accuracy, computed over a subset of 22 tasks (e.g., HellaSwag [195] and ARC-E [43]) that provide a low-variance signal even at small scales, linearly rescaling the accuracy per task so that 0 corresponds to random guessing and 1 corresponds to perfect accuracy. Finally, we report the EXTENDED centered accuracy, which averages the centered performance for all of our 53 tasks. For more metric details, see Appendix G. 4 Building high-quality training datasets with DCLM"},{"citing_arxiv_id":"2405.04434","ref_index":128,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model","primary_cat":"cs.CL","submitted_at":"2024-05-07T15:56:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.02954","ref_index":132,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"DeepSeek LLM: Scaling Open-Source Language Models with Longtermism","primary_cat":"cs.CL","submitted_at":"2024-01-05T18:59:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2311.12983","ref_index":86,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"GAIA: a benchmark for General AI Assistants","primary_cat":"cs.CL","submitted_at":"2023-11-21T20:34:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.14233","ref_index":122,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Enhancing Chat Language Models by Scaling High-quality Instructional Conversations","primary_cat":"cs.CL","submitted_at":"2023-05-23T16:49:14+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2303.17564","ref_index":23,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"BloombergGPT: A Large Language Model for Finance","primary_cat":"cs.LG","submitted_at":"2023-03-30T17:30:36+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2110.01552","ref_index":4,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Perhaps PTLMs Should Go to School -- A Task to Assess Open Book and Closed Book QA","primary_cat":"cs.CL","submitted_at":"2021-10-04T16:45:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Proposes a textbook-based true/false QA task where PTLMs score ~50% closed-book even after pre-training on the text and ~60% open-book with retrieval.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2104.08691","ref_index":5,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"The Power of Scale for Parameter-Efficient Prompt Tuning","primary_cat":"cs.CL","submitted_at":"2021-04-18T03:19:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Prompt tuning matches full model tuning performance on large language models while tuning only a small fraction of parameters and improves robustness to domain shifts.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"prompt initialization by training models at all sizes while ﬁxing other hyperparameters to their default values. For random initialization, we sample uni- 6Going past 100 tokens appears mildly detrimental for larger models. A similar pattern of diminishing performance past a certain preﬁx length is observed by Li and Liang (2021). formly from the range [−0.5, 0.5]. When initial- izing from sampled vocabulary, we restrict to the 5,000 most \"common\" tokens in T5's Sentence- Piece vocabulary (Kudo and Richardson, 2018), which is ordered by likelihood in the pre-training corpus. For \"class label\" initialization, we take the embeddings for the string representations of each class in the downstream task and use them to"}],"limit":50,"offset":0}