{"total":19,"items":[{"citing_arxiv_id":"2512.11013","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data","primary_cat":"cs.CL","submitted_at":"2025-12-11T16:55:30+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PIAST iteratively optimizes few-shot examples in prompts via Monte Carlo Shapley value estimation, outperforming prior automatic prompting methods and setting new SOTA on classification, simplification, and GSM8K with modest compute.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.16155","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PRIMETIME : Limits of LLMs in Temporal Primitives","primary_cat":"cs.NE","submitted_at":"2025-04-22T17:52:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.02737","ref_index":203,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model","primary_cat":"cs.CL","submitted_at":"2025-02-04T21:43:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.01574","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark","primary_cat":"cs.CL","submitted_at":"2024-06-03T17:53:00+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"capabilities have rapidly advanced, leaderboard scores have become increasingly concentrated at the top, with models like GPT-4 achieving near-perfect scores on multiple benchmarks. This trend high- lights the urgent need for more challenging benchmarks to fully test the limits of LLM capabilities. Recent studies have revealed that the performance of Large Language Models (LLMs) on current benchmarks is not robust to minor perturbations [25, 31]. Specifically, slight variations in the style or phrasing of prompts can lead to significant shifts in model scores. Beyond the inherent non-robustness of the models themselves, the typical four-option format of multiple-choice questions (MCQs) also contributes to this instability in model scoring. This format may not sufficiently challenge the models"},{"citing_arxiv_id":"2405.14782","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Lessons from the Trenches on Reproducible Evaluation of Language Models","primary_cat":"cs.CL","submitted_at":"2024-05-23T16:50:49+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.07974","ref_index":184,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code","primary_cat":"cs.SE","submitted_at":"2024-03-12T17:58:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.13228","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive","primary_cat":"cs.CL","submitted_at":"2024-02-20T18:42:34+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.06196","ref_index":134,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Large Language Models: A Survey","primary_cat":"cs.CL","submitted_at":"2024-02-09T05:37:09+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"instructions through prompts. This is the so-called instruction tuning [133]. We dive into the details of how to design and engineer prompts in section IV-B, but in the context of instruction tuning, it is important to understand that the instruction is a prompt that specifies the task that the LLM should accomplish. Instruction tuning datasets such as Natural Instructions [134] include not only the task definition but other components such as positive/negative examples or things to avoid. The specific approach and instruction datasets used to instruction-tune an LLM varies, but, generally speaking, in- struction tuned models outperform their original foundation models they are based on. For example, InstructGPT [59] outperforms GPT-3 on most benchmarks."},{"citing_arxiv_id":"2401.10020","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Self-Rewarding Language Models","primary_cat":"cs.CL","submitted_at":"2024-01-18T14:43:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2312.11805","ref_index":64,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Gemini: A Family of Highly Capable Multimodal Models","primary_cat":"cs.CL","submitted_at":"2023-12-19T02:39:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2308.03958","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Simple synthetic data reduces sycophancy in large language models","primary_cat":"cs.CL","submitted_at":"2023-08-07T23:48:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.16355","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PandaGPT: One Model To Instruction-Follow Them All","primary_cat":"cs.CL","submitted_at":"2023-05-25T04:16:07+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A single model trained only on image-text pairs gains instruction-following ability across images, video, and audio by routing all modalities through ImageBind's shared embedding space into Vicuna.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.11206","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LIMA: Less Is More for Alignment","primary_cat":"cs.CL","submitted_at":"2023-05-18T17:45:22+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Fine-tuning a 65B model on 1,000 high-quality examples produces output that humans rate as good as or better than GPT-4 in 43% of cases, indicating most capabilities come from pretraining.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.06355","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VideoChat: Chat-Centric Video Understanding","primary_cat":"cs.CV","submitted_at":"2023-05-10T17:59:04+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"els [32] are finetuned using datasets containing prompts with corresponding human-annotated desired behavior. This results in better alignment with users, improved output quality compared to GPT-3, increased truthfulness, and reduced risks. Instruction-tuned models also present remarkable gener- alization capacity for zero-shot tasks. Therefore, instruction-tuning [28, 8] is crucial in leveraging LLMs' potential. Besides of GPT family [30, 29, 3], there are multiple LLMs, including OPT [57], LLaMA [42], MOSS [9], and GLM [56], providing high-performance, open-source resources that can be finetuned for various purposes. For instance, Alpaca [40] proposes a self-instruct framework to instruction-tune LLaMA models without heavily relying on human-authored instruction data."},{"citing_arxiv_id":"2303.09014","ref_index":159,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ART: Automatic multi-step reasoning and tool-use for large language models","primary_cat":"cs.CL","submitted_at":"2023-03-16T01:04:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2301.13688","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Flan Collection: Designing Data and Methods for Effective Instruction Tuning","primary_cat":"cs.AI","submitted_at":"2023-01-31T15:03:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The Flan Collection demonstrates that task balancing, data enrichment, and mixed prompt training are critical to effective instruction tuning, yielding stronger Flan-T5 models released publicly.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2205.01068","ref_index":205,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OPT: Open Pre-trained Transformer Language Models","primary_cat":"cs.CL","submitted_at":"2022-05-02T17:49:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2202.12837","ref_index":228,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?","primary_cat":"cs.CL","submitted_at":"2022-02-25T17:25:19+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Randomly replacing labels in in-context demonstrations barely hurts performance, showing that label space, input distribution, and sequence format drive in-context learning more than ground-truth labels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2110.08207","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Multitask Prompted Training Enables Zero-Shot Task Generalization","primary_cat":"cs.LG","submitted_at":"2021-10-15T17:08:57+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}