{"total":49,"items":[{"citing_arxiv_id":"2605.20830","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech","primary_cat":"eess.AS","submitted_at":"2026-05-20T07:21:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Raon-OpenTTS provides an open 510K-hour curated speech dataset and DiT-based TTS models up to 1B parameters that achieve competitive WER and speaker similarity on benchmarks versus closed models trained on millions of hours.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19407","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Bitter Lesson for Data Filtering","primary_cat":"cs.LG","submitted_at":"2026-05-19T06:02:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"With enough compute, large models benefit from training on unfiltered data that includes low-quality and distractor examples instead of requiring high-quality filtered data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15104","ref_index":113,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents","primary_cat":"cs.CL","submitted_at":"2026-05-14T17:22:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14062","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection","primary_cat":"cs.AI","submitted_at":"2026-05-13T19:35:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MSIFR stops faulty LLM generations early via staged rule-based checks, reducing token consumption 11-78% with no accuracy loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12395","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Comparative Study of Controlled Text Generation Systems Using Level-Playing-Field Evaluation Principles","primary_cat":"cs.CL","submitted_at":"2026-05-12T16:57:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Re-evaluating controlled text generation systems under standardized conditions reveals that many published performance claims do not hold, highlighting the need for consistent evaluation practices.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Additionally, different work explores the limitations of the means-based evaluation metrics commonly used in NLP. Gehrmann et al. [16] survey the major obstacles in evaluating generated text, identifying inconsistencies in metric use, a lack of correlation with human judgements, and a general absence of standardised evaluation protocols. Peyrard et al. [43] argue that aggregating scores can obscure instance-level variations, resulting in evaluations that fail to fully capture the strengths and weaknesses of different systems. Colombo et al . [10] propose a novel approach to NLP benchmarking by introducing a method that aggregates system performances across multiple tasks. Their method aims to provide a more robust and fair comparison framework for evaluating"},{"citing_arxiv_id":"2605.06901","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Reflections and New Directions for Human-Centered Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-07T20:02:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Model developers must address human concerns, preferences, values, and goals with rigor at every stage of the LLM pipeline rather than only in post-training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06897","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes","primary_cat":"cs.CL","submitted_at":"2026-05-07T19:57:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"\"Refrigerator\": \"placements\": [\"Kitchen\"], \"capabilities\": \"mode\": [\"eco\", \"normal\"], \"ice_maker\": [\"on\", \"off\"], \"Oven\": \"placements\": [\"Kitchen\"], \"capabilities\": \"power\": [\"on\", \"off\"], \"temperature\": list(range(200, 451, 25)), \"mode\": [\"bake\", \"broil\", \"convection\"], \"Microwave\": \"placements\": [\"Kitchen\"], \"capabilities\": \"power\": [\"on\", \"off\"], \"duration_seconds\": [30, 60, 90, 120], \"Coffee Maker\": \"placements\": [\"Kitchen\", \"Office / Study\"], \"capabilities\": \"power\": [\"on\", \"off\"], \"brew_strength\": [\"mild\", \"medium\", \"strong\"], \"Dishwasher\": \"placements\": [\"Kitchen\"], \"capabilities\": \"power\": [\"on\", \"off\"], \"cycle\": [\"normal\", \"heavy\", \"rinse\"], # Entertainment \"TV\": \"placements\": [\"Living Room\", \"Bedroom\", \"Home Theater\"], \"capabilities\": \"power\": [\"on\", \"off\"], \"volume\": list(range(0, 51, 5)), \"source\": [\"HDMI 1\", \"Netflix\","},{"citing_arxiv_id":"2605.05365","ref_index":70,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ZAYA1-8B Technical Report","primary_cat":"cs.AI","submitted_at":"2026-05-06T18:44:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02757","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation","primary_cat":"cs.CV","submitted_at":"2026-05-04T15:57:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A video transfer pipeline augments simulated VLA data into realistic videos while preserving actions, yielding consistent performance gains on robot benchmarks such as 8% on Robotwin 2.0.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18272","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MFMDQwen: Multilingual Financial Misinformation Detection Based on Large Language Model","primary_cat":"cs.CE","submitted_at":"2026-04-20T13:48:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MFMDQwen is the first open-source LLM for multilingual financial misinformation detection, backed by a new instruction dataset and benchmark on which it outperforms other open-source models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17803","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition","primary_cat":"cs.AI","submitted_at":"2026-04-20T04:51:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05227","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods","primary_cat":"cs.LG","submitted_at":"2026-04-19T14:23:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16923","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Alignment Imprint: Zero-Shot AI-Generated Text Detection via Provable Preference Discrepancy","primary_cat":"cs.AI","submitted_at":"2026-04-18T09:12:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LAPD, derived from the provable preference discrepancy in aligned LLMs, improves zero-shot AI text detection by 45.82% over baselines with claimed statistical dominance over Fast-DetectGPT.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16475","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Spike-driven Large Language Model","primary_cat":"cs.NE","submitted_at":"2026-04-11T17:58:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SDLLM is a spike-driven LLM that uses gamma-SQP two-step encoding, bidirectional symmetric quantization, and membrane potential clipping to achieve 7x lower energy consumption and 4.2% higher accuracy than prior spike-based language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08448","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages","primary_cat":"cs.CL","submitted_at":"2026-04-09T16:45:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"AfriVoices-KE is a 3,000-hour multilingual speech dataset for Dholuo, Kikuyu, Kalenjin, Maasai, and Somali with 750 hours scripted and 2,250 hours spontaneous speech from 4,777 speakers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03199","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learning the Signature of Memorization in Autoregressive Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-03T17:17:51+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03298","ref_index":47,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs","primary_cat":"cs.AR","submitted_at":"2026-03-28T16:11:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.21613","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining","primary_cat":"cs.CL","submitted_at":"2025-11-26T17:36:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Fine-grained metadata such as document quality indicators accelerate LLM pretraining when prepended, and metadata appending plus learnable meta-tokens recover additional speedup via auxiliary tasks and latent structure.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.22075","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning","primary_cat":"cs.CL","submitted_at":"2025-09-26T08:55:09+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoSpaDi introduces a training-free sparse dictionary learning framework for post-training LLM compression that optimizes functional reconstruction error via activation-derived orthonormalization and achieves improved accuracy-compression trade-offs over SVD and pruning baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.07177","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector","primary_cat":"cs.CL","submitted_at":"2025-09-08T19:48:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Fine-tuned LLaMA 3.1-8B variants for the energy sector outperform the base model on domain QA benchmarks, with LoRA delivering similar gains at lower training cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.00432","ref_index":235,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning","primary_cat":"cs.AI","submitted_at":"2025-07-01T05:23:05+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.08223","ref_index":156,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices","primary_cat":"cs.DC","submitted_at":"2025-03-11T09:41:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Position paper claiming that distributed training across massive edge devices can overcome data depletion and centralized compute monopolies in LLM scaling.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"14619, 2024. [154] Guilherme Penedo, Anis Crnisanin, Ethan Shen, et al. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data. arXiv preprint arXiv:2306.01116, 2023. [155] Leo Gao, Stella Biderman, Sid Black, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020. [156] Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M Anwer, Michael Felsberg, Tim Baldwin, Eric P Xing, and Fahad Shahbaz Khan. Mobillama: Towards accurate and lightweight fully transparent gpt. arXiv preprint arXiv:2402.16840, 2024. [157] Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, et al."},{"citing_arxiv_id":"2502.12120","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws","primary_cat":"cs.LG","submitted_at":"2025-02-17T18:45:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Pretraining data determines loss-to-loss scaling laws in LLMs, while model size, optimization, tokenizer, and architecture have limited impact.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.11794","ref_index":139,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DataComp-LM: In search of the next generation of training sets for language models","primary_cat":"cs.LG","submitted_at":"2024-06-17T17:42:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"URL http: //jmlr.org/papers/v21/20-074.html. [138] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1-140:67, 2020. URL http: //jmlr.org/papers/v21/20-074.html. [139] Nazneen Rajani, Lewis Tunstall, Edward Beeching, Nathan Lambert, Alexander M. Rush, and Thomas Wolf. No robots. https://huggingface.co/datasets/HuggingFaceH4/ no_robots, 2023. 25 [140] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on"},{"citing_arxiv_id":"2406.04952","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Quantifying Geospatial in the Common Crawl Corpus","primary_cat":"cs.CL","submitted_at":"2024-06-07T14:16:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Analysis estimates 18.7% of Common Crawl documents contain geospatial information like coordinates and addresses, with little difference by language.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2405.04434","ref_index":150,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model","primary_cat":"cs.CL","submitted_at":"2024-05-07T15:56:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.18930","ref_index":133,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Hallucination of Multimodal Large Language Models: A Survey","primary_cat":"cs.CV","submitted_at":"2024-04-29T17:59:41+00:00","verdict":"ACCEPT","verdict_confidence":"UNKNOWN","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ESREAL [85], ConVis [128] Mitigating Inference-relatedHallucinations (§5.4) Generation Intervention Contrastive Decodinge.g.VCD [94], IBD [226], ICD [166] Guided Decoding e.g.MARINE [212], GCD [38], DeCo [158] Visual Amplificatione.g.M3ID [41], IBD [226], AGLA [1] Others e.g.OPERA [66], Skip'\\n' [56] Visual Prompting e.g.SoM-LLaVA [179] RAG e.g.ARA [133], FilterRAG [141] Ensembling e.g.RITUAL [169], MAD [107], MVP [134] Post-hoc Correction e.g.Woodpecker [188], Volcano [93], LURE [224], VFC [45] Fig. 1. The main content flow and categorization of this survey. Preprint, Vol. 1, No. 1, Article . Publication date: April 2025. Hallucination of Multimodal Large Language Models: A Survey 5 Vision InputVision ModelLLMImageVideo…CLIP DINO-v2Linear…LLaMAVicunaChatGLMFuyu"},{"citing_arxiv_id":"2403.17297","ref_index":209,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"InternLM2 Technical Report","primary_cat":"cs.CL","submitted_at":"2024-03-26T00:53:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.07691","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ORPO: Monolithic Preference Optimization without Reference Model","primary_cat":"cs.CL","submitted_at":"2024-03-12T14:34:08+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"generated per input, we sample the first item for each input and examine their inter cosine similarity with Equation 15 for across-input diversity. Un- like per-input diversity, it is noteworthy that Phi-2 (ORPO) has lower average cosine similarity in the second row of Table 4. We can infer that ORPO triggers the model to generate more instruction- specific responses than DPO. AIDD(θ) = D N[ i=1 Oi,θ,j=1 ! (15) Per Input↓ Across Input↓ Phi-2 + SFT + DPO 0.8012 0.6019 Phi-2 + ORPO 0.8909 0.5173 Llama-2 + SFT + DPO 0.8889 0.5658 Llama-2 + ORPO 0.9008 0.5091 Table 4: Lexical diversity of Phi-2 and Llama-2 fine- tuned with DPO and ORPO. Lower cosine similarity is equivalent to higher diversity. The highest value in each column within the same model family is bolded. 7 Discussion In this section, we expound on the theoretical and computational details of ORPO. The theoretical anal- ysis of ORPO is studied in Section 7.1, which will be supported with the empirical analysis in Section 7.2. Then, we compare the computational load of DPO and ORPO in Section 7.3. 7.1 Comparison to Probability Ratio The rationale for selecting the odds ratio instead of the probability ratio lies in its stability. The prob- ability ratio for generating the favored response yw over the disfavored response yl given an input sequence x can be defined as Equation 16. PRθ(yw, yl) = Pθ(yw|x) Pθ(yl|x) (16) While this formulation has been used in previous preference alignment methods that precede SFT (Rafailov et al., 2023; Azar et al., 2023), the odds ratio is a better choice in the setting where the preference alignment is incorporated in SFT as the odds ratio is more sensitive to the model's prefer- ence understanding. In other words, the probability ratio leads to more extreme discrimination of the disfavored responses than the odds ratio. We visualize this through the sample distribu- tions of the log probability ratio log PR(X2|X1) and log odds ratio log OR(X2|X1). We sample 50,000 samp"},{"citing_arxiv_id":"2402.06196","ref_index":124,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Large Language Models: A Survey","primary_cat":"cs.CL","submitted_at":"2024-02-09T05:37:09+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"for tasks about generating new sentences conditioned on a given input, such as summarization, translation, or generative question answering. B. Data Cleaning Data quality is crucial to the performance of language models trained on them. Data cleaning techniques such as filtering, deduplication, are shown to have a big impact on the model performance. As an example, in Falcon40B [124], Penedo et al. showed that properly filtered and deduplicated web data alone can lead to powerful models; even significantly outperforming models from the state-of-the-art trained on The Pile. Despite extensive filtering, they were able to obtain five trillion tokens from How LLMs Are Built? Data Cleaning Tokenizations BytePairEncoding WordPieceEncoding"},{"citing_arxiv_id":"2402.02750","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache","primary_cat":"cs.CL","submitted_at":"2024-02-05T06:06:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.15947","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MoE-LLaVA: Mixture of Experts for Large Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2024-01-29T08:13:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.02954","ref_index":152,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DeepSeek LLM: Scaling Open-Source Language Models with Longtermism","primary_cat":"cs.CL","submitted_at":"2024-01-05T18:59:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2312.14238","ref_index":115,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks","primary_cat":"cs.CV","submitted_at":"2023-12-21T18:59:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Flamingo [3] uses the visual and language inputs as prompts and shows remarkable few-shot performance for visual question answering. Subsequently, GPT-4 [110], LLaV A series [91, 92, 100] and MiniGPT-4 [187] have brought in visual instruction tuning, to improve the instruction-following ability of VLLMs. Concurrently, models such as VisionLLM [147], KOSMOS-2 [115], and Qwen-VL et al. [5, 21, 149] have improved VLLMs with visual grounding capabilities, facilitating tasks such as re- gion description and localization. Many API-based meth- ods [96, 97, 125, 133, 155, 163, 166] have also attempted to integrate vision APIs with LLMs for solving vision-centric tasks. Additionally, PaLM-E [43] and EmbodiedGPT [108]"},{"citing_arxiv_id":"2311.16867","ref_index":116,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Falcon Series of Open Language Models","primary_cat":"cs.CL","submitted_at":"2023-11-28T15:12:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2311.15127","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets","primary_cat":"cs.CV","submitted_at":"2023-11-25T22:28:38+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results while enabling motion LoRA and multi-view 3D applications.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tion on large-scale data. ArXiv, 2020. 15 [61] Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik P. Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models, 2023. 15 [62] Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts, 2022. 15 [63] Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobei- dli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Lau- nay. The RefinedWeb dataset for Falcon LLM: outperform- ing curated corpora with web data, and web data only.arXiv preprint arXiv:2306.01116, 2023. 3 [64] Dustin Podell, Zion English, Kyle Lacey, Andreas"},{"citing_arxiv_id":"2311.07575","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models","primary_cat":"cs.CV","submitted_at":"2023-11-13T18:59:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SPHINX improves multi-modal LLMs through joint mixing of weights, tasks, and visual embeddings from varied sources to achieve stronger alignment and multi-purpose capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2311.05232","ref_index":254,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions","primary_cat":"cs.CL","submitted_at":"2023-11-09T09:25:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"herently non-uniform, which results in LLMs demonstrating varying levels of proficiency across different types of knowledge. Recent studies have highlighted a strong correlation between the model's accuracy on general domain questions and the volume of relevant documents [145] or entity popularity [204] within the pre-training corpora. Furthermore, given that LLMs are pre- dominantly trained on extensive general domain corpora [93, 243, 254], they may exhibit deficits in domain-specific knowledge. This limitation becomes particularly evident when LLMs are con- fronted with tasks that require domain-specific expertise, such as medical [ 179, 279] and legal [149, 353] questions, these models may exhibit pronounced hallucinations, often manifesting as factual fabrication. Up-to-date Knowledge."},{"citing_arxiv_id":"2310.09478","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning","primary_cat":"cs.CV","submitted_at":"2023-10-14T03:22:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MiniGPT-v2 adds unique task identifiers to a large language model so one system can perform image description, visual question answering, and visual grounding after three-stage training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.06114","ref_index":140,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning Interactive Real-World Simulators","primary_cat":"cs.AI","submitted_at":"2023-10-09T19:42:22+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14871-14881, 2021. OpenAI. Gpt-4 technical report, 2023. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1-67, 2020. Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. Habitat-matterport 3d dataset (HM3d): 1000 large- scale 3d environments for embodied AI. In Thirty-fifth Conference on Neural Information Pro-"},{"citing_arxiv_id":"2309.17452","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving","primary_cat":"cs.CL","submitted_at":"2023-09-29T17:59:38+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ToRA trains language models on interactive tool-use trajectories with imitation learning and output shaping to integrate reasoning and external tools, yielding 13-19% gains on math datasets and new highs like 44.6% on MATH for a 7B model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.12284","ref_index":57,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models","primary_cat":"cs.CL","submitted_at":"2023-09-21T17:45:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"closed-source models GPT-4 [54] - 92.0 42.5 GPT-3.5-Turbo [53] - 80.8 34.1 PaLM [12] 8B 4.1 1.5 PaLM [12] 62B 33.0 4.4 PaLM [12] 540B 56.5 8.8 PaLM-2 [2] 540B 80.7 34.3 Flan-PaLM 2 [2] 540B 84.7 33.2 Minerva [35] 8B 16.2 14.1 Minerva [35] 62B 52.4 27.6 Minerva [35] 540B 58.8 33.6 open-source models (1-10B) LLaMA-2 [70] 7B 14.6 2.5 MPT [49] 7B 6.8 3.0 Falcon [57] 7B 6.8 2.3 Code-LLaMA [61] 7B 25.2 13.0 InternLM [29] 7B 31.2 - GPT-J [71] 6B 34.9 - ChatGLM 2 [81] 6B 32.4 - Qwen [1] 7B 51.6 - Baichuan-2 [4] 7B 24.5 5.6 SFT [70] 7B 41.6 - RFT [79] 7B 50.3 - MAmooTH-CoT [80] 7B 50.5 10.4 WizardMath [43] 7B 54.9 10.7 MetaMath 7B 66.5 19.8 open-source models (11-50B) LLaMA-2 [70] 13B 28.7 3.9 LLaMA-2 [70] 34B 42."},{"citing_arxiv_id":"2309.10305","ref_index":50,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Baichuan 2: Open Large-scale Language Models","primary_cat":"cs.CL","submitted_at":"2023-09-19T04:13:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Baichuan 2 presents 7B and 13B LLMs trained on 2.6T tokens that match or exceed similar open models on MMLU, CMMLU, GSM8K, HumanEval and excel in medicine and law.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.05653","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning","primary_cat":"cs.CL","submitted_at":"2023-09-11T17:47:22+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.00614","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Baseline Defenses for Adversarial Attacks Against Aligned Language Models","primary_cat":"cs.LG","submitted_at":"2023-09-01T17:59:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2307.15043","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Universal and Transferable Adversarial Attacks on Aligned Language Models","primary_cat":"cs.CL","submitted_at":"2023-07-27T17:49:12+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Gradient and greedy search over token suffixes produces universal, transferable adversarial prompts that elicit objectionable outputs from aligned models including black-box commercial systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2306.00978","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration","primary_cat":"cs.CL","submitted_at":"2023-06-01T17:59:10+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AWQ quantizes LLM weights to low bits by scaling salient channels based on activation statistics, outperforming prior methods on language, coding, math, and multi-modal benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.16264","ref_index":93,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scaling Data-Constrained Language Models","primary_cat":"cs.CL","submitted_at":"2023-05-25T17:18:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505-3506. [92] Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. 2011. Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning. In AAAI spring symposium: logical formalizations of commonsense reasoning, pages 90-95. [93] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99- 106. [94] Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. 2022. Multitask Prompted"},{"citing_arxiv_id":"2303.18223","ref_index":173,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Survey of Large Language Models","primary_cat":"cs.CL","submitted_at":"2023-03-31T17:28:46+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"much larger than BookCorpus but have not been publicly released so far. •Academic Data.In addition to book data, scientific publication data such as paper is also important for model pre-training. arXiv Dataset [172] is a corpus of 1.7 mil- lion academic papers, covering a wide range of papers in the fields of physics, mathematics, and computer science. S2ORC [173] is a corpora that consists of 136M academic 21. https://www.tensorflow.org/datasets/catalog/c4 14 papers collected by Semantic Scholar. It also releases a derivative dataset peS2o [174], which contains about 42B tokens. Wikipedia.Wikipedia [164] is an online encyclopedia con- taining a large volume of high-quality articles on diverse topics. Most of these articles are composed in an expository"}],"limit":50,"offset":0}