{"total":19,"items":[{"citing_arxiv_id":"2606.01279","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ANDES: Agent Native Data Evolving Synthesis Tool for Autonomous Instruction Alignment","primary_cat":"cs.AI","submitted_at":"2026-05-31T15:03:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ANDES equips AI agents with an interactive data-synthesis skill using World Tree routing to reach SOTA automated alignment on PostTrainBench under compute limits.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11663","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability","primary_cat":"cs.CL","submitted_at":"2026-05-12T07:22:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08738","ref_index":34,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training","primary_cat":"cs.LG","submitted_at":"2026-05-09T06:50:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the final 23A2B model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05940","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing","primary_cat":"cs.LG","submitted_at":"2026-05-07T09:50:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"NPD accelerates on-policy distillation 8.1 times faster than baselines by using asynchronous SFT with Δ-IFD filtering, outperforming standard SFT and enabling a 1B model to achieve 68.73% SOTA score.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20549","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Toward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection","primary_cat":"cs.CL","submitted_at":"2026-04-22T13:31:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Multilingual pooling for quality classifiers outperforms monolingual baselines in rank stability and accuracy for LLM pretraining data selection across high- and low-resource languages.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05688","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion","primary_cat":"cs.CL","submitted_at":"2026-04-07T10:40:16+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Following distillation practices used in the Nemotron/Minitron line [Bercovich et al., 2025b], we introduce a low-weight cosine similarity loss on a selected set of intermediate layersM⊆{1,...,L}: Lcos = 1∑N i=1ni|M| N∑ i=1 ni∑ t=1 ∑ ℓ∈M  1− ⣨ hS,ℓ i,t, hT,ℓ i,t ⟩ ∥hS,ℓ i,t∥2∥hT,ℓ i,t∥2  .(27) The final model-level objective is Lmodel =L KD +λcosLcos,(28) whereλcos is a small coefficient. Intuitively,LKD restores the teacher's predictive behavior at the token level, while Lcos provides a weak geometric regularization on the internal representation trajectory, which is especially helpful during the early phase of end-to-end training. Discussions.The overall procedure can be viewed as a progressive path fromintermediate-state distillationto"},{"citing_arxiv_id":"2507.20534","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Kimi K2: Open Agentic Intelligence","primary_cat":"cs.LG","submitted_at":"2025-07-28T05:35:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"additional instructions that probe specific failure modes or edge cases. This multipronged approach ensures both breadth and depth in instruction coverage. FaithfulnessFaithfulness is essential for an agentic model operating in scenarios such as multi-turn tool use, self- generated reasoning chains, and open-environment interactions. Inspired by the evaluation framework from FACTS Grounding [31], we train a sentence-level faithfulness judge model to perform automated verification. The judge is effective in detecting sentences that make a factual claim without supporting evidence in context. It serves as a reward model to enhance overall faithfulness performance. Coding & Software EngineeringTo enhance our capability in tackling competition-level programming problems,"},{"citing_arxiv_id":"2410.14702","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark","primary_cat":"cs.AI","submitted_at":"2024-10-06T20:35:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.06624","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio","primary_cat":"cs.CL","submitted_at":"2024-09-10T16:26:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Empirical practice of continual pre-training Llama-3 models with optimized additional language mixture ratios to enhance Chinese capabilities, showing gains in benchmarks and domains like math and coding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.11931","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence","primary_cat":"cs.SE","submitted_at":"2024-06-17T13:51:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An open-source MoE code model matches GPT-4 Turbo on coding and math benchmarks while expanding to 338 languages and 128K context length.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2405.04434","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model","primary_cat":"cs.CL","submitted_at":"2024-05-07T15:56:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"def is_not_prime(n): result = False for i in range(2,int(math.sqrt(n)) + 1): if n % i == 0: result = True return result [DONE] You are an expert Python programmer, and here is your task: Write a function to find the largest integers from a given list of numbers using heap queue algorithm. Your code should pass these tests: assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],3)==[85, 75, 65] assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],2)==[85, 75] assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],5)==[85, 75, 65, 58, 35] [BEGIN] import heapq as hq def heap_queue_largest(nums,n): largest_nums = hq.nlargest(n, nums) return largest_nums [DONE] You are an expert Python programmer, and here is your task: Write a function"},{"citing_arxiv_id":"2404.18930","ref_index":72,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hallucination of Multimodal Large Language Models: A Survey","primary_cat":"cs.CV","submitted_at":"2024-04-29T17:59:41+00:00","verdict":"ACCEPT","verdict_confidence":"UNKNOWN","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"accurate and non-hallucinatory, a text-to-image generative model should be capable of reversing this process to produce a similar image from that response. 5.3.3 Unlearning. Unlearning refers to a technique designed to induce a model to 'forget' specific behaviors or data, primarily through the application of gradient ascent methods [ 15]. Recently, unlearning for LLMs has been receiving increasing attention [72], effectively eliminating privacy vulnerabilities in LLMs. In the context of MLLMs, a recent work [ 177] introduces the Efficient Fine-grained Unlearning Framework (EFUF), applying an unlearning framework to address the hallucination problem. Specifically, it utilizes the CLIP model to construct a dataset comprised of both positive samples and negative (hallucinated) samples."},{"citing_arxiv_id":"2403.17297","ref_index":182,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternLM2 Technical Report","primary_cat":"cs.CL","submitted_at":"2024-03-26T00:53:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.04652","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Yi: Open Foundation Models by 01.AI","primary_cat":"cs.CL","submitted_at":"2024-03-07T16:52:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Drawing inspiration from the open- sourced LLaV A [46, 47], we present Yi Vision Language (Yi-VL) models, i.e., Yi-VL-6B and Yi-VL-34B, based on Yi-6B-Chat and Yi-34B-Chat language models. The architecture of Yi-VL models, as illustrated in Figure 7, comprises three primary modules. The Vision Transformer (ViT), used for image encoding, is initialized with CLIP ViT-H/14 model [ 33]. A Projection Module, designed to align image features with text feature spcae, consists of a two-layer Multilayer Perceptron (MLP) with layer normalizations. Finally, the large language model, initialized with the Yi-Chat models, demonstrating exceptional proficiency in understanding and generating both English and Chinese. To enhance the performance of Yi-VL models in bilingual multimodal understanding and"},{"citing_arxiv_id":"2401.06066","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models","primary_cat":"cs.CL","submitted_at":"2024-01-11T17:31:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.02954","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeepSeek LLM: Scaling Open-Source Language Models with Longtermism","primary_cat":"cs.CL","submitted_at":"2024-01-05T18:59:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2311.05232","ref_index":129,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions","primary_cat":"cs.CL","submitted_at":"2023-11-09T09:25:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net. https://openreview.net/pdf?id=4oYUGeGBPm [128] Siqing Huo, Negar Arabzadeh, and Charles L. A. Clarke. 2023. Retrieving Supporting Evidence for LLMs Generated Answers. ArXiv preprint abs/2306.13781 (2023). https://arxiv.org/abs/2306.13781 [129] Srinivasan Iyer, Xi Victoria Lin, Ramakanth Pasunuru, Todor Mihaylov, Daniel Simig, Ping Yu, Kurt Shuster, Tianlu Wang, Qing Liu, Punit Singh Koura, et al. 2022. Opt-iml: Scaling language model instruction meta learning through the lens of generalization. ArXiv preprint abs/2212.12017 (2022). https://arxiv.org/abs/2212.12017 [130] Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard"},{"citing_arxiv_id":"2310.02255","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts","primary_cat":"cs.CV","submitted_at":"2023-10-03T17:57:24+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"MathVista benchmark shows GPT-4V achieves 49.9% accuracy on visual mathematical reasoning tasks, outperforming other models but trailing humans by 10.4%.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"position of the number in the sequence (2). The third number (7) is 1 more than the second number (5) plus the position of the number in the sequence (3). Following this pattern, the missing number should be 1 more than the last number (32) plus the position of the number in the sequence (9). 32 + 9 + 1 = 42 So, the missing number is 42. PoT GPT-4: numbers = [4, 5, 7, 10, 14, 19, 25, 32] diffs = [numbers[i+1] - numbers[i] for i in range(len(numbers)-1)] # [1, 2, 3, 4, 5, 6, 7] missing_number = numbers[-1] + diffs[-1] print(missing_number) Output: 39 Figure 51: The LLM baselines augmented with Bard image captions and OCR text also fail on this question. CoT GPT-4 generates explanations with incorrect reasoning."},{"citing_arxiv_id":"2309.10305","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Baichuan 2: Open Large-scale Language Models","primary_cat":"cs.CL","submitted_at":"2023-09-19T04:13:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Baichuan 2 presents 7B and 13B LLMs trained on 2.6T tokens that match or exceed similar open models on MMLU, CMMLU, GSM8K, HumanEval and excel in medicine and law.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}