{"total":98,"items":[{"citing_arxiv_id":"2606.27632","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Yuvion LLM: An Adversarially-Aware Large Language Model for Content And AI Safety","primary_cat":"cs.CL","submitted_at":"2026-06-26T01:12:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Yuvion LLM applies adversarially aware training and introduces the YLRE benchmark set, claiming superior safety robustness over larger models on multiple tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05976","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Self-Correction Illusion: LLMs Correct Others but Not Themselves","primary_cat":"cs.AI","submitted_at":"2026-06-04T10:17:00+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Relabeling an identical erroneous claim from the model's own thought role to an external chat role increases explicit correction rates by 23-93 percentage points across 13 model-domain cells, indicating a chat-template artifact rather than a cognitive deficit.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31268","ref_index":70,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mellum2 Technical Report","primary_cat":"cs.CL","submitted_at":"2026-05-29T13:01:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Mellum 2 is a 12B MoE model with 2.5B active parameters, trained on 10.6T tokens with MoE, GQA, SWA, and MTP, then post-trained into Instruct and Thinking variants, claimed competitive with 4B-14B models at 2.5B compute.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19093","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts","primary_cat":"cs.AI","submitted_at":"2026-05-18T20:28:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ReElicit uses LLMs to elicit adaptive feature embeddings for Gaussian process Bayesian optimization of system prompts under aggregate-only feedback, outperforming baselines across ten tasks with a 30-evaluation budget.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12928","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Efficiency Gap in Byte Modeling","primary_cat":"cs.LG","submitted_at":"2026-05-13T03:03:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Byte modeling incurs greater scaling overhead for masked diffusion than autoregressive models because the diffusion objective destroys local byte contiguity needed to resolve semantics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11663","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability","primary_cat":"cs.CL","submitted_at":"2026-05-12T07:22:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11290","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ReAD: Reinforcement-Guided Capability Distillation for Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-11T22:17:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Distilling reasoning capabilities into smaller language models.Findings of the Association for Computational Linguistics: ACL 2023, pages 7059-7073, 2023. [28] Zhihong Sun, Chen Lyu, Bolun Li, Yao Wan, Hongyu Zhang, Ge Li, and Zhi Jin. Enhancing code generation performance of smaller models by distilling the reasoning ability of llms.arXiv preprint arXiv:2403.13271, 2024. 11 [29] Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261, 2022. [30] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy"},{"citing_arxiv_id":"2605.10933","ref_index":54,"ref_count":3,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices","primary_cat":"cs.LG","submitted_at":"2026-05-11T17:58:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DECO is a sparse MoE architecture with ReLU-based routing, learnable expert scaling, and NormSiLU activation that matches dense Transformer performance at 20% expert activation and delivers 2.93x speedup on Jetson AGX Orin.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10516","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability","primary_cat":"cs.AI","submitted_at":"2026-05-11T13:06:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"agent verifiers with sequential hypothesis testing, 2025. URLhttps://arxiv.org/abs/ 2512.03109. [25] R. J. Serfling.Approximation Theorems of Mathematical Statistics. John Wiley & Sons, 1980. [26] F. Shi, X. Chen, K. Misra, N. Scales, D. Dohan, E. Chi, N. Sch ¨arli, and D. Zhou. Large language models can be easily distracted by irrelevant context. 2023. URLhttps://arxiv. org/abs/2302.00093. [27] M. Suzgun, N. Scales, N. Sch ¨arli, S. Gehrmann, Y . Tay, H. W. Chung, A. Chowdhery, Q. V . Le, E. H. Chi, D. Zhou, and J. Wei. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022. URLhttps://arxiv.org/abs/2210.09261. [28] SWE-bench Team. Swe-bench leaderboard.https://www.swebench.com, 2024. Accessed October 2024. [29] SWE-bench Team."},{"citing_arxiv_id":"2605.10405","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization","primary_cat":"cs.LG","submitted_at":"2026-05-11T11:43:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Doubly robust estimators that incorporate low-rank predictions enable valid finite-sample confidence intervals for best-model identification under adaptive sampling and without-replacement example selection in LLM evaluation.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"the budget that each method requires to reach 95% level of accuracy. DatasetsWe use the evaluation dataset released by [ 26], which provides binary correctness scores for 4.4K models across six benchmarks with 21.5K total number of questions. Following [26], we organize the data into two benchmark datasets: Bench 1 which containsMMLU-Pro[ 4], with 12K questions; and Bench 2 - a composite dataset of BBH [28], GPQA [29], IFEval [30], MATH [31], and MuSR [32], with 9.5K questions. For each dataset, we split the models chronologically accord- ing to the models release date: the older half is used as training data to warm-start the low-rank factorization model (as described next), and the remaining half forms the test pool over which best-model identification is performed."},{"citing_arxiv_id":"2605.08904","ref_index":66,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces","primary_cat":"cs.AI","submitted_at":"2026-05-09T11:51:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08738","ref_index":58,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training","primary_cat":"cs.LG","submitted_at":"2026-05-09T06:50:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the final 23A2B model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08704","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AgentPSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization","primary_cat":"cs.AI","submitted_at":"2026-05-09T05:38:21+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"(1) For mathematical reasoning, we use DeepMath [16], MATH [24], AIME25 [49], and Minerva [23], which require multi-step reasoning, symbolic manipulation, and numerical problem solving. The resulting optimized agent skills are evaluated on the DeepMath test set and further applied to out-of- distribution mathematical benchmarks, including MATH, AIME25, and Minerva. (2) For general reasoning, we use BigBenchHard (BBH) [ 37], a challenging benchmark composed of 23 diverse reasoning tasks, including logical, symbolic, commonsense, and multi-step reasoning. We randomly split BBH into training, validation, and test sets and report the final performance on the BBH test split. Further details on the datasets and splits are provided in Appendix B. Baselines.We compare AgentPSO with baselines from three main categories."},{"citing_arxiv_id":"2605.08346","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sanity Checks for Long-Form Hallucination Detection","primary_cat":"cs.CL","submitted_at":"2026-05-08T18:00:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Hallucination detectors on LLM reasoning traces often rely on final-answer artifacts rather than reasoning validity; once controlled, lightweight lexical trajectory features suffice for robust detection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07268","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-08T05:33:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LogiHard hardens reasoning benchmarks by transforming 0-order selection into 2-order judgment, causing 31-56% accuracy drops in 12 frontier LLMs and a 47% drop on zero-shot MMLU, revealing a combinatorial reasoning gap rather than knowledge deficits.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"and extended chain-of-thought (CoT) with reflection [3] to achieve unprecedented performance on complex reasoning tasks. Supervised fine-tuning on logical reasoning datasets like LogiQA [18, 19] has become common practice for instilling foundational reasoning capabilities [25, 26]. Yet the development of LRMs has led to the rapid saturation of reasoning benchmarks. MMLU falls to GPT-5 at 92.5% [32], Sonnet 3.5 exceeds 93.1% on BBH [34], and OpenAI o1 achieves 90.0% average accuracy on LogiQA [15]. These numbers signal not the resolution of machine reasoning, but the failure of static evaluation [22]. Contemporary models achieve superhuman accuracy partly through training set memorization and exploitation of surface patterns (position bias, lexical overlap, stylistic cues) [40]. In response, ad-hoc hardening methods have proliferated: None-of-the-Above"},{"citing_arxiv_id":"2605.07053","ref_index":78,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations","primary_cat":"cs.CL","submitted_at":"2026-05-08T00:02:39+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06165","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost","primary_cat":"cs.AI","submitted_at":"2026-05-07T12:51:49+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05893","ref_index":55,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Logic-Regularized Verifier Elicits Reasoning from LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-07T09:03:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05810","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs","primary_cat":"cs.CV","submitted_at":"2026-05-07T07:46:17+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Diagnostic multimodal benchmarks further motivate targeted probes. Winoground [31], NegBench [1], MMBench [22], MMMU [42], MM-Vet [41], and SEED-Bench [15] show that strong average capability can hide brittleness under controlled tests. The same lesson appears in faithfulness and safety evaluation, including POPE [18], TruthfulQA [20], HaluEval [ 17], FaithDial [ 9], HELM [ 19], BIG-Bench Hard [ 29], MMLU-Pro [34], and DecodingTrust [32]. CXR-ContraBench brings that diagnostic philosophy into a clinically grounded chest-X-ray setting and couples it with both retrospective and direct polarity probes. A key distinction from prior negation benchmarks lies in our treatment of chain-of-thought prompting. Reasoning is often assumed to improve reliability, but we show that CoT is not a reliable remedy and"},{"citing_arxiv_id":"2605.02395","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Verifiable Counterfactual Supervision for Process Reward Models","primary_cat":"cs.AI","submitted_at":"2026-05-04T09:36:57+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01566","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multi-Agent Reasoning Improves Compute Efficiency: Pareto-Optimal Test-Time Scaling","primary_cat":"cs.AI","submitted_at":"2026-05-02T18:31:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Multi-agent debate and mixture-of-agents outperform self-consistency by 1.3 and 2.7 percentage points respectively at equal compute budgets on MMLU-Pro and BBH, with advantages that continue at higher scales while self-consistency saturates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00419","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Rethinking LLM Ensembling from the Perspective of Mixture Models","primary_cat":"cs.LG","submitted_at":"2026-05-01T05:31:18+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22575","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference","primary_cat":"cs.LG","submitted_at":"2026-04-24T14:07:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SpikingBrain2.0 is a 5B hybrid spiking-Transformer that recovers most base model performance while delivering 10x TTFT speedup at 4M context and supporting over 10M tokens on limited GPUs via dual sparse attention and dual quantization paths.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19301","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Large Language Models Exhibit Normative Conformity","primary_cat":"cs.AI","submitted_at":"2026-04-21T10:06:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Large language models exhibit normative conformity in addition to informational conformity, and subtle social context can direct which group they conform to.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18473","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts","primary_cat":"cs.LG","submitted_at":"2026-04-20T16:24:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17937","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis","primary_cat":"cs.AI","submitted_at":"2026-04-20T08:17:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ContraPrompt extracts optimization rules from dyadic differences in reasoning traces on identical inputs and organizes them into input-aware decision trees, outperforming GEPA on four benchmarks with gains up to 8.29 pp.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05227","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods","primary_cat":"cs.LG","submitted_at":"2026-04-19T14:23:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16646","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Agentic Frameworks for Reasoning Tasks: An Empirical Study","primary_cat":"cs.AI","submitted_at":"2026-04-17T19:02:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"role-based multi-agent, hierarchical, modular, and graph-based architectures. We also identify their adopted reasoning strategies and memory mechanisms. These 22 frame- works are empirically assessed in a unified experimental setting to compare their reasoning performance, trade-offs, and consistency across three widely recognized rea- soning benchmarks: Big-Bench Hard (BBH) [10], GSM8K [17], and the AI2 Reasoning Challenge (ARC) [18]. These benchmarks were selected because they contain diverse datasets designed to evaluate the complex reasoning capabilities of AI systems. Contributions: The key contributions of this paper are summarized as follows: • Systematic selection of agentic frameworks and creation of a taxonomy based"},{"citing_arxiv_id":"2604.15972","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Weak-Link Optimization for Multi-Agent Reasoning and Collaboration","primary_cat":"cs.AI","submitted_at":"2026-04-17T11:36:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"rigor, semantic accuracy, and reasoning coherence, which are detailed in the supplementary materials. The overall workflow is summarized in Pseudocode 1. IV. EXPERIMENTS A. Experimental Setup Datasets.To evaluate WORC's reasoning ability, we conduct experiments on six benchmark datasets includ- ing MATH [52] for advanced mathematical reasoning, GSM8K [53] for grade-school mathematical word problems, BBH [54] for logical and algorithmic reasoning tasks, MMLU- CF [55] for commonsense and factual knowledge evalua- tion, HotpotQA [56] for multi-hop question answering, and LongBench [57] for long-context reasoning scenarios. These datasets collectively cover a wide spectrum of reasoning tasks and provide a comprehensive evaluation testbed for the framework."},{"citing_arxiv_id":"2604.09258","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima","primary_cat":"cs.LG","submitted_at":"2026-04-10T12:17:18+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"We further verified these settings by sweeping the learning rate with a multiplier of2(i.e., verifying0.5× and2 .0×), confirming that their configurations remain optimal for our dataset. See Sec. H.1 for the detailed hyperparameters in each experiment. Benchmarks.We evaluate on diverse benchmarks encompassing general knowledge (MMLU [13]), reasoning (GPQA, GPQA Diamond [34], BBH [39]), math (GSM8k [5], MATH500 [14]), and coding (HumanEval [11], MBPP [1]). Beyond discrete accuracies, we also track downstream task losses and out-of-distribution (OOD) loss. The OOD loss is evaluated on a strictly cleaned proprietary in-house corpus, which exhibits a strong correlation with downstream benchmark capabilities. Highlighting Strategy."},{"citing_arxiv_id":"2604.07655","ref_index":70,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs","primary_cat":"cs.LG","submitted_at":"2026-04-08T23:47:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"age, we carefully assign data separately within each dataset's train and test/eval partitions. 4 Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs Topic Dataset Counts Topic Dataset Counts ♂leafHarmlessai2_arc[10] 3119 ♂leafHarmlessalpaca-cleaned[83] 5000 ♂leafHarmlessbbh[69] 6511 ♂leafHarmlesscode_contest[40] 3000 ♂leafHarmlesscommonsense_qa[70] 5000 ♂leafHarmlessgsm8k[11] 5000 ♂leafHarmlessmath_instruct[86] 5000 ♂leafHarmlessmedical_reasoning[47] 5000 ♂leafHarmlessmmlu[24] 5000 ♂leafHarmlessnatural_instructions[54] 5000 ♂leafHarmlessopenbook_qa[53] 4000 ♂leafHarmlessscience_exam[45] 5000 ♂leafHarmlessself_instruct[78] 5000 ♂leafHarmlesssquad[61] 5000 ♂leafHarmlesstrivia_qa[37] 5000 ♂leafHarmlessultrachat[15] 5000"},{"citing_arxiv_id":"2604.07023","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MARS: Enabling Autoregressive Models Multi-Token Generation","primary_cat":"cs.CL","submitted_at":"2026-04-08T12:41:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MARS fine-tunes autoregressive models to predict multiple tokens per step via continued training on instruction data, achieving 1.5-1.7x throughput while matching baseline accuracy and supporting real-time speed adjustment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06819","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Beyond End-to-End: Dynamic Chain Optimization for Private LLM Adaptation on the Edge","primary_cat":"cs.DC","submitted_at":"2026-04-08T08:37:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ChainFed achieves memory-efficient private LLM fine-tuning on edge devices through sequential layer-by-layer adapter training with dynamic co-tuning, perceptive optimization, and adaptive starting point selection, improving accuracy by up to 46.46%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.15031","ref_index":49,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Attention Residuals","primary_cat":"cs.CL","submitted_at":"2026-03-16T09:32:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Attention Residuals replaces fixed residual summation with input-dependent softmax attention over preceding layers, and a blocked variant is shown to improve uniformity and downstream performance in a 48B-parameter model pre-trained on 1.4T tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04942","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TDA-RC: Task-Driven Alignment for Knowledge-Based Reasoning Chains in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-03-13T13:01:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TDA-RC embeds topological patterns from multi-round reasoning into CoT via persistent homology and a repair agent, yielding better accuracy-efficiency trade-offs than ToT or GoT on tested datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.10477","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses","primary_cat":"cs.CL","submitted_at":"2026-03-11T07:00:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PEEM is a multi-criteria LLM-based evaluator for prompts and responses that aligns with standard accuracy while enabling zero-shot prompt optimization via feedback.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.10144","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When LLMs get significantly worse: A statistical approach to detect model degradations","primary_cat":"stat.ML","submitted_at":"2026-02-09T10:45:13+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A McNemar-based statistical test detects real degradations in optimized LLMs with controlled false positives, even for accuracy changes as small as 0.3%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.24880","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"mHC: Manifold-Constrained Hyper-Connections","primary_cat":"cs.CL","submitted_at":"2025-12-31T14:16:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"mHC projects hyper-connection residual spaces onto a manifold to restore identity mapping, enabling stable large-scale training with performance gains over standard HC.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.04570","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm","primary_cat":"cs.CV","submitted_at":"2025-11-06T17:25:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Video generation models demonstrate competitive multimodal reasoning on a new benchmark, matching or exceeding VLMs on visual puzzles and achieving 92% on MATH and 69.2% on MMMU.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.26692","ref_index":97,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Kimi Linear: An Expressive, Efficient Attention Architecture","primary_cat":"cs.CL","submitted_at":"2025-10-30T16:59:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"\"Challenging big-bench tasks and whether chain-of-thought can solve them\". In:arXiv preprint arXiv:2210.09261(2022). [95] Kimi Team et al.Kimi k1.5: Scaling Reinforcement Learning with LLMs. 2025. arXiv: 2501.12599 [cs.AI]. URL:https://arxiv.org/abs/2501.12599. [96] MiniCPM Team et al.MiniCPM4: Ultra-Efficient LLMs on End Devices. 2025. arXiv: 2506.07900 [cs.CL]. URL:https://arxiv.org/abs/2506.07900. [97] Tencent Hunyuan Team et al. \"Hunyuan-turbos: Advancing large language models through mamba-transformer synergy and adaptive chain-of-thought\". In:arXiv preprint arXiv:2505.15431(2025). [98] Hugo Touvron et al.LLaMA: Open and Efficient Foundation Language Models. 2023. arXiv: 2302.13971 [cs.CL]. [99] Ashish Vaswani et al. \"Attention is All you Need\"."},{"citing_arxiv_id":"2510.25741","ref_index":84,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scaling Latent Reasoning via Looped Language Models","primary_cat":"cs.CL","submitted_at":"2025-10-29T17:45:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.05528","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization","primary_cat":"cs.LG","submitted_at":"2025-10-07T02:39:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ARMOR is a one-shot post-training algorithm that factorizes weight matrices into a 2:4 sparse core wrapped by adaptive block-diagonal matrices, outperforming existing semi-structured pruning on Llama and Qwen models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.25699","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AIM-CoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning","primary_cat":"cs.CV","submitted_at":"2025-09-30T02:57:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AIM-CoT enhances interleaved multimodal chain-of-thought reasoning by adding context-enhanced attention generation, active visual probing via information foraging, and dynamic attention-shift triggering.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.18265","ref_index":117,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","primary_cat":"cs.CV","submitted_at":"2025-08-25T17:58:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"14 Evaluation on Language Capability To evaluate the language capabilities of InternVL3.5, we use benchmarks covering comprehensive assessments in general knowledge (MMLU [44], CMMLU [61], C-Eval [49], GAOKAO-Bench [177]), linguistic understanding (TriviaQA [52], NaturalQuestions [56], C3 [115], RACE [57]), reasoning (WinoGrande [107], HellaSwag [172], BigBench Hard [ 117]), mathematics (GSM8K-Test [ 18], MATH [ 45], AIME24 [ 84], AIME25 [ 85]), and 20 Model Text2SVG Img2SVG FID ↓ FID-C ↓ CLIP ↑ DINO ↑ SSIM ↑ LPIPS ↓ PSNR ↑ InternVL3.5-1B 22.50 12.16 72.43 0.79 0.57 0.35 7.35 InternVL3.5-2B 20.98 11.26 72.71 0.81 0.56 0.34 7.44 InternVL3.5-4B 17.06 7.54 74.35 0.84 0.61 0.30 8.37 Llama-3.1-8B [32] 19.43 11.25 71.86 - - - -"},{"citing_arxiv_id":"2508.15487","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Dream 7B: Diffusion Large Language Models","primary_cat":"cs.CL","submitted_at":"2025-08-21T12:09:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Dream 7B is a 7B diffusion LLM that refines sequences in parallel via denoising and outperforms prior diffusion models on general, mathematical, and coding benchmarks with added flexibility in generation order and quality-speed tradeoffs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.23009","ref_index":76,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead","primary_cat":"cs.LG","submitted_at":"2025-07-30T18:14:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Human tests should not be applied to AI to measure traits like intelligence due to calibration, validity, contamination, and prompt sensitivity issues; develop AI-specific evaluation frameworks instead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.20534","ref_index":71,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Kimi K2: Open Agentic Intelligence","primary_cat":"cs.LG","submitted_at":"2025-07-28T05:35:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"For Chinese language capabilities, we evaluate on C-Eval [30], CMMLU [41], and CSimpleQA [23]. BaselinesWe benchmark against leading open-source foundation models: DeepSeek-V3-Base [11], Qwen2.5-72B- Base [60] (Note that Qwen3-235B-A22B-Base is not open-sourced, and the largest open-sourced base model in the Qwen series is Qwen2.5-72B-Base), and Llama 4-Maverick [71] (Llama 4-Behemoth is also not open-sourced). All models are evaluated under identical configurations to ensure fair comparison. Evaluation ConfigurationsWe employ perplexity-based evaluation for MMLU, MMLU-Redux, GPQA-Diamond, HellaSwag, ARC-Challenge, C-Eval, and CMMLU. Generation-based evaluation is used for MMLU-Pro, SuperGPQA, TriviaQA, BBH, CSimpleQA, MATH, CMATH, GSM8K, GSM8K-Platinum, CRUXEval, LiveCodeBench, and"},{"citing_arxiv_id":"2507.00432","ref_index":187,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning","primary_cat":"cs.AI","submitted_at":"2025-07-01T05:23:05+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.18315","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Effective LLM Code Refinement via Property-Oriented and Structurally Minimal Feedback","primary_cat":"cs.SE","submitted_at":"2025-06-23T06:01:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PGS generates property-oriented, structurally minimal feedback from high-level program properties to refine LLM code, yielding up to 13.4% pass@1 gains and 1.4-1.6x higher bug-fix rates than prior TDD and debugging baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.13674","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PrefixMemory-Tuning: Modernizing Prefix-Tuning by Decoupling the Prefix from Attention","primary_cat":"cs.CL","submitted_at":"2025-06-16T16:30:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PrefixMemory-Tuning decouples the prefix from attention to overcome performance limits of traditional prefix-tuning and reaches competitive results with modern PEFT methods on LLM adaptation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}