{"total":98,"items":[{"citing_arxiv_id":"2606.24790","ref_index":65,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Grad Detect: Gradient-Based Hallucination Detection in LLMs","primary_cat":"cs.LG","submitted_at":"2026-06-23T16:46:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Grad Detect uses internal gradient patterns from one inference pass to predict LLM hallucinations and abstention, outperforming confidence and sampling baselines on Q&A benchmarks with most signal in the final five layers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.13814","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TASR: Training-Free Adaptive Stopping for Iterative Retrieval","primary_cat":"cs.IR","submitted_at":"2026-06-11T18:35:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TASR provides a training-free predicate that stops iterative retrieval on repeated normalized answers plus calibrated logit margin above 0.25, retaining 94.8% of fixed-k=5 F1 at 62.6% of the calls across 32 configurations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10657","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Are We Evaluating Knowledge or Phrasing? Mitigating MCQA Sensitivity with ParaEval","primary_cat":"cs.CL","submitted_at":"2026-06-09T10:05:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ParaEval reduces false performance gaps in MCQA benchmarks from over 2 points to below 1 point by scoring models on multiple paraphrases per answer option instead of single surface forms.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00819","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mitigating Hallucinations in Large Language Models Via Decoder Layer Skipping","primary_cat":"cs.AI","submitted_at":"2026-05-30T17:40:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeLask dynamically skips hallucination-prone decoder layers in LLMs by measuring gradient driftance via cosine similarity and partially aggregating states instead of full skipping.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18597","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Latent Action Reparameterization for Efficient Agent Inference","primary_cat":"cs.AI","submitted_at":"2026-05-18T16:07:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LAR learns a compact latent action space from trajectories that shortens the effective decision horizon for LLM agents, reducing token count and inference time while preserving task success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12340","ref_index":56,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Online Learning-to-Defer with Varying Experts","primary_cat":"stat.ML","submitted_at":"2026-05-12T16:19:44+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10202","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Task-Aware Calibration: Provably Optimal Decoding in LLMs","primary_cat":"cs.LG","submitted_at":"2026-05-11T08:48:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"calibration error of Hu and Wu [19] to an LLM's latent beliefs. In our work, we connect concepts 8 from the existing literature on calibration for classification with LLM's by interpreting their output distribution in a task-dependent latent space, and improve decision making such as decoding. Calibration in LLMs.Recent work has shown that LLMs are often miscalibrated [ 22, 38]. Conse- quently, methods emerged that recalibrate token probabilities or verbalized confidence scores [50, 29]. Some recent work also moves beyond token-level probabilities: Nakkiran et al. [33] show that language models can exhibit confidence calibration in semantic answer spaces. This semantic answer grouping can be seen as a special case of our proposedtaskconfidence calibration by instantiating"},{"citing_arxiv_id":"2605.09492","ref_index":64,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"APCD: Adaptive Path-Contrastive Decoding for Reliable Large Language Model Generation","primary_cat":"cs.CL","submitted_at":"2026-05-10T11:57:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"APCD adaptively branches LLM decoding paths based on token entropy and contrasts divergent paths to improve factual accuracy while preserving efficiency.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"rejection rate (Reject). Since the OpenAI Curie model is no longer available, we use the OpenAI recommended replacement model, GPT-4o-mini, to train GPT-Judge and GPT-Info models, making sure that the hyperparameters and training data re- main consistent. For the evaluation of TriviaQA, HotpotQA, and Natural Questions (NQ), we adopt the eval- uation method from Joshi (2017), which com- pares the final model output, model_answer, with the ground_truth. If model_answer contains ground_truth, then the answer is considered cor- rect; otherwise, it is considered incorrect. The ac- curacy is then calculated as the EM (Exact Match) score, and the F1 score is computed using the stan- dard formula. The model_answer and ground_truth"},{"citing_arxiv_id":"2605.07234","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Reformulating KV Cache Eviction Problem for Long-Context LLM Inference","primary_cat":"cs.CL","submitted_at":"2026-05-08T04:37:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"\"RULER: What's the Real Context Size of Your Long-Context Language Models?\" In:arXiv preprint arXiv:2404.06654(2024). [22] Luyang Huang et al. \"Efficient attentions for long document summarization\". In:arXiv preprint arXiv:2104.02112(2021). [23] Albert Q. Jiang et al.Mistral 7B. 2023. arXiv: 2310.06825 [cs.CL].URL: https://arxiv. org/abs/2310.06825. [24] Mandar Joshi et al. \"Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension\". In:arXiv preprint arXiv:1705.03551(2017). 10 [25] Ehsan Kamalloo et al. \"Evaluating open-domain question answering in the era of large language models\". In:Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers)."},{"citing_arxiv_id":"2605.06765","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing","primary_cat":"cs.CL","submitted_at":"2026-05-07T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05777","ref_index":63,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Estimating the Black-box LLM Uncertainty with Distribution-Aligned Adversarial Distillation","primary_cat":"cs.CL","submitted_at":"2026-05-07T07:09:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DisAAD trains a 1%-sized proxy model via adversarial distillation to quantify uncertainty in black-box LLMs by aligning with their output distributions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02178","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-05-04T03:15:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"T²PO improves stability and performance in multi-turn agentic RL by using uncertainty dynamics at token and turn levels to guide exploration and avoid wasted rollouts.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"while preserving realistic user intent. At the end of each episode, the agent receives a terminal reward r=R(s T , a), where a=choose[buy] , y is the product selected in the final state sT , and Yatt, Yopt, and yprice denote the attributes, options, and price ofy. The reward is defined as: r=r type · |Uatt ∩Y att|+|U opt ∩Y opt|+1[y price ≤u price] |Uatt|+|U opt|+ 1 ,(9) where the type reward rtype =TextMatch(¯y,¯y∗) penalizes category mismatches between the predicted product y and target product y∗. Specifically, rtype assigns a low score when y and y∗ share similar attributes or options but belong to different product categories. For example, \"butter\" and \"plant-based meat\" may both exhibit attributes such as \"cruelty-free\""},{"citing_arxiv_id":"2605.02105","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting","primary_cat":"cs.LG","submitted_at":"2026-05-04T00:02:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"For a given pretrained checkpointθPT, we define the learning-forgetting tradeoff set as T(θPT) = {( LFT(θFT),L PT(θFT) )⏐⏐θFT∈ΘFT(θPT) } .(8) To enable loss-matched comparison across pretrained checkpoints, we define a common fine-tuning loss threshold as follows. For each checkpointθ(i) PT, we first compute the minimum fine-tuning loss achieved within its tradeoff set: L(i) min = min (LFT,L PT)∈T(θ(i) PT) LFT.(9) We then define the global fine-tuning thresholdτas the maximum over these per-checkpoint minima: τ= max i L(i) min.(10) For each pretrained checkpoint, we report the retained pretraining loss LPT corresponding to the model on its tradeoff frontier whose fine-tuning loss satisfiesL FT≤τ. D Additional results for OLMo-2-1B experiments D.1 Post-training quantization"},{"citing_arxiv_id":"2604.26525","ref_index":51,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PRAG: End-to-End Privacy-Preserving Retrieval-Augmented Generation","primary_cat":"cs.CR","submitted_at":"2026-04-29T10:46:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PRAG delivers end-to-end private RAG with 72-74% recall via non-interactive homomorphic approximations, interactive client assistance, and operation-error estimation to preserve ranking quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24623","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation","primary_cat":"cs.AI","submitted_at":"2026-04-27T15:52:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23108","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mixture of Heterogeneous Grouped Experts for Language Modeling","primary_cat":"cs.CL","submitted_at":"2026-04-25T02:05:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MoHGE achieves standard MoE performance with 20% fewer parameters and balanced GPU utilization via grouped heterogeneous experts, two-level routing, and specialized auxiliary losses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22271","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals","primary_cat":"cs.LG","submitted_at":"2026-04-24T06:33:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21231","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference","primary_cat":"cs.NI","submitted_at":"2026-04-23T02:55:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SparKV reduces time-to-first-token by 1.3x-5.1x and energy use by 1.5x-3.3x for on-device LLM inference by adaptively choosing between cloud KV streaming and local computation while overlapping execution and adjusting for runtime conditions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20452","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HaS: Accelerating RAG through Homology-Aware Speculative Retrieval","primary_cat":"cs.IR","submitted_at":"2026-04-22T11:15:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HaS accelerates RAG retrieval via homology-aware speculative retrieval and homologous query re-identification validation, cutting latency 24-37% with 1-2% accuracy drop on tested datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20051","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text","primary_cat":"cs.CL","submitted_at":"2026-04-21T23:21:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"POP bootstraps post-training signals for open-ended LLM tasks by synthesizing rubrics during self-play on pretraining corpus, yielding performance gains on Qwen-2.5-7B across healthcare QA, creative writing, and instruction following.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"LLM-as-a-judge benchmarks. OOD Evaluations.To ensure that our trained model's performance does not degrade on other tasks, especially on verifiable tasks, we evaluate POP on ten out-of-distribution (OOD) benchmarks: on scientific reasoning (GPQA-Diamond [29]); math reasoning (GSM8K [6]; Math500 [13]; AIME2024; AIME2025); and factoid QA (NaturalQueries [19]; TriviaQA [18]; TruthfulQA [20]; MMLU-Pro [43]); MedMCQA [26]. We use 0-shot prompting for the benchmarks. Models.We experiment with both a pretrained base model (Qwen-2.5-7B [ 40]) and an instruction- finetuned model (Qwen-2.5-7B-Instruct [40]) as our reference modelπ ref . Sampling.We sample I= 4096 queries (except for Creative Writing, where we set I= 8192 ), each"},{"citing_arxiv_id":"2604.19899","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Reproducibility Study of Metacognitive Retrieval-Augmented Generation","primary_cat":"cs.IR","submitted_at":"2026-04-21T18:22:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"MetaRAG is only partially reproducible with lower absolute scores than originally reported, gains substantially from reranking, and shows greater robustness than SIM-RAG under extended retrieval features.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19444","ref_index":167,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation","primary_cat":"cs.LG","submitted_at":"2026-04-21T13:25:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Unsupervised single-generation confidence calibration for reasoning LLMs via offline self-consistency proxy distillation outperforms baselines on math and QA tasks and improves selective prediction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18738","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Remask, Don't Replace: Token-to-Mask Refinement in Diffusion Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-20T18:43:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Token-to-Mask remasking improves self-correction in diffusion LLMs by resetting erroneous commitments to masks rather than overwriting them, yielding +13.33 points on AIME 2025 and +8.56 on CMATH.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18159","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"FreezeEmpath: Efficient Training for Empathetic Spoken Chatbots with Frozen LLMs","primary_cat":"cs.CL","submitted_at":"2026-04-20T12:22:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FreezeEmpath achieves emotionally expressive speech output and strong performance on empathetic dialogue, speech emotion recognition, and spoken QA tasks by training with a frozen LLM on existing speech datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17112","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification","primary_cat":"cs.AI","submitted_at":"2026-04-18T19:00:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Cross-model semantic disagreement adds an epistemic uncertainty term that improves total uncertainty estimation over self-consistency alone, helping flag confident errors in LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12529","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BackFlush: Knowledge-Free Backdoor Detection and Elimination with Watermark Preservation in Large Language Models","primary_cat":"cs.CR","submitted_at":"2026-04-15T10:56:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BackFlush detects backdoors via susceptibility amplification and eliminates them with RoPE unlearning to reach 1% ASR and 99% clean accuracy while preserving watermarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12056","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-13T20:53:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LoSA caches prefix attention for stable tokens in block-wise DLMs and applies sparse attention only to active tokens, preserving near-dense accuracy while achieving 1.54x lower attention density and up to 4.14x speedup.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08519","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts","primary_cat":"cs.CL","submitted_at":"2026-04-09T17:55:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-parameter model on the full dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22782","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing","primary_cat":"cs.LG","submitted_at":"2026-04-03T14:56:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Stochastic training with random cross-layer KV attention enables depth-wise cache sharing in transformers, cutting memory footprint while preserving or improving performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02934","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PolyReal: A Benchmark for Real-World Polymer Science Workflows","primary_cat":"cs.CV","submitted_at":"2026-04-03T10:05:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PolyReal benchmark shows leading MLLMs perform well on polymer knowledge reasoning but drop sharply on practical tasks like lab safety analysis and raw data extraction.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InProceedings of the AAAI conference on artificial intelligence, pages 590-597, 2019. 3 [22] Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1-9, 2016. 3 [23] Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettle- moyer. Triviaqa: A large scale distantly supervised chal- lenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551, 2017. 1 [24] Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. Are you smarter than a sixth grader? textbook question answer-"},{"citing_arxiv_id":"2604.09666","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Do We Still Need GraphRAG? Benchmarking RAG and GraphRAG for Agentic Search Systems","primary_cat":"cs.IR","submitted_at":"2026-04-01T07:21:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Agentic search narrows the gap between dense RAG and GraphRAG but does not remove GraphRAG's advantage on complex multi-hop reasoning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: A Benchmark for Question Answering Research.Transactions of the Association for Computational Linguistics7 (2019), 452-466. doi:10.1162/tacl_a_00276 [19] Meng-Chieh Lee, Qi Zhu, Costas Mavromatis, Zhen Han, Soji Adeshina, Vassilis N Ioannidis, Huzefa Rangwala, and Christos Faloutsos. 2025. Hybgrag: Hybrid retrieval-augmented generation on textual and relational knowledge bases. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 879-893."},{"citing_arxiv_id":"2603.18297","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Path-Constrained Mixture-of-Experts","primary_cat":"cs.LG","submitted_at":"2026-03-18T21:35:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PathMoE constrains expert paths in MoE models by sharing router parameters across layer blocks, yielding more concentrated paths, better performance on perplexity and tasks, and no need for auxiliary losses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.17839","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How do LLMs Compute Verbal Confidence","primary_cat":"cs.CL","submitted_at":"2026-03-18T15:31:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mechanistic experiments on Gemma 3 27B, Qwen 2.5 7B and Magistral Small 24B show verbal confidence is cached at post-answer positions from answer tokens and captures richer answer-quality information beyond token log-probabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.17837","ref_index":16,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning","primary_cat":"eess.AS","submitted_at":"2026-03-18T15:30:29+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.15031","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Attention Residuals","primary_cat":"cs.CL","submitted_at":"2026-03-16T09:32:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Attention Residuals replaces fixed residual summation with input-dependent softmax attention over preceding layers, and a blocked variant is shown to improve uniformity and downstream performance in a 48B-parameter model pre-trained on 1.4T tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.08022","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization","primary_cat":"cs.LG","submitted_at":"2026-03-09T06:58:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CAMEL is a scaling law capturing nonlinear model-size and mixture interactions to extrapolate optimal data mixtures for large LLMs from small-model experiments, reducing optimization cost by 50% and improving benchmarks by up to 3%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.23516","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens","primary_cat":"cs.CL","submitted_at":"2026-03-06T02:29:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MSA is an end-to-end trainable memory model using sparse attention and document-wise RoPE that scales to 100M tokens with linear complexity and less than 9% degradation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20854","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ERA: Evidence-based Reliability Alignment for Honest Retrieval-Augmented Generation","primary_cat":"cs.IR","submitted_at":"2026-02-24T01:45:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ERA models internal and external knowledge as independent Dirichlet belief masses and uses Dempster-Shafer Theory to quantify conflicts, enabling better abstention decisions in RAG systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.01203","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse","primary_cat":"cs.CL","submitted_at":"2026-02-01T12:45:39+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.08584","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Ministral 3","primary_cat":"cs.CL","submitted_at":"2026-01-13T14:06:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Ministral 3 releases 3B/8B/14B parameter-efficient language models with base, instruction, and reasoning variants derived via iterative pruning and distillation, including image understanding capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.23213","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process","primary_cat":"cs.CL","submitted_at":"2025-12-29T05:25:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM-PeerReview ensembles LLMs by scoring responses with LLM-as-Judge and selecting the best via averaging or truth inference, beating Smoothie-Global by 6.9-7.3 points on four datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.15745","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LLaDA2.0: Scaling Up Diffusion Language Models to 100B","primary_cat":"cs.LG","submitted_at":"2025-12-10T09:26:18+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.06655","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Graph-Regularized Sparse Autoencoders for LLM Safety Steering","primary_cat":"cs.LG","submitted_at":"2025-12-07T04:46:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GSAE improves selective refusal on safety benchmarks by smoothing SAE directions over a co-activation graph and applying them via a two-gate controller, outperforming standard SAEs and baselines on Llama-3 and other models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.09803","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Retrieval as a Decision: Training-Free Adaptive Gating for Efficient RAG","primary_cat":"cs.CL","submitted_at":"2025-11-12T23:09:52+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TARG uses uncertainty scores from a short no-context draft to gate retrieval in RAG, matching Always-RAG accuracy while cutting retrievals by 70-90% on QA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.00739","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective","primary_cat":"cs.AI","submitted_at":"2025-11-01T23:46:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper analyzes CPU bottlenecks in agentic AI serving, selects representative workloads, and demonstrates that CPU-aware scheduling optimizations COMB and MAS can reduce P50 latency by up to 1.7x and total latency by up to 2.49x on two hardware systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.26692","ref_index":48,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Kimi Linear: An Expressive, Efficient Attention Architecture","primary_cat":"cs.CL","submitted_at":"2025-10-30T16:59:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"interactions, and complex decision spaces at inference time, exposes fundamental inefficiencies in standard attention mechanisms. In particular, the quadratic time complexity and the linearly growing key-value (KV) cache of softmax attention introduce substantial computational and memory overheads, hindering throughput, context-length scaling, and real-time interactivity. Linear attention [48] offers a principled approach to reducing computational complexity but has historically under- performed softmax attention in language modeling-even for short sequences-due to limited expressivity. Recent advances have significantly narrowed this gap, primarily through two innovations: gating or decay mechanisms [92, 16, 114] and the delta rule [84, 112, 111, 71]."},{"citing_arxiv_id":"2511.00066","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sharpness-Guided Group Relative Policy Optimization via Probability Shaping","primary_cat":"cs.LG","submitted_at":"2025-10-29T08:07:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"GRPO-SG is a sharpness-guided token-weighted variant of GRPO that downweights high-gradient tokens to stabilize optimization and improve generalization in reinforcement learning with verifiable rewards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.16079","ref_index":31,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle","primary_cat":"cs.CL","submitted_at":"2025-10-17T12:03:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EvolveR enables LLM agents to self-evolve via a closed loop of distilling interaction trajectories into strategic principles offline and retrieving them to guide online decisions with policy reinforcement, yielding better results on multi-hop QA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.12539","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations","primary_cat":"cs.IR","submitted_at":"2025-09-16T00:41:05+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LEAF distills teacher-aligned student embedding models that achieve new SOTA results on BEIR and MTEB for their size class while requiring only modest data and compute.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.05276","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SpikingBrain: Spiking Brain-inspired Large Models","primary_cat":"cs.LG","submitted_at":"2025-09-05T17:34:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpikingBrain-7B and SpikingBrain-76B achieve Transformer-comparable performance after continual pre-training on 150B tokens, with over 100x TTFT speedup on 4M-token sequences and 69.15% sparsity from event-driven spiking.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}