{"total":18,"items":[{"citing_arxiv_id":"2606.27786","ref_index":100,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SHIFT: Gate-Modulated Activation Steering for Knowledge Conflict Mitigation in Retrieval-Augmented Generation","primary_cat":"cs.CL","submitted_at":"2026-06-26T07:17:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SHIFT reformulates neuron editing as learnable gate modulation on under 0.01% parameters to let LLMs adaptively balance contextual and parametric knowledge during RAG generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.12397","ref_index":37,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Redesign Mixture-of-Experts Routers with Manifold Power Iteration","primary_cat":"cs.LG","submitted_at":"2026-06-10T17:57:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Manifold Power Iteration aligns MoE router rows with principal singular directions of experts via a power-then-retract process, with theory showing convergence and experiments on 1B-11B models showing gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.08347","ref_index":31,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Tensorizing Engram: Sharing Latents Across N-Gram Embeddings is Beneficial in LLMs","primary_cat":"cs.CL","submitted_at":"2026-06-06T21:36:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TN-gram replaces per-order hash tables in n-gram memory modules with a CP tensor factorization that shares token-position factors and uses order-absorption vectors, achieving comparable or better performance with fewer parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19220","ref_index":66,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering","primary_cat":"cs.CL","submitted_at":"2026-05-19T00:47:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Mainstream UQ for LLMs reduces to unsupervised clustering of internal generation consistency and therefore cannot detect confident hallucinations or provide reliable safety signals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23051","ref_index":31,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Evaluating Temporal Consistency in Multi-Turn Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-24T22:44:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21106","ref_index":48,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models","primary_cat":"cs.LG","submitted_at":"2026-04-22T21:51:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Association for Computational Linguistics. doi: 10.18653/v1/N19-1246. URLhttps://aclanthology.org/N19-1246/. 13 [47] Siva Reddy, Danqi Chen, and Christopher D. Manning. CoQA: A conversational question answering challenge.Transactions of the Association for Computational Linguistics, 7:249-266, 2019. doi: 10.1162/tacl_a_00266. URLhttps://aclanthology.org/Q19-1016/. [48] Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors,Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080-"},{"citing_arxiv_id":"2604.20267","ref_index":60,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"ATIR: Towards Audio-Text Interleaved Contextual Retrieval","primary_cat":"cs.SD","submitted_at":"2026-04-22T07:11:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19520","ref_index":22,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SimDiff: Depth Pruning via Similarity and Difference","primary_cat":"cs.AI","submitted_at":"2026-04-21T14:43:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SimDiff uses similarity and difference metrics to prune LLM layers more effectively than cosine similarity alone, retaining over 91% performance at 25% pruning on LLaMA2-7B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19245","ref_index":68,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Talking to a Know-It-All GPT or a Second-Guesser Claude? How Repair reveals unreliable Multi-Turn Behavior in LLMs","primary_cat":"cs.CL","submitted_at":"2026-04-21T08:50:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Each tested LLM shows its own characteristic unreliability when engaging in repair during extended math-question dialogues.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18770","ref_index":49,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Agentic GraphRAG: Navigating Unstructured Financial Data with Collaborative AI","primary_cat":"cs.IR","submitted_at":"2026-04-15T16:16:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agentic GraphRAG constructs a Neo4j graph via deterministic structured ingestion plus LLM extraction from notices, then deploys modular agents with tool access and reflection to outperform vector-RAG baselines on Swiss commercial gazette data across entity resolution, answer quality, and multi-turn ","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05350","ref_index":9,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"DQA: Diagnostic Question Answering for IT Support","primary_cat":"cs.CL","submitted_at":"2026-04-07T02:42:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DQA maintains persistent diagnostic state and aggregates retrievals at the root-cause level to reach 78.7% success on 150 enterprise IT scenarios versus 41.3% for standard multi-turn RAG while cutting average turns from 8.4 to 3.9.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.14427","ref_index":44,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models","primary_cat":"cs.CL","submitted_at":"2025-02-20T10:25:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Adapts multi-layer token-level Mahalanobis distance with supervised linear regression to yield improved uncertainty scores for LLM truthfulness tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.10813","ref_index":87,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory","primary_cat":"cs.CL","submitted_at":"2024-10-14T17:59:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.10692","ref_index":35,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Unconditional Truthfulness: Learning Unconditional Uncertainty of Large Language Models","primary_cat":"cs.CL","submitted_at":"2024-08-20T09:42:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A regression model using attention features and recurrent uncertainty scores improves selective generation in LLMs over unsupervised and supervised baselines on ten datasets and three models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.11794","ref_index":152,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"DataComp-LM: In search of the next generation of training sets for language models","primary_cat":"cs.LG","submitted_at":"2024-06-17T17:42:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"Glu variants improve transformer. ArXiv preprint, abs/2002.05202, 2020. URL https://arxiv.org/abs/2002.05202. 26 [151] Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. Detecting pretraining data from large language models. ArXiv preprint, abs/2310.16789, 2023. URL https://arxiv.org/abs/2310.16789. [152] Igor Shilov, Matthieu Meeus, and Yves-Alexandre de Montjoye. Mosaic memory: Fuzzy duplication in copyright traps for large language models. ArXiv preprint, abs/2405.15523, 2024. URL https://arxiv.org/abs/2405.15523. [153] Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron T Parisi, Abhishek"},{"citing_arxiv_id":"2311.12983","ref_index":134,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"GAIA: a benchmark for General AI Assistants","primary_cat":"cs.CL","submitted_at":"2023-11-21T20:34:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.06161","ref_index":233,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"StarCoder: may the source be with you!","primary_cat":"cs.CL","submitted_at":"2023-05-09T08:16:42+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2207.14255","ref_index":61,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Efficient Training of Language Models to Fill in the Middle","primary_cat":"cs.CL","submitted_at":"2022-07-28T17:40:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Autoregressive language models trained on data with middle spans relocated to the end learn infilling without degrading left-to-right perplexity or sampling quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}