{"total":14,"items":[{"citing_arxiv_id":"2606.29182","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evidence-Informed LLM Beliefs for Continual Scientific Discovery","primary_cat":"cs.AI","submitted_at":"2026-06-28T04:06:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Evidence-informed belief updates make Bayesian surprise non-stationary in LLM hypothesis search, with embedding-based RAG identifying 37.5% spurious static surprisals and modified search (filtering plus diversity) yielding 30.62% higher accumulated non-stationary surprisal across five domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11437","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Power of Test-Time Training for Approximate Sampling","primary_cat":"cs.DS","submitted_at":"2026-06-09T20:48:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Establishes a quadratic lower bound on query complexity for sampling from large classes of distributions given approximate density oracles, answers an open question on optimality of random walks, and shows circumvention for bounded classes as an abstraction of TTT.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06906","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EASE-TTT: Evidence-Aligned Selective Test-Time Training for Long-Context Question Answering","primary_cat":"cs.CL","submitted_at":"2026-06-05T04:49:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EASE-TTT creates a soft attention target from evidence chunks to guide query-side test-time adaptation, yielding higher macro-average scores than full-context, retrieval-only, and standard qTTT baselines on six LongBench QA tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05513","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EpiEvolve: Self-Evolving Agents for Streaming Pandemic Forecasting under Regime Shifts","primary_cat":"cs.AI","submitted_at":"2026-06-03T23:40:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EpiEvolve achieves 0.629 accuracy in streaming COVID-19 forecasting by using episodic memory, reflection on delayed labels, and regime-aware retrieval, outperforming static LLMs (0.561) and CDC ensembles (0.325) while halving recovery lag after regime shifts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04536","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Scaling Self-Evolving Agents via Parametric Memory","primary_cat":"cs.AI","submitted_at":"2026-06-03T07:18:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TMEM lets LLM agents evolve their policy mid-episode by absorbing distilled supervision into online LoRA updates, outperforming summary and retrieval baselines on several long-context benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28349","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HMARS: A Hierarchical Multi-Agent Memory System for Long-Context Reasoning","primary_cat":"cs.IR","submitted_at":"2026-06-03T07:15:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HMARS introduces a hierarchical multi-agent memory system that outperforms standard retrieval and other baselines on long-document and multi-turn reasoning tasks through improved evidence coverage.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22478","ref_index":58,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeliCIR: Deliberative Test-Time Evolutionary Hierarchical Multi-Agents for Composed Image Retrieval","primary_cat":"cs.CV","submitted_at":"2026-05-21T13:36:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proposes PDF, a hierarchical multi-agent Perception-to-Deliberation Framework that adds experience self-evolution and test-time scaling to composed image retrieval, claiming SOTA on CIRR, CIRCO, and FashionIQ.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13369","ref_index":9,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Query-Conditioned Test-Time Self-Training for Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-13T11:27:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11328","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Epistemic Uncertainty for Test-Time Discovery","primary_cat":"cs.LG","submitted_at":"2026-05-11T23:26:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UG-TTT adds epistemic uncertainty measured by adapter disagreement as an exploration bonus in RL for LLMs, raising maximum reward and diversity on scientific discovery benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00405","ref_index":15,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"BOLT: Online Lightweight Adaptation for Preparation-Free Heterogeneous Cooperative Perception","primary_cat":"cs.CV","submitted_at":"2026-05-01T04:53:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BOLT is a 0.9M-parameter plug-and-play module that uses ego-as-teacher distillation on high-confidence predictions to align neighbor features online, raising AP@50 by up to 32.3 points over unadapted fusion while beating ego-only baselines on DAIR-V2X and OPV2V.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18131","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration","primary_cat":"cs.AI","submitted_at":"2026-04-20T11:54:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and allowing a 14B model to beat Gemini-2.5-Flash.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Locas: Your models are principled initializers of locally-supported parametric memories.arXiv preprint arXiv:2602.05085, 2026. [43] Mohammad Mahdi Moradi, Hossam Amer, Sudhir Mudur, Weiwei Zhang, Yang Liu, and Walid Ahmed. Continuous self-improvement of large language models by test-time training with verifier-driven sample selection.arXiv preprint arXiv:2505.19475, 2025. [44] Jinwu Hu, Zhitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, and Mingkui Tan. Test-time learning for large language models.arXiv preprint arXiv:2505.20633, 2025. [45] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention."},{"citing_arxiv_id":"2604.14142","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space","primary_cat":"cs.LG","submitted_at":"2026-04-15T17:59:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baselines on reasoning tasks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"We employ Qwen3-4B and Qwen3-8B (Yang et al., 2025) as base models and train on the MATH dataset (Lewkowycz et al., 2022), which contains 7,500 problems. Detailed training hyperparameters and setups are provided in Appendix B.1. Evaluation Setup.We evaluate DSRL against several RL baselines, including GRPO (Shao et al., 2024), PPO (Schulman et al., 2017), Reinforce++ (Hu, 2025) and RLOO (Ahmadian et al., 2024). Also with the optimized version of GRPO: Dr.GRPO (Liu et al., 2025b) and DAPO (Yu et al., 2025) with only clip higher mechanism. The benchmarks cover: MATH500 (Lightman et al., 2023), AMC23 (MAA, 2023), AIME24 (MAA, 2024), AIME25 (MAA, 2025), Min- erva (Lewkowycz et al., 2022) and OlympiadBench (He et al., 2024)."},{"citing_arxiv_id":"2604.06169","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"In-Place Test-Time Training","primary_cat":"cs.LG","submitted_at":"2026-04-07T17:59:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"RULER: What's the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024. URLhttps://arxiv.org/abs/2404.06654. [29] Jinwu Hu, Zhitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, and Mingkui Tan. Test-time learning for large language models. arXiv preprint arXiv:2505.20633, 2025. URL https://arxiv.org/abs/2505.20633. Accepted at ICML 2025. [30] Kazuki Irie and Samuel J. Gershman. Fast weight programming and linear transformers: from machine learning to neurobiology, 2025. URLhttps://arxiv.org/abs/2508.08435. [31] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne"},{"citing_arxiv_id":"2605.20189","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation","primary_cat":"cs.AI","submitted_at":"2026-03-23T07:18:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SOLAR introduces a self-optimizing agent using meta-learning on model weights and RL-driven strategy discovery for lifelong adaptation in LLMs, claiming superior performance on reasoning tasks across domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}