Establishes a quadratic lower bound on query complexity for sampling from large classes of distributions given approximate density oracles, answers an open question on optimality of random walks, and shows circumvention for bounded classes as an abstraction of TTT.
hub
Test-time learning for large language models
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 14representative citing papers
QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.
Evidence-informed belief updates make Bayesian surprise non-stationary in LLM hypothesis search, with embedding-based RAG identifying 37.5% spurious static surprisals and modified search (filtering plus diversity) yielding 30.62% higher accumulated non-stationary surprisal across five domains.
EpiEvolve achieves 0.629 accuracy in streaming COVID-19 forecasting by using episodic memory, reflection on delayed labels, and regime-aware retrieval, outperforming static LLMs (0.561) and CDC ensembles (0.325) while halving recovery lag after regime shifts.
TMEM lets LLM agents evolve their policy mid-episode by absorbing distilled supervision into online LoRA updates, outperforming summary and retrieval baselines on several long-context benchmarks.
HMARS introduces a hierarchical multi-agent memory system that outperforms standard retrieval and other baselines on long-document and multi-turn reasoning tasks through improved evidence coverage.
Proposes PDF, a hierarchical multi-agent Perception-to-Deliberation Framework that adds experience self-evolution and test-time scaling to composed image retrieval, claiming SOTA on CIRR, CIRCO, and FashionIQ.
UG-TTT adds epistemic uncertainty measured by adapter disagreement as an exploration bonus in RL for LLMs, raising maximum reward and diversity on scientific discovery benchmarks.
BOLT is a 0.9M-parameter plug-and-play module that uses ego-as-teacher distillation on high-confidence predictions to align neighbor features online, raising AP@50 by up to 32.3 points over unadapted fusion while beating ego-only baselines on DAIR-V2X and OPV2V.
LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and allowing a 14B model to beat Gemini-2.5-Flash.
PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baselines on reasoning tasks.
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
EASE-TTT creates a soft attention target from evidence chunks to guide query-side test-time adaptation, yielding higher macro-average scores than full-context, retrieval-only, and standard qTTT baselines on six LongBench QA tasks.
SOLAR introduces a self-optimizing agent using meta-learning on model weights and RL-driven strategy discovery for lifelong adaptation in LLMs, claiming superior performance on reasoning tasks across domains.
citing papers explorer
-
The Power of Test-Time Training for Approximate Sampling
Establishes a quadratic lower bound on query complexity for sampling from large classes of distributions given approximate density oracles, answers an open question on optimality of random walks, and shows circumvention for bounded classes as an abstraction of TTT.
-
Evidence-Informed LLM Beliefs for Continual Scientific Discovery
Evidence-informed belief updates make Bayesian surprise non-stationary in LLM hypothesis search, with embedding-based RAG identifying 37.5% spurious static surprisals and modified search (filtering plus diversity) yielding 30.62% higher accumulated non-stationary surprisal across five domains.
-
EpiEvolve: Self-Evolving Agents for Streaming Pandemic Forecasting under Regime Shifts
EpiEvolve achieves 0.629 accuracy in streaming COVID-19 forecasting by using episodic memory, reflection on delayed labels, and regime-aware retrieval, outperforming static LLMs (0.561) and CDC ensembles (0.325) while halving recovery lag after regime shifts.
-
Scaling Self-Evolving Agents via Parametric Memory
TMEM lets LLM agents evolve their policy mid-episode by absorbing distilled supervision into online LoRA updates, outperforming summary and retrieval baselines on several long-context benchmarks.
-
HMARS: A Hierarchical Multi-Agent Memory System for Long-Context Reasoning
HMARS introduces a hierarchical multi-agent memory system that outperforms standard retrieval and other baselines on long-document and multi-turn reasoning tasks through improved evidence coverage.
-
DeliCIR: Deliberative Test-Time Evolutionary Hierarchical Multi-Agents for Composed Image Retrieval
Proposes PDF, a hierarchical multi-agent Perception-to-Deliberation Framework that adds experience self-evolution and test-time scaling to composed image retrieval, claiming SOTA on CIRR, CIRCO, and FashionIQ.
-
Epistemic Uncertainty for Test-Time Discovery
UG-TTT adds epistemic uncertainty measured by adapter disagreement as an exploration bonus in RL for LLMs, raising maximum reward and diversity on scientific discovery benchmarks.
-
BOLT: Online Lightweight Adaptation for Preparation-Free Heterogeneous Cooperative Perception
BOLT is a 0.9M-parameter plug-and-play module that uses ego-as-teacher distillation on high-confidence predictions to align neighbor features online, raising AP@50 by up to 32.3 points over unadapted fusion while beating ego-only baselines on DAIR-V2X and OPV2V.
-
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and allowing a 14B model to beat Gemini-2.5-Flash.
-
From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space
PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baselines on reasoning tasks.
-
EASE-TTT: Evidence-Aligned Selective Test-Time Training for Long-Context Question Answering
EASE-TTT creates a soft attention target from evidence chunks to guide query-side test-time adaptation, yielding higher macro-average scores than full-context, retrieval-only, and standard qTTT baselines on six LongBench QA tasks.
-
SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation
SOLAR introduces a self-optimizing agent using meta-learning on model weights and RL-driven strategy discovery for lifelong adaptation in LLMs, claiming superior performance on reasoning tasks across domains.