CL-Bench is the first expert-validated benchmark for continual learning in frontier LLMs across six real-world domains, showing limited gains and that naive in-context learning outperforms dedicated memory systems.
hub
End-to-end test-time training for long context
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 14roles
background 3polarities
background 3representative citing papers
Test-time training with KV binding reduces to learned linear attention.
A sleep mechanism with N offline recurrent passes consolidates context into fast weights, improving performance on reasoning tasks where standard transformers fail.
MemDLM embeds a simulated denoising trajectory into DLM training via bi-level optimization, creating a parametric memory that improves convergence and long-context performance even when the memory is dropped at test time.
TTT-Discover applies test-time RL to set new state-of-the-art results on math inequalities, GPU kernels, algorithm contests, and single-cell denoising using an open model and public code.
T3R applies multiple Rotograd matrices and a rotation technique to create surrogate gradients, enabling deeper test-time adaptation in GNNs and yielding 0.172 MAE reduction plus 9.37% relative gains on OGB benchmarks.
TMEM lets LLM agents evolve their policy mid-episode by absorbing distilled supervision into online LoRA updates, outperforming summary and retrieval baselines on several long-context benchmarks.
Introduces AgentOdyssey, a procedural generator of open-ended long-horizon text games, to evaluate test-time continual learning agents and diagnose limits in exploration, memory, and planning.
TipPFN uses prior-data fitted networks and in-context learning on synthetic bifurcation data to detect proximity to critical transitions in unseen dynamical systems and real observations.
FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, yielding gains on long-context benchmarks.
Elastic Test-Time Training stabilizes test-time updates via an elastic prior and moving-average anchor, enabling Fast Spatial Memory for scalable long-sequence 4D reconstruction with reduced memory use and fewer shortcuts.
IndexMem proposes a learned KV importance predictor paired with a latent memory module to enable bounded KV cache size for long-context inference, reporting gains on RULER, Needle-in-a-Haystack, and LongBench across multiple LLMs.
PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing, recommendation, and protein tasks.
Bilevel learning methods rely on implicit differentiation but are restricted by assumptions of unique lower-level solutions and struggle with constraints, and connections to broader bilevel optimization literature may enable more scalable general-purpose algorithms.
citing papers explorer
-
Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments
CL-Bench is the first expert-validated benchmark for continual learning in frontier LLMs across six real-world domains, showing limited gains and that naive in-context learning outperforms dedicated memory systems.
-
Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference
A sleep mechanism with N offline recurrent passes consolidates context into fast weights, improving performance on reasoning tasks where standard transformers fail.
-
MemDLM: Memory-Enhanced DLM Training
MemDLM embeds a simulated denoising trajectory into DLM training via bi-level optimization, creating a parametric memory that improves convergence and long-context performance even when the memory is dropped at test time.
-
Learning to Discover at Test Time
TTT-Discover applies test-time RL to set new state-of-the-art results on math inequalities, GPU kernels, algorithm contests, and single-cell denoising using an open model and public code.
-
T3R: Deeper Test-Time Adaptation for Graph Neural Networks via Gradient Rotation
T3R applies multiple Rotograd matrices and a rotation technique to create surrogate gradients, enabling deeper test-time adaptation in GNNs and yielding 0.172 MAE reduction plus 9.37% relative gains on OGB benchmarks.
-
Scaling Self-Evolving Agents via Parametric Memory
TMEM lets LLM agents evolve their policy mid-episode by absorbing distilled supervision into online LoRA updates, outperforming summary and retrieval baselines on several long-context benchmarks.
-
AgentOdyssey: Open-Ended Long-Horizon Text Game Generation for Test-Time Continual Learning Agents
Introduces AgentOdyssey, a procedural generator of open-ended long-horizon text games, to evaluate test-time continual learning agents and diagnose limits in exploration, memory, and planning.
-
In-context learning to predict critical transitions in dynamical systems
TipPFN uses prior-data fitted networks and in-context learning on synthetic bifurcation data to detect proximity to critical transitions in unseen dynamical systems and real observations.
-
FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning
FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, yielding gains on long-context benchmarks.
-
Fast Spatial Memory with Elastic Test-Time Training
Elastic Test-Time Training stabilizes test-time updates via an elastic prior and moving-average anchor, enabling Fast Spatial Memory for scalable long-sequence 4D reconstruction with reduced memory use and fewer shortcuts.
-
IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference
IndexMem proposes a learned KV importance predictor paired with a latent memory module to enable bounded KV cache size for long-context inference, reporting gains on RULER, Needle-in-a-Haystack, and LongBench across multiple LLMs.
-
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents
PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing, recommendation, and protein tasks.
-
Bilevel learning
Bilevel learning methods rely on implicit differentiation but are restricted by assumptions of unique lower-level solutions and struggle with constraints, and connections to broader bilevel optimization literature may enable more scalable general-purpose algorithms.