hub

End-to-end test-time training for long context

Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel Rød, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin, Jed McCaleb, Yejin Choi, Yu Sun · 2025 · arXiv 2512.23675

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

read on arXiv browse 14 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments

cs.AI · 2026-06-04 · unverdicted · novelty 8.0

CL-Bench is the first expert-validated benchmark for continual learning in frontier LLMs across six real-world domains, showing limited gains and that naive in-context learning outperforms dedicated memory systems.

Test-Time Training with KV Binding Is Secretly Linear Attention

cs.LG · 2026-02-24 · conditional · novelty 8.0

Test-time training with KV binding reduces to learned linear attention.

Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

A sleep mechanism with N offline recurrent passes consolidates context into fast weights, improving performance on reasoning tasks where standard transformers fail.

MemDLM: Memory-Enhanced DLM Training

cs.CL · 2026-03-23 · unverdicted · novelty 7.0

MemDLM embeds a simulated denoising trajectory into DLM training via bi-level optimization, creating a parametric memory that improves convergence and long-context performance even when the memory is dropped at test time.

Learning to Discover at Test Time

cs.LG · 2026-01-22 · unverdicted · novelty 7.0

TTT-Discover applies test-time RL to set new state-of-the-art results on math inequalities, GPU kernels, algorithm contests, and single-cell denoising using an open model and public code.

T3R: Deeper Test-Time Adaptation for Graph Neural Networks via Gradient Rotation

cs.LG · 2026-06-29 · unverdicted · novelty 6.0

T3R applies multiple Rotograd matrices and a rotation technique to create surrogate gradients, enabling deeper test-time adaptation in GNNs and yielding 0.172 MAE reduction plus 9.37% relative gains on OGB benchmarks.

Scaling Self-Evolving Agents via Parametric Memory

cs.AI · 2026-06-03 · unverdicted · novelty 6.0

TMEM lets LLM agents evolve their policy mid-episode by absorbing distilled supervision into online LoRA updates, outperforming summary and retrieval baselines on several long-context benchmarks.

AgentOdyssey: Open-Ended Long-Horizon Text Game Generation for Test-Time Continual Learning Agents

cs.CL · 2026-05-29 · unverdicted · novelty 6.0

Introduces AgentOdyssey, a procedural generator of open-ended long-horizon text games, to evaluate test-time continual learning agents and diagnose limits in exploration, memory, and planning.

In-context learning to predict critical transitions in dynamical systems

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

TipPFN uses prior-data fitted networks and in-context learning on synthetic bifurcation data to detect proximity to critical transitions in unseen dynamical systems and real observations.

FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, yielding gains on long-context benchmarks.

Fast Spatial Memory with Elastic Test-Time Training

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

Elastic Test-Time Training stabilizes test-time updates via an elastic prior and moving-average anchor, enabling Fast Spatial Memory for scalable long-sequence 4D reconstruction with reduced memory use and fewer shortcuts.

IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference

cs.CL · 2026-05-25 · unverdicted · novelty 5.0

IndexMem proposes a learned KV importance predictor paired with a latent memory module to enable bounded KV cache size for long-context inference, reporting gains on RULER, Needle-in-a-Haystack, and LongBench across multiple LLMs.

PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing, recommendation, and protein tasks.

Bilevel learning

math.OC · 2026-05-02 · unverdicted · novelty 2.0

Bilevel learning methods rely on implicit differentiation but are restricted by assumptions of unique lower-level solutions and struggle with constraints, and connections to broader bilevel optimization literature may enable more scalable general-purpose algorithms.

citing papers explorer

Showing 13 of 13 citing papers after filters.

Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments cs.AI · 2026-06-04 · unverdicted · none · ref 31
CL-Bench is the first expert-validated benchmark for continual learning in frontier LLMs across six real-world domains, showing limited gains and that naive in-context learning outperforms dedicated memory systems.
Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference cs.CL · 2026-05-25 · unverdicted · none · ref 55
A sleep mechanism with N offline recurrent passes consolidates context into fast weights, improving performance on reasoning tasks where standard transformers fail.
MemDLM: Memory-Enhanced DLM Training cs.CL · 2026-03-23 · unverdicted · none · ref 57
MemDLM embeds a simulated denoising trajectory into DLM training via bi-level optimization, creating a parametric memory that improves convergence and long-context performance even when the memory is dropped at test time.
Learning to Discover at Test Time cs.LG · 2026-01-22 · unverdicted · none · ref 71
TTT-Discover applies test-time RL to set new state-of-the-art results on math inequalities, GPU kernels, algorithm contests, and single-cell denoising using an open model and public code.
T3R: Deeper Test-Time Adaptation for Graph Neural Networks via Gradient Rotation cs.LG · 2026-06-29 · unverdicted · none · ref 15
T3R applies multiple Rotograd matrices and a rotation technique to create surrogate gradients, enabling deeper test-time adaptation in GNNs and yielding 0.172 MAE reduction plus 9.37% relative gains on OGB benchmarks.
Scaling Self-Evolving Agents via Parametric Memory cs.AI · 2026-06-03 · unverdicted · none · ref 14
TMEM lets LLM agents evolve their policy mid-episode by absorbing distilled supervision into online LoRA updates, outperforming summary and retrieval baselines on several long-context benchmarks.
AgentOdyssey: Open-Ended Long-Horizon Text Game Generation for Test-Time Continual Learning Agents cs.CL · 2026-05-29 · unverdicted · none · ref 27
Introduces AgentOdyssey, a procedural generator of open-ended long-horizon text games, to evaluate test-time continual learning agents and diagnose limits in exploration, memory, and planning.
In-context learning to predict critical transitions in dynamical systems cs.LG · 2026-05-12 · unverdicted · none · ref 50
TipPFN uses prior-data fitted networks and in-context learning on synthetic bifurcation data to detect proximity to critical transitions in unseen dynamical systems and real observations.
FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning cs.CL · 2026-05-11 · unverdicted · none · ref 32
FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, yielding gains on long-context benchmarks.
Fast Spatial Memory with Elastic Test-Time Training cs.CV · 2026-04-08 · unverdicted · none · ref 45
Elastic Test-Time Training stabilizes test-time updates via an elastic prior and moving-average anchor, enabling Fast Spatial Memory for scalable long-sequence 4D reconstruction with reduced memory use and fewer shortcuts.
IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference cs.CL · 2026-05-25 · unverdicted · none · ref 23
IndexMem proposes a learned KV importance predictor paired with a latent memory module to enable bounded KV cache size for long-context inference, reporting gains on RULER, Needle-in-a-Haystack, and LongBench across multiple LLMs.
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents cs.LG · 2026-05-07 · unverdicted · none · ref 38
PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing, recommendation, and protein tasks.
Bilevel learning math.OC · 2026-05-02 · unverdicted · none · ref 28
Bilevel learning methods rely on implicit differentiation but are restricted by assumptions of unique lower-level solutions and struggle with constraints, and connections to broader bilevel optimization literature may enable more scalable general-purpose algorithms.

End-to-end test-time training for long context

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer