Large-chunk online updates during inference let test-time training scale state capacity to 40% of model size and handle contexts up to 1M tokens without custom kernels.
Leave no context behind: Efficient infinite context transformers with infini-attention
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it