{"paper":{"title":"Test-Time Training Done Right","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Large-chunk updates during inference make test-time training efficient enough to scale nonlinear states to 40 percent of model parameters.","cross_cats":["cs.CL","cs.CV"],"primary_cat":"cs.LG","authors_text":"Fujun Luan, Hao Tan, Kai Zhang, Kalyan Sunkavalli, Sai Bi, Songlin Yang, Tianyuan Zhang, William T. Freeman, Yicong Hong","submitted_at":"2025-05-29T17:50:34Z","abstract_excerpt":"Test-Time Training (TTT) models context dependencies by adapting part of the model's weights (referred to as fast weights) during inference. This fast weight, akin to recurrent states in RNNs, stores temporary memories of past tokens in the current sequence. Existing TTT methods struggled to show effectiveness in handling long-context data, due to their inefficiency on modern GPUs. The TTT layers in many of these approaches operate with extremely low FLOPs utilization (often <5%) because they deliberately apply small online minibatch sizes (e.g., updating fast weights every 16 or 64 tokens). M"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"LaCT improves hardware utilization by orders of magnitude, facilitates scaling of nonlinear state size (up to 40% of model parameters), and enables 14B-parameter AR video diffusion on 56K tokens and 1M-token novel view synthesis without custom kernels.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That performing weight updates on extremely large chunks (2K–1M tokens) preserves or improves modeling quality compared with the fine-grained causal updates used in prior TTT work.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Large-chunk online updates during inference let test-time training scale state capacity to 40% of model size and handle contexts up to 1M tokens without custom kernels.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Large-chunk updates during inference make test-time training efficient enough to scale nonlinear states to 40 percent of model parameters.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"45fd7ad930b4f0f9de25eda15400f3da3be0caaecde495345ee79f5310d14927"},"source":{"id":"2505.23884","kind":"arxiv","version":1},"verdict":{"id":"38c3eaa0-b718-4de0-af64-d6d1fe20b0d4","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T11:21:09.142492Z","strongest_claim":"LaCT improves hardware utilization by orders of magnitude, facilitates scaling of nonlinear state size (up to 40% of model parameters), and enables 14B-parameter AR video diffusion on 56K tokens and 1M-token novel view synthesis without custom kernels.","one_line_summary":"Large-chunk online updates during inference let test-time training scale state capacity to 40% of model size and handle contexts up to 1M tokens without custom kernels.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That performing weight updates on extremely large chunks (2K–1M tokens) preserves or improves modeling quality compared with the fine-grained causal updates used in prior TTT work.","pith_extraction_headline":"Large-chunk updates during inference make test-time training efficient enough to scale nonlinear states to 40 percent of model parameters."},"references":{"count":75,"sample":[{"doi":"","year":2017,"title":"Attention is all you need","work_id":"7a952c5f-f5c4-4a0a-bc7f-ed068389d046","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Learning to (Learn at Test Time): RNNs with Expressive Hidden States","work_id":"c682430c-e7a2-4699-b82d-55287448dbba","ref_index":2,"cited_arxiv_id":"2407.04620","is_internal_anchor":true},{"doi":"","year":2021,"title":"Linear transformers are secretly fast weight programmers","work_id":"ae448930-6886-42c9-805c-97ecbc17cbbc","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Ke Alexander Wang, Jiaxin Shi, and Emily B. Fox. Test-time regression: a unifying framework for designing sequence models with associative memory, 2025","work_id":"44a6409a-3393-4270-aa11-8d7bc801effe","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Titans: Learning to Memorize at Test Time","work_id":"fb2b7625-b733-43cb-af52-00b0a31a8d7f","ref_index":5,"cited_arxiv_id":"2501.00663","is_internal_anchor":true}],"resolved_work":75,"snapshot_sha256":"4ba522b94a6af93d8af7b6df2144a50c6d67ae7d1aa6c0b12c145ece4f5d9063","internal_anchors":19},"formal_canon":{"evidence_count":3,"snapshot_sha256":"0a1c32f2d7dca4ba20ad95b79fae0d9fbcbbdf98a56bfd7bfcbd04eca39c731f"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}