{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:RCPETT5HFXXTYZCIDGIVXYAWDV","short_pith_number":"pith:RCPETT5H","schema_version":"1.0","canonical_sha256":"889e49cfa72def3c644819915be0161d71812901998c79e2d764dfbfa76e92d6","source":{"kind":"arxiv","id":"2407.04620","version":4},"attestation_state":"computed","paper":{"title":"Learning to (Learn at Test Time): RNNs with Expressive Hidden States","license":"http://creativecommons.org/licenses/by/4.0/","headline":"RNNs can match long-context performance by updating a learnable hidden-state model via self-supervised steps at test time.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.LG","authors_text":"Arjun Vikram, Carlos Guestrin, Genghan Zhang, Jiarui Xu, Karan Dalal, Sanmi Koyejo, Tatsunori Hashimoto, Xiaolong Wang, Xinhao Li, Xinlei Chen, Yann Dubois, Yu Sun","submitted_at":"2024-07-05T16:23:20Z","abstract_excerpt":"Self-attention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their hidden states. We present a practical framework for instantiating sequence modeling layers with linear complexity and expressive hidden states. The key idea is to make the hidden state a machine learning model itself, and the update rule a step of self-supervised learning. Since the hidden state is updated by training even on test sequences, our layers are called Test-Time Training (TTT) layers. "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2407.04620","kind":"arxiv","version":4},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.LG","submitted_at":"2024-07-05T16:23:20Z","cross_cats_sorted":["cs.AI","cs.CL"],"title_canon_sha256":"28bf260612ef235043aafc2f64009b40780baf577faf3678da216f6c9231734f","abstract_canon_sha256":"05b4d8152342b055af443082b4000e3e33ae32d46334dbf1752401c3572d0c9e"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:53.408752Z","signature_b64":"cK95Vz4jMmKFC42dlOfMtHvEm/WTQ9En+dsKpBh0dy/o4nrwA3XE7e6THxtHVGrWIRqcfSaDnAFsbkbYKUvnBA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"889e49cfa72def3c644819915be0161d71812901998c79e2d764dfbfa76e92d6","last_reissued_at":"2026-05-17T23:38:53.408085Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:53.408085Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Learning to (Learn at Test Time): RNNs with Expressive Hidden States","license":"http://creativecommons.org/licenses/by/4.0/","headline":"RNNs can match long-context performance by updating a learnable hidden-state model via self-supervised steps at test time.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.LG","authors_text":"Arjun Vikram, Carlos Guestrin, Genghan Zhang, Jiarui Xu, Karan Dalal, Sanmi Koyejo, Tatsunori Hashimoto, Xiaolong Wang, Xinhao Li, Xinlei Chen, Yann Dubois, Yu Sun","submitted_at":"2024-07-05T16:23:20Z","abstract_excerpt":"Self-attention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their hidden states. We present a practical framework for instantiating sequence modeling layers with linear complexity and expressive hidden states. The key idea is to make the hidden state a machine learning model itself, and the update rule a step of self-supervised learning. Since the hidden state is updated by training even on test sequences, our layers are called Test-Time Training (TTT) layers. "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"TTT-Linear and TTT-MLP can keep reducing perplexity by conditioning on more tokens, while Mamba cannot after 16k context.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That performing gradient-based self-supervised updates on the hidden-state model at test time remains stable, computationally tractable, and beneficial without overfitting or excessive overhead at scale.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"RNNs can match long-context performance by updating a learnable hidden-state model via self-supervised steps at test time.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"9095f13a6182e0b5955486320b67b84c814e7b8aa639ebaf1f48e65d1eb93dcd"},"source":{"id":"2407.04620","kind":"arxiv","version":4},"verdict":{"id":"e57d0e53-4244-4e10-b83a-5f1649bddba6","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T05:15:14.448221Z","strongest_claim":"TTT-Linear and TTT-MLP can keep reducing perplexity by conditioning on more tokens, while Mamba cannot after 16k context.","one_line_summary":"TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That performing gradient-based self-supervised updates on the hidden-state model at test time remains stable, computationally tractable, and beneficial without overfitting or excessive overhead at scale.","pith_extraction_headline":"RNNs can match long-context performance by updating a learnable hidden-state model via self-supervised steps at test time."},"references":{"count":85,"sample":[{"doi":"","year":2023,"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","ref_index":1,"cited_arxiv_id":"2303.08774","is_internal_anchor":true},{"doi":"","year":2016,"title":"Learning to learn by gradient descent by gradient descent","work_id":"10a07384-39c3-4ed9-9876-b91a06e77edc","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"You just found out your book was used to train ai","work_id":"177c1a74-b066-4e47-a0dd-3a374c625bdc","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"o ppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, G \\","work_id":"cd8ee2fb-957f-4f9d-bdf9-e168fba4c2b8","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1990,"title":"Learning a synaptic learning rule","work_id":"a61250e2-c21e-4545-9b59-e49d185d4f4e","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":85,"snapshot_sha256":"a5e266a013436bb700daa9fde2a862c497152e65109a0e936d186157f3c5aa39","internal_anchors":15},"formal_canon":{"evidence_count":2,"snapshot_sha256":"bd30cfad859502468e2d539bb2bf092191e9ddf5781c7d3128f227b3de8e073d"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2407.04620","created_at":"2026-05-17T23:38:53.408199+00:00"},{"alias_kind":"arxiv_version","alias_value":"2407.04620v4","created_at":"2026-05-17T23:38:53.408199+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2407.04620","created_at":"2026-05-17T23:38:53.408199+00:00"},{"alias_kind":"pith_short_12","alias_value":"RCPETT5HFXXT","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"RCPETT5HFXXTYZCI","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"RCPETT5H","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":38,"internal_anchor_count":38,"sample":[{"citing_arxiv_id":"2410.04960","citing_title":"On Efficient Variants of Segment Anything Model: A Survey","ref_index":195,"is_internal_anchor":true},{"citing_arxiv_id":"2502.14644","citing_title":"LIFT: A Novel Framework for Enhancing Long-Context Understanding of LLMs via Long Input Fine-Tuning","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12770","citing_title":"WriteSAE: Sparse Autoencoders for Recurrent State","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2603.15031","citing_title":"Attention Residuals","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16350","citing_title":"Federated Nested Learning: Collaborative Training of Self-Referential Memories for Test-Time Adaptation","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17478","citing_title":"Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19444","citing_title":"Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2507.18809","citing_title":"Test-time Offline Reinforcement Learning on Goal-related Experience","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2509.22630","citing_title":"StateX: Enhancing RNN Recall via Post-training State Expansion","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2509.24552","citing_title":"Short window attention enables long-term memorization","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2510.26083","citing_title":"Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2510.27258","citing_title":"Higher-order Linear Attention","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2509.26645","citing_title":"TTT3R: 3D Reconstruction as Test-Time Training","ref_index":73,"is_internal_anchor":true},{"citing_arxiv_id":"2512.10267","citing_title":"Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2601.03323","citing_title":"Listen to Rhythm, Choose Movements: Autoregressive Multimodal Dance Generation via Diffusion and Mamba with Decoupled Dance Dataset","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2509.19349","citing_title":"ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution","ref_index":147,"is_internal_anchor":true},{"citing_arxiv_id":"2505.23884","citing_title":"Test-Time Training Done Right","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15178","citing_title":"SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer","ref_index":77,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12770","citing_title":"WriteSAE: Sparse Autoencoders for Recurrent State","ref_index":67,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14477","citing_title":"Test-Time Learning with an Evolving Library","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2501.00663","citing_title":"Titans: Learning to Memorize at Test Time","ref_index":103,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12770","citing_title":"WriteSAE: Sparse Autoencoders for Recurrent State","ref_index":67,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13473","citing_title":"OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2603.29002","citing_title":"Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2510.26692","citing_title":"Kimi Linear: An Expressive, Efficient Attention Architecture","ref_index":93,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/RCPETT5HFXXTYZCIDGIVXYAWDV","json":"https://pith.science/pith/RCPETT5HFXXTYZCIDGIVXYAWDV.json","graph_json":"https://pith.science/api/pith-number/RCPETT5HFXXTYZCIDGIVXYAWDV/graph.json","events_json":"https://pith.science/api/pith-number/RCPETT5HFXXTYZCIDGIVXYAWDV/events.json","paper":"https://pith.science/paper/RCPETT5H"},"agent_actions":{"view_html":"https://pith.science/pith/RCPETT5HFXXTYZCIDGIVXYAWDV","download_json":"https://pith.science/pith/RCPETT5HFXXTYZCIDGIVXYAWDV.json","view_paper":"https://pith.science/paper/RCPETT5H","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2407.04620&json=true","fetch_graph":"https://pith.science/api/pith-number/RCPETT5HFXXTYZCIDGIVXYAWDV/graph.json","fetch_events":"https://pith.science/api/pith-number/RCPETT5HFXXTYZCIDGIVXYAWDV/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/RCPETT5HFXXTYZCIDGIVXYAWDV/action/timestamp_anchor","attest_storage":"https://pith.science/pith/RCPETT5HFXXTYZCIDGIVXYAWDV/action/storage_attestation","attest_author":"https://pith.science/pith/RCPETT5HFXXTYZCIDGIVXYAWDV/action/author_attestation","sign_citation":"https://pith.science/pith/RCPETT5HFXXTYZCIDGIVXYAWDV/action/citation_signature","submit_replication":"https://pith.science/pith/RCPETT5HFXXTYZCIDGIVXYAWDV/action/replication_record"}},"created_at":"2026-05-17T23:38:53.408199+00:00","updated_at":"2026-05-17T23:38:53.408199+00:00"}