{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2026:JKNNJPLSCSGNRU22IWJJDO5F3S","short_pith_number":"pith:JKNNJPLS","schema_version":"1.0","canonical_sha256":"4a9ad4bd72148cd8d35a459291bba5dcacfabdc1c6ad4938c7d7095b52ad75b7","source":{"kind":"arxiv","id":"2605.16899","version":1},"attestation_state":"computed","paper":{"title":"LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A dual-memory architecture with contrastive spatio-temporal learning builds latent cognitive maps for embodied agents.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Jinzhou Tang, Keze Wang, Sidi Liu, Waikit Xiu, Weixing Chen","submitted_at":"2026-05-16T09:21:56Z","abstract_excerpt":"A fundamental challenge in embodied AI is verifying if agents build internal models of spatial structure or merely learn to mimic task-specific expert trajectories. This is critical as foundational approaches rooted in action-centric tasks (e.g., VLN) and reasoning-centric tasks (e.g., EQA) often share a common limitation: they lack a learning signal that forces them to encode fine-grained spatial relationships (like topology or distance) over long-range, fragmented experiences. To address this, we first propose LASAR, an architecture featuring a dual-memory system designed to maintain both ep"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2605.16899","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2026-05-16T09:21:56Z","cross_cats_sorted":[],"title_canon_sha256":"0402443ec61ebc687c11219e67b4f7113293ef06658ea8b2079b07c075110e0a","abstract_canon_sha256":"a0868a8cc29d741c352a9586f2097b2fc0b66b488480360a47911409940c3283"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-20T00:03:29.056103Z","signature_b64":"Q0wyHBOw3WgmNQK0ivFuMc+HggIV+UKHJwwqVb2wnc4CObH3I6qy9P+S8VfAdoXReIuZwN+mTESm7ezVouk8Bw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"4a9ad4bd72148cd8d35a459291bba5dcacfabdc1c6ad4938c7d7095b52ad75b7","last_reissued_at":"2026-05-20T00:03:29.055379Z","signature_status":"signed_v1","first_computed_at":"2026-05-20T00:03:29.055379Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A dual-memory architecture with contrastive spatio-temporal learning builds latent cognitive maps for embodied agents.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Jinzhou Tang, Keze Wang, Sidi Liu, Waikit Xiu, Weixing Chen","submitted_at":"2026-05-16T09:21:56Z","abstract_excerpt":"A fundamental challenge in embodied AI is verifying if agents build internal models of spatial structure or merely learn to mimic task-specific expert trajectories. This is critical as foundational approaches rooted in action-centric tasks (e.g., VLN) and reasoning-centric tasks (e.g., EQA) often share a common limitation: they lack a learning signal that forces them to encode fine-grained spatial relationships (like topology or distance) over long-range, fragmented experiences. To address this, we first propose LASAR, an architecture featuring a dual-memory system designed to maintain both ep"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Experiments demonstrate that our method achieves 2%-3.5% gains in both zero-shot generalization on standard VLN-CE and VSI-Bench benchmarks. We also demonstrate that our proposed cognitive map has high self-consistency.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"Foundational approaches rooted in action-centric tasks (e.g., VLN) and reasoning-centric tasks (e.g., EQA) lack a learning signal that forces them to encode fine-grained spatial relationships (like topology or distance) over long-range, fragmented experiences; the ST-CRL objective is assumed to supply exactly that missing signal.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"LASAR pairs a dual-memory system with spatio-temporal contrastive learning to induce latent cognitive maps, reporting 2-3.5% zero-shot gains on VLN-CE and VSI-Bench plus high map self-consistency.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A dual-memory architecture with contrastive spatio-temporal learning builds latent cognitive maps for embodied agents.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"7412bee171c37671f7cc0661cef794d8e13d4625c897e103e95c634d9e2c94c1"},"source":{"id":"2605.16899","kind":"arxiv","version":1},"verdict":{"id":"47f39333-f267-418c-b6f0-ecef42137a8c","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-19T21:31:15.424475Z","strongest_claim":"Experiments demonstrate that our method achieves 2%-3.5% gains in both zero-shot generalization on standard VLN-CE and VSI-Bench benchmarks. We also demonstrate that our proposed cognitive map has high self-consistency.","one_line_summary":"LASAR pairs a dual-memory system with spatio-temporal contrastive learning to induce latent cognitive maps, reporting 2-3.5% zero-shot gains on VLN-CE and VSI-Bench plus high map self-consistency.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"Foundational approaches rooted in action-centric tasks (e.g., VLN) and reasoning-centric tasks (e.g., EQA) lack a learning signal that forces them to encode fine-grained spatial relationships (like topology or distance) over long-range, fragmented experiences; the ST-CRL objective is assumed to supply exactly that missing signal.","pith_extraction_headline":"A dual-memory architecture with contrastive spatio-temporal learning builds latent cognitive maps for embodied agents."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.16899/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"doi_title_agreement","ran_at":"2026-05-19T22:01:19.531210Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T21:40:53.102671Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"cited_work_retraction","ran_at":"2026-05-19T20:52:41.004959Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"claim_evidence","ran_at":"2026-05-19T18:41:56.278020Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"ai_meta_artifact","ran_at":"2026-05-19T18:33:26.356921Z","status":"skipped","version":"1.0.0","findings_count":0}],"snapshot_sha256":"ee862382319554abfa35cc9514f219bd8ea129982404d03313bfb2b53803fae7"},"references":{"count":73,"sample":[{"doi":"","year":2023,"title":"Self-supervised object detection from egocentric videos","work_id":"a6a4b4eb-a1cd-4f11-b912-299e5b0c32ef","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2018,"title":"Vision- and-language navigation: Interpreting visually-grounded navigation instructions in real environments","work_id":"a9820bf7-7e1b-470c-b002-2c2beab07ef3","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"org/abs/2510.19818","work_id":"4849b0f4-7516-4161-abbb-421b009d9a71","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"Clark, and Aaron Wilber","work_id":"b74f8447-77fa-4bc0-ad60-190a22b70eb2","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Rt-2: Vision-language-action mod- els transfer web knowledge to robotic control, 2023","work_id":"e158e303-e769-4cf6-8788-77ac3138d07c","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":73,"snapshot_sha256":"fe607f4417e0e79b8a0219f94e5fb3810383f6aa623ad5eda4f584ef6ba56e83","internal_anchors":12},"formal_canon":{"evidence_count":2,"snapshot_sha256":"baea8f8b72b99bb0582a2a9fbd6a5b20075529f2953c6d3f28fc7c998a3111ad"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2605.16899","created_at":"2026-05-20T00:03:29.055472+00:00"},{"alias_kind":"arxiv_version","alias_value":"2605.16899v1","created_at":"2026-05-20T00:03:29.055472+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2605.16899","created_at":"2026-05-20T00:03:29.055472+00:00"},{"alias_kind":"pith_short_12","alias_value":"JKNNJPLSCSGN","created_at":"2026-05-20T00:03:29.055472+00:00"},{"alias_kind":"pith_short_16","alias_value":"JKNNJPLSCSGNRU22","created_at":"2026-05-20T00:03:29.055472+00:00"},{"alias_kind":"pith_short_8","alias_value":"JKNNJPLS","created_at":"2026-05-20T00:03:29.055472+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":0,"internal_anchor_count":0,"sample":[]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/JKNNJPLSCSGNRU22IWJJDO5F3S","json":"https://pith.science/pith/JKNNJPLSCSGNRU22IWJJDO5F3S.json","graph_json":"https://pith.science/api/pith-number/JKNNJPLSCSGNRU22IWJJDO5F3S/graph.json","events_json":"https://pith.science/api/pith-number/JKNNJPLSCSGNRU22IWJJDO5F3S/events.json","paper":"https://pith.science/paper/JKNNJPLS"},"agent_actions":{"view_html":"https://pith.science/pith/JKNNJPLSCSGNRU22IWJJDO5F3S","download_json":"https://pith.science/pith/JKNNJPLSCSGNRU22IWJJDO5F3S.json","view_paper":"https://pith.science/paper/JKNNJPLS","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2605.16899&json=true","fetch_graph":"https://pith.science/api/pith-number/JKNNJPLSCSGNRU22IWJJDO5F3S/graph.json","fetch_events":"https://pith.science/api/pith-number/JKNNJPLSCSGNRU22IWJJDO5F3S/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/JKNNJPLSCSGNRU22IWJJDO5F3S/action/timestamp_anchor","attest_storage":"https://pith.science/pith/JKNNJPLSCSGNRU22IWJJDO5F3S/action/storage_attestation","attest_author":"https://pith.science/pith/JKNNJPLSCSGNRU22IWJJDO5F3S/action/author_attestation","sign_citation":"https://pith.science/pith/JKNNJPLSCSGNRU22IWJJDO5F3S/action/citation_signature","submit_replication":"https://pith.science/pith/JKNNJPLSCSGNRU22IWJJDO5F3S/action/replication_record"}},"created_at":"2026-05-20T00:03:29.055472+00:00","updated_at":"2026-05-20T00:03:29.055472+00:00"}