{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2026:QIWSSMM3UF4CRTSAZCYHFZSAQL","short_pith_number":"pith:QIWSSMM3","schema_version":"1.0","canonical_sha256":"822d29319ba17828ce40c8b072e64082e1c0f3ddfe5483683d12be7640f3fee3","source":{"kind":"arxiv","id":"2605.13328","version":1},"attestation_state":"computed","paper":{"title":"What Limits Vision-and-Language Navigation ?","license":"http://creativecommons.org/licenses/by/4.0/","headline":"StereoNav uses target-location priors and stereo vision to achieve robust real-world vision-and-language navigation with fewer parameters and less data.","cross_cats":["cs.AI","cs.CL","cs.CV"],"primary_cat":"cs.RO","authors_text":"Jiaxi Zhang, Junzhe Xu, Kun Liu, Lusong Li, Renjing Xu, Taowen Wang, Wei Lu, Yixiao Feng, Yuetong Fang, Yunheng Wang, Zecui Zeng, Zizhao Yuan","submitted_at":"2026-05-13T10:41:24Z","abstract_excerpt":"Vision-and-Language Navigation (VLN) is a cornerstone of embodied intelligence. However, current agents often suffer from significant performance degradation when transitioning from simulation to real-world deployment, primarily due to perceptual instability (e.g., lighting variations and motion blur) and under-specified instructions. While existing methods attempt to bridge this gap by scaling up model size and training data, we argue that the bottleneck lies in the lack of robust spatial grounding and cross-domain priors. In this paper, we propose StereoNav, a robust Vision-Language-Action f"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":false},"canonical_record":{"source":{"id":"2605.13328","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.RO","submitted_at":"2026-05-13T10:41:24Z","cross_cats_sorted":["cs.AI","cs.CL","cs.CV"],"title_canon_sha256":"ee0b99c94fee81331fa8b475bc09421d04906f0460a087bfdc634a7f9ebbd079","abstract_canon_sha256":"6d926be6e47a166eb417ceaf37a090f2d36bc628b3a6683386eb044e6963ccbf"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-18T02:44:48.590965Z","signature_b64":"GCVoChVRQ9QRRBP19Yzb9PJjUx3IA9Bd6EMV3pP+9oBptJPtFxNUQ9RrQMsVU3WATPKMsVam49InlAHnheZ9Dg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"822d29319ba17828ce40c8b072e64082e1c0f3ddfe5483683d12be7640f3fee3","last_reissued_at":"2026-05-18T02:44:48.590477Z","signature_status":"signed_v1","first_computed_at":"2026-05-18T02:44:48.590477Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"What Limits Vision-and-Language Navigation ?","license":"http://creativecommons.org/licenses/by/4.0/","headline":"StereoNav uses target-location priors and stereo vision to achieve robust real-world vision-and-language navigation with fewer parameters and less data.","cross_cats":["cs.AI","cs.CL","cs.CV"],"primary_cat":"cs.RO","authors_text":"Jiaxi Zhang, Junzhe Xu, Kun Liu, Lusong Li, Renjing Xu, Taowen Wang, Wei Lu, Yixiao Feng, Yuetong Fang, Yunheng Wang, Zecui Zeng, Zizhao Yuan","submitted_at":"2026-05-13T10:41:24Z","abstract_excerpt":"Vision-and-Language Navigation (VLN) is a cornerstone of embodied intelligence. However, current agents often suffer from significant performance degradation when transitioning from simulation to real-world deployment, primarily due to perceptual instability (e.g., lighting variations and motion blur) and under-specified instructions. While existing methods attempt to bridge this gap by scaling up model size and training data, we argue that the bottleneck lies in the lack of robust spatial grounding and cross-domain priors. In this paper, we propose StereoNav, a robust Vision-Language-Action f"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"StereoNav achieves state-of-the-art egocentric RGB performance, with SR and SPL scores of 81.1% and 68.3%, and 67.5% and 52.0%, respectively, while using significantly fewer parameters and less training data than prior scaling-based approaches. More importantly, real-world robotic deployments confirm that StereoNav substantially improves navigation reliability in complex, unstructured environments.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the introduced Target-Location Priors remain invariant and useful across simulation-to-real domain shifts and that stereo vision reliably supplies depth cues that overcome motion blur and illumination changes without additional calibration or post-processing.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"StereoNav reaches new benchmark highs on R2R-CE and RxR-CE and improves real-robot reliability by supplying persistent target-location priors and stereo-derived geometry that stay stable under lighting changes and blur.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"StereoNav uses target-location priors and stereo vision to achieve robust real-world vision-and-language navigation with fewer parameters and less data.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"a4b8385422623a48717818b495ac02a37394d09ebaf2a1bbcc1bf0c122b15adc"},"source":{"id":"2605.13328","kind":"arxiv","version":1},"verdict":{"id":"de9bb94e-4461-4073-91d6-de99ec619ad2","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T17:55:13.413265Z","strongest_claim":"StereoNav achieves state-of-the-art egocentric RGB performance, with SR and SPL scores of 81.1% and 68.3%, and 67.5% and 52.0%, respectively, while using significantly fewer parameters and less training data than prior scaling-based approaches. More importantly, real-world robotic deployments confirm that StereoNav substantially improves navigation reliability in complex, unstructured environments.","one_line_summary":"StereoNav reaches new benchmark highs on R2R-CE and RxR-CE and improves real-robot reliability by supplying persistent target-location priors and stereo-derived geometry that stay stable under lighting changes and blur.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the introduced Target-Location Priors remain invariant and useful across simulation-to-real domain shifts and that stereo vision reliably supplies depth cues that overcome motion blur and illumination changes without additional calibration or post-processing.","pith_extraction_headline":"StereoNav uses target-location priors and stereo vision to achieve robust real-world vision-and-language navigation with fewer parameters and less data."},"references":{"count":59,"sample":[{"doi":"","year":2022,"title":"Vision-and-language navigation: A survey of tasks, methods, and future directions","work_id":"eb53cdc4-bf8b-4169-a97f-a90d52b6483b","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"arXiv preprint arXiv:2407.07035 , year=","work_id":"7543774d-9741-4892-b882-61376d9cf153","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"HomeRobot: Open-vocabulary mobile manipulation","work_id":"e30032a1-8971-4c74-9e96-365850cf0dce","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Navila: Legged robot vision-language-action model for navigation","work_id":"e88f8ab4-e2a0-4de1-8eed-bf6e07ae40cc","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Navid: Video-based vlm plans the next step for vision-and-language navigation","work_id":"bff28982-ff8e-4bb6-b489-ce19e4b5d8b6","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":59,"snapshot_sha256":"afc552b10de407913279d443f5d2accf3505cae14b057ff85a7673a99d9db687","internal_anchors":9},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2605.13328","created_at":"2026-05-18T02:44:48.590569+00:00"},{"alias_kind":"arxiv_version","alias_value":"2605.13328v1","created_at":"2026-05-18T02:44:48.590569+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2605.13328","created_at":"2026-05-18T02:44:48.590569+00:00"},{"alias_kind":"pith_short_12","alias_value":"QIWSSMM3UF4C","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"QIWSSMM3UF4CRTSA","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"QIWSSMM3","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":0,"internal_anchor_count":0,"sample":[]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/QIWSSMM3UF4CRTSAZCYHFZSAQL","json":"https://pith.science/pith/QIWSSMM3UF4CRTSAZCYHFZSAQL.json","graph_json":"https://pith.science/api/pith-number/QIWSSMM3UF4CRTSAZCYHFZSAQL/graph.json","events_json":"https://pith.science/api/pith-number/QIWSSMM3UF4CRTSAZCYHFZSAQL/events.json","paper":"https://pith.science/paper/QIWSSMM3"},"agent_actions":{"view_html":"https://pith.science/pith/QIWSSMM3UF4CRTSAZCYHFZSAQL","download_json":"https://pith.science/pith/QIWSSMM3UF4CRTSAZCYHFZSAQL.json","view_paper":"https://pith.science/paper/QIWSSMM3","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2605.13328&json=true","fetch_graph":"https://pith.science/api/pith-number/QIWSSMM3UF4CRTSAZCYHFZSAQL/graph.json","fetch_events":"https://pith.science/api/pith-number/QIWSSMM3UF4CRTSAZCYHFZSAQL/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/QIWSSMM3UF4CRTSAZCYHFZSAQL/action/timestamp_anchor","attest_storage":"https://pith.science/pith/QIWSSMM3UF4CRTSAZCYHFZSAQL/action/storage_attestation","attest_author":"https://pith.science/pith/QIWSSMM3UF4CRTSAZCYHFZSAQL/action/author_attestation","sign_citation":"https://pith.science/pith/QIWSSMM3UF4CRTSAZCYHFZSAQL/action/citation_signature","submit_replication":"https://pith.science/pith/QIWSSMM3UF4CRTSAZCYHFZSAQL/action/replication_record"}},"created_at":"2026-05-18T02:44:48.590569+00:00","updated_at":"2026-05-18T02:44:48.590569+00:00"}