{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2026:WV4JHJDGDHCP4RJJFNGAJT53HW","short_pith_number":"pith:WV4JHJDG","schema_version":"1.0","canonical_sha256":"b57893a46619c4fe45292b4c04cfbb3d973179f5e76182f3d967260db12446c0","source":{"kind":"arxiv","id":"2605.15477","version":1},"attestation_state":"computed","paper":{"title":"EgoExo-WM: Unlocking Exo Video for Ego World Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Converting exocentric videos into egocentric views via body pose extraction allows training of more capable action-conditioned world models.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Danny Tran, Kristen Grauman, Roberto Mart\\'in-Mart\\'in","submitted_at":"2026-05-14T23:35:54Z","abstract_excerpt":"Egocentric world models present a promising direction for enabling agents to predict and plan, but their performance is constrained by the limited availability of egocentric training data and its inherent partial observability of humans' physical actions. In contrast, exocentric video is abundant and reveals body poses well, but lacks direct alignment with an agent's action space -- and is not egocentric. We propose a method to bridge this gap by extracting structured body pose from exocentric video as a representation of action and transforming the exocentric video to egocentric video, inform"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2605.15477","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2026-05-14T23:35:54Z","cross_cats_sorted":[],"title_canon_sha256":"899deb79d3bf1d6a3350d96bfca6c87dca3762fe822fec8e7705abb7135b011d","abstract_canon_sha256":"4b369f3408f675104d157b77023414f44239dac8b2b2a6ae6d5950575340b8be"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-20T00:01:00.614569Z","signature_b64":"swS/dc05ZFhb7FZrpRuUg8e7IM8K1DbHwhWnDL/6RdE8SVIdYrY3Ixjzdt6P6mjHk9I78LaBgnL73m8ptJoVDg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"b57893a46619c4fe45292b4c04cfbb3d973179f5e76182f3d967260db12446c0","last_reissued_at":"2026-05-20T00:01:00.613441Z","signature_status":"signed_v1","first_computed_at":"2026-05-20T00:01:00.613441Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"EgoExo-WM: Unlocking Exo Video for Ego World Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Converting exocentric videos into egocentric views via body pose extraction allows training of more capable action-conditioned world models.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Danny Tran, Kristen Grauman, Roberto Mart\\'in-Mart\\'in","submitted_at":"2026-05-14T23:35:54Z","abstract_excerpt":"Egocentric world models present a promising direction for enabling agents to predict and plan, but their performance is constrained by the limited availability of egocentric training data and its inherent partial observability of humans' physical actions. In contrast, exocentric video is abundant and reveals body poses well, but lacks direct alignment with an agent's action space -- and is not egocentric. We propose a method to bridge this gap by extracting structured body pose from exocentric video as a representation of action and transforming the exocentric video to egocentric video, inform"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"training whole-body action-conditioned egocentric world models with our converted data significantly improves both prediction quality and downstream planning performance, where we infer the sequence of body poses needed to achieve a visual goal state.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The exocentric-to-egocentric video transformation, informed by a human kinematics prior, produces training data whose action representation and visual statistics remain sufficiently faithful to real egocentric observations that downstream world-model training and planning gains are not artifacts of the conversion process.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Converting exocentric video to egocentric format via body-pose extraction and kinematics prior enables training of action-conditioned egocentric world models that improve prediction quality and goal-directed planning.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Converting exocentric videos into egocentric views via body pose extraction allows training of more capable action-conditioned world models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"b434afafd1167bad17ef19e0e36cd14406a3213a364a23ca7f6327d4d418bda2"},"source":{"id":"2605.15477","kind":"arxiv","version":1},"verdict":{"id":"392a1e69-35e7-498f-9c49-44a82bd1303e","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-19T14:28:18.801641Z","strongest_claim":"training whole-body action-conditioned egocentric world models with our converted data significantly improves both prediction quality and downstream planning performance, where we infer the sequence of body poses needed to achieve a visual goal state.","one_line_summary":"Converting exocentric video to egocentric format via body-pose extraction and kinematics prior enables training of action-conditioned egocentric world models that improve prediction quality and goal-directed planning.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The exocentric-to-egocentric video transformation, informed by a human kinematics prior, produces training data whose action representation and visual statistics remain sufficiently faithful to real egocentric observations that downstream world-model training and planning gains are not artifacts of the conversion process.","pith_extraction_headline":"Converting exocentric videos into egocentric views via body pose extraction allows training of more capable action-conditioned world models."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.15477/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"cited_work_retraction","ran_at":"2026-05-19T15:22:05.040577Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_title_agreement","ran_at":"2026-05-19T15:01:17.557556Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T14:37:41.769120Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"claim_evidence","ran_at":"2026-05-19T14:21:54.082837Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"citation_quote_validity","ran_at":"2026-05-19T13:49:41.407528Z","status":"skipped","version":"0.1.0","findings_count":0},{"name":"ai_meta_artifact","ran_at":"2026-05-19T13:33:22.658430Z","status":"skipped","version":"1.0.0","findings_count":0}],"snapshot_sha256":"da1b7d9557560a1bbc3f375829340d34318d61c82f89b5fedc4746185ffe3daa"},"references":{"count":84,"sample":[{"doi":"","year":2023,"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","ref_index":1,"cited_arxiv_id":"2303.08774","is_internal_anchor":true},{"doi":"","year":2025,"title":"Cosmos-transfer1: Conditional world generation with adaptive multimodal control","work_id":"f6758fe8-a1a6-4b00-9094-edba697f4c67","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Fiction: 4d future interaction prediction from video","work_id":"a82941b6-44c2-408f-ac2a-be0b77a3c0ef","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning","work_id":"a9c28401-f16a-4933-89f0-788e2f94e52b","ref_index":4,"cited_arxiv_id":"2506.09985","is_internal_anchor":true},{"doi":"","year":2025,"title":"Safemimic: Towards safe and autonomous human-to-robot imitation for mobile manipulation","work_id":"16fa32fe-60e5-47d9-97a7-0ae28e4508d1","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":84,"snapshot_sha256":"27f65732d413d110af9303e7ad30e965e7813e1e42b3fc33a849d827fd4aa7c6","internal_anchors":12},"formal_canon":{"evidence_count":2,"snapshot_sha256":"df11030e0f95acbded20b761ff6db8e0c787d8bd1930462294d6e03afc8bdf43"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2605.15477","created_at":"2026-05-20T00:01:00.613915+00:00"},{"alias_kind":"arxiv_version","alias_value":"2605.15477v1","created_at":"2026-05-20T00:01:00.613915+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2605.15477","created_at":"2026-05-20T00:01:00.613915+00:00"},{"alias_kind":"pith_short_12","alias_value":"WV4JHJDGDHCP","created_at":"2026-05-20T00:01:00.613915+00:00"},{"alias_kind":"pith_short_16","alias_value":"WV4JHJDGDHCP4RJJ","created_at":"2026-05-20T00:01:00.613915+00:00"},{"alias_kind":"pith_short_8","alias_value":"WV4JHJDG","created_at":"2026-05-20T00:01:00.613915+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":0,"internal_anchor_count":0,"sample":[]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/WV4JHJDGDHCP4RJJFNGAJT53HW","json":"https://pith.science/pith/WV4JHJDGDHCP4RJJFNGAJT53HW.json","graph_json":"https://pith.science/api/pith-number/WV4JHJDGDHCP4RJJFNGAJT53HW/graph.json","events_json":"https://pith.science/api/pith-number/WV4JHJDGDHCP4RJJFNGAJT53HW/events.json","paper":"https://pith.science/paper/WV4JHJDG"},"agent_actions":{"view_html":"https://pith.science/pith/WV4JHJDGDHCP4RJJFNGAJT53HW","download_json":"https://pith.science/pith/WV4JHJDGDHCP4RJJFNGAJT53HW.json","view_paper":"https://pith.science/paper/WV4JHJDG","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2605.15477&json=true","fetch_graph":"https://pith.science/api/pith-number/WV4JHJDGDHCP4RJJFNGAJT53HW/graph.json","fetch_events":"https://pith.science/api/pith-number/WV4JHJDGDHCP4RJJFNGAJT53HW/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/WV4JHJDGDHCP4RJJFNGAJT53HW/action/timestamp_anchor","attest_storage":"https://pith.science/pith/WV4JHJDGDHCP4RJJFNGAJT53HW/action/storage_attestation","attest_author":"https://pith.science/pith/WV4JHJDGDHCP4RJJFNGAJT53HW/action/author_attestation","sign_citation":"https://pith.science/pith/WV4JHJDGDHCP4RJJFNGAJT53HW/action/citation_signature","submit_replication":"https://pith.science/pith/WV4JHJDGDHCP4RJJFNGAJT53HW/action/replication_record"}},"created_at":"2026-05-20T00:01:00.613915+00:00","updated_at":"2026-05-20T00:01:00.613915+00:00"}