{"paper":{"title":"TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A video diffusion transformer can be repurposed as a feed-forward dense 3D tracker that follows every pixel from a reference frame across a monocular video.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Honggyu An, Jaewoo Jung, Jahyeok Koo, Jisu Nam, Junhwa Hur, Seungryong Kim, Soowon Son","submitted_at":"2026-05-12T17:59:27Z","abstract_excerpt":"Dense 3D tracking from monocular video is fundamental to dynamic scene understanding. While recent 3D foundation models provide reliable per-frame geometry, recovering object motion in this geometry remains challenging and benefits from strong motion priors learned from real-world videos. Existing 3D trackers either follow iterative paradigms trained from scratch on synthetic data or fine-tune 3D reconstruction models learned from static multi-view images, both lacking real-world motion priors. Pre-trained video diffusion transformers (video DiTs) offer rich spatio-temporal priors from interne"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"We present TrackCraft3R, the first method to repurpose a video DiT as a feed-forward dense 3D tracker... achieves state-of-the-art performance on standard sparse and dense 3D tracking benchmarks, while running 1.3x faster and using 4.6x less peak memory than the strongest prior method.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the frame-anchored generative priors in pre-trained video DiTs can be converted into reliable reference-anchored dense 3D tracking through a dual-latent representation and temporal RoPE alignment with only LoRA fine-tuning.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A video diffusion transformer can be repurposed as a feed-forward dense 3D tracker that follows every pixel from a reference frame across a monocular video.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"4db706481e3eeff8d99cbdbf7b5ec30c4fd711a0d9e04ee77fffa59baba3323c"},"source":{"id":"2605.12587","kind":"arxiv","version":1},"verdict":{"id":"573bce5a-3d6e-4872-bba0-92ff116d4d5e","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T21:25:08.872212Z","strongest_claim":"We present TrackCraft3R, the first method to repurpose a video DiT as a feed-forward dense 3D tracker... achieves state-of-the-art performance on standard sparse and dense 3D tracking benchmarks, while running 1.3x faster and using 4.6x less peak memory than the strongest prior method.","one_line_summary":"TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the frame-anchored generative priors in pre-trained video DiTs can be converted into reliable reference-anchored dense 3D tracking through a dual-latent representation and temporal RoPE alignment with only LoRA fine-tuning.","pith_extraction_headline":"A video diffusion transformer can be repurposed as a feed-forward dense 3D tracker that follows every pixel from a reference frame across a monocular video."},"references":{"count":87,"sample":[{"doi":"","year":2020,"title":"Mapillary planet-scale depth dataset","work_id":"4b55c119-e053-4b1f-9c10-30689584b3f1","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Track2Act: Predicting point tracks from internet videos enables generalizable robot manipulation","work_id":"18934b86-52a4-471c-a168-cf6e16d1500e","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets","work_id":"4f68eada-27e3-437a-a2fe-6e4ca524d0d3","ref_index":3,"cited_arxiv_id":"2311.15127","is_internal_anchor":true},{"doi":"","year":2001,"title":"Virtual KITTI 2","work_id":"c0d9c030-aa25-44e7-9cc4-72d7403f1447","ref_index":4,"cited_arxiv_id":"2001.10773","is_internal_anchor":true},{"doi":"","year":2025,"title":"Videojam: Joint appearance-motion representations for en- hanced motion generation in video models","work_id":"d22ef704-e6df-4caf-a9b3-f220ad768f8b","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":87,"snapshot_sha256":"bf148f163c9c8a93d3269521617b066f7a68f90ea76ecfe861f3ddcd4bda329f","internal_anchors":12},"formal_canon":{"evidence_count":2,"snapshot_sha256":"28b40a7bbbced35982456e3abb4503adc9ecc8ff5b4590f7cc76dadb964bb1d5"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}