{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:GQ7AFIA2KIS563MEOYXCERTLKJ","short_pith_number":"pith:GQ7AFIA2","schema_version":"1.0","canonical_sha256":"343e02a01a5225df6d84762e22466b52617a475fbcaf461a2972c81236a1a4b4","source":{"kind":"arxiv","id":"2509.02560","version":2},"attestation_state":"computed","paper":{"title":"FastVGGT: Training-Free Acceleration of Visual Geometry Transformer","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A 3D-specific token partitioning strategy lets token merging accelerate VGGT fourfold on thousand-image sequences without retraining.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Jiayi Ji, Liujuan Cao, Shengchuan Zhang, Xiawu Zheng, Yansong Qu, You Shen, Zhipeng Zhang","submitted_at":"2025-09-02T17:54:21Z","abstract_excerpt":"Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, scaling these models to long-sequence image inputs remains a significant challenge due to inference-time inefficiency. In this work, we present a detailed analysis of VGGT, a state-of-the-art feed-forward visual geometry model and identify its primary bottleneck. Visualization further reveals a token collapse phenomenon in the attention maps. Motivated by these findings, we explore the potential of token merging in the feed-forward visual geometry model. Owing to the unique architectur"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2509.02560","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2025-09-02T17:54:21Z","cross_cats_sorted":[],"title_canon_sha256":"775dee16a98c51d2dca3277c76c502537f8c521554eeb5ce1246ba918d1be400","abstract_canon_sha256":"b5183e4a8ae8fe8f06347e2d74565b2c0c9067442ce470bb14af321b1d5f9d74"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:49.752619Z","signature_b64":"6BJv8XeoB39XZT0+VlmWo2+QkzxZPYFzYu1RhlmB1uUjky/iZmNf4E2YpnhEMFC1e4xREbr6+8MZ9DeWhfNtCA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"343e02a01a5225df6d84762e22466b52617a475fbcaf461a2972c81236a1a4b4","last_reissued_at":"2026-05-17T23:38:49.752098Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:49.752098Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"FastVGGT: Training-Free Acceleration of Visual Geometry Transformer","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A 3D-specific token partitioning strategy lets token merging accelerate VGGT fourfold on thousand-image sequences without retraining.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Jiayi Ji, Liujuan Cao, Shengchuan Zhang, Xiawu Zheng, Yansong Qu, You Shen, Zhipeng Zhang","submitted_at":"2025-09-02T17:54:21Z","abstract_excerpt":"Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, scaling these models to long-sequence image inputs remains a significant challenge due to inference-time inefficiency. In this work, we present a detailed analysis of VGGT, a state-of-the-art feed-forward visual geometry model and identify its primary bottleneck. Visualization further reveals a token collapse phenomenon in the attention maps. Motivated by these findings, we explore the potential of token merging in the feed-forward visual geometry model. Owing to the unique architectur"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Notably, with 1000 input images, FastVGGT achieves a 4x speedup over VGGT while mitigating error accumulation in long-sequence scenarios.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the newly devised 3D-specific token partitioning strategy can remove redundant computation while fully preserving VGGT's reconstruction capacity, even though directly applying existing merging techniques is stated to be challenging due to architectural and task-specific properties.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"FastVGGT achieves 4x speedup on VGGT for 1000-image inputs using training-free token merging tailored to 3D architectures while reducing error accumulation.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A 3D-specific token partitioning strategy lets token merging accelerate VGGT fourfold on thousand-image sequences without retraining.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"3d10fba8fe40e5c7ecbf125beace250215d7db342aa9ac59f305d10df01611b1"},"source":{"id":"2509.02560","kind":"arxiv","version":2},"verdict":{"id":"fad21ddd-9b7c-410d-ba62-094860d8edd2","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T23:32:44.742619Z","strongest_claim":"Notably, with 1000 input images, FastVGGT achieves a 4x speedup over VGGT while mitigating error accumulation in long-sequence scenarios.","one_line_summary":"FastVGGT achieves 4x speedup on VGGT for 1000-image inputs using training-free token merging tailored to 3D architectures while reducing error accumulation.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the newly devised 3D-specific token partitioning strategy can remove redundant computation while fully preserving VGGT's reconstruction capacity, even though directly applying existing merging techniques is stated to be challenging due to architectural and task-specific properties.","pith_extraction_headline":"A 3D-specific token partitioning strategy lets token merging accelerate VGGT fourfold on thousand-image sequences without retraining."},"references":{"count":33,"sample":[{"doi":"","year":null,"title":"Token merging for fast sta- ble diffusion","work_id":"c2a6db5b-1e5d-46b2-bd17-02ef709ec2b4","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Token Merging: Your ViT But Faster","work_id":"528509bc-2611-4e7f-a772-ea14d25b6dae","ref_index":2,"cited_arxiv_id":"2210.09461","is_internal_anchor":true},{"doi":"","year":null,"title":"Pumer: Pruning and merging tokens for efficient vision language models.arXiv preprint arXiv:2305.17530,","work_id":"91361f0f-8572-41d1-a5eb-06f62fd5b464","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Emerg- ing properties in self-supervised vision transformers","work_id":"8a5a47d5-320f-431e-953a-e774104b82a0","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"vid-tldr: Training free token merging for light-weight video transformer","work_id":"aba45603-5ed7-40ed-9c87-a8fc5e7de13d","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":33,"snapshot_sha256":"203434c8ade2d746ee1f4491a2948ffb92dbaba8e303131c946409924889260a","internal_anchors":8},"formal_canon":{"evidence_count":3,"snapshot_sha256":"25739931bf32a146934c67e37139adaf5374be9b2e37ecfe90b2541c0e4150c1"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2509.02560","created_at":"2026-05-17T23:38:49.752177+00:00"},{"alias_kind":"arxiv_version","alias_value":"2509.02560v2","created_at":"2026-05-17T23:38:49.752177+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2509.02560","created_at":"2026-05-17T23:38:49.752177+00:00"},{"alias_kind":"pith_short_12","alias_value":"GQ7AFIA2KIS5","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"GQ7AFIA2KIS563ME","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"GQ7AFIA2","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":24,"internal_anchor_count":24,"sample":[{"citing_arxiv_id":"2605.15828","citing_title":"Not All Tasks Quantize Equally: Fisher-Guided Quantization for Visual Geometry Transformer","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.23889","citing_title":"HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21131","citing_title":"UniT: Unified Geometry Learning with Group Autoregressive Transformer","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06270","citing_title":"Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15828","citing_title":"Not All Tasks Quantize Equally: Fisher-Guided Quantization for Visual Geometry Transformer","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16981","citing_title":"Rethinking the State Update Gate for Long-Sequence Recurrent 3D Reconstruction","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2603.04385","citing_title":"ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training","ref_index":57,"is_internal_anchor":true},{"citing_arxiv_id":"2603.20284","citing_title":"STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2603.18943","citing_title":"VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2603.27222","citing_title":"HD-VGGT: High-Resolution Visual Geometry Transformer","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12027","citing_title":"4DVGGT-D: 4D Visual Geometry Transformer with Improved Dynamic Depth Estimation","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2604.26341","citing_title":"SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08371","citing_title":"PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09644","citing_title":"Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23798","citing_title":"ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06270","citing_title":"Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08542","citing_title":"Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2604.09366","citing_title":"Robust 4D Visual Geometry Transformer with Uncertainty-Aware Priors","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07279","citing_title":"Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2604.14141","citing_title":"Geometric Context Transformer for Streaming 3D Reconstruction","ref_index":59,"is_internal_anchor":true},{"citing_arxiv_id":"2604.13476","citing_title":"RobotPan: A 360$^\\circ$ Surround-View Robotic Vision System for Embodied Perception","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2604.14025","citing_title":"Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective","ref_index":166,"is_internal_anchor":true},{"citing_arxiv_id":"2604.15237","citing_title":"StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04435","citing_title":"Ground4D: Spatially-Grounded Feedforward 4D Reconstruction for Unstructured Off-Road Scenes","ref_index":36,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":3,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/GQ7AFIA2KIS563MEOYXCERTLKJ","json":"https://pith.science/pith/GQ7AFIA2KIS563MEOYXCERTLKJ.json","graph_json":"https://pith.science/api/pith-number/GQ7AFIA2KIS563MEOYXCERTLKJ/graph.json","events_json":"https://pith.science/api/pith-number/GQ7AFIA2KIS563MEOYXCERTLKJ/events.json","paper":"https://pith.science/paper/GQ7AFIA2"},"agent_actions":{"view_html":"https://pith.science/pith/GQ7AFIA2KIS563MEOYXCERTLKJ","download_json":"https://pith.science/pith/GQ7AFIA2KIS563MEOYXCERTLKJ.json","view_paper":"https://pith.science/paper/GQ7AFIA2","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2509.02560&json=true","fetch_graph":"https://pith.science/api/pith-number/GQ7AFIA2KIS563MEOYXCERTLKJ/graph.json","fetch_events":"https://pith.science/api/pith-number/GQ7AFIA2KIS563MEOYXCERTLKJ/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/GQ7AFIA2KIS563MEOYXCERTLKJ/action/timestamp_anchor","attest_storage":"https://pith.science/pith/GQ7AFIA2KIS563MEOYXCERTLKJ/action/storage_attestation","attest_author":"https://pith.science/pith/GQ7AFIA2KIS563MEOYXCERTLKJ/action/author_attestation","sign_citation":"https://pith.science/pith/GQ7AFIA2KIS563MEOYXCERTLKJ/action/citation_signature","submit_replication":"https://pith.science/pith/GQ7AFIA2KIS563MEOYXCERTLKJ/action/replication_record"}},"created_at":"2026-05-17T23:38:49.752177+00:00","updated_at":"2026-05-17T23:38:49.752177+00:00"}