{"total":13,"items":[{"citing_arxiv_id":"2606.30347","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FFAvatar: Feed-Forward 4D Head Avatar Reconstruction from Sparse Portrait Images","primary_cat":"cs.CV","submitted_at":"2026-06-29T14:21:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FFAvatar uses a Transformer-based 3D Gaussian model with alternating attention and sparse-to-dense learning to enable feed-forward, incremental reconstruction of animatable 4D head avatars from sparse portrait images.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10656","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Envision4D: Envisioning Visual Futures via Feed-forward 4D Gaussian Splatting for Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2026-06-09T10:04:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Envision4D presents a feed-forward 4D Gaussian Splatting framework with future pose prediction, temporal attention, and conditioned motion lifting for pose-free extrapolation in autonomous driving scenes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04593","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"4D Reconstruction from Sparse Dynamic Cameras","primary_cat":"cs.CV","submitted_at":"2026-06-03T08:31:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Presents a 3D track initialization method, depth-ordering regularization, and batch sampling for 4D reconstruction from sparse dynamic cameras, plus the LetCamsGo dataset showing gains in dynamic regions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31595","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction","primary_cat":"cs.CV","submitted_at":"2026-05-29T17:57:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"C4G introduces compact timestamp-conditioned Gaussian query tokens that aggregate full temporal context to decode 3D Gaussians with timestamp-modulated positions for feed-forward 4D reconstruction from monocular video, plus a diffusion-based rendering module and extension to 4D feature fields.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17303","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LongDPM: Overlap-Aware 4D Reconstruction from Long Monocular Videos","primary_cat":"cs.CV","submitted_at":"2026-05-17T07:41:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LongDPM introduces an overlap-aware chunk-based framework that registers and fuses local dynamic reconstructions to achieve coherent long-range 4D geometry and tracking from monocular video.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15239","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens","primary_cat":"cs.CV","submitted_at":"2026-04-16T17:12:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TokenGS uses learnable Gaussian tokens in an encoder-decoder architecture to regress 3D means directly, achieving SOTA feed-forward reconstruction on static and dynamic scenes with better robustness.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"more balanced 3DGS distribution, while seamlessly recov- ering emergent scene attributes such as static-dynamic de- composition and scene flow. 1. Introduction The field of feed-forward neural reconstruction has re- cently seen great progress in terms of reconstruction qual- ity [47, 54], scalability to large datasets [57], and support for dynamic scenes [23, 24, 49]. In certain scenarios, these methods are even starting to approach the quality of com- putationally intensive per-scene optimization methods. arXiv:2604.15239v1 [cs.CV] 16 Apr 2026 Despite this rapid progress, the dominant paradigm of using a large encoder-only 1 Transformer backbone to pre- dict pixel-aligned 3D Gaussian primitives, still faces several"},{"citing_arxiv_id":"2604.14025","ref_index":189,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective","primary_cat":"cs.CV","submitted_at":"2026-04-15T16:07:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":",Puzzles [174], MegaSynth [175], Aug3D [176], MVBoost [177]. Visual Augmentation(§4.4.2) e.g.,MVSplat360 [178], ProSplat [179], LatentSplat [180], DIFIX3D+ [181]. Temporal-aware Models(§4.5) Online Streaming(§4.5.1) e.g.,StreamSplat [182], Cut3R [107], DGS-LRM [183], Stream3R [184], LongStream [185]. Offline Processing(§4.5.2) e.g.,L4GM [186], MonST3R [187], EgoMono4D [188], BTimer [189], 4D-LRM [190], Easi3R [191], 4DGT [192], 4Real-Video-V2 [193], MoVieS [194], MonoFusion [195], SEA-RAFT [196], EgoMono4D [188], BTimer [189]. Interactive Modeling(§4.5.3) e.g.,PIXIE [197], PhysGM [198]. Specialized Tasks(§4.5.4) e.g.,DAS3R [199], St4RTrack [200]. Figure 2.A taxonomy of feed-forward 3D reconstruction methods.This taxonomy summarizes the"},{"citing_arxiv_id":"2604.05182","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows","primary_cat":"cs.CV","submitted_at":"2026-04-06T21:21:12+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LSRM scales transformer context windows with native sparse attention and geometric routing to deliver high-fidelity feed-forward 3D reconstruction and inverse rendering that approaches dense optimization quality.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tion·3D foundation model 1 Introduction Recent years have witnessed rapid progress in the application of 3D foundation models-typically built on large-scale transformer architectures [60]-to tackle 3D tasks previously considered intractable. These tasks include joint estimation of geometry and camera parameters [33,47,61,62], dynamic scene reconstruc- tion [44,46,49,69,78], and sparse-view reconstruction [22,26,63,79,84] and in- verse rendering [41,75]. In object-centric reconstruction and inverse rendering arXiv:2604.05182v1 [cs.CV] 6 Apr 2026 2 Z. Li et al. Fig.1: High-fidelity 3D reconstruction.Given 12-18 images (left), LSRM adapts Native Sparse Attention (NSA) to generate explicit meshes and textures in a single"},{"citing_arxiv_id":"2604.01204","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Neural Harmonic Textures for High-Quality Primitive Based Neural Reconstruction","primary_cat":"cs.CV","submitted_at":"2026-04-01T17:48:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Neural Harmonic Textures add periodic feature interpolation and deferred neural decoding to primitive representations, achieving state-of-the-art real-time novel-view synthesis and bridging primitive and neural-field methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"This shift is driven not only by significantly improved rendering speed, but also by the structural advantages of explicit representations. Primitive-based methods naturally adapt to scene detail, scale more gracefully, and readily sup- port motion, deformation, and editing [5,62]. Furthermore, they align well with feed-forwardreconstructionpipelines[31,72]andcloselyresemblewidelyadopted point-map representations [59,61]. Despite these advantages, the limited expressive power of individual primi- tives remains a fundamental bottleneck. Geometry and appearance are tightly coupled within each primitive, forcing high-frequency spatial detail to be repre- sented by increasing the number of primitives, which directly increases memory"},{"citing_arxiv_id":"2511.00503","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models","primary_cat":"cs.CV","submitted_at":"2025-11-01T11:16:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A feed-forward video latent transformer that predicts time-varying 3D Gaussian primitives from one image to produce controllable 4D scenes with appearance, geometry, and motion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.17568","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PAGE-4D: VGGT-4D Perception via Disentangled Pose and Geometry Estimation","primary_cat":"cs.CV","submitted_at":"2025-10-20T14:17:16+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.10934","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ViPE: Video Pose Engine for 3D Geometric Perception","primary_cat":"cs.CV","submitted_at":"2025-08-12T18:39:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ViPE estimates camera intrinsics, motion, and dense near-metric depth from uncalibrated videos, outperforming baselines on TUM and KITTI while releasing annotations for 96M frames across real and generated videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.03575","ref_index":111,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Cosmos World Foundation Model Platform for Physical AI","primary_cat":"cs.CV","submitted_at":"2025-01-07T06:55:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}