{"total":48,"items":[{"citing_arxiv_id":"2605.23098","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UfM*: Uncertainty from Motion* for DNN Depth Estimation Using Gaussians","primary_cat":"cs.RO","submitted_at":"2026-05-21T23:08:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UfM* uses Gaussian mixtures to compute multiview disagreement for uncertainty in depth estimation with single inference per image, reducing energy and memory use.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22629","ref_index":86,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"H-Flow: Self-supervised Human Scene Flow via Physics-inspired Joint Multi-modal Learning","primary_cat":"cs.CV","submitted_at":"2026-05-21T15:38:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"H-Flow learns dense human scene flow from monocular video via joint pose and depth prediction in a multi-head transformer, using physics-inspired geometric and biomechanical priors for self-supervision, and introduces the DynAct4D synthetic benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21414","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction","primary_cat":"cs.RO","submitted_at":"2026-05-20T17:10:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PointACT proposes a 3D-aware dual-system VLA policy using multi-scale point-action interaction with bottleneck window self-attention, achieving 10% higher success rates on RLBench-10Tasks over prior pretrained VLAs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20461","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Understanding Model Behavior in Monocular Polyp Sizing","primary_cat":"cs.CV","submitted_at":"2026-05-19T20:14:14+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Monocular polyp sizing models achieve moderate performance by exploiting examination behavior cues rather than true metric scales, with scale information and segmentation robustness acting as independent bottlenecks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19797","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Depth2Pose: A Pose-Based Benchmark for Monocular Depth Estimation without Ground-Truth Depth","primary_cat":"cs.CV","submitted_at":"2026-05-19T12:59:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Depth2Pose is a new evaluation framework for monocular depth estimators that uses relative camera pose accuracy as a task-driven proxy and introduces the D2P dataset of challenging out-of-distribution scenes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18599","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Resolving Representation Ambiguity in Feedforward Novel View Synthesis Transformer via Semantic-Spatial Decoupling","primary_cat":"cs.CV","submitted_at":"2026-05-18T16:09:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Decouples semantic and spatial tokens in NVS transformers to resolve representation ambiguity, yielding consistent gains with near-zero added latency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18052","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Efficient 3D Content Reconstruction and Generation","primary_cat":"cs.CV","submitted_at":"2026-05-18T08:41:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Presents Instant3D for rapid text/image-to-3D generation via multi-view diffusion plus feed-forward reconstruction, and FastMap for 10x faster structure-from-motion with comparable accuracy.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"peatedly samples diverse multi-view images for curated creative prompts, computes MRC rewards, and updates the diffusion model (fig. 3.7). This diversity- and quality-preserving finetuning is not feasible with SFT alone, since collecting ground-truth multi-view images for such prompts is prohibitively expensive. We make three specific improvements to the RLFT algorithm [20]: we use a purely on-policy policy-gradient method [265] instead of partially on-policy PPO [214] to improve stability; we include KL regularization [64, 171] to stay close to the base model and avoid distribution shift; and we scale compute to reach optimal rewards using diffusion-model RLFT scaling laws identified empirically [20, 64]. Applying Carve3D RLFT to Instant3D-10K [120] (a multi-view diffusion model SFT from"},{"citing_arxiv_id":"2605.17661","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mono-Hydra++: Real-Time Monocular Scene Graph Construction with Multi-Task Learning for 3D Indoor Mapping","primary_cat":"cs.RO","submitted_at":"2026-05-17T21:36:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Mono-Hydra++ is a monocular RGB-IMU pipeline that constructs hierarchical 3D scene graphs in real time while reporting lower trajectory error than some RGB-D baselines on indoor datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15876","ref_index":5,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Unlocking Dense Metric Depth Estimation in VLMs","primary_cat":"cs.CV","submitted_at":"2026-05-15T11:54:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DepthVLM converts a standard VLM into a dense metric depth predictor by attaching a lightweight head and training under unified vision-text supervision, outperforming prior VLMs and some pure vision models on a new indoor-outdoor benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09425","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation","primary_cat":"cs.CV","submitted_at":"2026-05-10T08:56:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"AtteConDA adds attention-based conflict suppression to multi-condition diffusion models so that generated driving-scene images retain richer structural cues from the original annotations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Monocular depth estimation is important for 3D geometry in driving scenes. Eigenet al.proposed an early multiscale CNN approach [18]; Monodepth2 improved self-supervised depth [21]; DPT and MiDaS improved dense prediction and zero-shot transfer [62, 63]; Depth Anything and Depth Anything V2 used large-scale unlabeled and synthetic data [93, 94]; and ZoeDepth and Metric3Dv2 advanced relative and metric depth estimation [3, 33]. This work uses Metric3Dv2 as the depth projector. Because the camera intrinsics of all input images are not treated as known in the pipeline, the output is used as relative depth for geometry consistency rather than absolute metric distance. Figure 4 compares representative depth maps. Edges.Edges represent intensity changes and fine contours."},{"citing_arxiv_id":"2605.05390","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World","primary_cat":"cs.CV","submitted_at":"2026-05-06T19:23:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LAMP tracks 3D human motion from moving multi-camera headsets by converting 2D detections to a unified metric 3D world frame via device localization and fitting with an end-to-end spatio-temporal transformer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01852","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DP-SfM: Dual-Pixel Structure-from-Motion without Scale Ambiguity","primary_cat":"cs.CV","submitted_at":"2026-05-03T12:45:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Dual-pixel defocus blur enables absolute scale estimation in SfM without reference objects or calibration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00345","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Pose-Aware Diffusion for 3D Generation","primary_cat":"cs.CV","submitted_at":"2026-05-01T02:05:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"the need for additional post-optimization or manual viewpoint alignment. 2.2 3D scene generation and composition Generating 3D scenes from a single or sparse set of images remains a challenging task. One group of methods adopts an explore-and-inpaint strategy [6,59,60]. These approaches typically utilize visual perceptual models like depth estima- tors [2,47] or dense stereo models [45,48] to warp input images into novel view- points, followed by 2D image or video diffusion models [50,54] to inpaint missing regions and incorporate them into 3D structure through optimization [4,36,44] or reconstruction models [22,61,62]. While such methods can produce visu- ally plausible results from training views, their reliance on 2D diffusion priors"},{"citing_arxiv_id":"2605.00051","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learning from the Unseen: Generative Data Augmentation for Geometric-Semantic Accident Anticipation","primary_cat":"cs.CV","submitted_at":"2026-04-29T15:29:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A generative video synthesis pipeline paired with a semantic graph neural network yields gains in accident anticipation accuracy and lead time on driving datasets, accompanied by a new benchmark release.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26488","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners","primary_cat":"cs.CV","submitted_at":"2026-04-29T09:51:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LILA learns temporally consistent semantic and geometric pixel features from uncurated videos via linear in-context learning on off-the-shelf depth and motion cues, yielding empirical gains on video object segmentation, surface normal estimation, and semantic segmentation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26454","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Last-Layer-Centric Feature Recombination: Unleashing 3D Geometric Knowledge in DINOv3 for Monocular Depth Estimation","primary_cat":"cs.CV","submitted_at":"2026-04-29T09:11:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Layer analysis of DINOv3 shows non-uniform 3D geometric knowledge concentrated in deeper layers, enabling a last-layer-centric recombination module that improves monocular depth estimation accuracy to state-of-the-art levels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22686","ref_index":4,"ref_count":3,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SS3D: End2End Self-Supervised 3D from Web Videos","primary_cat":"cs.CV","submitted_at":"2026-04-24T16:12:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SS3D pretrains an end-to-end feed-forward 3D estimator on filtered YouTube-8M videos via SfM self-supervision, MVS filtering, and expert distillation, delivering stronger zero-shot transfer and fine-tuning than prior self-supervised baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"This motivates evaluating depth, pose, and intrinsics jointly from a single trained feed-forward model checkpoint under a unified protocol that reflects end-to-end 3D consistency. Scaling to a large and heterogeneous video corpus.Recent supervised depth methods have improved substantially by training on mixtures of large la- beleddatasets[35,42,60]andscalingmodelcapacity[4,41],andrecent\"foundation- style\" 3D models leverage large amounts of annotated multi-domain 3D data [23,31,53,54,57] to reach outstanding performance. In contrast, SfM-based self- supervision methods remain tied to narrowly curated data or closely related mix- tures (e.g., KITTI + Cityscapes [5] or mixture of indoor datasets [8]). However, scaling SfM-based self-supervision methods to unconstrained, multi-domain web"},{"citing_arxiv_id":"2604.16284","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Enhancing Hazy Wildlife Imagery: AnimalHaze3k and IncepDehazeGan","primary_cat":"cs.CV","submitted_at":"2026-04-17T17:46:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A new wildlife-specific hazy image dataset and IncepDehazeGan model that reports state-of-the-art dehazing metrics and more than doubles downstream animal detection performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12592","ref_index":1,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ELoG-GS: Dual-Branch Gaussian Splatting with Luminance-Guided Enhancement for Extreme Low-light 3D Reconstruction","primary_cat":"cs.CV","submitted_at":"2026-04-14T11:17:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ELoG-GS integrates geometry-aware initialization and luminance-guided photometric adaptation into Gaussian Splatting, achieving PSNR 18.66 and SSIM 0.69 on the NTIRE 2026 Track 1 low-light 3D reconstruction benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12309","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors","primary_cat":"cs.CV","submitted_at":"2026-04-14T05:35:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06576","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LiftFormer: Lifting and Frame Theory Based Monocular Depth Estimation Using Depth and Edge Oriented Subspace Representation","primary_cat":"cs.CV","submitted_at":"2026-04-08T01:52:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LiftFormer transforms monocular depth prediction into depth-oriented geometric and edge-aware subspace representations via lifting and frame theory, achieving state-of-the-art results on standard datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05715","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"In Depth We Trust: Reliable Monocular Depth Supervision for Gaussian Splatting","primary_cat":"cs.CV","submitted_at":"2026-04-07T11:15:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A selective regularization framework lets scale-ambiguous monocular depth priors improve Gaussian Splatting geometry and rendering by isolating and supervising only ill-posed regions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03339","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction","primary_cat":"cs.CV","submitted_at":"2026-04-03T07:59:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A multilevel perceptual CRF model using Swin Transformer, HPF fusion, HA adapters, and dynamic scaling attention achieves state-of-the-art monocular depth estimation on NYU Depth v2, KITTI, and MatterPort3D with reduced error and fast inference.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[33] introduced the Swin Trans- former with hierarchical feature maps and shifted window attention mechanisms, enabling efficient local- global information exchange. Subsequent works, including deformable attention modules [57] that select key-value positions in a data-dependent manner, skip attention modules [2] for pixel-level query refine- ment, and ZoeDepth [6] that pioneered the combination of relative and metric depth estimation, have collectively pushed the state of the art. DDP [25] innovatively integrated denoising diffusion processes with the Swin Transformer encoder, demonstrating robust performance in noisy and extreme scenarios. EVP [27] introduced inverse multi-attentive feature refinement to aggregate high-level spatial informa-"},{"citing_arxiv_id":"2603.28980","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas","primary_cat":"cs.CV","submitted_at":"2026-03-30T20:26:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Stepper uses stepwise panoramic expansion with a multi-view 360-degree diffusion model and geometry reconstruction to produce high-fidelity, structurally consistent immersive 3D scenes from text.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.24577","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction","primary_cat":"cs.CV","submitted_at":"2026-03-25T17:53:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EndoVGGT uses a dynamic DeGAT graph attention module to improve depth estimation and non-rigid 3D reconstruction in surgery, reporting 24.6% PSNR and 9.1% SSIM gains on SCARED with zero-shot generalization to new domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.18943","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation","primary_cat":"cs.CV","submitted_at":"2026-03-19T14:18:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VGGT-360 delivers geometry-consistent zero-shot panoramic depth by converting panoramas into multi-view 3D reconstructions via VGGT models and three plug-and-play correction modules, then reprojecting the result.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.11566","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection","primary_cat":"cs.CV","submitted_at":"2026-03-12T05:41:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"R4Det fuses 4D radar and camera inputs via panoramic depth fusion, deformable gated temporal fusion without ego pose, and instance-guided refinement to reach state-of-the-art 3D detection on TJ4DRadSet and VoD.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.19035","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness","primary_cat":"cs.CV","submitted_at":"2026-02-22T04:18:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OpenVO estimates ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras by encoding temporal dynamics in a two-frame regression framework and using 3D priors from foundation models, delivering over 20% gains and 46-92% lower errors on KITTI, nuScenes, and A","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.09532","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes","primary_cat":"cs.CV","submitted_at":"2026-02-10T08:44:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RAD retrieves semantically similar RGB-D context samples for low-confidence regions and fuses them via matched cross-attention to cut relative absolute depth error by 29.2% on NYU Depth v2 underrepresented classes while staying competitive on standard benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.03209","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Depth Completion in Unseen Field Robotics Environments Using Extremely Sparse Depth Measurements","primary_cat":"cs.RO","submitted_at":"2026-02-03T07:24:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A depth completion network trained on synthetic field-robotics scenes predicts dense metric depth from extremely sparse real measurements and runs in real time on embedded hardware in unseen outdoor environments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.19216","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Bridging Visual and Wireless Sensing via a Unified Radiation Field for 3D Radio Map Construction","primary_cat":"cs.NI","submitted_at":"2026-01-27T05:35:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"URF-GS creates a single radiation field from visual and wireless observations via 3D Gaussian splatting to predict radio signals at any location and configuration with higher accuracy and fewer samples than prior NeRF approaches.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.22274","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GeCo: Evaluating Geometric Consistency for Video Generation via Motion and Structure","primary_cat":"cs.CV","submitted_at":"2025-12-25T03:28:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GeCo is a new geometry-based metric that produces dense maps of motion and structure inconsistencies in video generation by fusing residual motion and depth priors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.03454","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles","primary_cat":"cs.CV","submitted_at":"2025-12-03T05:14:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ThinkDeeper introduces a world-model-based reasoning step that predicts future spatial states to improve multimodal visual grounding for autonomous vehicles, achieving top results on Talk2Car and other benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.17568","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PAGE-4D: VGGT-4D Perception via Disentangled Pose and Geometry Estimation","primary_cat":"cs.CV","submitted_at":"2025-10-20T14:17:16+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.09880","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Geometry-Aware Scene Configurations for Novel View Synthesis","primary_cat":"cs.CV","submitted_at":"2025-10-10T21:36:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Geometry-guided adaptive placement of bases and virtual viewpoints improves rendering quality and memory use over uniform arrangements in scalable NeRF for large indoor scenes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.06687","ref_index":46,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Geometry-Aware Cross Modal Alignment for Light Field-LiDAR Semantic Segmentation","primary_cat":"cs.CV","submitted_at":"2025-10-08T06:15:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Proposes the first light field-LiDAR semantic segmentation dataset and the Mlpfseg network, which improves mIoU by 1.71 over image-only and 2.38 over point-cloud-only baselines via feature completion and depth perception modules.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.13977","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ROVR-Open-Dataset: A Large-Scale Depth Dataset for Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2025-08-19T16:13:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ROVR is a new diverse depth dataset for autonomous driving with 200K frames, released pipelines, and ablations showing sparse ground truth supports model training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.02546","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details","primary_cat":"cs.CV","submitted_at":"2025-07-03T11:40:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MoGe-2 recovers metric-scale 3D point maps with fine details from single images via data refinement and extension of affine-invariant predictions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"2 Related Works Monocular metric depth estimation. Early works in this field [13, 15, 70, 4, 20] primarily focused on predicting metric depth in specific domains like indoor environments or street views, using limited data from certain RGBD cameras or LiDAR sensors. With the increasing availability of depth data from various sources, recent methods [5, 73, 23, 66, 67, 44, 7] have aimed to predict metric depth in open-domain settings. For example, Metric3D [73, 23] utilized numerous metric depth datasets and introduced a canonical camera transformation module to address metric ambiguity from diverse data sources. ZoeDepth [5] built on a relative depth estimation framework [47, 6] that is pre-trained"},{"citing_arxiv_id":"2504.17761","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Step1X-Edit: A Practical Framework for General Image Editing","primary_cat":"cs.CV","submitted_at":"2025-04-24T17:25:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Step1X-Edit integrates a multimodal LLM with a diffusion decoder, trained on a custom high-quality dataset, to deliver image editing performance that surpasses open-source baselines and approaches proprietary models on the new GEdit-Bench.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"for these tasks, we utilize Qwen2.5-VL [3] and the Recognize-Anything Model [69] to identify target objects or keywords, followed by Flux-Fill [6] for content-aware inpainting. The instructions are automatically generated by Step-1o and the triplets are human-verified. Color Alteration & Material Modification : After detecting objects in the image, we employ Zeodepth [4] for depth estimation to understand object geometry. Based on the identified target transformation (e.g., change of color or material), we use ControlNet [67] with diffusion model [1] to generate new images that preserve object identity while altering appearance attributes such as texture or color. Text Modification: For text-editing tasks, we differentiate between valid and invalid text edits."},{"citing_arxiv_id":"2502.20110","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler","primary_cat":"cs.CV","submitted_at":"2025-02-27T14:03:15+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"UniDepthV2 predicts metric 3D points directly from single images using a self-promptable camera module, pseudo-spherical representation, and new losses for improved cross-domain generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.15830","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model","primary_cat":"cs.RO","submitted_at":"2025-01-27T07:34:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SpatialVLA adds 3D-aware position encoding and adaptive discretized action grids to visual-language-action models, enabling strong zero-shot performance and fine-tuning on new robot setups after pre-training on 1.1 million real-world episodes.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"More details of the model architecture and action encoding can be found in Appendix. B. Ego3D Position Encoding. The proposed Ego3D position encoding integrates depth information from the camera frame and image pixels to construct an egocentric 3D coordinate system, which eliminates the need for robot-camera extrinsic calibration and is agnostic to specific robot setups. Specifically, we use ZoeDepth [4] to estimate depth map D and obtain a)Action DistributionofΔRandΔT b) Action Grid Split from Distribution c) ActionGridsofΔRandΔTΔrollΔpitch Δyaw Δθ Δφ Δr Fig. 3: Illustration of adaptive action grids. (a) Statistics of translation and rotation action movements on the whole pre- training mixture, (b) grids are split on each action variable according to the probability density function of fitted Gaussian"},{"citing_arxiv_id":"2501.03717","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Materialist: Physically Based Editing Using Single-Image Inverse Rendering","primary_cat":"cs.CV","submitted_at":"2025-01-07T11:52:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Materialist performs single-image inverse rendering via neural-initialized progressive differentiable rendering to enable physically consistent material editing, object insertion, relighting, and transparency edits without full scene geometry.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.02576","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DepthMaster: Taming Diffusion Models for Monocular Depth Estimation","primary_cat":"cs.CV","submitted_at":"2025-01-05T15:18:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DepthMaster proposes a single-step diffusion model with Feature Alignment and Fourier Enhancement modules in a two-stage training process to improve generalization and detail preservation in monocular depth estimation over prior diffusion methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.03825","ref_index":46,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion","primary_cat":"cs.CV","submitted_at":"2024-10-04T18:00:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"By fine-tuning DUST3R to output per-timestep pointmaps on scarce dynamic video datasets, MonST3R achieves stronger video depth and pose estimation without explicit motion modeling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.13107","ref_index":48,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Robust Surgical Automation via Digital Twin Representations from Foundation Models","primary_cat":"cs.RO","submitted_at":"2024-09-19T22:24:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Digital twin representations from vision foundation models enable LLM-based planning for robust peg transfer and gauze retrieval on the dVRK surgical platform with claimed generalizability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.09414","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Depth Anything V2","primary_cat":"cs.CV","submitted_at":"2024-06-13T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Depth Anything V2 delivers finer, more robust monocular depth predictions by replacing real labeled images with synthetic data, scaling the teacher model, and using large-scale pseudo-labeled real images for student training.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"δ1 δ2 δ3 AbsRel RMSE log10 AdaBins [5] 0.903 0.984 0.997 0.103 0.364 0.044 DPT [55] 0.904 0.988 0.998 0.110 0.357 0.045 P3Depth [51] 0.898 0.981 0.996 0.104 0.356 0.043 SwinV2 [44] 0.949 0.994 0.999 0.083 0.287 0.035 AiT [49] 0.954 0.994 0.999 0.076 0.275 0.033 VPD [102] 0.964 0.995 0.999 0.069 0.254 0.030 IEBins [67] 0.936 0.992 0.998 0.087 0.314 0.038 ZoeDepth [6] 0.951 0.994 0.999 0.077 0.282 0.033 Ours (ViT-S) 0.961 0.996 0.999 0.073 0.261 0.032 Ours (ViT-B) 0.977 0.997 1.000 0.063 0.228 0.027 Ours (ViT-L) 0.984 0.998 1.000 0.056 0.206 0.024 (a) NYU-D dataset Method Higher is better ↑ Lower is better ↓ δ1 δ2 δ3 AbsRel RMSE RMSE log AdaBins [5] 0.964 0.995 0.999 0.058 2.360 0.088 P3Depth [51] 0.953 0.993 0."},{"citing_arxiv_id":"2406.04301","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Neural Surface Reconstruction from Sparse Views Using Epipolar Geometry","primary_cat":"cs.CV","submitted_at":"2024-06-06T17:47:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EpiS improves generalizable neural surface reconstruction from sparse views by guiding epipolar feature aggregation with cost volumes, using an epipolar transformer, and applying pretrained monocular depth constraints, outperforming prior methods on DTU and BlendedMVS.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.09631","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"3D-VLA: A 3D Vision-Language-Action Generative World Model","primary_cat":"cs.CV","submitted_at":"2024-03-14T17:58:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":", Hao, Y ., Singhal, S., Ma, S., Lv, T., Cui, L., Mohammed, O. K., Liu, Q., et al. Lan- guage is not all you need: Aligning perception with lan- guage models. arXiv preprint arXiv:2302.14045, 2023c. James, S., Ma, Z., Arrojo, D. R., and Davison, A. J. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters , 5(2):3019-3026, 2020. Jang, E., Irpan, A., Khansari, M., Kappler, D., Ebert, F., Lynch, C., Levine, S., and Finn, C. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning , pp. 991-1002. PMLR, 2022. Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. Li, J., Li, D."}],"limit":50,"offset":0}