Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D

Jonah Philion; Sanja Fidler

arxiv: 2008.05711 · v1 · pith:XLDAHXALnew · submitted 2020-08-13 · 💻 cs.CV

Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D

Jonah Philion , Sanja Fidler This is my paper

classification 💻 cs.CV

keywords birds-eye-viewrepresentationscameramodelmotionplanningapproach

0 comments

read the original abstract

The goal of perception for autonomous vehicles is to extract semantic representations from multiple sensors and fuse these representations into a single "bird's-eye-view" coordinate frame for consumption by motion planning. We propose a new end-to-end architecture that directly extracts a bird's-eye-view representation of a scene given image data from an arbitrary number of cameras. The core idea behind our approach is to "lift" each image individually into a frustum of features for each camera, then "splat" all frustums into a rasterized bird's-eye-view grid. By training on the entire camera rig, we provide evidence that our model is able to learn not only how to represent images but how to fuse predictions from all cameras into a single cohesive representation of the scene while being robust to calibration error. On standard bird's-eye-view tasks such as object segmentation and map segmentation, our model outperforms all baselines and prior work. In pursuit of the goal of learning dense representations for motion planning, we show that the representations inferred by our model enable interpretable end-to-end motion planning by "shooting" template trajectories into a bird's-eye-view cost map output by our network. We benchmark our approach against models that use oracle depth from lidar. Project page with code: https://nv-tlabs.github.io/lift-splat-shoot .

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GOLD-BEV: GrOund and aeriaL Data for Dense Semantic BEV Mapping of Dynamic Scenes
cs.CV 2026-04 unverdicted novelty 6.0

GOLD-BEV learns dense BEV semantic maps including dynamic agents from ego-centric sensors by using synchronized aerial imagery for training supervision and pseudo-label generation.
InfiniVerse: Occupancy Guided Unbounded Scene Generation for Autonomous Driving
cs.CV 2026-06 unverdicted novelty 5.0

InfiniVerse reconstructs 3D occupancy from one frame, extends scenes autoregressively, converts to video via diffusion, and uses re-projection feedback to achieve SOTA FID 6.4 and FVD 67.97 on Waymo and nuScenes.
MapATM: Enhancing HD Map Construction through Actor Trajectory Modeling
cs.CV 2026-04 unverdicted novelty 5.0

MapATM improves lane divider AP by 4.6 and mAP by 2.6 on NuScenes by treating actor trajectories as structural priors for road geometry.