arxiv: 1805.09817 · v1 · submitted 2018-05-24 · 💻 cs.CV · cs.GR

Recognition: 2 theorem links

· Lean Theorem

Stereo Magnification: Learning View Synthesis using Multiplane Images

Tinghui Zhou , Richard Tucker , John Flynn , Graham Fyffe , Noah Snavely

Authors on Pith no claims yet

Pith reviewed 2026-05-12 23:58 UTC · model grok-4.3

classification 💻 cs.CV cs.GR

keywords view synthesismultiplane imagesstereo magnificationnovel view synthesisimage-based renderinglayered representationsdeep learning

0 comments

The pith

A deep network predicts a multiplane image from a stereo pair to synthesize novel views that extrapolate far beyond the input baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to solve stereo magnification: generating views that extend well past the narrow separation of a stereo camera pair, such as those on phones or VR headsets. It does so by training a network to output a multiplane image, a stack of semi-transparent colored layers at fixed depths. Once predicted, the MPI can be rendered from any desired camera position using standard compositing. The training data comes from mining stereo pairs and wider frames out of ordinary YouTube videos. If the approach works, it removes the need for explicit 3D geometry or wide-baseline capture when creating free-viewpoint video.

Core claim

The central claim is that a multiplane image inferred from a narrow-baseline stereo pair is sufficient to synthesize a continuous range of novel views, including those that extrapolate significantly beyond the input baseline, and that a network trained on mined YouTube video data can produce such MPIs accurately enough to outperform prior view-synthesis baselines on this task.

What carries the argument

Multiplane images (MPIs): a fixed set of parallel RGB-alpha planes at discrete depths that together represent the scene and support differentiable rendering from new viewpoints.

If this is right

Stereo pairs from everyday dual-lens phones become usable for wide-baseline view synthesis without additional hardware.
View extrapolation becomes possible without building an explicit 3D mesh or point cloud.
A single forward pass through the network yields a representation that supports rendering at arbitrary novel positions and distances.
Large-scale training on unlabeled online video replaces the need for calibrated multi-view datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same MPI output could be used as an intermediate representation for other tasks such as depth estimation or light-field rendering.
Because the planes are fixed in depth, the method may degrade for scenes with very large depth ranges or thin structures.
Extending the approach to video input sequences could enable temporally coherent view synthesis.

Load-bearing premise

An MPI predicted from narrow-baseline stereo input is sufficient to accurately synthesize significantly extrapolated novel views.

What would settle it

Capture real ground-truth images from a camera baseline several times wider than the input stereo pair and compare them pixel-for-pixel or perceptually to the views rendered from the predicted MPI.

read the original abstract

The view synthesis problem--generating novel views of a scene from known imagery--has garnered recent attention due in part to compelling applications in virtual and augmented reality. In this paper, we explore an intriguing scenario for view synthesis: extrapolating views from imagery captured by narrow-baseline stereo cameras, including VR cameras and now-widespread dual-lens camera phones. We call this problem stereo magnification, and propose a learning framework that leverages a new layered representation that we call multiplane images (MPIs). Our method also uses a massive new data source for learning view extrapolation: online videos on YouTube. Using data mined from such videos, we train a deep network that predicts an MPI from an input stereo image pair. This inferred MPI can then be used to synthesize a range of novel views of the scene, including views that extrapolate significantly beyond the input baseline. We show that our method compares favorably with several recent view synthesis methods, and demonstrate applications in magnifying narrow-baseline stereo images.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces multiplane images as a layered representation for extrapolating views from narrow-baseline stereo, trained on mined YouTube video pairs, but the large-extrapolation results rest on how well the network fills in disocclusions from limited training parallax.

read the letter

The new piece is the multiplane image (MPI): a stack of RGBA planes at fixed depths that the network predicts from a stereo pair, then renders by homography warping and alpha compositing. They mine adjacent frames from YouTube videos to create training pairs at scale, avoiding the usual need for calibrated rigs. That combination of representation and data source is the actual advance over prior image-based rendering work.

Referee Report

2 major / 1 minor

Summary. The paper introduces multiplane images (MPIs) as a layered scene representation for the stereo magnification task: synthesizing novel views that extrapolate significantly beyond the narrow baseline of an input stereo pair. A deep network is trained to predict an MPI (per-plane RGB and alpha) from the stereo input, using supervision mined from adjacent frames in YouTube videos; the MPI is then rendered via homography warping to produce the extrapolated views. The abstract states that the approach compares favorably with prior view-synthesis methods and demonstrates applications to magnifying narrow-baseline stereo imagery.

Significance. If the quantitative claims hold, the work is significant for VR/AR pipelines that rely on consumer dual-lens cameras. The MPI representation offers a compact, differentiable alternative to explicit depth or mesh reconstruction, and the use of large-scale mined video data provides a scalable training regime that avoids the need for calibrated wide-baseline captures or ground-truth depth.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the claim that the method 'compares favorably with several recent view synthesis methods' is not supported by any quantitative numbers, error metrics, or baseline details in the abstract; without these, the central empirical claim cannot be evaluated and the soundness assessment remains low.
[§3.2 and §5] §3.2 (MPI Rendering) and §5 (Discussion): the homography-based warping of predicted MPIs accumulates errors linearly with baseline distance; because training supervision is drawn only from adjacent YouTube frames (modest parallax), it is unclear whether the learned per-plane RGB/alpha values remain accurate for the 'significantly beyond the input baseline' regime asserted in the abstract.

minor comments (1)

[§3.1] Notation: the number and spacing of MPI planes are treated as fixed hyperparameters; their sensitivity should be reported in an ablation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the claim that the method 'compares favorably with several recent view synthesis methods' is not supported by any quantitative numbers, error metrics, or baseline details in the abstract; without these, the central empirical claim cannot be evaluated and the soundness assessment remains low.

Authors: We agree that the abstract would be strengthened by including specific quantitative support for the comparison claim. Section 4 already contains the full set of error metrics (PSNR, SSIM, and others), baseline details, and comparisons against recent view-synthesis methods. In the revised manuscript we will update the abstract to concisely reference these key results, allowing the empirical claim to be evaluated directly from the abstract without lengthening it beyond standard limits. revision: yes
Referee: [§3.2 and §5] §3.2 (MPI Rendering) and §5 (Discussion): the homography-based warping of predicted MPIs accumulates errors linearly with baseline distance; because training supervision is drawn only from adjacent YouTube frames (modest parallax), it is unclear whether the learned per-plane RGB/alpha values remain accurate for the 'significantly beyond the input baseline' regime asserted in the abstract.

Authors: This is a legitimate concern about generalization. Although supervision comes from modest-parallax adjacent frames, the network is trained on a large and diverse set of YouTube videos; the resulting per-plane RGB and alpha values are shown in Section 4 to support plausible synthesis at baselines larger than those seen during training. We will expand the discussion in the revised Section 5 to explicitly address error accumulation under homography warping and to clarify how the learned MPI representation enables the reported extrapolation results. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the MPI prediction and rendering pipeline

full rationale

The paper describes a supervised learning pipeline: a deep network is trained on external YouTube video frames (treated as stereo pairs) to regress an MPI representation, which is then rendered via homography warping to novel views. The central claim—that the learned MPI enables significant extrapolation—rests on empirical training and held-out evaluation rather than any self-referential definition, fitted parameter renamed as prediction, or load-bearing self-citation chain. No equations or steps reduce the output to the input by construction; the network parameters are optimized against independent video data, and the MPI rendering equations are standard homography compositions independent of the training regime.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on the MPI being an adequate scene approximation and on the network successfully inferring it from limited baseline input.

free parameters (2)

number and depths of MPI planes
Design choice for discretizing scene depth; may be tuned or learned.
neural network parameters
Learned from training data on YouTube video frames.

axioms (1)

domain assumption A scene can be approximated by a stack of fronto-parallel planes each carrying color and alpha.
This is the foundational assumption of the multiplane image representation.

invented entities (1)

Multiplane Image (MPI) no independent evidence
purpose: Layered representation enabling efficient novel view synthesis from stereo input.
New representation introduced by the paper.

pith-pipeline@v0.9.0 · 5474 in / 1276 out tokens · 61710 ms · 2026-05-12T23:58:18.675334+00:00 · methodology

discussion (0)

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.
$h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement
cs.CV 2026-05 unverdicted novelty 7.0

h-control introduces block-conditional pseudo-Gibbs refinement for training-free camera control in flow-matching video generators, achieving superior FVD scores on RealEstate10K and DAVIS benchmarks.
SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis
cs.CV 2026-05 unverdicted novelty 7.0

SplatWeaver dynamically allocates Gaussian primitives via cardinality experts and pixel-level routing guided by high-frequency cues for improved generalizable novel view synthesis.
AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision
cs.CV 2026-04 conditional novelty 7.0

AirZoo is a new large-scale synthetic dataset for aerial 3D vision that improves state-of-the-art models on image retrieval, cross-view matching, and 3D reconstruction when used for fine-tuning.
GSCompleter: A Distillation-Free Plugin for Metric-Aware 3D Gaussian Splatting Completion in Seconds
cs.CV 2026-04 unverdicted novelty 7.0

GSCompleter completes sparse 3D Gaussian Splatting scenes via a distillation-free generate-then-register pipeline using Stereo-Anchor lifting and Ray-Constrained Registration, delivering SOTA results on three benchmarks.
URoPE: Universal Relative Position Embedding across Geometric Spaces
cs.CV 2026-04 unverdicted novelty 7.0

URoPE is a parameter-free relative position embedding for transformers that works across arbitrary geometric spaces by ray sampling and projection, yielding consistent gains on novel view synthesis, 3D detection, trac...
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 7.0

UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens
cs.CV 2026-04 unverdicted novelty 7.0

GlobalSplat achieves competitive novel-view synthesis on RealEstate10K and ACID using only 16K Gaussians via global scene tokens and coarse-to-fine training, with a 4MB footprint and under 78ms inference.
TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens
cs.CV 2026-04 unverdicted novelty 7.0

TokenGS uses learnable Gaussian tokens in an encoder-decoder architecture to regress 3D means directly, achieving SOTA feed-forward reconstruction on static and dynamic scenes with better robustness.
CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

CT-1 transfers spatial reasoning from vision-language models to estimate camera trajectories, which are then used in a video diffusion model with wavelet regularization to produce controllable videos, claiming 25.7% b...
Novel View Synthesis as Video Completion
cs.CV 2026-04 unverdicted novelty 7.0

Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training
cs.CV 2026-04 unverdicted novelty 7.0

Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.
OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control
cs.CV 2026-04 unverdicted novelty 7.0

OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.
$\pi^3$: Permutation-Equivariant Visual Geometry Learning
cs.CV 2025-07 conditional novelty 7.0

π³ is a feed-forward network with full permutation equivariance that outputs affine-invariant poses and scale-invariant local point maps without reference frames, reaching state-of-the-art on camera pose, depth, and d...
Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video
cs.CV 2026-05 unverdicted novelty 6.0

Warp-as-History enables zero-shot camera trajectory following in frozen video models by supplying camera-warped pseudo-history, with single-video LoRA fine-tuning improving generalization to unseen videos.
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models
cs.CV 2026-05 unverdicted novelty 6.0

M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and lon...
Long-tail Internet photo reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

Finetuning 3D foundation models on simulated sparse subsets from MegaDepth-X produces robust reconstructions from extremely sparse, noisy internet photos while preserving performance on dense benchmarks.
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 6.0

UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.
Self-Improving 4D Perception via Self-Distillation
cs.CV 2026-04 unverdicted novelty 6.0

SelfEvo enables pretrained 4D perception models to self-improve on unlabeled videos via self-distillation, delivering up to 36.5% relative gains in video depth estimation and 20.1% in camera estimation across eight be...
SurfelSplat: Learning Efficient and Generalizable Gaussian Surfel Representations for Sparse-View Surface Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

A feed-forward model regresses accurate Gaussian surfel geometry from sparse views using Nyquist-guided cross-view feature aggregation, achieving 100x speedup over optimization-based 3DGS surface methods on DTU benchmarks.
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
cs.CV 2026-04 unverdicted novelty 6.0

INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...
NavCrafter: Exploring 3D Scenes from a Single Image
cs.CV 2026-04 unverdicted novelty 6.0

NavCrafter generates controllable novel-view videos from one image via video diffusion, geometry-aware expansion, and enhanced 3D Gaussian Splatting to achieve state-of-the-art synthesis under large viewpoint changes.
HD-VGGT: High-Resolution Visual Geometry Transformer
cs.CV 2026-03 unverdicted novelty 6.0

HD-VGGT achieves state-of-the-art high-resolution 3D reconstruction from image collections via a dual-branch architecture that predicts coarse geometry at low resolution and refines details at high resolution while mo...
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
cs.CV 2024-04 unverdicted novelty 6.0

CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
cs.CV 2026-05 unverdicted novelty 5.0

SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...
RoSplat: Robust Feed-Forward Pixel-wise Gaussian Splatting for Varying Input Views and High-Resolution Rendering
cs.CV 2026-05 unverdicted novelty 5.0

RoSplat adds alpha normalization for brightness consistency across varying input views and a 3D sampling regularizer to mitigate hole artifacts in high-resolution feed-forward Gaussian splatting.
AdaptSplat: Adapting Vision Foundation Models for Feed-Forward 3D Gaussian Splatting
cs.CV 2026-05 unverdicted novelty 5.0

AdaptSplat adds a lightweight Frequency-Preserving Adapter to vision foundation models that extracts direction-aware high-frequency priors and integrates them via positional encodings and residual modulation to improv...
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation
cs.CV 2026-05 unverdicted novelty 5.0

ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
Video Generation with Predictive Latents
cs.CV 2026-05 unverdicted novelty 5.0

PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.
Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images
cs.CV 2026-04 unverdicted novelty 5.0

UniSplat learns consistent 3D geometry, appearance, and semantics from unposed images using dual masking, progressive Gaussian splatting, and recalibration to align predictions across tasks.
InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model
cs.CV 2026-03 unverdicted novelty 5.0

InSpatio-WorldFM is a frame-independent generative model that uses explicit 3D anchors and spatial memory to deliver real-time multi-view consistent spatial intelligence via a three-stage training pipeline from pretra...
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
cs.CV 2026-04 unverdicted novelty 4.0

Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 32 Pith papers · 3 internal anchors

[1]

http://ceres-solver.org

Ceres Solver. http://ceres-solver.org. (2016). Apple

work page 2016
[2]

https://www.apple.com/newsroom/2016/10/ portrait-mode-now-available-on-iphone-7-plus-with-ios-101/

Portrait mode now available on iPhone 7 Plus with iOS 10.1. https://www.apple.com/newsroom/2016/10/ portrait-mode-now-available-on-iphone-7-plus-with-ios-101/. (2016). Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton

work page 2016
[3]

Layer Normalization

Layer normalization. arXiv preprint arXiv:1607.06450 (2016). Alexandre Chapiro, Simon Heinzle, Tunç Ozan Aydın, Steven Poulakos, Matthias Zwicker, Aljosa Smolic, and Markus Gross

work page internal anchor Pith review Pith/arXiv arXiv 2016
[4]

IEEE Trans

Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. on Pattern Analysis and Machine Intelligence 40, 4 (2018). Qifeng Chen and Vladlen Koltun

work page 2018
[5]

IEEE Trans

Direct sparse odometry. IEEE Trans. on Pattern Analysis and Machine Intelligence 40, 3 (2018). John Flynn, Ivan Neulander, James Philbin, and Noah Snavely

work page 2018
[6]

Unsupervised Monoc- ular Depth Estimation with Left-Right Consistency. In CVPR. Google. 2017a. Introducing VR180 cameras. https://vr.google.com/vr180/. (2017). Google. 2017b. Portrait mode on the Pixel 2 and Pixel 2 XL smartphones. https://research. googleblog.com/2017/10/portrait-mode-on-pixel-2-and-pixel-2-xl.html. (2017). Steven J. Gortler, Radek Grzes...

work page 2017
[7]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). Marc Levoy and Pat Hanrahan

work page internal anchor Pith review Pith/arXiv arXiv 2014
[8]

https://www.lytro.com/

Lytro. https://www.lytro.com/. (2018). Montiel J. M. M. Mur-Artal, Raúl and Juan D. Tardós

work page 2018
[9]

IEEE Trans

ORB-SLAM: a Versatile and Accurate Monocular SLAM System. IEEE Trans. on Robotics 31, 5 (2015). Eric Penner and Li Zhang

work page 2015
[10]

The Best of IET and IBC 4 (09 2012)

Fully automatic stereo-to-multiview conversion in autostereoscopic displays. The Best of IET and IBC 4 (09 2012). Johannes Lutz Schönberger and Jan-Michael Frahm

work page 2012
[11]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014). Pratul P. Srinivasan, Tongzhou Wang, Ashwin Sreelal, Ravi Ramamoorthi, and Ren Ng

work page internal anchor Pith review Pith/arXiv arXiv 2014
[12]

arXiv preprint arXiv:1704.07804 (2017)

Sfm-net: Learning of structure and motion from video. arXiv preprint arXiv:1704.07804 (2017). John YA Wang and Edward H Adelson

work page arXiv 2017
[13]

IEEE Trans

Representing moving images with layers. IEEE Trans. on Image Processing 3, 5 (1994). Zhou Wang, Alan Bovik, Hamid Sheikh, and Eero Simoncelli

work page 1994
[14]

IEEE Trans

Image quality assessment: from error visibility to structural similarity. IEEE Trans. on Image Processing 13, 4 (2004). Sven Wanner, Stephan Meister, and Bastian Goldluecke

work page 2004
[15]

https://en.wikipedia.org/wiki/Multiplane_camera

Multiplane camera. https://en.wikipedia.org/wiki/Multiplane_camera. (2017). Junyuan Xie, Ross B. Girshick, and Ali Farhadi

work page 2017
[16]

Publication date: August 2018

work page 2018