Resolving Representation Ambiguity in Feedforward Novel View Synthesis Transformer via Semantic-Spatial Decoupling

Junchi Yan; Shaofeng Zhang; Xiaosong Jia; Yihang Sun; Yihang Wu; Yu-Gang Jiang; Zuxuan Wu

arxiv: 2605.18599 · v1 · pith:INUO5E76new · submitted 2026-05-18 · 💻 cs.CV

Resolving Representation Ambiguity in Feedforward Novel View Synthesis Transformer via Semantic-Spatial Decoupling

Yihang Wu , Yihang Sun , Shaofeng Zhang , Zuxuan Wu , Junchi Yan , Xiaosong Jia , Yu-gang Jiang This is my paper

Pith reviewed 2026-05-20 10:58 UTC · model grok-4.3

classification 💻 cs.CV

keywords novel view synthesisfeedforward NVStransformersemantic-spatial decouplingrepresentation ambiguityPlucker raysshared attentionrendering fidelity

0 comments

The pith

Decoupling semantic and spatial tokens in feedforward NVS transformers resolves representation ambiguity and improves fidelity with no added latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current transformer-based models for feedforward novel view synthesis mix semantic information like RGB values with spatial information like Plucker rays inside one shared feature space. The lattice structure in the rays creates a spatial bias that interferes with accurate appearance modeling and lowers final rendering quality. The paper introduces a decoupled architecture that maintains separate branches for semantic tokens and spatial tokens while still allowing interaction through shared attention routing. Optional categorized supervision gives each branch its own training signal and bidirectional modulation strengthens the exchange between branches. The base version of this design adds virtually zero inference latency because the change is architectural rather than computational.

Core claim

The central claim is that separating the representation into distinct semantic and spatial token branches, while keeping cross-branch interaction via shared attention routing, eliminates the interference that occurs when both types of information share a single feature space. Adding categorized supervision and bidirectional modulation further strengthens the branches without compromising the interaction, and the resulting models show consistent gains on both decoder-only and encoder-decoder feedforward NVS architectures.

What carries the argument

Semantic-spatial decoupling through separate token branches connected by shared attention routing.

If this is right

Consistent quality gains appear across both decoder-only and encoder-decoder feedforward NVS models.
Categorized supervision supplies branch-specific training signals that keep semantic and spatial learning distinct.
Bidirectional modulation strengthens information exchange between the two branches.
The architectural change introduces virtually zero extra inference latency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decoupling pattern could be tested in other vision transformers that combine positional and content features, such as those used for 3D scene reconstruction.
Adjusting the strength of the shared attention links might allow the model to adapt to scenes with very different spatial complexity.
The method opens a route to study whether spatial bias appears in other multimodal vision tasks beyond novel view synthesis.

Load-bearing premise

Mixing semantic and spatial information into a shared feature space causes spatial bias to interfere with appearance representation and degrade rendering fidelity, and explicit decoupling plus shared attention resolves this without losing necessary cross-information.

What would settle it

Running the decoupled model against its mixed-feature baseline on a standard benchmark such as DTU or LLFF and measuring no gain or a drop in PSNR or SSIM would falsify the claim that decoupling improves fidelity.

Figures

Figures reproduced from arXiv: 2605.18599 by Junchi Yan, Shaofeng Zhang, Xiaosong Jia, Yihang Sun, Yihang Wu, Yu-Gang Jiang, Zuxuan Wu.

**Figure 1.** Figure 1: Overview. In this work, we identify that mixing RGB and Plücker-ray information in a shared feature space can make spatial bias interfere with appearance representation. We decouple the two information streams while preserving cross-branch interaction to enhance rendering. Semantic Spatial Concatenated Feature + Input Image Artifacts [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Feature Coupling and Cosine Similarity. Left: Semantic and spatial features are structured separately, while their concatenation exposes grid-like Plücker-related artifacts. Right: We sample vector pairs to estimate cosine similarity distributions for I–I (semantic), P–P (spatial), and I–P (cross-branch). Dashed lines indicate the mean of each group. tokens, redesigning the Transformer representation space… view at source ↗

**Figure 3.** Figure 3: Semantic–Spatial Decoupled Architecture for Feedforward NVS. Semantic tokens (I) from RGB and spatial tokens (P) from Plücker rays pass through decoupled-attention blocks. Attention shares query–key interactions while using independent value projections to preserve heterogeneous representations. Bidirectional modulation further enables cross-stream conditioning. spatial geometry within unified tokens. Our … view at source ↗

**Figure 4.** Figure 4: Feature Map Comparison across Model Variants. We visualize intermediate feature maps from middle Transformer layers: decoupling produces more structured representations, while supervision and modulation are most effective when applied on top of the decoupled token design. Geometric Consistency of the Spatial Branch. For the P-branch, we use DA3-derived geometry [19] to construct visible cross-view corresp… view at source ↗

**Figure 5.** Figure 5: Novel View Synthesis Visual Comparison. Our decoupled model produces more coherent structures and sharper details [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Supervision Difference. Categorized supervision benefits the decoupled model but harms the entangled baseline. 4.4 Ablation Study Table. 2 summarizes the component and control ablations. Unless otherwise specified, feature visualizations are taken from middle Transformer layers. Decoupling and Categorized Supervision. In the entangled baseline, RGB and Plücker information share the same feature channels, s… view at source ↗

**Figure 7.** Figure 7: Shared vs. Independent QK. Shared Q/K with independent V yields better performance. layer 1 layer 3 layer 5 layer 7 layer 9 layer 11 [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 10.** Figure 10: Entangled vs. Decoupled Modulation and Bidirectional Modulation. Left: Modulation improves performance in the decoupled structure but degrades performance in the entangled architecture. Right: Bidirectional modulation yields more structured representations. (a) Intensity of Modulation. (b) Input-View Generalization. Model Latency(ms) ↓ Baseline 12.2 Decouple 12.3 Decouple + Mod 13.2 (c) Inference Latency… view at source ↗

**Figure 12.** Figure 12: Layer-wise Feature Visualization of Decouple-Only. PCA visualizations of intermediate input-view and target-view features from the decouple-only model. Layer-wise Feature Evolution [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

read the original abstract

Transformer-based models have advanced feedforward novel view synthesis (NVS). Current architectures such as GS-LRM and LVSM mix semantic information (e.g., RGB) and spatial information (e.g., Pl\"ucker rays) into a shared feature space. Since Pl\"ucker rays naturally carry lattice-like spatial structure, these designs can make the spatial bias interfere with appearance representation and degrade rendering fidelity. To this end, we propose to decouple the representation of feedforward NVS transformers into separate semantic and spatial tokens. The decoupled design keeps semantic and spatial information explicit in their branches while preserving cross-branch interaction through shared attention routing. Built on this design, we introduce optional categorized supervision and bidirectional modulation: the former provides branch-specific training signals, while the latter improves interaction between the two branches. Notably, the base decoupled design introduces virtually zero additional inference latency due to its architectural design. The proposed designs achieve consistent improvements, demonstrating effectiveness across decoder-only and encoder-decoder feedforward NVS models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The decoupling of semantic and spatial tokens via shared attention is a reasonable architectural response to mixing issues in NVS transformers, but the abstract gives almost no data to judge whether it actually delivers.

read the letter

The paper identifies that mixing RGB-like semantic features with Plücker ray spatial features in the same token space lets the lattice structure from the rays interfere with appearance modeling. Their fix is to split into separate semantic and spatial branches while routing through shared attention, plus optional categorized supervision and bidirectional modulation to strengthen the branches and their interaction. The base version adds essentially no inference cost, which is a practical plus, and they say it improves both decoder-only and encoder-decoder feedforward NVS models over mixed baselines like GS-LRM and LVSM.

Referee Report

1 major / 1 minor

Summary. The paper claims that mixing semantic (RGB) and spatial (Plücker ray) information in shared feature spaces of feedforward NVS transformers introduces lattice-like spatial bias that interferes with appearance representation and degrades fidelity. It proposes decoupling into separate semantic and spatial token branches that interact via shared attention routing, augmented by optional categorized supervision and bidirectional modulation. The base decoupled architecture adds virtually zero inference latency and yields consistent empirical improvements across decoder-only and encoder-decoder NVS models.

Significance. If the decoupling demonstrably maintains separation while enabling useful cross-interaction, the approach supplies a low-overhead architectural principle that could improve rendering quality in feedforward NVS without sacrificing efficiency. The near-zero latency claim and cross-architecture validation would make the contribution practically relevant for real-time novel-view synthesis.

major comments (1)

[§3.2] §3.2 (Shared Attention Routing): the claim that explicit decoupling plus shared attention routing resolves spatial-to-semantic interference rests on the unverified assumption that attention weights do not permit substantial leakage of Plücker-ray lattice structure into semantic tokens; without attention-map analysis or controlled ablations isolating the base decoupling from supervision/modulation, it remains unclear whether observed gains stem from the proposed separation or from the auxiliary components.

minor comments (1)

[Abstract] Abstract: quantitative metrics, dataset names, and baseline comparisons are absent, making it difficult for readers to gauge the scale of the reported consistent improvements before reaching the experimental section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify our contributions. We address the major comment point-by-point below and commit to strengthening the manuscript with additional analyses.

read point-by-point responses

Referee: [§3.2] §3.2 (Shared Attention Routing): the claim that explicit decoupling plus shared attention routing resolves spatial-to-semantic interference rests on the unverified assumption that attention weights do not permit substantial leakage of Plücker-ray lattice structure into semantic tokens; without attention-map analysis or controlled ablations isolating the base decoupling from supervision/modulation, it remains unclear whether observed gains stem from the proposed separation or from the auxiliary components.

Authors: We agree that direct verification of minimal leakage and isolation of the base decoupling effect would strengthen the claims. In the revision we will add (i) visualizations of attention maps from the shared routing layers demonstrating that semantic tokens predominantly attend to appearance cues while spatial tokens retain Plücker-ray structure, and (ii) controlled ablations that evaluate the decoupled architecture without categorized supervision or bidirectional modulation. These new results show that the core separation already delivers consistent fidelity gains across both decoder-only and encoder-decoder backbones, indicating that the architectural decoupling itself is the primary driver rather than the auxiliary components alone. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural proposal validated empirically

full rationale

The manuscript proposes a semantic-spatial decoupling architecture for feedforward NVS transformers, keeping information explicit in separate branches while using shared attention routing for interaction. Central claims rest on this design choice plus optional categorized supervision and bidirectional modulation, with reported consistent empirical gains across decoder-only and encoder-decoder models. No equations, derivations, or first-principles reductions appear in the provided text; the base design is presented as introducing virtually zero additional latency by construction of the architecture itself rather than by fitting or self-definition. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes, and the argument does not reduce any prediction to its own inputs. The work is therefore self-contained against external benchmarks via experimental results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that Plücker-ray spatial structure interferes with semantic features when mixed, plus the modeling choice that separate branches plus shared attention suffice to maintain necessary interactions.

axioms (1)

domain assumption Plücker rays naturally carry lattice-like spatial structure that can interfere with appearance representation when mixed in a shared feature space.
Invoked in the abstract to motivate the decoupling.

pith-pipeline@v0.9.0 · 5729 in / 1198 out tokens · 30615 ms · 2026-05-20T10:58:31.428146+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 9 internal anchors

[1]

Bronstein and Petar Velickovic and Razvan Pascanu , title =

Federico Barbero, Álvaro Arroyo, Xiangming Gu, Christos Perivolaropoulos, Michael Bronstein, Petar Veliˇckovi´c, and Razvan Pascanu. Why do llms attend to the first token?arXiv preprint arXiv:2504.02732, 2025

work page arXiv 2025
[2]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Müller. Zoedepth: Zero-shot transfer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. arXiv preprint arXiv:2410.02073, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InIEEE/CVF International Conference on Computer Vision (ICCV), 2021

work page 2021
[5]

Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo

Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. InIEEE/CVF International Conference on Computer Vision (ICCV), 2021

work page 2021
[6]

Vision transformers need registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[7]

Modeling and rendering architecture from photographs: A hybrid geometry- and image-based approach

Paul E Debevec, Camillo J Taylor, and Jitendra Malik. Modeling and rendering architecture from photographs: A hybrid geometry- and image-based approach. InProceedings of SIGGRAPH, pages 11–20, 1996

work page 1996
[8]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InCVPR, 2023

work page 2023
[9]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[10]

Lrm: Large reconstruction model for single image to 3d

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[11]

Efficient-lvsm: Faster, cheaper, and better large view synthesis model via decoupled co-refinement attention.arXiv preprint arXiv:2602.06478, 2026

Xiaosong Jia, Yihang Sun, Junqi You, Songbur Wong, Zichen Zou, Junchi Yan, Zuxuan Wu, and Yu-Gang Jiang. Efficient-lvsm: Faster, cheaper, and better large view synthesis model via decoupled co-refinement attention.arXiv preprint arXiv:2602.06478, 2026

work page arXiv 2026
[12]

Rayzer: A self-supervised large view synthesis model.arXiv preprint arXiv:2505.00702, 2025

Hanwen Jiang, Hao Tan, Peng Wang, Haian Jin, Yue Zhao, Sai Bi, Kai Zhang, Fujun Luan, Kalyan Sunkavalli, Qixing Huang, and Georgios Pavlakos. Rayzer: A self-supervised large view synthesis model.arXiv preprint arXiv:2505.00702, 2025

work page arXiv 2025
[13]

Lvsm: A large view synthesis model with minimal 3d inductive bias

Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[14]

Perceptual losses for real-time style transfer and super-resolution

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. InECCV, pages 694–711, 2016

work page 2016
[15]

ilrm: An iterative large 3d reconstruction model, 2025

Gyeongjin Kang, Seungtae Nam, Seungkwon Yang, Xiangyu Sun, Sameh Khamis, Abdelrahman Mohamed, and Eunbyung Park. ilrm: An iterative large 3d reconstruction model, 2025. URL https://arxiv.org/abs/2507.23277

work page arXiv 2025
[16]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (TOG), 42(4): 1–14, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (TOG), 42(4): 1–14, 2023. 10

work page 2023
[17]

Register and [cls] tokens yield a decoupling of local and global features in large vits.arXiv preprint arXiv:2505.05892, 2025

Alexander Lappe and Martin A Giese. Register and [cls] tokens yield a decoupling of local and global features in large vits.arXiv preprint arXiv:2505.05892, 2025

work page arXiv 2025
[18]

Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

work page arXiv 2025
[19]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Scaling Sequence-to-Sequence Generative Neural Rendering

Shikun Liu, Kam Woh Ng, Wonbong Jang, Jiadong Guo, Junlin Han, Haozhe Liu, Yiannis Douratsos, Juan C Pérez, Zijian Zhou, Chi Phung, Tao Xiang, and Juan-Manuel Pérez-Rúa. Scaling sequence-to-sequence generative neural rendering.arXiv preprint arXiv:2510.04236, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

work page 2019
[22]

Revisiting [cls] and patch token interaction in vision transformers.arXiv preprint arXiv:2602.08626, 2026

Alexis Marouani, Oriane Siméoni, Hervé Jégou, Piotr Bojanowski, and Huy V V o. Revisiting [cls] and patch token interaction in vision transformers.arXiv preprint arXiv:2602.08626, 2026

work page arXiv 2026
[23]

Nerf: Representing scenes as neural radiance fields for view synthesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InEuropean Conference on Computer Vision (ECCV), pages 405–421. Springer, 2020

work page 2020
[24]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick La...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

FiLM: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. FiLM: Visual reasoning with a general conditioning layer. InAAAI Conference on Artificial Intelligence, pages 3942–3951, 2018

work page 2018
[26]

On a new geometry of space.Philosophical Transactions of the Royal Society of London, 155:725–791, 1865

Julius Plücker. On a new geometry of space.Philosophical Transactions of the Royal Society of London, 155:725–791, 1865

work page
[27]

Denoising vision transformers

Jiawei Shi, Peiyuan Shen, Yinhe Zheng, Lu Hou, Ji Zhang, Yiyang Luo, Xin Xia, Yitong Wang, Chun Yuan, and Hongxia Yang. Denoising vision transformers. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[28]

DINOv3

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794, 2025

Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure? arXiv preprint arXiv:2512.10794, 2025

work page arXiv 2025
[30]

U-repa: Aligning diffusion u-nets to vits.arXiv preprint arXiv:2503.18414, 2025

Yuchuan Tian, Hanting Chen, Mengyu Zheng, Yuchen Liang, Chao Xu, and Yunhe Wang. U-repa: Aligning diffusion u-nets to vits.arXiv preprint arXiv:2503.18414, 2025

work page arXiv 2025
[31]

Least-squares estimation of transformation parameters between two point patterns.IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4):376–380,

Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns.IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4):376–380,

work page
[32]

doi: 10.1109/34.88573. 11

work page doi:10.1109/34.88573
[33]

Ibrnet: Learning multi-view image-based rendering

Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021
[34]

Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision.arXiv preprint arXiv:2410.19115, 2024

Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision.arXiv preprint arXiv:2410.19115, 2024

work page arXiv 2024
[35]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jérôme Revaud. Dust3r: Geometric 3d vision made easy. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[36]

Image quality assessment: From error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4): 600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: From error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4): 600–612, 2004

work page 2004
[37]

From rays to projections: Better inputs for feed-forward view synthesis.arXiv preprint arXiv:2601.05116, 2026

Zirui Wu, Zeren Jiang, Martin R Oswald, and Jie Song. From rays to projections: Better inputs for feed-forward view synthesis.arXiv preprint arXiv:2601.05116, 2026

work page arXiv 2026
[38]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[39]

Depth Anything V2

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.arXiv preprint arXiv:2406.09414, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

pixelnerf: Neural radiance fields from one or few images

Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021
[41]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Gs-lrm: Large reconstruction model for 3d gaussian splatting

Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. Gs-lrm: Large reconstruction model for 3d gaussian splatting. InEuropean Conference on Computer Vision (ECCV), 2024

work page 2024
[43]

The unrea- sonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 586–595, 2018

work page 2018
[44]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

arc left

Jensen Zhou, Hang Gao, Vikram V oleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, and Varun Jampani. Stable virtual camera: Generative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025

work page arXiv 2025
[46]

Stereo magnifi- cation: Learning view synthesis using multiplane images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnifi- cation: Learning view synthesis using multiplane images. InACM SIGGRAPH, 2018

work page 2018
[47]

Limitations

Chen Ziwen, Hao Tan, Kai Zhang, Sai Bi, Fujun Luan, Yicong Hong, Fuxin Li, and Zexiang Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats. arXiv preprint arXiv:2410.12781, 2024. 12 A Implementation and Training Details Controlled Training Budget.The reported baselines are controlled reimplementations rather than numbe...

work page arXiv 2024
[48]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[1] [1]

Bronstein and Petar Velickovic and Razvan Pascanu , title =

Federico Barbero, Álvaro Arroyo, Xiangming Gu, Christos Perivolaropoulos, Michael Bronstein, Petar Veliˇckovi´c, and Razvan Pascanu. Why do llms attend to the first token?arXiv preprint arXiv:2504.02732, 2025

work page arXiv 2025

[2] [2]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Müller. Zoedepth: Zero-shot transfer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. arXiv preprint arXiv:2410.02073, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InIEEE/CVF International Conference on Computer Vision (ICCV), 2021

work page 2021

[5] [5]

Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo

Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. InIEEE/CVF International Conference on Computer Vision (ICCV), 2021

work page 2021

[6] [6]

Vision transformers need registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024

[7] [7]

Modeling and rendering architecture from photographs: A hybrid geometry- and image-based approach

Paul E Debevec, Camillo J Taylor, and Jitendra Malik. Modeling and rendering architecture from photographs: A hybrid geometry- and image-based approach. InProceedings of SIGGRAPH, pages 11–20, 1996

work page 1996

[8] [8]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InCVPR, 2023

work page 2023

[9] [9]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021

[10] [10]

Lrm: Large reconstruction model for single image to 3d

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024

[11] [11]

Efficient-lvsm: Faster, cheaper, and better large view synthesis model via decoupled co-refinement attention.arXiv preprint arXiv:2602.06478, 2026

Xiaosong Jia, Yihang Sun, Junqi You, Songbur Wong, Zichen Zou, Junchi Yan, Zuxuan Wu, and Yu-Gang Jiang. Efficient-lvsm: Faster, cheaper, and better large view synthesis model via decoupled co-refinement attention.arXiv preprint arXiv:2602.06478, 2026

work page arXiv 2026

[12] [12]

Rayzer: A self-supervised large view synthesis model.arXiv preprint arXiv:2505.00702, 2025

Hanwen Jiang, Hao Tan, Peng Wang, Haian Jin, Yue Zhao, Sai Bi, Kai Zhang, Fujun Luan, Kalyan Sunkavalli, Qixing Huang, and Georgios Pavlakos. Rayzer: A self-supervised large view synthesis model.arXiv preprint arXiv:2505.00702, 2025

work page arXiv 2025

[13] [13]

Lvsm: A large view synthesis model with minimal 3d inductive bias

Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025

[14] [14]

Perceptual losses for real-time style transfer and super-resolution

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. InECCV, pages 694–711, 2016

work page 2016

[15] [15]

ilrm: An iterative large 3d reconstruction model, 2025

Gyeongjin Kang, Seungtae Nam, Seungkwon Yang, Xiangyu Sun, Sameh Khamis, Abdelrahman Mohamed, and Eunbyung Park. ilrm: An iterative large 3d reconstruction model, 2025. URL https://arxiv.org/abs/2507.23277

work page arXiv 2025

[16] [16]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (TOG), 42(4): 1–14, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (TOG), 42(4): 1–14, 2023. 10

work page 2023

[17] [17]

Register and [cls] tokens yield a decoupling of local and global features in large vits.arXiv preprint arXiv:2505.05892, 2025

Alexander Lappe and Martin A Giese. Register and [cls] tokens yield a decoupling of local and global features in large vits.arXiv preprint arXiv:2505.05892, 2025

work page arXiv 2025

[18] [18]

Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

work page arXiv 2025

[19] [19]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Scaling Sequence-to-Sequence Generative Neural Rendering

Shikun Liu, Kam Woh Ng, Wonbong Jang, Jiadong Guo, Junlin Han, Haozhe Liu, Yiannis Douratsos, Juan C Pérez, Zijian Zhou, Chi Phung, Tao Xiang, and Juan-Manuel Pérez-Rúa. Scaling sequence-to-sequence generative neural rendering.arXiv preprint arXiv:2510.04236, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

work page 2019

[22] [22]

Revisiting [cls] and patch token interaction in vision transformers.arXiv preprint arXiv:2602.08626, 2026

Alexis Marouani, Oriane Siméoni, Hervé Jégou, Piotr Bojanowski, and Huy V V o. Revisiting [cls] and patch token interaction in vision transformers.arXiv preprint arXiv:2602.08626, 2026

work page arXiv 2026

[23] [23]

Nerf: Representing scenes as neural radiance fields for view synthesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InEuropean Conference on Computer Vision (ECCV), pages 405–421. Springer, 2020

work page 2020

[24] [24]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick La...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

FiLM: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. FiLM: Visual reasoning with a general conditioning layer. InAAAI Conference on Artificial Intelligence, pages 3942–3951, 2018

work page 2018

[26] [26]

On a new geometry of space.Philosophical Transactions of the Royal Society of London, 155:725–791, 1865

Julius Plücker. On a new geometry of space.Philosophical Transactions of the Royal Society of London, 155:725–791, 1865

work page

[27] [27]

Denoising vision transformers

Jiawei Shi, Peiyuan Shen, Yinhe Zheng, Lu Hou, Ji Zhang, Yiyang Luo, Xin Xia, Yitong Wang, Chun Yuan, and Hongxia Yang. Denoising vision transformers. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024

[28] [28]

DINOv3

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794, 2025

Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure? arXiv preprint arXiv:2512.10794, 2025

work page arXiv 2025

[30] [30]

U-repa: Aligning diffusion u-nets to vits.arXiv preprint arXiv:2503.18414, 2025

Yuchuan Tian, Hanting Chen, Mengyu Zheng, Yuchen Liang, Chao Xu, and Yunhe Wang. U-repa: Aligning diffusion u-nets to vits.arXiv preprint arXiv:2503.18414, 2025

work page arXiv 2025

[31] [31]

Least-squares estimation of transformation parameters between two point patterns.IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4):376–380,

Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns.IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4):376–380,

work page

[32] [32]

doi: 10.1109/34.88573. 11

work page doi:10.1109/34.88573

[33] [33]

Ibrnet: Learning multi-view image-based rendering

Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021

[34] [34]

Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision.arXiv preprint arXiv:2410.19115, 2024

Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision.arXiv preprint arXiv:2410.19115, 2024

work page arXiv 2024

[35] [35]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jérôme Revaud. Dust3r: Geometric 3d vision made easy. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[36] [36]

Image quality assessment: From error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4): 600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: From error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4): 600–612, 2004

work page 2004

[37] [37]

From rays to projections: Better inputs for feed-forward view synthesis.arXiv preprint arXiv:2601.05116, 2026

Zirui Wu, Zeren Jiang, Martin R Oswald, and Jie Song. From rays to projections: Better inputs for feed-forward view synthesis.arXiv preprint arXiv:2601.05116, 2026

work page arXiv 2026

[38] [38]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[39] [39]

Depth Anything V2

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.arXiv preprint arXiv:2406.09414, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

pixelnerf: Neural radiance fields from one or few images

Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021

[41] [41]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Gs-lrm: Large reconstruction model for 3d gaussian splatting

Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. Gs-lrm: Large reconstruction model for 3d gaussian splatting. InEuropean Conference on Computer Vision (ECCV), 2024

work page 2024

[43] [43]

The unrea- sonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 586–595, 2018

work page 2018

[44] [44]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

arc left

Jensen Zhou, Hang Gao, Vikram V oleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, and Varun Jampani. Stable virtual camera: Generative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025

work page arXiv 2025

[46] [46]

Stereo magnifi- cation: Learning view synthesis using multiplane images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnifi- cation: Learning view synthesis using multiplane images. InACM SIGGRAPH, 2018

work page 2018

[47] [47]

Limitations

Chen Ziwen, Hao Tan, Kai Zhang, Sai Bi, Fujun Luan, Yicong Hong, Fuxin Li, and Zexiang Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats. arXiv preprint arXiv:2410.12781, 2024. 12 A Implementation and Training Details Controlled Training Budget.The reported baselines are controlled reimplementations rather than numbe...

work page arXiv 2024

[48] [48]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page