RayFormer: Modeling Inter- and Intra-Ray Similarity for NeRF-Based Video Snapshot Compressive Imaging
Pith reviewed 2026-05-07 07:20 UTC · model grok-4.3
The pith
A transformer that models similarities between neighboring rays and along each ray improves NeRF reconstruction of videos from single-shot compressive measurements.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We first propose a patch-level ray sampling strategy to enable the modeling of content structure. Then, we propose an Inter- and Intra-Ray Transformer (RayFormer) to capture the structural similarities, modeling both inter-ray similarities among spatially neighboring points at the same depth and intra-ray correlations between adjacent points along the viewing ray. Finally, benefiting from the patch-level sampling strategy, the total variation prior is incorporated into the objective function to enhance spatial smoothness and suppress artifacts.
What carries the argument
RayFormer, a transformer that jointly attends to inter-ray similarities among spatially neighboring points at the same depth and intra-ray correlations along individual viewing rays, made possible by patch-level rather than random ray sampling.
If this is right
- Patch-level sampling makes local structural patterns available for attention, enabling the model to exploit content correlations that random sampling ignores.
- Modeling both inter-ray and intra-ray relations together produces higher-fidelity reconstructions of dynamic scenes than methods that treat rays independently.
- Incorporating the total variation prior on the sampled patches reduces spatial artifacts while preserving motion detail.
- The resulting pipeline reaches state-of-the-art performance on both simulated and real-world video snapshot compressive imaging benchmarks.
Where Pith is reading between the lines
- The same inter- and intra-ray attention pattern could be applied to other ray-based rendering tasks, such as light-field or plenoptic video reconstruction, where neighboring rays share similar scene content.
- Structured patch sampling may prove beneficial in additional compressive-sensing settings beyond SCI, suggesting that random sampling is often suboptimal when scene geometry is locally coherent.
- Because the method is built on top of existing NeRF pipelines, it can be combined with future improvements in radiance-field representations without redesigning the core sampling and attention logic.
Load-bearing premise
That patch-level ray sampling combined with the specific inter- and intra-ray attention in RayFormer will reliably extract scene structure more effectively than random sampling, and that adding the total variation term will improve quality without introducing bias or new artifacts.
What would settle it
An ablation experiment in which random sampling plus a standard transformer replaces the patch sampling and RayFormer, yet still matches or exceeds the reported PSNR and SSIM on the same simulated and real test sets, would show the proposed similarity modeling is not required for the claimed gains.
read the original abstract
Video snapshot compressive imaging (SCI) enables the reconstruction of dynamic scenes from a single snapshot measurement. Recently, NeRF-based methods have shown promising reconstruction performance. However, such methods typically adopt random ray sampling strategies and fail to capture content structural similarities, resulting in limited reconstruction quality. To address these issues, we first propose a patch-level ray sampling strategy to enable the modeling of content structure. Then, we propose an Inter- and Intra-Ray Transformer (RayFormer) to capture the structural similarities, modeling both inter-ray similarities among spatially neighboring points at the same depth and intra-ray correlations between adjacent points along the viewing ray. Finally, benefiting from the patch-level sampling strategy, the total variation prior is incorporated into the objective function to enhance spatial smoothness and suppress artifacts. Experiments in both simulated and real-world scenes demonstrate that the proposed method achieves state-of-the-art (SOTA) reconstruction performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes RayFormer for NeRF-based video snapshot compressive imaging (SCI). It introduces a patch-level ray sampling strategy to enable modeling of content structural similarities, an Inter- and Intra-Ray Transformer (RayFormer) that captures inter-ray similarities among spatially neighboring points at the same depth and intra-ray correlations between adjacent points along each viewing ray, and incorporates a total variation prior into the objective function to enhance spatial smoothness and suppress artifacts. Experiments on simulated and real-world scenes are claimed to achieve state-of-the-art reconstruction performance.
Significance. If the empirical results hold, the work could advance NeRF-based SCI by replacing random ray sampling with structured patch sampling and geometry-aware attention, addressing a recognized limitation in prior methods that under-exploit ray similarities. The combination of transformer modeling with TV regularization is a plausible extension, but significance depends on whether gains are attributable to the proposed components rather than capacity or optimization details.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): The SOTA claim is central to the paper but the abstract supplies no quantitative metrics, error bars, ablation studies, or dataset details. The experiments section must include direct comparisons (e.g., PSNR/SSIM tables) against recent NeRF-SCI baselines with statistical significance to substantiate the claim.
- [§3.2] §3.2 (Patch-level ray sampling): The strategy is presented as necessary to capture structural similarities, yet the manuscript must include an ablation comparing patch-level sampling directly to standard random sampling (the unbiased baseline in NeRF). Without this, it remains unclear whether the restriction improves structure capture or introduces spatial correlation that slows convergence or leaves regions under-sampled; this is load-bearing for the central claim.
- [§4.3] §4.3 (Real-world experiments): The total variation prior is added to suppress compressive artifacts, but in real dynamic scenes lacking ground truth the paper should quantify risks of over-smoothing high-frequency or temporally varying detail. Visual inspection alone is insufficient to rule out bias introduced by the prior, which could undermine the reported SOTA gains.
minor comments (3)
- [§2] §2 (Related work): Additional citations to recent transformer-based NeRF variants and SCI reconstruction methods would better position the contribution and avoid potential gaps in the literature review.
- [Figure 1] Figure 1 (Architecture diagram): The inter-ray and intra-ray attention blocks would benefit from explicit labeling of query/key/value definitions and how patch sampling feeds into the transformer to improve clarity.
- [§3.1] Notation in §3.1: The definitions of ray points, depth sampling, and the combined loss (including the TV term) could be made more precise with an additional equation or table summarizing the symbols.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The SOTA claim is central to the paper but the abstract supplies no quantitative metrics, error bars, ablation studies, or dataset details. The experiments section must include direct comparisons (e.g., PSNR/SSIM tables) against recent NeRF-SCI baselines with statistical significance to substantiate the claim.
Authors: We agree that the abstract should explicitly report key quantitative results to support the SOTA claim. We will revise the abstract to include average PSNR and SSIM improvements over the compared NeRF-SCI baselines. In §4, direct PSNR/SSIM tables against recent baselines are already present for both simulated and real scenes; we will add per-scene standard deviations (error bars) across multiple runs and include a statistical significance analysis (e.g., paired t-tests) to substantiate the reported gains. Dataset details appear in §4.1 but will be cross-referenced more clearly in the abstract and tables. revision: yes
-
Referee: [§3.2] §3.2 (Patch-level ray sampling): The strategy is presented as necessary to capture structural similarities, yet the manuscript must include an ablation comparing patch-level sampling directly to standard random sampling (the unbiased baseline in NeRF). Without this, it remains unclear whether the restriction improves structure capture or introduces spatial correlation that slows convergence or leaves regions under-sampled; this is load-bearing for the central claim.
Authors: We acknowledge that a direct ablation against the standard random ray sampling baseline is necessary to isolate the benefit of patch-level sampling. The current manuscript demonstrates the end-to-end gains of RayFormer but does not isolate this component. We will add a dedicated ablation study in the revised §4 that compares patch-level sampling versus random sampling under otherwise identical conditions, reporting reconstruction PSNR/SSIM, convergence behavior, and qualitative sampling coverage to address concerns about spatial correlation or under-sampling. revision: yes
-
Referee: [§4.3] §4.3 (Real-world experiments): The total variation prior is added to suppress compressive artifacts, but in real dynamic scenes lacking ground truth the paper should quantify risks of over-smoothing high-frequency or temporally varying detail. Visual inspection alone is insufficient to rule out bias introduced by the prior, which could undermine the reported SOTA gains.
Authors: We agree that visual inspection alone is limited for real scenes without ground truth. We will expand §4.3 with an ablation varying the TV weight, presenting side-by-side reconstructions with and without the prior to illustrate preservation of high-frequency and temporal details. We will also add a discussion of how the regularization strength is selected to balance artifact suppression against over-smoothing. While fully quantitative metrics for over-smoothing are not possible without ground truth, these controlled ablations and qualitative evidence will provide stronger substantiation that the prior does not introduce systematic bias. revision: partial
Circularity Check
No circularity: architectural proposal with independent empirical claims
full rationale
The paper introduces a patch-level ray sampling strategy, a RayFormer transformer module for inter- and intra-ray attention, and a total-variation term enabled by the sampling choice. These are presented as design decisions whose value is assessed via reconstruction experiments on simulated and real scenes. No equations, uniqueness theorems, or first-principles derivations appear that reduce the claimed performance gain to a fitted parameter, self-definition, or self-citation chain. The central modeling claim (that the proposed attention captures structural similarities better than random sampling) is an empirical hypothesis, not a tautological restatement of inputs. Self-citations are absent from the provided text, and the method remains falsifiable against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption NeRF-based methods can represent dynamic scenes from single snapshot compressive measurements
invented entities (1)
-
RayFormer
no independent evidence
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION Video Snapshot Compressive Imaging (SCI) [1, 2] has emerged as a promising computational imaging paradigm that enables the acquisition of high-speed video through a single 2D measurement. By encoding temporal informa- tion into spatially multiplexed patterns via designed coded masks—such as modulated patterns across time—video SCI effectively...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
METHODOLOGY 2.1. Preliminaries 2.1.1. Imaging Model of Video SCI In video SCI, the high-dimensional spatio-temporal informa- tion of a scene is compressed into a single two-dimensional measurement. This acquisition process is physically real- ized by employing programmable optical devices—most commonly Digital Micromirror Devices (DMD) or liquid crystal-b...
-
[3]
Experimental Settings Datasets
EXPERIMENTS 3.1. Experimental Settings Datasets. Following [8], we evaluate on six synthetic scenes: Airplants [16], Hotdog [17], Cozy2room, Tanabata, Factory, and Vendor [18]. To assess generalization, we further test on real-world SCI data captured by the setup in [8]. Compared methods and evaluation metrics. We com- pare with several SOTA SCI reconstru...
-
[4]
CONCLUSION In this paper, we proposed patch-level ray sampling and the Inter- and Intra-Ray Transformer (RayFormer) to capture content structural similarities for NeRF-based Video SCI. Ad- ditionally, benefiting from the patch-level sampling strategy, we incorporated the total variation prior into the objective function to enhance spatial smoothness and r...
-
[5]
Coded aperture compressive tempo- ral imaging,
Patrick Llull, Xuejun Liao, Xin Yuan, Jianbo Yang, David Kittle, Lawrence Carin, Guillermo Sapiro, and David J. Brady, “Coded aperture compressive tempo- ral imaging,”Opt. Express, vol. 21, no. 9, pp. 10526– 10545, May 2013
2013
-
[6]
Snapshot compressive imaging: Theory, algorithms, and applications,
Xin Yuan, David J Brady, and Aggelos K Katsaggelos, “Snapshot compressive imaging: Theory, algorithms, and applications,”IEEE Signal Processing Magazine, vol. 38, no. 2, pp. 65–88, 2021
2021
-
[7]
Generalized alternating projection based to- tal variation minimization for compressive sensing,
Xin Yuan, “Generalized alternating projection based to- tal variation minimization for compressive sensing,” in 2016 IEEE International conference on image process- ing (ICIP). IEEE, 2016, pp. 2539–2543
2016
-
[8]
Rank minimization for snapshot com- pressive imaging,
Yang Liu, Xin Yuan, Jinli Suo, David J Brady, and Qionghai Dai, “Rank minimization for snapshot com- pressive imaging,”IEEE transactions on pattern analy- sis and machine intelligence, vol. 41, no. 12, pp. 2990– 3006, 2018
2018
-
[9]
Plug-and-play algorithms for large-scale snapshot compressive imaging,
Xin Yuan, Yang Liu, Jinli Suo, and Qionghai Dai, “Plug-and-play algorithms for large-scale snapshot compressive imaging,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2020, pp. 1447–1457
2020
-
[10]
Efficientsci: Densely connected network with space-time factoriza- tion for large-scale video snapshot compressive imag- ing,
Lishun Wang, Miao Cao, and Xin Yuan, “Efficientsci: Densely connected network with space-time factoriza- tion for large-scale video snapshot compressive imag- ing,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 18477–18486
2023
-
[11]
Spatial-temporal transformer for video snapshot com- pressive imaging,
Lishun Wang, Miao Cao, Yong Zhong, and Xin Yuan, “Spatial-temporal transformer for video snapshot com- pressive imaging,”IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, vol. 45, no. 7, pp. 9072– 9089, 2022
2022
-
[12]
Scinerf: Neural radiance fields from a snapshot compressive image,
Yunhao Li, Xiaodong Wang, Ping Wang, Xin Yuan, and Peidong Liu, “Scinerf: Neural radiance fields from a snapshot compressive image,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 10542–10552
2024
-
[13]
Attention is all you need,
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,”Ad- vances in neural information processing systems, vol. 30, 2017
2017
-
[14]
Residual degradation learning unfolding framework with mixing priors across spectral and spatial for compressive spectral imaging,
Yubo Dong, Dahua Gao, Tian Qiu, Yuyan Li, Minxi Yang, and Guangming Shi, “Residual degradation learning unfolding framework with mixing priors across spectral and spatial for compressive spectral imaging,” inProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2023, pp. 22262– 22271
2023
-
[15]
Deep gaussian scale mixture prior for image reconstruction,
Tao Huang, Xin Yuan, Weisheng Dong, Jinjian Wu, and Guangming Shi, “Deep gaussian scale mixture prior for image reconstruction,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 10778–10794, 2023
2023
-
[16]
Alternating direction unfolding with a cross spectral attention prior for dual-camera compres- sive hyperspectral imaging,
Yubo Dong, Dahua Gao, Danhua Liu, Yanli Liu, and Guangming Shi, “Alternating direction unfolding with a cross spectral attention prior for dual-camera compres- sive hyperspectral imaging,”IEEE Transactions on Im- age Processing, vol. 34, pp. 5325–5340, 2025
2025
-
[17]
Barf: Bundle-adjusting neural radiance fields,
Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, and Simon Lucey, “Barf: Bundle-adjusting neural radiance fields,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 5741–5751
2021
-
[18]
Bad-nerf: Bundle adjusted deblur neural radiance fields,
Peng Wang, Lingzhe Zhao, Ruijie Ma, and Peidong Liu, “Bad-nerf: Bundle adjusted deblur neural radiance fields,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4170–4179
2023
-
[19]
NeRF −−: Neural radiance fields without known camera parameters,
Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Victor Adrian Prisacariu, “Nerf–: Neural radiance fields without known camera parameters,”arXiv preprint arXiv:2102.07064, 2021
-
[20]
Local light field fu- sion: Practical view synthesis with prescriptive sam- pling guidelines,
Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz- Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar, “Local light field fu- sion: Practical view synthesis with prescriptive sam- pling guidelines,”ACM Transactions on Graphics (ToG), vol. 38, no. 4, pp. 1–14, 2019
2019
-
[21]
Nerf: Representing scenes as neural radiance fields for view synthesis,
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,”Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021
2021
-
[22]
Deblur-nerf: Neural radi- ance fields from blurry images,
Li Ma, Xiaoyu Li, Jing Liao, Qi Zhang, Xuan Wang, Jue Wang, and Pedro V Sander, “Deblur-nerf: Neural radi- ance fields from blurry images,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12861–12870
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.