pith. sign in

arxiv: 2605.00052 · v1 · submitted 2026-04-29 · 💻 cs.CV

Two-View Accumulation as the Primary Training Lever for Hybrid-Capture Gaussian Splatting: A Variance-Decomposition View of When Gradient Surgery Helps

Pith reviewed 2026-05-09 19:55 UTC · model grok-4.3

classification 💻 cs.CV
keywords hybrid capture3D Gaussian Splattingnovel view synthesisgradient accumulationvariance decompositiontraining dynamicsbimodal camerasgradient surgery
0
0 comments X

The pith

Rendering two views per optimizer step closes the 1-3 dB gap in hybrid-capture 3D Gaussian Splatting while gradient surgery adds nothing beyond seed variance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests why 3D Gaussian Splatting under-fits scenes that mix distant aerial and close ground-level cameras. It compares several compute-matched training changes and isolates that simply accumulating gradients from two rendered views per step recovers most of the lost performance. A variance-decomposition argument explains the result by showing that the dominant effect is halving overall gradient variance rather than any clever pairing of near and far views. The same structural change improves two other Gaussian splatting backbones without further tuning.

Core claim

Standard 3DGS training with one view per step under-fits the minority camera regime by 1-3 dB on five hybrid-capture benchmarks. Among matched-budget alternatives, two-view accumulation per step outperforms 60K iterations, GradNorm, direction-aware near/far surgery, projective preconditioning, and confidence-gated surgery. Random, geometry-defined, and loss-disparity pairings all produce PSNR values within seed variance of one another. The variance-decomposition framework attributes this equivalence to small between-regime gradient variance relative to within-regime variance, so the variance reduction from two-view accumulation is the dominant lever.

What carries the argument

Two-view accumulation per optimizer step, whose benefit is isolated by a variance-decomposition framework that compares within-regime and between-regime gradient variance components.

Load-bearing premise

Between-regime gradient variance remains small relative to within-regime variance under bimodal camera distributions in 3DGS.

What would settle it

Measure PSNR on a new hybrid scene with a controlled experiment that keeps total compute fixed and shows structured near/far pairing beating random two-view pairing by more than seed variance.

Figures

Figures reproduced from arXiv: 2605.00052 by Sungjun Cho.

Figure 1
Figure 1. Figure 1: Hybrid-capture scenes require one Gaussian representation to support near views demanding [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of CrossGrad-GS. Given hybrid-capture cameras, we first split training views into near and far groups by camera distance and sample one paired near/far view at each iteration. Both views are rendered with the same unchanged 3DGS representation, but their gradients are computed separately on the shared Gaussian parameters. When the near and far gradients conflict (g ⊤ neargfar < 0), CrossGrad-GS ap… view at source ↗
Figure 3
Figure 3. Figure 3: Training-only CrossGrad-GS recovers near/far reconstruction on the headline scenes. Using only altitude-balanced sampling and symmetric cross-altitude gradient projection on the Vanilla backbone, CrossGrad-GS recovers the missing altitude regime on (a) MatrixCity and (b) HorizonGS Road. (c) UC-GS SF is included as a limitation case: when the default Euclidean grouping does not align with the visual-scale r… view at source ↗
Figure 4
Figure 4. Figure 4: Diagnostic prediction versus measured gradient ratios across hybrid-capture benchmarks [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Gradient conflict rate during training on HorizonGS Road. A substantial fraction of shared [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Gradient magnitude ratio trajectories on HorizonGS Road. Vanilla training can move far [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Evidence for the empirical observation that larger camera-distance variance is associated [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison on HorizonGS Road using the primary training-only CrossGrad-GS [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Backbone-agnostic qualitative comparison on Scaffold-GS. Applying the same near/far [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Optional distance-conditioned extension. This figure uses CrossGrad-GS with a distance [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Additional rendering views from the optional distance-conditioned extension on Hori [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
read the original abstract

Hybrid-capture novel view synthesis combines images at substantially different camera distances (e.g., aerial drone and ground-level views). Standard 3D Gaussian Splatting (3DGS), trained for 30K iterations with one rendered view per optimizer step, under-fits the minority regime by 1-3 dB on five hybrid-capture benchmarks. We isolate the lever that closes this gap. Among compute-matched alternatives -- vanilla 60K iterations, magnitude corrections (GradNorm), direction-aware near/far gradient surgery, projective preconditioning, confidence-gated sample-level surgery, and a random two-view-per-step control -- the simplest structural change wins: rendering two views per optimizer step. The pairing rule (geometry-defined near/far, random, or active loss-disparity) does not change PSNR beyond seed variance on any of the five scenes; the structural change of having two views per step does. We propose a variance-decomposition framework that predicts and explains this finding: under bimodal camera regimes, between-regime gradient variance turns out to be small relative to within-regime variance in 3DGS, so structured and random pairings are variance-equivalent in expectation, and the variance halving from two-view accumulation itself is the dominant effect. We verify the framework on five scenes whose camera-altitude bimodality coefficients span [0.55, 1.00], and we report the negative result that direction-aware projection, magnitude correction, confidence gating, and an active loss-disparity pairing all fall within seed variance of random two-view pairing. The two-view structural lever transfers cleanly to the Scaffold-GS and Pixel-GS backbones. We position this work as an honest characterization of which training-side axes do and do not move PSNR for hybrid-capture 3DGS, together with the framework that explains why.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that for hybrid-capture novel view synthesis with 3D Gaussian Splatting, the dominant training improvement (closing a 1-3 dB gap on minority camera regimes) is achieved simply by rendering and accumulating two views per optimizer step. Among compute-matched alternatives including longer training, GradNorm, direction-aware gradient surgery, projective preconditioning, confidence-gated surgery, and active loss-disparity pairing, only the two-view structural change matters; pairing rules (geometry-defined, random, or active) produce PSNR differences within seed variance across five scenes with bimodality coefficients spanning [0.55, 1.00]. A variance-decomposition framework is proposed to explain the result by showing that between-regime gradient variance is small relative to within-regime variance, rendering structured and random pairings equivalent in expectation. The two-view lever transfers to Scaffold-GS and Pixel-GS backbones, and the work emphasizes negative results on more complex interventions.

Significance. If the empirical findings hold, the paper delivers a high-impact practical insight: a minimal structural change in the training loop suffices where more elaborate gradient-manipulation techniques do not. The explicit negative results on multiple alternatives are valuable for the field, as they discourage over-engineering of training procedures for bimodal capture. The variance-decomposition view supplies a principled explanation that could generalize to other multi-regime optimization settings in computer vision and graphics, while the clean transfer to two additional backbones strengthens the claim that the lever is not 3DGS-specific.

major comments (1)
  1. Variance-decomposition framework (methods/experiments): the claim that between-regime gradient variance is small relative to within-regime variance (and therefore predicts pairing equivalence) is load-bearing for the explanatory contribution; the manuscript should explicitly state whether these variance terms are estimated from held-out independent runs or derived from the same training trajectories that produce the reported PSNR tables, as post-hoc measurement on the result runs would weaken the predictive status of the framework.
minor comments (2)
  1. Abstract and §4 (results): the statement that 'pairing rule does not change PSNR beyond seed variance' is repeated for all five scenes; adding a short table or inline report of per-method standard deviations across the N seeds used would allow readers to directly verify the equivalence claim without needing to assume the magnitude of seed variance.
  2. The bimodality coefficient range [0.55, 1.00] is given without a reference or short derivation; a one-sentence definition or citation to the measure used would improve reproducibility for readers wishing to apply the same scene selection criterion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the positive overall assessment of our work. We are pleased that the significance of the two-view accumulation finding and the negative results on alternative interventions are recognized. Below we provide a point-by-point response to the major comment.

read point-by-point responses
  1. Referee: Variance-decomposition framework (methods/experiments): the claim that between-regime gradient variance is small relative to within-regime variance (and therefore predicts pairing equivalence) is load-bearing for the explanatory contribution; the manuscript should explicitly state whether these variance terms are estimated from held-out independent runs or derived from the same training trajectories that produce the reported PSNR tables, as post-hoc measurement on the result runs would weaken the predictive status of the framework.

    Authors: We agree with the referee that the source of the variance estimates should be made explicit for full transparency. The between- and within-regime gradient variances are computed from the gradient statistics collected during the same training runs that yield the reported PSNR values. This is because the framework is designed as an explanatory tool to account for the empirical observation that pairing strategy does not affect performance beyond seed variance. To strengthen the manuscript, we will revise the methods section to clearly state this and include a short justification that the consistency of the variance ratio across independent seeds and scenes supports the framework's validity even when measured on the optimization trajectories. We believe this addresses the concern without weakening the contribution, as the framework is verified by its ability to explain the observed results across multiple scenes. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claims rest on direct experimental comparisons

full rationale

The paper's primary result—that two-view accumulation per optimizer step closes the hybrid-capture PSNR gap while compute-matched alternatives (including pairing rules, gradient surgery variants, and preconditioning) fall within seed variance—is supported by explicit negative results across five scenes and transfer to Scaffold-GS/Pixel-GS. The variance-decomposition framework is offered as a post-experiment explanation for why between-regime gradient variance is small relative to within-regime variance, thereby accounting for the observed pairing equivalence. This constitutes an analysis of the same experimental runs rather than a first-principles derivation or fitted parameter whose output is forced to match the input by construction. No self-citations, uniqueness theorems, ansatzes smuggled via citation, or renamings of known results appear as load-bearing steps. The derivation chain is therefore self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central explanatory claim rests on one domain assumption about the relative sizes of within-regime and between-regime gradient variances in 3DGS under bimodal capture; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Under bimodal camera regimes, between-regime gradient variance is small relative to within-regime variance in 3DGS
    This assumption is invoked to conclude that random and structured pairings are variance-equivalent and that two-view accumulation itself is the dominant effect.

pith-pipeline@v0.9.0 · 5646 in / 1410 out tokens · 61430 ms · 2026-05-09T19:55:17.331437+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages

  1. [1]

    ACM Transactions on Graphics (SIGGRAPH) , year=

    3D Gaussian Splatting for Real-Time Radiance Field Rendering , author=. ACM Transactions on Graphics (SIGGRAPH) , year=

  2. [2]

    CVPR , year=

    Mip-Splatting: Alias-Free 3D Gaussian Splatting , author=. CVPR , year=

  3. [3]

    ICCV , year=

    MatrixCity: A Large-scale City Dataset for City-scale Neural Rendering and Beyond , author=. ICCV , year=

  4. [4]

    Xu, Ruihui and others , booktitle=

  5. [5]

    ICCV , year=

    Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields , author=. ICCV , year=

  6. [6]

    CVPR , year=

    Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields , author=. CVPR , year=

  7. [7]

    IEEE Transactions on Image Processing , year=

    Image Quality Assessment: From Error Visibility to Structural Similarity , author=. IEEE Transactions on Image Processing , year=

  8. [8]

    CVPR , year=

    The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , author=. CVPR , year=

  9. [9]

    Li, Lihan and Xu, Linning and Xiangli, Yuanbo and Dai, Bo and Lin, Dahua , journal=

  10. [10]

    1994 , publisher=

    Scale-Space Theory in Computer Vision , author=. 1994 , publisher=

  11. [11]

    Readings in Computer Vision , pages=

    Scale-Space Filtering , author=. Readings in Computer Vision , pages=. 1987 , publisher=

  12. [12]

    Su, Jianlin and Ahmed, Murtadha and Lu, Yu and Pan, Shengfeng and Bo, Wen and Liu, Yunfeng , journal=

  13. [13]

    Lu, Tao and Yu, Mulin and Xu, Linning and Xiangli, Yuanbo and Wang, Limin and Lin, Dahua and Dai, Bo , booktitle=

  14. [14]

    Xiangli, Yuanbo and Xu, Linning and Pan, Xingang and Zhao, Nanxuan and Rao, Anyi and Theobalt, Christian and Dai, Bo and Lin, Dahua , booktitle=

  15. [15]

    NeurIPS , year=

    Gradient Surgery for Multi-Task Learning , author=. NeurIPS , year=

  16. [16]

    NeurIPS , year=

    Conflict-Averse Gradient Descent for Multi-Task Learning , author=. NeurIPS , year=

  17. [17]

    Zhang, Liqiang and others , booktitle=

  18. [18]

    ECCV , year=

    Analytic-Splatting: Anti-Aliased 3D Gaussian Splatting via Analytic Integration , author=. ECCV , year=

  19. [19]

    Ren, Kerui and Jiang, Lihan and Lu, Tao and Yu, Mulin and Xu, Linning and Ni, Zhangkai and Dai, Bo , booktitle=

  20. [20]

    CVPR Workshop , year=

    Compressed 3D Gaussian Splatting for Accelerated Novel View Synthesis , author=. CVPR Workshop , year=

  21. [21]

    CVPR , year=

    4D Gaussian Splatting for Real-Time Dynamic Scene Rendering , author=. CVPR , year=

  22. [22]

    Lin, Jiaqi and Li, Zhihao and Tang, Xiao and He, Jianzhuang and Liu, Shiyong and Liu, Jiaying and Lu, Yanwei and Qi, Xiaojuan and Xu, Dong and Li, Hongsheng , booktitle=

  23. [23]

    arXiv preprint , year=

    Kulh. arXiv preprint , year=

  24. [24]

    Tang, Yi and others , journal=

  25. [25]

    Vuong, An and others , journal=

  26. [26]

    Chen, Zhao and Badrinarayanan, Vijay and Lee, Chen-Yu and Rabinovich, Andrew , booktitle=

  27. [27]

    Zhang, Jian and others , journal=

  28. [28]

    arXiv preprint , year=

    Multi-Wavelet Gaussian Splatting for Frequency-Adaptive Rendering , author=. arXiv preprint , year=

  29. [29]

    Zhao, Yue and others , journal=

  30. [30]

    arXiv preprint , year=

    Pushing Rendering Boundaries: Hard Gaussian Splatting , author=. arXiv preprint , year=

  31. [31]

    and Low, W

    Yan, Z. and Low, W. F. and Chen, Y. , journal=. Multi-Scale

  32. [32]

    and Rawat, Y

    Mitra, S. and Rawat, Y. S. , journal=

  33. [33]

    and Xiong, Y.-J

    Zhou, Z. and Xiong, Y.-J. and Zhang, J.-C. , journal=. Gradient-Direction-Aware Density Control for

  34. [34]

    and Wang, T

    Hou, Y. and Wang, T. and Wang, X. , journal=

  35. [35]

    Zhang, Zheng and Hu, Wenbo and Lao, Yixing and He, Tong and Zhao, Hengshuang , booktitle=

  36. [36]

    Eurographics Symposium on Rendering (EGSR) , year=

    Floaters No More: Radiance Field Gradient Scaling for Improved Near-Camera Training , author=. Eurographics Symposium on Rendering (EGSR) , year=

  37. [37]

    and Zhang, Y

    Li, Z. and Zhang, Y. and Wu, C. , journal=

  38. [38]

    Li, Zhuopeng and Zhang, Yilin and Wu, Chenming and Zhu, Jianke and Zhang, Liangjun , booktitle=

  39. [39]

    Eurographics , year=

    Efficient Perspective-Correct 3D Gaussian Splatting Using Hybrid Transparency , author=. Eurographics , year=

  40. [40]

    and others , journal=

    Yang, Y. and others , journal=

  41. [41]

    and Lin, J

    Gu, J. and Lin, J. and Fan, L. , booktitle=

  42. [42]

    ACM Transactions on Graphics (SIGGRAPH) , year=

    A Hierarchical 3D Gaussian Representation for Real-Time Rendering of Very Large Datasets , author=. ACM Transactions on Graphics (SIGGRAPH) , year=

  43. [43]

    Liu, Yang and Luo, Chuanchen and Fan, Lue and Wang, Naiyan and Peng, Junran and Zhang, Zhaoxiang , booktitle=

  44. [44]

    Kulhanek, Jonas and Peng, Songyou and Kukelova, Zuzana and Pollefeys, Marc and Sattler, Torsten , booktitle=

  45. [45]

    Wang, Zirui and Tsvetkov, Yulia and Firat, Orhan and Cao, Yuan , booktitle=

  46. [46]

    Wu, Ke and Zhang, Kaizhao and Zhang, Zhiwei and Tao, Muyang and Yuan, Sheng and Liu, Zhongxue and Zhao, Hang , booktitle=

  47. [47]

    European Conference on Computer Vision (ECCV) , year=

    Zhu, Zehao and Fan, Zhiwen and Jiang, Yifan and Wang, Zhangyang , title=. European Conference on Computer Vision (ECCV) , year=

  48. [48]

    European Conference on Computer Vision (ECCV) , year=

    Zhang, Jiawei and Li, Jiahe and Yu, Xiaohan and Huang, Lei and Gu, Lin and Zheng, Jin and Du, Bo , title=. European Conference on Computer Vision (ECCV) , year=

  49. [49]

    Zhang, Yancheng and Sun, Guangyu and Chen, Chen , journal=

  50. [50]

    Li, Yanyan and Lyu, Chenyu and Di, Yan and Zhai, Guangyao and Lee, Gim Hee and Tombari, Federico , journal=

  51. [51]

    Zhang, Chenhao and Cao, Yuanping and Zhang, Lei , journal=

  52. [52]

    Zhao, Cheng and Sun, Su and Wang, Ruoyu and Guo, Yuliang and Wan, Jun-Jun and Huang, Zhou and Huang, Xinyu and Chen, Yingjie Victor and Ren, Liu , journal=

  53. [53]

    2024 , doi=

    Ye, Zongxin and Li, Wenyu and Liu, Sidun and Qiao, Peng and Dou, Yong , booktitle=. 2024 , doi=

  54. [54]

    Jeong, Moonsoo and Kim, Dongbeen and Kim, Minseong and Lee, Sungkil , journal=

  55. [55]

    Gradient-Direction-Aware Density Control for

    Zhou, Zheng and Xiong, Yu-Jie and Xia, Chun-Ming and Zhang, Jia-Chen and Zhan, Hong-Jian , journal=. Gradient-Direction-Aware Density Control for

  56. [56]

    arXiv preprint arXiv:2404.06109 , year=

    Revising Densification in Gaussian Splatting , author=. arXiv preprint arXiv:2404.06109 , year=

  57. [57]

    Re-Activating Frozen Primitives for

    Cheng, Yuxin and Huang, Binxiao and Zhou, Wenyong and Wu, Taiqiang and Liu, Zhengwu and Chesi, Graziano and Wong, Ngai , booktitle=. Re-Activating Frozen Primitives for