pith. sign in

arxiv: 2607.02372 · v1 · pith:2AAF7Y3Snew · submitted 2026-07-02 · 💻 cs.CV

Learning Spectral and Polarimetric Clues for One-to-Multimodal Novel View Synthesis

Pith reviewed 2026-07-03 15:18 UTC · model grok-4.3

classification 💻 cs.CV
keywords neural renderingnovel view synthesismultimodalinfraredpolarimetricmultispectralimplicit representation
0
0 comments X

The pith

Pre-training on multimodal scenes enables RGB-only fine-tuning to render infrared, polarimetric, and multispectral views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to synthesize novel views in non-RGB modalities without needing to capture those modalities for each new scene. It does this by first pre-training a neural renderer on scenes that supply multiple types of images so that the model learns how the modalities relate to one another. Those learned relations then allow the model, when adapted to a fresh scene using only its RGB photographs, to output consistent images in the other modalities as well. A reader would care because this removes the requirement for costly specialized cameras on every target scene.

Core claim

SPoILeR performs a multimodal pre-training phase in which the model learns the mutual correlation between different modalities. This correlation knowledge then supports accurate prediction of unconventional modalities during fine-tuning that is supervised solely by RGB images, yielding multi-view consistent renderings of infrared, polarimetric, and multispectral frames even when no samples from those sensors are available for the scene.

What carries the argument

The Spectral and Polarimetric Implicit Learned Representation (SPoILeR), which encodes learned correlations across imaging modalities to enable transfer from pre-training to RGB-supervised fine-tuning.

If this is right

  • The approach produces accurate renderings of infrared, polarimetric, and multispectral data without any input samples from those sensors.
  • Renderings remain multi-view consistent across the new scene.
  • Fine-tuning requires supervision from RGB frames only.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If modality correlations prove stable across scenes, the method could be applied to additional imaging types such as thermal or depth data.
  • Applications in fields that use multiple sensors might reduce their dependence on full multimodal capture setups.

Load-bearing premise

Correlations between modalities discovered during pre-training on some scenes will transfer reliably to new scenes that provide only RGB supervision.

What would settle it

Acquire real infrared or polarimetric images of a previously unseen scene and measure whether the model's RGB-only renderings match those ground-truth captures within expected error bounds; systematic mismatch would falsify the transfer claim.

Figures

Figures reproduced from arXiv: 2607.02372 by Federico Lincetto, Gianluca Agresti, Mattia Rossi, Piergiorgio Sartor, Pietro Zanuttigh.

Figure 1
Figure 1. Figure 1: The proposed approach firstly learns the correlation across multiple modalities on a collection of scenes. Then, a single-scene fine-tuning on RGB data alone produces a model able to render views of arbitrary modalities for the considered scene. tasks or enhance result quality. For example, there exist Multispectral (MS) sensors that are sensitive to different bands of visible light, Near-Infrared (NIR) se… view at source ↗
Figure 2
Figure 2. Figure 2: Radiance module architecture scheme: fea￾tures from basis and coefficients are combined and decoded into different radiance modalities [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 5
Figure 5. Figure 5: Scheme of the inverse func￾tion module I. It predicts the esti￾mated latent vector zˆe from multi￾modal radiance values {mˆ 1, ..., mˆ k}. Latent Space Geometry Loss This regularization loss aims to promote the la￾tent space explainability. Considering that the multimodal supervision is strongly limited in the FT, it is reasonable to observe the model estimating suboptimal latents. We observed that the dec… view at source ↗
Figure 6
Figure 6. Figure 6: Scheme of the modality-to-luma module M. It encourages consistency between the refer￾ence RGB luminance g and the estimated lumi￾nance ge predicted from one random modality. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative samples from the “Fruits” scene FT performed with only RGB data MultimodalStudio [42], we follow the same strategy of training our model with MMS-DATA raw mosaicked frames and computing the metrics only on masked foreground regions. To evaluate the quality of all rendering results, we use Peak Signal-to-Noise Ratio (PSNR) as the main metric. In addition, we also show Structural Similarity Index… view at source ↗
Figure 8
Figure 8. Figure 8: Tests with an unbalanced com￾bination of modalities. X axis refers to the number of additional modality (MS, Pol, or NIR) frames. Comparison between SPoILeR (Ours) and MMS-FW. 4.4 Unbalanced Combination of Modalities In this section, we investigate the results achieved by MMS-FW when trained with an unbalanced combination of modalities. This test is introduced by [42] and involves scenarios where many fram… view at source ↗
Figure 9
Figure 9. Figure 9: MS, Pol, and NIR render￾ings with different loss ablations DoP in terms of mean angle error (MAngE) and mean absolute error (MAbsE). In Tab. 3 we compare the results achieved by SPoILeR fine-tuning with those of MMS-FW using PolarAnything data. Our model outperforms the competitor in terms of both MAngE and MAbsE, by 14.08° and 0.044, respectively. In this case, PolarAnything cannot generate multi-view con… view at source ↗
Figure 10
Figure 10. Figure 10: Tests with an unbalanced combination of modalities. The X axis corresponds to the number of additional modality (MS or NIR) frames. Comparison between SPoILeR (Ours) and MMS-FW. Results averaged on all 16 X-NeRF scenes. procedure cannot benefit from the multi-view consistency enforced by NeRF-like models during training, as the modality conversion is performed as a final step on each rendering independent… view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative renderings of the “Teddybear” scene from the FT step supervised with only RGB data. All frames are mosaicked, except RGB frames. RGB is demo￾saicked only for visualization purposes [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative renderings of the “Toys” scene from the FT step supervised with only RGB data. All frames are mosaicked, except RGB frames. RGB is demosaicked only for visualization purposes [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative renderings of the “Birdhouse” scene from the FT step supervised with only RGB data. All frames are mosaicked, except RGB frames. RGB is demo￾saicked only for visualization purposes [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative renderings of the “Bouquet” scene from the FT step supervised with only RGB data. All frames are mosaicked, except RGB frames. RGB is demo￾saicked only for visualization purposes [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative comparison between SPoILeR and MMS-FW + MST++ in terms of recovered multispectral radiance on the “Bouquet” scene. All frames are mosaicked. “before” and “after” refer to whether the RGB-to-MS conversion in performed before or after the training of MMS-FW [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Qualitative comparison between SPoILeR and MMS-FW + MST++ in terms of recovered multispectral radiance on the “Teddybear” scene. All frames are mosaicked. “before” and “after” refer to whether the RGB-to-MS conversion in performed before or after the training of MMS-FW [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Qualitative comparison between SPoILeR and MMS-FW + MST++ in terms of recovered multispectral radiance on the “Toys” scene. All frames are mosaicked. “before” and “after” refer to whether the RGB-to-MS conversion in performed before or after the training of MMS-FW [PITH_FULL_IMAGE:figures/full_fig_p032_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Qualitative comparison between SPoILeR and MMS-FW + MST++ in terms of recovered multispectral radiance on the “Birdhouse” scene. All frames are mosaicked. “before” and “after” refer to whether the RGB-to-MS conversion in performed before or after the training of MMS-FW [PITH_FULL_IMAGE:figures/full_fig_p033_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Qualitative comparison between SPoILeR and MMS-FW + MST++ in terms of recovered multispectral radiance on the “Bouquet” scene. All frames are mosaicked. “before” and “after” refer to whether the RGB-to-MS conversion in performed before or after the training of MMS-FW [PITH_FULL_IMAGE:figures/full_fig_p034_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Qualitative comparison between SPoILeR and MMS-FW + PolarAnything (PA) in terms of recovered polarization on the “Fruits” scene. “before” and “after” refer to whether the RGB-to-Pol conversion in performed before or after the training of MMS-FW [PITH_FULL_IMAGE:figures/full_fig_p035_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Qualitative comparison between SPoILeR and MMS-FW + PolarAnything (PA) in terms of recovered polarization on the “Teddybear” scene. “before” and “after” refer to whether the RGB-to-Pol conversion in performed before or after the training of MMS-FW [PITH_FULL_IMAGE:figures/full_fig_p036_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Qualitative comparison between SPoILeR and MMS-FW + PolarAnything (PA) in terms of recovered polarization on the “Toys” scene. “before” and “after” refer to whether the RGB-to-Pol conversion in performed before or after the training of MMS-FW [PITH_FULL_IMAGE:figures/full_fig_p037_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Qualitative comparison between SPoILeR and MMS-FW + PolarAnything (PA) in terms of recovered polarization on the “Birdhouse” scene. “before” and “after” refer to whether the RGB-to-Pol conversion in performed before or after the training of MMS-FW [PITH_FULL_IMAGE:figures/full_fig_p038_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Qualitative comparison between SPoILeR and MMS-FW + PolarAnything (PA) in terms of recovered polarization on the “Bouquet” scene. “before” and “after” refer to whether the RGB-to-Pol conversion in performed before or after the training of MMS-FW [PITH_FULL_IMAGE:figures/full_fig_p039_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Qualitative renderings of the “Chess” scene from the pre-training step with all-modality supervision. All frames are mosaicked, except RGB frames. RGB is demo￾saicked only for visualization purposes [PITH_FULL_IMAGE:figures/full_fig_p040_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Qualitative renderings of the “Forestgang 1” scene from the pre-training step with all-modality supervision. All frames are mosaicked, except RGB frames. RGB is demosaicked only for visualization purposes [PITH_FULL_IMAGE:figures/full_fig_p041_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Qualitative renderings of the “Laurelwreath” scene from the pre-training step with all-modality supervision. All frames are mosaicked, except RGB frames. RGB is demosaicked only for visualization purposes [PITH_FULL_IMAGE:figures/full_fig_p042_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Qualitative renderings of the “Truck” scene from the pre-training step with all-modality supervision. All frames are mosaicked, except RGB frames. RGB is demo￾saicked only for visualization purposes [PITH_FULL_IMAGE:figures/full_fig_p043_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Qualitative renderings of the “Aloe” scene from the pre-training step with all-modality supervision. All frames are mosaicked, except RGB frames. RGB is demo￾saicked only for visualization purposes [PITH_FULL_IMAGE:figures/full_fig_p044_29.png] view at source ↗
read the original abstract

Neural rendering techniques allow for accurate reconstruction of the geometry and color appearance of 3D scenes. Some methods have extended their use to additional imaging modalities, such as multispectral, infrared, or polarimetric data. However, all of these approaches require expensive sensors and calibrated setups to capture new multimodal frames for each new scene. We propose Spectral and Polarimetric Implicit Learned Representation (SPoILeR), a novel method to obtain multi-view consistent renderings of unconventional modalities for scenes where either only RGB frames or very few of the additional modalities are available. Thanks to a multimodal pre-training phase, the model learns the mutual correlation between different modalities. This step allows predicting accurate renderings of unconventional modalities during a fine-tuning phase supervised only by RGB images. Experimental results show that the approach can accurately render infrared, polarimetric, and multispectral frames for scenes where no input sample captured by these types of sensors is provided.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces SPoILeR, a neural implicit representation method for one-to-multimodal novel view synthesis. It performs multimodal pre-training across RGB, infrared, polarimetric, and multispectral data to learn cross-modal correlations, followed by fine-tuning on RGB-only supervision for new scenes. This enables rendering of the unconventional modalities without any target-modality input samples for those scenes. The central claim is that the pre-trained correlations transfer sufficiently to produce accurate multi-view consistent renderings of IR, polarimetric, and multispectral frames.

Significance. If the transfer of pre-trained modality correlations holds for novel scenes, the approach would meaningfully lower the barrier to multimodal 3D reconstruction by eliminating the need for expensive calibrated sensors on every target scene. The paper reports experimental results demonstrating accurate renderings under RGB-only fine-tuning, which would be a practical advance if the generalization is robust.

major comments (1)
  1. [Abstract and Experimental Results section] The load-bearing assumption that cross-modal correlations learned during pre-training are sufficiently scene-independent to enable accurate rendering of non-RGB modalities with zero multimodal supervision on entirely new scenes is not adequately secured by the reported experiments. The abstract and described pipeline provide no evidence (e.g., cross-dataset testing or ablation on material/illumination variation) that the pre-training distribution covers the statistics of the test scenes.
minor comments (1)
  1. [Abstract] The abstract supplies no quantitative metrics, error analysis, or baseline comparisons, making it difficult to assess the strength of the reported results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and the opportunity to clarify the generalization properties of SPoILeR. Below we address the major comment point by point.

read point-by-point responses
  1. Referee: [Abstract and Experimental Results section] The load-bearing assumption that cross-modal correlations learned during pre-training are sufficiently scene-independent to enable accurate rendering of non-RGB modalities with zero multimodal supervision on entirely new scenes is not adequately secured by the reported experiments. The abstract and described pipeline provide no evidence (e.g., cross-dataset testing or ablation on material/illumination variation) that the pre-training distribution covers the statistics of the test scenes.

    Authors: We respectfully disagree that the reported experiments fail to secure the central assumption. The pre-training corpus comprises multiple distinct scenes spanning varied materials, surface properties, and illumination conditions across all four modalities. Fine-tuning and quantitative evaluation are performed exclusively on held-out scenes that were never observed during pre-training; these test scenes were captured under different viewpoints, lighting, and material configurations from the pre-training set. The consistent accuracy of the rendered IR, polarimetric, and multispectral outputs on these unseen scenes constitutes direct empirical evidence that the learned cross-modal correlations transfer beyond the exact training scenes. While we do not conduct an explicit cross-dataset evaluation (owing to the scarcity of publicly available calibrated multimodal 3D datasets), the intra-dataset scene diversity and the zero-shot transfer results already address the core concern of scene independence. We are prepared to expand the experimental section with additional qualitative examples highlighting material and illumination variation if the editor deems it necessary. revision: no

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper presents a standard two-stage neural rendering pipeline: multimodal pre-training to capture cross-modal correlations, followed by RGB-supervised fine-tuning on novel scenes. No equations, parameter-fitting steps, or self-citations are described that would make any claimed prediction equivalent to its inputs by construction. The transfer of learned correlations to unseen scenes is an empirical claim resting on the pre-training distribution, not a definitional or fitted tautology. The method is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract, the central claim rests on the domain assumption that modality correlations are stable across scenes and can be learned from pre-training data.

axioms (1)
  • domain assumption Multimodal correlations learned on pre-training scenes generalize to new scenes without any multimodal input
    This transfer is required for the fine-tuning phase to produce accurate unconventional modality renderings from RGB alone.
invented entities (1)
  • SPoILeR model no independent evidence
    purpose: Implicit representation that encodes cross-modality correlations for novel view synthesis
    New architecture introduced to perform the described pre-training and fine-tuning

pith-pipeline@v0.9.1-grok · 5704 in / 1270 out tokens · 20902 ms · 2026-07-03T15:18:30.231293+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 61 canonical work pages

  1. [1]

    In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

    Arad, B., Timofte, R., Yahel, R., Morag, N., Bernat, A., Cai, Y., Lin, J., Lin, Z., Wang, H., Zhang, Y., et al.: Ntire 2022 spectral recovery challenge and data set. In: IEEE Conference on Computer Vision and Pattern Recognition Workshop. pp. 862–880. IEEE (2022).https://doi.org/10.1109/CVPRW56347.2022.001024

  2. [3]

    In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

    Bachmann, R., Kar, O.F., Mizrahi, D., Garjani, A., Gao, M., Griffiths, D., Hu, J., Dehghan, A., Zamir, A.: 4m-21: An any-to-any vision model for tens of tasks and modalities. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in Neural Information Processing Systems. vol. 37, pp. 61872–61911. Curran As...

  3. [4]

    In: European Conference on Computer Vision

    Bachmann, R., Mizrahi, D., Atanov, A., Zamir, A.: Multimae: Multi-modal multi- task masked autoencoders. In: European Conference on Computer Vision. pp. 348–

  4. [5]

    Springer (2022).https://doi.org/10.1007/978-3-031-19836-6_203

  5. [6]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Boss, M., Braun, R., Jampani, V., Barron, J.T., Liu, C., Lensch, H.: Nerd: Neu- ral reflectance decomposition from image collections. In: International Confer- ence on Computer Vision. pp. 12684–12694 (2021).https://doi.org/10.1109/ ICCV48922.2021.012451

  6. [7]

    In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

    Cai,Y.,Lin,J.,Lin,Z.,Wang,H.,Zhang,Y.,Pfister,H.,Timofte,R.,VanGool,L.: Mst++: Multi-stage spectral-wise transformer for efficient spectral reconstruction. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 745–755 (2022).https://doi.org/10.1109/CVPRW56347.2022.000902, 3, 4, 10, 13, 23

  7. [8]

    In: European Conference on Computer Vision (2026) 8

    Camuffo, E., Barbato, F., Ozay, M., Milani, S., Michieli, U.: Mocha: Multi-modal objects-aware cross-architecture alignment. In: European Conference on Computer Vision (2026) 8

  8. [9]

    In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Charatan,D.,Li,S.L.,Tagliasacchi,A.,Sitzmann,V.:pixelsplat:3dgaussiansplats from image pairs for scalable generalizable 3d reconstruction. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 19457–19467 (2024).https: //doi.org/10.1109/CVPR52733.2024.018404

  9. [10]

    ACM Transactions on Graphics (2023),https:// doi.org/10.1145/35921354, 5

    Chen, A., Xu, Z., Wei, X., Tang, S., Su, H., Geiger, A.: Dictionary fields: Learning a neural basis decomposition. ACM Transactions on Graphics (2023),https:// doi.org/10.1145/35921354, 5

  10. [11]

    In: European Conference on Computer Vision

    Chen, Q., Shu, S., Bai, X.: Thermal3d-gs: Physics-induced 3d gaussians for thermal infrared novel-view synthesis. In: European Conference on Computer Vision. pp. 253–269. Springer (2024).https://doi.org/10.1007/978-3-031-73383-3_152

  11. [13]

    In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Chou, Z.T., Huang, S.Y., Liu, I., Wang, Y.C.F., et al.: Gsnerf: Generalizable se- mantic neural radiance fields with enhanced 3d scene understanding. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 20806–20815 (2024). https://doi.org/10.1109/CVPR52733.2024.019661

  12. [15]

    In: Color Imaging Conference

    Darling, B.A., Ferwerda, J.A., Berns, R.S., Chen, T.: Real-time multispectral ren- dering with complex illumination. In: Color Imaging Conference. vol. 19, pp. 345–

  13. [16]

    1145/3721250.37430352

    Society of Imaging Science and Technology (2011).https://doi.org/10. 1145/3721250.37430352

  14. [17]

    In: European Conference on Computer Vision

    Dave, A., Zhao, Y., Veeraraghavan, A.: Pandora: Polarization-aided neural decom- position of radiance. In: European Conference on Computer Vision. pp. 538–556. Springer (2022).https://doi.org/10.1007/978-3-031-20071-7_322

  15. [18]

    In: International Con- ference on Learning Representations (2021) 4

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Con- ference on Learning Representations (2021) 4

  16. [19]

    In: IEEE Symposium on Volume Visualization and Graphics

    Gibson, S.F.: Using distance maps for accurate surface representation in sampled volumes. In: IEEE Symposium on Volume Visualization and Graphics. pp. 23–30 (1998) 6

  17. [20]

    In: International Conference on Learning Representa- tions (2026),https://openreview.net/forum?id=BR2ItBcqOo3

    Griffiths, R., Dansereau, D.G.: RoRE: Rotary ray embedding for generalised multi- modal scene understanding. In: International Conference on Learning Representa- tions (2026),https://openreview.net/forum?id=BR2ItBcqOo3

  18. [21]

    In: International Conference on Machine Learning (2020),https://dl.acm.org/doi/abs/10.5555/3524938.35252936

    Gropp, A., Yariv, L., Haim, N., Atzmon, M., Lipman, Y.: Implicit geometric reg- ularization for learning shapes. In: International Conference on Machine Learning (2020),https://dl.acm.org/doi/abs/10.5555/3524938.35252936

  19. [22]

    Scientific Reports12(1), 17288 (2022).https://doi

    Großmann, W., Horn, H., Niggemann, O.: Improving remote material classification ability with thermal imagery. Scientific Reports12(1), 17288 (2022).https://doi. org/10.1038/s41598-022-21588-42

  20. [23]

    In: AAAI Conference on Artificial Intelligence

    Guo, H., Liu, H., Wen, J., Li, J.: Cross-spectral gaussian splatting with spatial occupancy consistency. In: AAAI Conference on Artificial Intelligence. vol. 39, pp. 3229–3237 (2025).https://doi.org/10.1609/aaai.v39i3.323332, 3

  21. [24]

    In: International Conference on Computer Vision

    Han, Y., Tie, B., Guo, H., Lyu, Y., Li, S., Shi, B., Jia, Y., Ma, Z.: Polgs: Polari- metric gaussian splatting for fast reflective surface reconstruction. In: International Conference on Computer Vision. pp. 28073–28082 (2025) 2

  22. [25]

    Optics Express19(10), 9315–9329 (2011).https://doi.org/10.1364/OE

    Hashimoto, N., Murakami, Y., Bautista, P.A., Yamaguchi, M., Obi, T., Ohyama, N., Uto, K., Kosugi, Y.: Multispectral image enhancement for effective visualiza- tion. Optics Express19(10), 9315–9329 (2011).https://doi.org/10.1364/OE. 19.0093152

  23. [26]

    acha.2010.07.001

    Hassan, M., Forest, F., Fink, O., Mielle, M.: Thermonerf: A multimodal neural radiance field for joint rgb-thermal novel view synthesis of building facades. Ad- vanced Engineering Informatics65, 103345 (2025).https://doi.org/10.1016/j. aei.2025.1033452

  24. [27]

    Computer Graphics Forum42(2023).https://doi.org/10.1111/cgf.149404

    He, H., Liang, Y., Xiao, S., Chen, J., Chen, Y.: Cp-nerf: Conditionally parameter- ized neural radiance fields for cross-scene novel view synthesis. Computer Graphics Forum42(2023).https://doi.org/10.1111/cgf.149404

  25. [28]

    In: ACM International Conference on Computer Graphics and Interactive Techniques

    Huang, B., Yu, Z., Chen, A., Geiger, A., Gao, S.: 2d gaussian splatting for geomet- rically accurate radiance fields. In: ACM International Conference on Computer Graphics and Interactive Techniques. pp. 1–11 (2024).https://doi.org/10.1145/ 3641519.36574281

  26. [29]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Huang, Y.H., He, Y., Yuan, Y.J., Lai, Y.K., Gao, L.: Stylizednerf: Consistent 3d scene stylization as stylized nerf via 2d-3d mutual learning. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 18321–18331 (2022).https: //doi.org/10.1109/CVPR52688.2022.017804 18 F. Lincetto et al

  27. [30]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Jain, A., Tancik, M., Abbeel, P.: Putting nerf on a diet: Semantically consistent few-shot view synthesis. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 5885–5894 (2021).https://doi.org/10.1109/ICCV48922.2021. 005834

  28. [31]

    Castro, Benedikt Boecking, Harshita Sharma, Kenza Bouzid, Anja Thieme, Anton Schwaighofer, Maria Wetscherek, Matthew P

    Jin, H., Liu, I., Xu, P., Zhang, X., Han, S., Bi, S., Zhou, X., Xu, Z., Su, H.: Tensoir: Tensorial inverse rendering. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 165–174 (2023).https://doi.org/10.1109/CVPR52729.2023. 000241

  29. [32]

    ACM Transactions on Graphics42(4), 1–14 (2023).https://doi.org/10.1145/35924331

    Kerbl, B., Kopanas, G., Leimkuehler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics42(4), 1–14 (2023).https://doi.org/10.1145/35924331

  30. [33]

    In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Kerr, J., Kim, C.M., Goldberg, K., Kanazawa, A., Tancik, M.: Lerf: Language embedded radiance fields. In: International Conference on Computer Vision. pp. 19729–19739 (2023).https://doi.org/10.1109/ICCV51070.2023.018071

  31. [34]

    In: ACM International Conference on Computer Graphics and Interactive Techniques - Asia

    Kim, Y., Jin, W., Cho, S., Baek, S.H.: Neural spectro-polarimetric fields. In: ACM International Conference on Computer Graphics and Interactive Techniques - Asia. pp. 1–11 (2023).https://doi.org/10.1145/3610548.36181722, 3

  32. [35]

    In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR)

    Lei, C., Huang, X., Zhang, M., Yan, Q., Sun, W., Chen, Q.: Polarized reflection removal with perfect alignment in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 1750–1758 (2020).https://doi.org/10. 1109/CVPR42600.2020.001822

  33. [36]

    In: IEEE Conference on Computer Vision and Pat- ternRecognition.pp.12632–12641(2022).https://doi.org/10.1109/CVPR52688

    Lei, C., Qi, C., Xie, J., Fan, N., Koltun, V., Chen, Q.: Shape from polarization for complex scenes in the wild. In: IEEE Conference on Computer Vision and Pat- ternRecognition.pp.12632–12641(2022).https://doi.org/10.1109/CVPR52688. 2022.012302

  34. [37]

    In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Li, C., Ono, T., Uemori, T., Mihara, H., Gatto, A., Nagahara, H., Moriuchi, Y.: Neisf: Neural incident stokes field for geometry and material estimation. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 21434–21445 (2024). https://doi.org/10.1109/CVPR52733.2024.020252

  35. [39]

    In: International Conference on Acoustics, Speech, and Signal Process- ing

    Li, J., Li, Y., Sun, C., Wang, C., Xiang, J.: Spec-nerf: Multi-spectral neural radi- ance fields. In: International Conference on Acoustics, Speech, and Signal Process- ing. pp. 2485–2489. IEEE (2024).https://doi.org/10.1109/ICASSP48485.2024. 104460152

  36. [40]

    In: AAAI Conference on Artificial In- telligence

    Li, R., Liu, J., Liu, G., Zhang, S., Zeng, B., Liu, S.: Spectralnerf: Physically based spectral rendering with neural radiance field. In: AAAI Conference on Artificial In- telligence. vol. 38, pp. 3154–3162 (2024).https://doi.org/10.1609/aaai.v38i4. 280992

  37. [41]

    Castro, Benedikt Boecking, Harshita Sharma, Kenza Bouzid, Anja Thieme, Anton Schwaighofer, Maria Wetscherek, Matthew P

    Li, Z., Müller, T., Evans, A., Taylor, R.H., Unberath, M., Liu, M.Y., Lin, C.H.: Neuralangelo: High-fidelity neural surface reconstruction. In: IEEE Conference on Computer Vision and Pattern Recognition (2023).https://doi.org/10.1109/ CVPR52729.2023.008171, 6, 10

  38. [42]

    In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Liang, Z., Zhang, Q., Feng, Y., Shan, Y., Jia, K.: Gs-ir: 3d gaussian splatting for inverse rendering. In: IEEE Conference on Computer Vision and Pattern Recogni- tion. pp. 21644–21653 (2024).https://doi.org/10.1109/CVPR52733.2024.02045 1 Learning Spect. and Pol. Clues for One-to-Multimodal Novel View Synthesis 19

  39. [43]

    In: British Machine Vision Conference

    Lincetto, F., Agresti, G., Rossi, M., Zanuttigh, P.: Exploiting multiple priors for neural 3d indoor reconstruction. In: British Machine Vision Conference. BMVA (2023) 1

  40. [44]

    In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Lincetto, F., Agresti, G., Rossi, M., Zanuttigh, P.: Multimodalstudio: A heteroge- neous sensor dataset and framework for neural rendering across multiple imaging modalities. In: IEEE Conference on Computer Vision and Pattern Recognition (2025).https://doi.org/10.1109/CVPR52734.2025.010242, 3, 5, 10, 11, 12, 22

  41. [45]

    In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Liu, Y., Hu, B., Huang, J., Tai, Y.W., Tang, C.K.: Instance neural radiance field. In: International Conference on Computer Vision. pp. 787–796 (October 2023). https://doi.org/10.1109/ICCV51070.2023.000791

  42. [46]

    In: International Conference on Learning Repre- sentations (2025) 2

    Lu, R., Chen, H., Zhu, Z., Qin, Y., Lu, M., Yan, C., et al.: Thermalgaussian: Thermal 3d gaussian splatting. In: International Conference on Learning Repre- sentations (2025) 2

  43. [47]

    IEEE Access12, 45331–45341 (2024).https: //doi.org/10.1109/ACCESS.2024.33815312

    Ma, R., Ma, T., Guo, D., He, S.: Novel view synthesis and dataset augmentation for hyperspectral data using nerf. IEEE Access12, 45331–45341 (2024).https: //doi.org/10.1109/ACCESS.2024.33815312

  44. [48]

    In: Advances in Neural Informa- tion Processing Systems (2025) 4

    Meng, G., Cai, Z., Chen, R., Tu, J., Wang, Y., Huang, Y., Ding, X.: Frn: Fractal- based recursive spectral reconstruction network. In: Advances in Neural Informa- tion Processing Systems (2025) 4

  45. [49]

    In: Eu- ropean Conference on Computer Vision (2020),https://doi.org/10.1007/978- 3-030-58452-8_241

    Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. In: Eu- ropean Conference on Computer Vision (2020),https://doi.org/10.1007/978- 3-030-58452-8_241

  46. [50]

    In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

    Mizrahi,D.,Bachmann,R.,Kar,O.,Yeo,T.,Gao,M.,Dehghan,A.,Zamir,A.:4m: Massively multimodal masked modeling. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems. vol. 36, pp. 58363–58408. Curran Associates, Inc. (2023),https://dl. acm.org/doi/10.5555/3666122.36686663

  47. [51]

    Applied Mechanics Reviews57(3), B15–B15 (2004),https://doi.org/10

    Osher, S., Fedkiw, R., Piechor, K.: Level set methods and dynamic implicit sur- faces. Applied Mechanics Reviews57(3), B15–B15 (2004),https://doi.org/10. 1007/b988796

  48. [52]

    In: European Confer- ence on Computer Vision

    Özer, M., Weiherer, M., Hundhausen, M., Egger, B.: Exploring multi-modal neural scene representations with applications on thermal imaging. In: European Confer- ence on Computer Vision. pp. 82–98. Springer (2024).https://doi.org/10.1007/ 978-3-031-92805-5_62

  49. [53]

    In: IEEE Conference on Computer Vision and Pattern Recognition

    Perez, F., Rojas, S., Hinojosa, C., Rueda-Chacón, H., Ghanem, B.: Unmix-nerf: Spectral unmixing meets neural radiance fields. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 26284–26293 (2025) 2

  50. [54]

    In: International Conference on 3D Vision

    Poggi, M., Ramirez, P.Z., Tosi, F., Salti, S., Mattoccia, S., Di Stefano, L.: Cross- spectral neural radiance fields. In: International Conference on 3D Vision. pp. 606–616. IEEE (2022).https://doi.org/10.1109/3DV57658.2022.000712, 3, 24

  51. [55]

    In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Qin, M., Li, W., Zhou, J., Wang, H., Pfister, H.: Langsplat: 3d language gaussian splatting. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 20051–20060 (2024).https://doi.org/10.1109/CVPR52733.2024.018951

  52. [56]

    In: ACM International Conference on Multimedia

    Qu, Y., Dai, S., Li, X., Lin, J., Cao, L., Zhang, S., Ji, R.: Goi: Find 3d gaussians of interest with an optimizable open-vocabulary semantic-space hyperplane. In: ACM International Conference on Multimedia. pp. 5328–5337 (2024).https:// doi.org/10.1145/3664647.36808521

  53. [57]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: IEEE Conference on Computer 20 F. Lincetto et al. Vision and Pattern Recognition. pp. 10684–10695 (2022).https://doi.org/10. 1109/CVPR52688.2022.010424

  54. [58]

    In: IEEE Conference on Computer Vision and Pat- tern Recognition

    Saponaro, P., Sorensen, S., Kolagunda, A., Kambhamettu, C.: Material classifi- cation with thermal imagery. In: IEEE Conference on Computer Vision and Pat- tern Recognition. pp. 4649–4656 (2015).https://doi.org/10.1109/CVPR.2015. 72990962

  55. [59]

    In: IEEE Conference on Computer Vision and Pattern Recognition Workshop

    Shi, Z., Chen, C., Xiong, Z., Liu, D., Wu, F.: Hscnn+: Advanced cnn-based hyper- spectral recovery from rgb images. In: IEEE Conference on Computer Vision and Pattern Recognition Workshop. pp. 939–947 (2018).https://doi.org/10.1109/ CVPRW.2018.001394

  56. [60]

    In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Thirgood, C., Mendez, O., Ling, E., Storey, J., Hadfield, S.: Hypergs: Hyperspec- tral 3d gaussian splatting. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 5970–5979 (2025).https://doi.org/10.1109/CVPR52734.2025. 005602

  57. [61]

    International Journal of Remote Sens- ing26(15), 3241–3254 (2005).https://doi.org/10.1080/014311605001276092

    Tsagaris,V., Anastassopoulos,V.: Multispectralimage fusionfor improved rgbrep- resentation based on perceptual attributes. International Journal of Remote Sens- ing26(15), 3241–3254 (2005).https://doi.org/10.1080/014311605001276092

  58. [62]

    Varma, M., Wang, P., Chen, X., Chen, T., Venugopalan, S., Wang, Z.: Is atten- tion all that nerf needs? In: International Conference on Learning Representations (2023) 4

  59. [63]

    In: ACM International Conference on Multimedia

    Wang, H., Wen, S., Guo, B.: Polarimetric monocular gaussian splatting slam for dense surface reconstruction. In: ACM International Conference on Multimedia. pp. 7519–7528 (2025).https://doi.org/10.1145/3746027.37549252

  60. [64]

    Ad- vances in Neural Information Processing Systems (2021),https://dl.acm.org/ doi/10.5555/3540261.35423421

    Wang, P., Liu, L., Liu, Y., Theobalt, C., Komura, T., Wang, W.: Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. Ad- vances in Neural Information Processing Systems (2021),https://dl.acm.org/ doi/10.5555/3540261.35423421

  61. [65]

    In: IEEE International Conference on Image Processing

    Wang, W., Zhang, J., Shen, C.: Improved human detection and classification in thermal images. In: IEEE International Conference on Image Processing. pp. 2313–

  62. [66]

    IEEE (2010).https://doi.org/10.1109/ICIP.2010.56499462

  63. [67]

    Advances in Neural Information Processing Systems37, 103168–103197 (2024),https://dl.acm.org/ doi/10.5555/3737916.37411941

    Wang, Y., Huang, D., Ye, W., Zhang, G., Ouyang, W., He, T.: Neurodin: A two- stage framework for high-fidelity neural surface reconstruction. Advances in Neural Information Processing Systems37, 103168–103197 (2024),https://dl.acm.org/ doi/10.5555/3737916.37411941

  64. [68]

    In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Wang, Y., Han, Q., Habermann, M., Daniilidis, K., Theobalt, C., Liu, L.: Neus2: Fast learning of neural implicit surfaces for multi-view reconstruction. In: Interna- tional Conference on Computer Vision. pp. 3295–3306 (2023).https://doi.org/ 10.1109/ICCV51070.2023.003051

  65. [69]

    In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Wang, Y., Fang, L., Zhu, H., Hu, F., Ye, L., Ma, Z.: Golf-nrt: Integrating global context and local geometry for few-shot view synthesis*. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 21349–21359 (2025).https: //doi.org/10.1109/CVPR52734.2025.019894

  66. [71]

    IEEE Transactions on Neural Networks and Learning Systems 36(7), 12736–12746 (2024).https://doi.org/10.1109/TNNLS.2024.34601904 Learning Spect

    Wu, Y., Dian, R., Li, S.: Multistage spatial-spectral fusion network for spectral super-resolution. IEEE Transactions on Neural Networks and Learning Systems 36(7), 12736–12746 (2024).https://doi.org/10.1109/TNNLS.2024.34601904 Learning Spect. and Pol. Clues for One-to-Multimodal Novel View Synthesis 21

  67. [72]

    In: European Conference on Com- puter Vision

    Xu, J., Liao, M., Kathirvel, R.P., Patel, V.M.: Leveraging thermal modality to enhance reconstruction in low-light conditions. In: European Conference on Com- puter Vision. pp. 321–339. Springer (2024).https://doi.org/10.1007/978-3- 031-72913-3_182

  68. [73]

    In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Xu, Y., Zoss, G., Chandran, P., Gross, M., Bradley, D., Gotardo, P.: Renerf: Re- lightable neural radiance fields with nearfield lighting. In: International Confer- ence on Computer Vision. pp. 22581–22591 (2023).https://doi.org/10.1109/ ICCV51070.2023.020641

  69. [74]

    In: IEEE Conference on Computer Vision and Pat- ternRecognition.pp.10890–10899(2025).https://doi.org/10.1109/CVPR52734

    Yao, M., Wang, M., Tam, K.M., Li, L., Xue, T., Gu, J.: Polarfree: Polarization- based reflection-free imaging. In: IEEE Conference on Computer Vision and Pat- ternRecognition.pp.10890–10899(2025).https://doi.org/10.1109/CVPR52734. 2025.010172

  70. [75]

    In: Advances in Neural Information Processing Systems (2021),https: //dl.acm.org/doi/10.5555/3540261.35406281, 7

    Yariv, L., Gu, J., Kasten, Y., Lipman, Y.: Volume rendering of neural implicit surfaces. In: Advances in Neural Information Processing Systems (2021),https: //dl.acm.org/doi/10.5555/3540261.35406281, 7

  71. [76]

    In: International Conference on Intelligent Robots and Systems

    Ye, T., Wu, Q., Deng, J., Liu, G., Liu, L., Xia, S., Pang, L., Yu, W., Pei, L.: Thermal-nerf: Neural radiance fields from an infrared camera. In: International Conference on Intelligent Robots and Systems. pp. 1046–1053. IEEE (2024) 2

  72. [77]

    Yin, Q., Guo, P.: Multispectral remote sensing image classification with multiple features.In:InternationalConferenceonMachineLearningandCybernetics.vol.1, pp. 360–365. IEEE (2007).https://doi.org/10.1109/ICMLC.2007.43701702

  73. [78]

    IEEE Transactions on Pattern Analysis and Machine Intelligence47(12), 11961–11973 (2025).https://doi.org/ 10.1109/TPAMI.2025.36040104

    Zhang, D., Yuan, Y.J., Chen, Z., Zhang, F.L., He, Z., Shan, S., Gao, L.: Stylizedgs: Controllable stylization for 3d gaussian splatting. IEEE Transactions on Pattern Analysis and Machine Intelligence47(12), 11961–11973 (2025).https://doi.org/ 10.1109/TPAMI.2025.36040104

  74. [79]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Zhang, K., Luan, F., Li, Z., Snavely, N.: Iron: Inverse rendering by optimiz- ing neural sdfs and materials from photometric images. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 5565–5574 (2022).https: //doi.org/10.1109/CVPR52688.2022.005481

  75. [80]

    26466–26476 (2025) 2, 3, 4, 10, 13, 23

    Zhang, K., Lyu, Y., Guo, H., Li, S., Ma, Z., Shi, B.: Polaranything: Diffusion-based polarimetricimagesynthesis.In:InternationalConferenceonComputerVision.pp. 26466–26476 (2025) 2, 3, 4, 10, 13, 23

  76. [81]

    IEEE Access11, 27401–27413 (2023)

    Zhang, Q., Wang, B.H., Yang, M.C., Zou, H.: Mmnerf: multi-modal and multi-view optimized cross-scene neural radiance fields. IEEE Access11, 27401–27413 (2023). https://doi.org/10.1109/ACCESS.2023.32545484

  77. [82]

    In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Zhang, Y., Chen, A., Wan, Y., Song, Z., Yu, J., Luo, Y., Yang, W.: Ref-gs: Direc- tional factorization for 2d gaussian splatting. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 26483–26492 (2025).https://doi.org/10. 1109/CVPR52734.2025.024661

  78. [83]

    Pattern Recognition161, 111271 (2025) 4

    Zhao, C., Huang, X., Yang, K., Wang, X., Wang, Q.: Generalizable 3d gaussian splatting for novel view synthesis. Pattern Recognition161, 111271 (2025) 4

  79. [84]

    Applied Optics55(23), 6480–6490 (2016).https: //doi.org/10.1364/AO.55.0064802

    Zhou,Z.,Dong,M.,Xie,X.,Gao,Z.:Fusionofinfraredandvisibleimagesfornight- vision context enhancement. Applied Optics55(23), 6480–6490 (2016).https: //doi.org/10.1364/AO.55.0064802

  80. [85]

    Fruits”, “Teddybear

    Zhu, H., Ding, T., Chen, T., Zharkov, I., Nevatia, R., Liang, L.: Caesarnerf: Cal- ibrated semantic representation for few-shot generalizable neural rendering. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) European Conference on Computer Vision. pp. 71–89. Springer Nature Switzer- land, Cham (2025).https://doi.org/...