pith. sign in

arxiv: 2606.11894 · v2 · pith:PZVEFC7Enew · submitted 2026-06-10 · 💻 cs.CV

Wild3R: Feed-Forward 3D Gaussian Splatting from Unconstrained Sparse Photo Collection

Pith reviewed 2026-06-27 10:11 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D Gaussian Splattingfeed-forward reconstructionsparse photo collectionstransient removalappearance consistencyWildCity datasetunconstrained 3D reconstruction
0
0 comments X

The pith

Wild3R produces 3D Gaussian splats from unconstrained sparse photos without per-scene optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Wild3R, a feed-forward model for 3D Gaussian Splatting that processes real-world photo collections containing varying illumination and transient objects. Standard 3DGS requires slow per-scene optimization, and prior feed-forward methods break down under these conditions. The authors address the data shortage by releasing the WildCity dataset of 200 scenes, 170 lighting conditions, and transient elements, for a total of 337,500 images. Training on this data lets the model enforce appearance consistency across reference views while suppressing transients, yielding better results than other feed-forward baselines and performance comparable to optimized per-scene methods.

Core claim

Wild3R is a feed-forward network for 3D Gaussian Splatting that, when trained on the WildCity dataset, learns to generate consistent scene representations from sparse unconstrained photos by conditioning on reference views and removing transient content.

What carries the argument

The WildCity dataset of 200 scenes with 170 lighting conditions and transients, used to train a feed-forward model that enforces appearance consistency across viewpoints while removing transient objects.

If this is right

  • The method outperforms existing feed-forward 3DGS approaches on real-world sparse collections.
  • Results are competitive with traditional per-scene optimization methods.
  • The approach removes the need for time-consuming optimization when reconstructing from casual photos.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Casual smartphone photos could become sufficient input for usable 3D models without expert capture or long compute.
  • The same training strategy of conditioning on references while suppressing transients may transfer to other 3D representations.
  • Larger or more diverse scene collections could further reduce remaining gaps with per-scene methods.

Load-bearing premise

The WildCity dataset supplies enough variety in viewpoints, illuminations, and transients at sufficient scale to train a model that generalizes to other real-world photo collections.

What would settle it

A controlled test on a new collection of photos with lighting or transient patterns absent from WildCity where the model produces view-inconsistent appearances or retains moving objects.

Figures

Figures reproduced from arXiv: 2606.11894 by Kaede Shiohara, Takashi Otonari, Toshihiko Yamasaki, Yuto Furutani.

Figure 1
Figure 1. Figure 1: Given unconstrained photo collections and reference views, Wild3R reconstructs 3D scenes [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: PSNR vs. Reconstruction Time. The horizontal axis represents the reconstruction time in log scale to gen￾erate 3D Gaussians from input images. The vertical axis indicates the corresponding PSNR of each model on the Photo Tourism dataset [31] using 16 context views. Star markers denote methods that operate without ground truth camera parameters, while circle markers represent those that require them. 3D Gau… view at source ↗
Figure 3
Figure 3. Figure 3: WildCity Dataset Creation Pipeline. (i) First, we collect 3D assets from the SceneCity, a Blender add-on, and Sketchfab (Sec. 3.1). (ii) Then, we generate scenes using these assets (Sec. 3.2). (iii) Next, we render the scenes from multiple viewpoints and multiple lighting conditions using the HDRI maps (Sec. 3.3). (iv) Finally, we add transient objects to the rendered images using Gemini (Sec. 3.4). 3.2 Sc… view at source ↗
Figure 4
Figure 4. Figure 4: Scene Examples from WildCity Dataset. WildCity dataset contains images of various 3D assets captured under different viewpoints and illuminations. The added transient objects are highlighted with red dashed boxes. We also provide corresponding depth maps and masks indicating the sky regions. where ID, IG, IT , IB, and IE denote the diffuse, glossy, transmission, background, and emission components, respect… view at source ↗
Figure 5
Figure 5. Figure 5: Learning Appearance Consistency and Transient-free Geometry. During training, Wild3R takes multi-view images with random view-dependent illuminations and transient objects, and is constrained to predict transient-free 3D Gaussian primitives, depth maps, and camera poses. The illumination of the reconstructed scene is conditioned on the reference view. 4.2 Learning Appearance Consistency and Transient-free … view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative Comparison with 16 Context Views on the Photo Tourism Dataset. demonstrating its robust capability to synthesize perceptually high-quality reconstructions from highly limited observations. Qualitative Comparison. Qualitative comparisons on the Photo Tourism dataset [31] are shown in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative Ablation Study with 16 Context Views. (a) Our full model enables transient object removal while the variant fails. (b) Our full model is more consistent on fine-grained geometries than the model trained without Sketchfab assets. 6 Limitations While Wild3R demonstrates impressive results in unconstrained 3D reconstruction, it has several limitations inherent to its design. First, our method does… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative Validation of Adding Transient Objects. We show randomly selected examples from the WildCity dataset before and after adding transient objects via a text-driven image editing model [35], alongside their corresponding pixel-wise difference heatmaps. SSIM evaluates the perceptual similarity by comparing luminance, contrast, and structural information between two images. It is defined as SSIM(I, ˆ… view at source ↗
Figure 9
Figure 9. Figure 9: Additional Qualitative Comparison on the Photo Tourism Dataset. since its depth maps are not provided, we generated pseudo ground truth depth maps from the ground truth camera parameters and images using COLMAP [29]. These datasets inherently limit model generalization: first, both lack transient objects; second, LightCity provides insufficient scene diversity; finally, DTU is object-centric, lacking scene… view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative Comparison on the NeRF-OSR Dataset. Training Dataset 4 Context Views 16 Context Views 64 Context Views PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ None 11.25 0.320 0.593 13.72 0.377 0.546 14.88 0.417 0.512 DTU [11] 11.29 0.299 0.629 12.79 0.352 0.599 13.51 0.391 0.585 LightCity [38] 9.69 0.288 0.667 11.17 0.309 0.676 12.33 0.340 0.690 WildCity (Ours) 13.04 0.370 0.556 15.87 0.435 … view at source ↗
read the original abstract

Feed-forward 3D Gaussian Splatting (3DGS) removes the need for time-consuming per-scene optimization required by traditional 3DGS. However, existing feed-forward approaches struggle with real-world photo collections that include diverse lighting conditions and transient objects. In this paper, we present Wild3R, a feed-forward approach for unconstrained sparse photo collections. The main bottleneck is the lack of training data that provides multiple viewpoints, a variety of illuminations, and transient variations necessary for learning robust scene representations. To address this, we introduce the WildCity dataset, which comprises 200 scenes, 170 lighting conditions, and transient objects, resulting in 337,500 images in total. By leveraging the dataset, our model learns appearance consistency across viewpoints conditioned on reference views, while removing transient content. Extensive experiments demonstrate that our method outperforms existing feed-forward approaches and achieves results competitive with prior per-scene optimization-based methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents Wild3R, a feed-forward method for 3D Gaussian Splatting from unconstrained sparse photo collections containing diverse lighting and transient objects. It introduces the WildCity dataset (200 scenes, 170 lighting conditions, transients, 337500 images) to enable learning of viewpoint-conditioned appearance consistency and transient removal, claiming outperformance over existing feed-forward approaches and competitiveness with per-scene optimization methods.

Significance. If the central claims hold with rigorous validation, the work would advance practical feed-forward 3D reconstruction by addressing real-world variations without per-scene optimization, potentially broadening applicability of 3DGS to casual photo collections.

major comments (1)
  1. [Abstract] Abstract: The claim that WildCity supplies independent multi-view, lighting, and transient variations sufficient for isolating appearance consistency and transient removal in a feed-forward objective is load-bearing for the outperformance and generalization assertions, yet the provided text gives no details on data collection protocol, independence of factors, or supervision signals for transient removal; without these, the training cannot be shown to disentangle the factors as required.
minor comments (1)
  1. The abstract references 'extensive experiments' demonstrating outperformance but provides no metrics, tables, baselines, or ablation details to allow assessment of whether reported gains are supported by data or affected by post-hoc choices.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and for identifying the need for greater clarity in the abstract regarding the WildCity dataset. We address the comment point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that WildCity supplies independent multi-view, lighting, and transient variations sufficient for isolating appearance consistency and transient removal in a feed-forward objective is load-bearing for the outperformance and generalization assertions, yet the provided text gives no details on data collection protocol, independence of factors, or supervision signals for transient removal; without these, the training cannot be shown to disentangle the factors as required.

    Authors: We agree that the abstract is concise and does not itself supply the requested protocol details. The full manuscript (Section 3) describes the WildCity construction: each of the 200 scenes was captured from multiple viewpoints under 170 distinct lighting conditions, with and without transient objects, using a protocol that isolates the three factors through repeated captures of the same geometry. Transient supervision is obtained from paired images of identical viewpoint and lighting that differ only in the presence of transients. We will revise the abstract to add one sentence summarizing this design, thereby making the load-bearing claim more self-contained while remaining within length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external dataset and standard training

full rationale

The paper introduces the WildCity dataset as newly collected external data (200 scenes, 170 lighting conditions, transients) to train the feed-forward model for appearance consistency and transient removal. Performance claims rest on experiments comparing against other methods, with no quoted equations or steps showing a result defined in terms of itself, a fitted parameter renamed as prediction, or a load-bearing self-citation chain. The central premise (learning from multi-view/illumination/transient variations) is supported by the dataset's independent collection rather than reducing to internal definitions or prior self-work by construction. This matches the default case of a self-contained paper with no circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations or training details; therefore no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5706 in / 1148 out tokens · 22465 ms · 2026-06-27T10:11:12.771759+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 5 canonical work pages · 4 internal anchors

  1. [1]

    M. L. Antequera, P. Gargallo, M. Hofinger, S. R. Bulo, Y . Kuang, and P. Kontschieder. Mapillary planet-scale depth dataset. InECCV, 2020

  2. [2]

    Virtual KITTI 2

    Y . Cabon, N. Murray, and M. Humenberger. Virtual kitti 2.arXiv preprint arXiv:2001.10773, 2020

  3. [3]

    X. Chen, Q. Zhang, X. Li, Y . Chen, Y . Feng, X. Wang, and J. Wang. Hallucinated neural radiance fields in the wild. InCVPR, 2022

  4. [4]

    Couturier

    A. Couturier. SceneCity.https://www.cgchan.com/

  5. [5]

    Dahmani, M

    H. Dahmani, M. Bennehar, N. Piasco, L. Roldao, and D. Tsishkou. Swag: Splatting in the wild images with appearance-conditioned gaussians. InECCV, 2024

  6. [6]

    A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. Scannet: Richly- annotated 3d reconstructions of indoor scenes. InCVPR, 2017

  7. [7]

    Deitke, D

    M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi. Objaverse: A universe of annotated 3d objects. InCVPR, 2023

  8. [8]

    Fridovich-Keil, G

    S. Fridovich-Keil, G. Meanti, F. R. Warburg, B. Recht, and A. Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. InCVPR, 2023

  9. [9]

    Greff, F

    K. Greff, F. Belletti, L. Beyer, C. Doersch, Y . Du, D. Duckworth, D. J. Fleet, D. Gnanapragasam, F. Golemo, C. Herrmann, T. Kipf, A. Kundu, D. Lagun, I. Laradji, H.-T. (Derek)Liu, H. Meyer, Y . Miao, D. Nowrouzezahrai, C. Oztireli, E. Pot, N. Radwan, D. Rebain, S. Sabour, M. S. M. Sajjadi, M. Sela, V . Sitzmann, A. Stone, D. Sun, S. V ora, Z. Wang, T. Wu...

  10. [10]

    Huang, K

    P.-H. Huang, K. Matzen, J. Kopf, N. Ahuja, and J.-B. Huang. Deepmvs: Learning multi-view stereopsis. InCVPR, 2018

  11. [11]

    Jensen, A

    R. Jensen, A. Dahl, G. V ogiatzis, E. Tola, and H. Aanæs. Large scale multi-view stereopsis evaluation. InCVPR, 2014

  12. [12]

    Jiang, Y

    L. Jiang, Y . Mao, L. Xu, T. Lu, K. Ren, Y . Jin, X. Xu, M. Yu, J. Pang, F. Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.TOG, 2025

  13. [13]

    Kassab, A

    K. Kassab, A. Schnepf, J.-Y . Franceschi, L. Caraffa, J. Mary, and V . Gouet-Brunet. Refinedfields: Radiance fields refinement for planar scene representations.TMLR, 2025

  14. [14]

    Kerbl, G

    B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering. InTOG, 2023

  15. [15]

    Kulhanek, S

    J. Kulhanek, S. Peng, Z. Kukelova, M. Pollefeys, and T. Sattler. WildGaussians: 3D gaussian splatting in the wild. InNeurIPS, 2024

  16. [16]

    C. Li, Z. Shi, Y . Lu, W. He, and X. Xu. Robust neural rendering in the wild with asymmetric dual 3d gaussian splatting. InNeurIPS, 2025

  17. [17]

    Li and N

    Z. Li and N. Snavely. Megadepth: Learning single-view depth prediction from internet photos. InCVPR, 2018

  18. [18]

    H. Lin, S. Chen, J. H. Liew, D. Y . Chen, Z. Li, Y . Zhao, S. Peng, H. Guo, X. Zhou, G. Shi, J. Feng, and B. Kang. Depth anything 3: Recovering the visual space from any views. InICLR, 2026. 10

  19. [19]

    L. Ling, Y . Sheng, Z. Tu, W. Zhao, C. Xin, K. Wan, L. Yu, Q. Guo, Z. Yu, Y . Lu, X. Li, X. Sun, R. Ashok, A. Mukherjee, H. Kang, X. Kong, G. Hua, T. Zhang, B. Benes, and A. Bera. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InCVPR, 2024

  20. [20]

    I. Liu, L. Chen, Z. Fu, L. Wu, H. Jin, Z. Li, C. M. R. Wong, Y . Xu, R. Ramamoorthi, Z. Xu, and H. Su. Openillumination: A multi-illumination dataset for inverse rendering evaluation on real objects. InNeurIPS, 2023

  21. [21]

    Martin-Brualla, N

    R. Martin-Brualla, N. Radwan, M. S. M. Sajjadi, J. T. Barron, A. Dosovitskiy, and D. Duckworth. Nerf in the wild: Neural radiance fields for unconstrained photo collections. InCVPR, 2021

  22. [22]

    Mildenhall, P

    B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InECCV, 2020

  23. [23]

    Moreau, R

    A. Moreau, R. Shaw, M. Nazarczuk, J. Shin, T. Tanay, Z. Zhang, S. Xu, and E. Pérez-Pellitero. Off the grid: Detection of primitives for feed-forward 3d gaussian splatting. InCVPR, 2026

  24. [24]

    X. Pan, N. Charron, Y . Yang, S. Peters, T. Whelan, C. Kong, O. Parkhi, R. Newcombe, and C. Y . Ren. Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. In ICCV, 2023

  25. [25]

    F. Pei, J. Bai, X. Feng, Z. Bi, K. Zhou, and H. Wu. Opensubstance: A high-quality measured dataset of multi-view and -lighting images and shapes. InICCV, 2025

  26. [26]

    Reizenstein, R

    J. Reizenstein, R. Shapovalov, P. Henzler, L. Sbordone, P. Labatut, and D. Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In ICCV, 2021

  27. [27]

    Roberts, J

    M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InICCV, 2021

  28. [28]

    Rudnev, M

    V . Rudnev, M. Elgharib, W. Smith, L. Liu, V . Golyanik, and C. Theobalt. Nerf for outdoor scene relighting. InECCV, 2022

  29. [29]

    J. L. Schönberger and J.-M. Frahm. Structure-from-Motion Revisited. InCVPR, 2016

  30. [30]

    Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

    B. Smart, C. Zheng, I. Laina, and V . A. Prisacariu. Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912, 2024

  31. [31]

    Snavely, S

    N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: exploring photo collections in 3d.TOG, 2006

  32. [32]

    The Replica Dataset: A Digital Replica of Indoor Spaces

    J. Straub, T. Whelan, L. Ma, Y . Chen, E. Wijmans, S. Green, J. J. Engel, R. Mur-Artal, C. Ren, S. Verma, A. Clarkson, M. Yan, B. Budge, Y . Yan, X. Pan, J. Yon, Y . Zou, K. Leon, N. Carter, J. Briales, T. Gillingham, E. Mueggler, L. Pesqueira, M. Savva, D. Batra, H. M. Strasdat, R. D. Nardi, M. Goesele, S. Lovegrove, and R. Newcombe. The replica dataset:...

  33. [33]

    J. Sun, X. Chen, Q. Wang, Z. Li, H. Averbuch-Elor, X. Zhou, and N. Snavely. Neural 3d reconstruction in the wild. InSIGGRAPH, 2022

  34. [34]

    A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y . Zhao, J. Turner, N. Maestre, M. Mukadam, D. Chaplot, O. Maksymets, A. Gokaslan, V . V ondrus, S. Dharur, F. Meier, W. Galuba, A. Chang, Z. Kira, V . Koltun, J. Malik, M. Savva, and D. Batra. Habitat 2.0: Training home assistants to rearrange their habitat. InNeurIPS, 2021

  35. [35]

    G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  36. [36]

    Toschi, R

    M. Toschi, R. De Matteo, R. Spezialetti, D. De Gregorio, L. Di Stefano, and S. Salti. Relight my nerf: A dataset for novel view synthesis and relighting of real world objects. InCVPR, 2023. 11

  37. [37]

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer. InCVPR, 2025

  38. [38]

    J. Wang, Q. Hu, C. Bao, Y . Zhu, H. Bao, Z. Cui, and G. Zhang. Lightcity: An urban dataset for outdoor inverse rendering and reconstruction under multi-illumination conditions. InICCV, 2025

  39. [39]

    S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024

  40. [40]

    Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity.TIP, 2004

  41. [41]

    Wu and T

    L. Wu and T. Zhang. Wildsplatting: Unposed incremental 3d gaussian splatting reconstruction in the wild. InVRCAI, 2025

  42. [42]

    H. Xia, Y . Fu, S. Liu, and X. Wang. Rgbd objects in the wild: Scaling real-world 3d object learning from rgb-d videos. InCVPR, 2024

  43. [43]

    J. Xu, Y . Mei, and V . M. Patel. Wild-gs: Real-time novel view synthesis from unconstrained photo collections. InNeurIPS, 2024

  44. [44]

    Y . Yang, S. Zhang, Z. Huang, Y . Zhang, and M. Tan. Cross-ray neural radiance fields for novel-view synthesis from unconstrained image collections. InICCV, 2023

  45. [45]

    Y . Yao, Z. Luo, S. Li, J. Zhang, Y . Ren, L. Zhou, T. Fang, and L. Quan. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. InCVPR, 2020

  46. [46]

    B. Ye, B. Chen, H. Xu, D. Barath, and M. Pollefeys. Yonosplat: You only need one model for feedforward 3d gaussian splatting. InICLR, 2026

  47. [47]

    Zhang, C

    D. Zhang, C. Wang, W. Wang, P. Li, M. Qin, and H. Wang. Gaussian in the wild: 3d gaussian splatting for unconstrained image collections. InECCV, 2024

  48. [48]

    Zhang, P

    R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

  49. [49]

    Zhang, J

    S. Zhang, J. Wang, Y . Xu, N. Xue, C. Rupprecht, X. Zhou, Y . Shen, and G. Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. In CVPR, 2025

  50. [50]

    Zhang, X

    X. Zhang, X. Zheng, Y . Yin, T. Zhao, K. Tang, M. B. Mi, Z. Xu, and D. Z. Chen. Anchorsplat: Feed-forward 3d gaussian splatting with 3d geometric priors. InCVPR, 2026

  51. [51]

    Zheng, A

    Y . Zheng, A. W. Harley, B. Shen, G. Wetzstein, and L. J. Guibas. Pointodyssey: A large-scale synthetic dataset for long-term point tracking. InICCV, 2023

  52. [52]

    Ziwen, H

    C. Ziwen, H. Tan, K. Zhang, S. Bi, F. Luan, Y . Hong, L. Fuxin, and Z. Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats. InICCV, 2025. 12 A Implementation Details Scene and Camera Setup.During the manual selection of buildings for scene generation, we generally maintained a distance of at least 20 meters between the...

  53. [53]

    Please addt p,t v,t s,t b, while maintaining the geometry and lightness/darkness

    If there are roads where people and cars can be placed, we use: “Please addt p,t v,t s,t b, while maintaining the geometry and lightness/darkness.”

  54. [54]

    Please add ˜tb, while maintaining the geometry and lightness/darkness

    If not, we use: “Please add ˜tb, while maintaining the geometry and lightness/darkness.” where the words tp, tv, ts, tb, and ˜tb are sampled from Tp, Tv, Ts, Tb, and eTb as in Table 4, respectively. We classify each image into the two types above, using another variant of Gemini (gemini-3-flash-preview) with a predefined promptq, as shown in Table 4. Vali...