arxiv: 2603.28287 · v2 · submitted 2026-03-30 · 💻 cs.CV

Recognition: no theorem link

TerraSky3D: Multi-View Reconstructions of European Landmarks in 4K

Mattia D'Urso , Yuxi Hu , Christian Sormann , Mattia Rossi , Friedrich Fraundorfer

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:45 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D reconstructionmulti-view stereodatasetEuropean landmarks4K imagesdepth mapscamera posesaerial imagery

0 comments

The pith

TerraSky3D supplies 50,000 high-resolution images across 150 European landmark scenes with calibrated poses and depth maps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies a shortage of large, high-quality public datasets suitable for training modern 3D reconstruction systems. To address this gap the authors captured and curated TerraSky3D, a collection of 50,000 4K images organized into 150 scenes that combine ground-level, aerial, and mixed viewpoints of European landmarks. Each scene is supplied with accurate intrinsic and extrinsic calibration, camera poses, and depth maps. The dataset is positioned as a resource that can be used directly to train and benchmark multi-view reconstruction pipelines under more realistic and varied conditions than earlier collections.

Core claim

The authors present TerraSky3D, a dataset of 50,000 high-resolution 4K images divided into 150 ground, aerial, and mixed scenes focused on European landmarks, each accompanied by curated calibration data, camera poses, and depth maps, created specifically to support training and evaluation of advanced 3D reconstruction pipelines.

What carries the argument

TerraSky3D multi-view dataset, consisting of high-resolution images with associated geometric annotations that span ground, aerial, and combined capture altitudes.

Load-bearing premise

The collected images, poses, and depth maps maintain consistent quality, accuracy, and diversity across scenes.

What would settle it

An experiment in which state-of-the-art reconstruction networks trained on TerraSky3D show no measurable improvement in accuracy or completeness when tested on standard benchmarks compared with networks trained only on prior datasets.

Figures

Figures reproduced from arXiv: 2603.28287 by Christian Sormann, Friedrich Fraundorfer, Mattia D'Urso, Mattia Rossi, Yuxi Hu.

**Figure 1.** Figure 1: Example scene from TerraSky3D. Left: Sparse reconstruction of the Villalta Castle, Italy. Right: Representative images collected from aerial and ground perspectives, shown with their corresponding semantically filtered depth maps. Abstract Despite the growing need for data of more and more sophisticated 3D reconstruction pipelines, we can still observe a scarcity of suitable public datasets. Existing 3D d… view at source ↗

**Figure 2.** Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Example Scene from TerraSky3D. Left: Sparse reconstruction of the Natural History Museum, Vienna, Austria. Right: The first row shows example images, and the second row shows the corresponding semantically filtered depth maps. a diverse array of European landmarks and urban environments, including medieval castles, historical buildings, arches, statues, fountains, bridges, dams, piers, shrines, as well as… view at source ↗

**Figure 4.** Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Example Scene from TerraSky3D. Left: Sparse reconstruction of the Italian Charnel House, Kobarid, Slovenia. Right: Representative images collected from aerial and ground perspectives, shown with their corresponding semantically filtered depth maps. While the entire dataset can be utilized for training largescale models, we propose an official data split to ensure consistent evaluation. Specifically, the … view at source ↗

**Figure 6.** Figure 6: Example Scene from TerraSky3D. Left: Sparse reconstruction of Erto e Casso, Pordenone, Italy. Right: Representative images collected from aerial and ground perspectives, shown with their corresponding semantically filtered depth maps [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Example Scene from TerraSky3D. Left: Sparse reconstruction of the Barcis Dam, Pordenone, Italy. Right: Representative images collected from aerial and ground perspectives, shown with their corresponding semantically filtered depth maps. Tab. 5 reports the percentage of valid pixels at cumulative thresholds demonstrating that our data exhibits higher geometric consistency across all error thresholds. It is… view at source ↗

read the original abstract

Despite the growing need for data of more and more sophisticated 3D reconstruction pipelines, we can still observe a scarcity of suitable public datasets. Existing 3D datasets are either low resolution, limited to a small amount of scenes, based on images of varying quality because retrieved from the internet, or limited to specific capturing scenarios. Motivated by this lack of suitable 3D datasets, we captured TerraSky3D, a high-resolution large-scale 3D reconstruction dataset comprising 50,000 images divided into 150 ground, aerial, and mixed scenes. The dataset focuses on European landmarks and comes with curated calibration data, camera poses, and depth maps. TerraSky3D tries to answer the need for challenging dataset that can be used to train and evaluate 3D reconstruction-related pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents TerraSky3D, a new high-resolution 3D reconstruction dataset comprising 50,000 images across 150 ground, aerial, and mixed scenes of European landmarks, supplied with curated calibration data, camera poses, and depth maps to address the scarcity of suitable public datasets for training and evaluating sophisticated 3D pipelines.

Significance. A validated dataset of this scale and multi-view diversity could meaningfully advance 3D reconstruction research by providing challenging, landmark-focused captures that exceed the resolution or consistency limits of existing benchmarks. However, without demonstrated accuracy or comparative utility, the significance remains potential rather than established.

major comments (3)

[Abstract] Abstract: The central claim that the dataset is 'curated' and suitable for training/evaluation rests on the unverified assertion of high-quality calibration, poses, and depth maps, yet the text supplies no capture protocols, reprojection errors, pose RMSE against independent SfM, depth RMSE against LiDAR/stereo, or any quantitative validation metrics.
[Dataset description] Dataset description section: No ablation or histogram is provided on scene diversity (texture, lighting, viewpoint variation) or image quality statistics, which directly undermines the claim that TerraSky3D improves upon existing datasets limited by low resolution or internet-sourced variability.
[Evaluation] Evaluation or experiments section: The manuscript reports no baseline 3D reconstruction results (e.g., on COLMAP or learned methods) or comparisons against Tanks & Temples / ETH3D, leaving the asserted utility for pipelines unsupported by evidence.

minor comments (2)

[Dataset] Add a table breaking down the 150 scenes by category (ground/aerial/mixed) with exact image counts per scene to improve transparency.
[Capture protocol] Clarify the 4K resolution claim with explicit pixel dimensions and any downsampling applied during capture or processing.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript introducing the TerraSky3D dataset. We address each major comment below and will incorporate revisions to strengthen the description and validation of the data.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the dataset is 'curated' and suitable for training/evaluation rests on the unverified assertion of high-quality calibration, poses, and depth maps, yet the text supplies no capture protocols, reprojection errors, pose RMSE against independent SfM, depth RMSE against LiDAR/stereo, or any quantitative validation metrics.

Authors: We acknowledge that the original abstract and text did not include these quantitative details. In the revised manuscript we will expand the methods section with capture protocols (camera models, acquisition setup, and synchronization) and report specific metrics including average reprojection error from bundle adjustment, pose RMSE from cross-validation against independent SfM runs on overlapping subsets, and depth consistency RMSE measured against multi-view stereo reconstructions. LiDAR ground truth was not collected during acquisition, so direct LiDAR RMSE is unavailable; we will instead emphasize the stereo-based validation. revision: yes
Referee: [Dataset description] Dataset description section: No ablation or histogram is provided on scene diversity (texture, lighting, viewpoint variation) or image quality statistics, which directly undermines the claim that TerraSky3D improves upon existing datasets limited by low resolution or internet-sourced variability.

Authors: We agree that explicit statistics would better substantiate the advantages over prior datasets. The revision will add a new subsection with histograms and summary tables quantifying scene diversity: texture complexity via gradient magnitude distributions, lighting variation across capture times and conditions, viewpoint coverage (altitude, azimuth, and elevation ranges), and image quality metrics such as average sharpness scores and resolution uniformity across the 50,000 images. revision: yes
Referee: [Evaluation] Evaluation or experiments section: The manuscript reports no baseline 3D reconstruction results (e.g., on COLMAP or learned methods) or comparisons against Tanks & Temples / ETH3D, leaving the asserted utility for pipelines unsupported by evidence.

Authors: The manuscript is structured as a dataset release paper rather than a methods benchmark. To directly address the concern, we will add a concise evaluation subsection demonstrating baseline usability: COLMAP reconstructions on a representative subset of scenes with reported completeness and accuracy metrics, plus a side-by-side comparison table highlighting TerraSky3D's higher resolution, landmark focus, and multi-view (ground/aerial) diversity relative to Tanks & Temples and ETH3D. revision: yes

Circularity Check

0 steps flagged

Dataset release paper exhibits no circularity

full rationale

The manuscript is a dataset release paper that describes the capture and curation of 50,000 images across 150 scenes, along with provided calibration data, camera poses, and depth maps. There are no derivations, equations, predictions, fitted parameters, or load-bearing claims that reduce by construction to the paper's own inputs. The central assertion rests solely on the existence of the collected data rather than any self-referential reasoning, self-citation chains, or ansatz smuggling. No steps qualify as circular under the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset introduction paper. No mathematical derivations, free parameters, axioms, or new postulated entities are involved in the central claim.

pith-pipeline@v0.9.0 · 5452 in / 1082 out tokens · 53420 ms · 2026-05-14T21:45:32.301739+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 2 internal anchors

[1]

HPatches: A benchmark and evaluation of handcrafted and learned local descriptors

Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krystian Mikolajczyk. HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5173–5182. IEEE, 2017. 3

work page 2017
[2]

Opencv.Dr

Gary Bradski, Adrian Kaehler, et al. Opencv.Dr. Dobb’s journal of software tools, 3(2), 2000. 4

work page 2000
[3]

Rdd: Robust feature detector and descriptor using deformable transformer

Gonglin Chen, Tianwen Fu, Haiwei Chen, Wenbin Teng, Hanyuan Xiao, and Yajie Zhao. Rdd: Robust feature detector and descriptor using deformable transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6394–6403, 2025. 2, 3, 5, 6

work page 2025
[4]

Masked-attention mask trans- former for universal image segmentation

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask trans- former for universal image segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1290–1300, 2022. 4

work page 2022
[5]

Superpoint: Self-supervised interest point detection and description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. InProceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 224–236, 2018. 5, 6

work page 2018
[6]

A stream- lined attention-based network for descriptor extraction

Mattia D’Urso, Emanuele Santellani, Christian Sormann, Mat- tia Rossi, Andreas Kuhn, and Friedrich Fraundorfer. A stream- lined attention-based network for descriptor extraction. In 2026 International Conference on 3D Vision (3DV). IEEE Computer Society, 2026. 3, 5, 6

work page 2026
[7]

Dedode: Detect, don’t describe–describe, don’t detect for local feature matching.arXiv preprint arXiv:2308.08479, 2023

Johan Edstedt, Georg B ¨okman, M ˚arten Wadenb ¨ack, and Michael Felsberg. Dedode: Detect, don’t describe–describe, don’t detect for local feature matching.arXiv preprint arXiv:2308.08479, 2023. 5, 6

work page arXiv 2023
[8]

Roma: Robust dense feature matching

Johan Edstedt, Qiyu Sun, Georg B¨okman, M˚arten Wadenb¨ack, and Michael Felsberg. Roma: Robust dense feature matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19790–19800, 2024. 5, 6

work page 2024
[9]

Roma v2: Harder better faster denser feature matching.arXiv preprint arXiv:2511.15706, 2025

Johan Edstedt, David Nordstr ¨om, Yushan Zhang, Georg B¨okman, Jonathan Astermark, Viktor Larsson, Anders Hey- den, Fredrik Kahl, M˚arten Wadenb¨ack, and Michael Felsberg. Roma v2: Harder better faster denser feature matching.arXiv preprint arXiv:2511.15706, 2025. 2

work page arXiv 2025
[10]

Image match- ing across wide baselines: From paper to practice.Interna- tional Journal of Computer Vision, 129(2):517–547, 2021

Yuhe Jin, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas, Pascal Fua, Kwang Moo Yi, and Eduard Trulls. Image match- ing across wide baselines: From paper to practice.Interna- tional Journal of Computer Vision, 129(2):517–547, 2021. 2, 3, 7

work page 2021
[11]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d recon- struction.arXiv preprint arXiv:2509.13414, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), 2023. 6

work page 2023
[13]

Tanks and temples: Benchmarking large-scale scene reconstruction.ACM Transactions on Graphics, 36(4), 2017

Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction.ACM Transactions on Graphics, 36(4), 2017. 4

work page 2017
[14]

Ground- ing image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 5, 6

work page 2024
[15]

Fastmap: Revisiting dense and scal- able structure from motion.arXiv preprint arXiv:2505.04612,

Jiahao Li, Haochen Wang, Muhammad Zubair Irshad, Igor Vasiljevic, Matthew R Walter, Vitor Campagnolo Guizilini, and Greg Shakhnarovich. Fastmap: Revisiting dense and scal- able structure from motion.arXiv preprint arXiv:2505.04612,

work page arXiv
[16]

Cvd-sfm: A cross- view deep front-end structure-from-motion system for sparse localization in multi-altitude scenes.arXiv preprint arXiv:2508.01936, 2025

Yaxuan Li, Yewei Huang, Bijay Gaudel, Hamidreza Ja- farnejadsani, and Brendan Englot. Cvd-sfm: A cross- view deep front-end structure-from-motion system for sparse localization in multi-altitude scenes.arXiv preprint arXiv:2508.01936, 2025. 2, 3

work page arXiv 2025
[17]

Megadepth: Learning single- view depth prediction from internet photos

Zhengqi Li and Noah Snavely. Megadepth: Learning single- view depth prediction from internet photos. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 2041–2050, 2018. 1, 2, 3, 7

work page 2041
[18]

Lightglue: Local feature matching at light speed

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle- feys. Lightglue: Local feature matching at light speed. In Proceedings of the IEEE/CVF international conference on computer vision, pages 17627–17638, 2023. 5, 6

work page 2023
[19]

Global structure-from-motion revisited

Linfei Pan, D´aniel Bar´ath, Marc Pollefeys, and Johannes L Sch¨onberger. Global structure-from-motion revisited. InEuro- pean Conference on Computer Vision, pages 58–77. Springer,

work page
[20]

Revisiting oxford and paris: Large-scale image retrieval benchmarking

Filip Radenovi ´c, Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ond ˇrej Chum. Revisiting oxford and paris: Large-scale image retrieval benchmarking. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 5706–5715, 2018. 3

work page 2018
[21]

S-trek: Sequential translation and rotation equivariant keypoints for local feature extraction

Emanuele Santellani, Christian Sormann, Mattia Rossi, An- dreas Kuhn, and Friedrich Fraundorfer. S-trek: Sequential translation and rotation equivariant keypoints for local feature extraction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9728–9737, 2023. 6

work page 2023
[22]

Benchmarking 6dof outdoor visual localization in changing conditions

Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Oku- tomi, Marc Pollefeys, Josef Sivic, et al. Benchmarking 6dof outdoor visual localization in changing conditions. InPro- ceedings of the IEEE conference on computer vision and pattern recognition, pages 8601–8610, 2018. 3

work page 2018
[23]

Structure- from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE confer- ence on computer vision and pattern recognition, pages 4104– 4113, 2016. 2, 4

work page 2016
[24]

A multi-view stereo benchmark with high- resolution images and multi-camera videos

Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger. A multi-view stereo benchmark with high- resolution images and multi-camera videos. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 3260–3269, 2017. 2, 4

work page 2017
[25]

Loftr: Detector-free local feature matching with transformers

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xi- aowei Zhou. Loftr: Detector-free local feature matching with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8922–8931,

work page
[26]

24/7 place recognition by view synthesis

Akihiko Torii, Relja Arandjelovic, Josef Sivic, Masatoshi Okutomi, and Tomas Pajdla. 24/7 place recognition by view synthesis. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1808–1817, 2015. 3

work page 2015
[27]

Megascenes: Scene-level view synthesis at scale

Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, and Noah Snavely. Megascenes: Scene-level view synthesis at scale. In ECCV, 2024. 1, 2, 3

work page 2024
[28]

Disk: Learning local features with policy gradient.Advances in neu- ral information processing systems, 33:14254–14265, 2020

Michał Tyszkiewicz, Pascal Fua, and Eduard Trulls. Disk: Learning local features with policy gradient.Advances in neu- ral information processing systems, 33:14254–14265, 2020. 5, 6

work page 2020
[29]

Aerialmegadepth: Learn- ing aerial-ground reconstruction and view synthesis

Khiem Vuong, Anurag Ghosh, Deva Ramanan, Srinivasa Narasimhan, and Shubham Tulsiani. Aerialmegadepth: Learn- ing aerial-ground reconstruction and view synthesis. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 21674–21684, 2025. 2, 3

work page 2025
[30]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 4, 6

work page 2025
[31]

Adaptive patch deformation for textureless-resilient multi-view stereo

Yuesong Wang, Zhaojie Zeng, Tao Guan, Wei Yang, Zhuo Chen, Wenkai Liu, Luoyuan Xu, and Yawei Luo. Adaptive patch deformation for textureless-resilient multi-view stereo. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1621–1630,

work page
[32]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chun- hua Shen, and Tong He. π3: Permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018. 6

work page 2018
[34]

Aliked: A lighter keypoint and descriptor extraction network via deformable transforma- tion.IEEE Transactions on Instrumentation and Measure- ment, 72:1–16, 2023

Xiaoming Zhao, Xingming Wu, Weihai Chen, Peter CY Chen, Qingsong Xu, and Zhengguo Li. Aliked: A lighter keypoint and descriptor extraction network via deformable transforma- tion.IEEE Transactions on Instrumentation and Measure- ment, 72:1–16, 2023. 5, 6

work page 2023
[35]

Culture3d: Cul- tural landmarks and terrain dataset for 3d applications.arXiv preprint arXiv:2501.06927, 2025

Xinyi Zheng, Steve Zhang, Weizhe Lin, Aaron Zhang, Wal- terio W Mayol-Cuevas, and Junxiao Shen. Culture3d: Cul- tural landmarks and terrain dataset for 3d applications.arXiv preprint arXiv:2501.06927, 2025. 2, 3

work page arXiv 2025
[36]

University-1652: A multi-view multi-source benchmark for drone-based geo- localization

Zhedong Zheng, Yunchao Wei, and Yi Yang. University-1652: A multi-view multi-source benchmark for drone-based geo- localization. InProceedings of the 28th ACM international conference on Multimedia, pages 1395–1403, 2020. 2, 3

work page 2020