arxiv: 2604.09478 · v1 · submitted 2026-04-10 · 💻 cs.CV · cs.RO

Recognition: no theorem link

Incremental Semantics-Aided Meshing from LiDAR-Inertial Odometry and RGB Direct Label Transfer

Muhammad Affan , Ville Lehtola , George Vosselman

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:51 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords semantic mesh reconstructionLiDAR-inertial odometryTSDF fusionRGB label transferindoor 3D reconstructionvision foundation modelincremental mapping

0 comments

The pith

Direct transfer of semantic labels from RGB frames to LiDAR maps resolves boundary ambiguities and yields higher-quality 3D meshes than geometry-only methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that semantic labels extracted from RGB images can be projected and fused directly onto a LiDAR-inertial odometry map to guide mesh reconstruction in large indoor spaces. Pure geometric fusion often produces holes, over-smoothing, or spurious surfaces where LiDAR points are sparse or small positioning errors accumulate. By labeling each incoming RGB frame with a vision foundation model and incorporating those labels into an incremental TSDF fusion step, the method uses visual semantics to decide how to handle structural boundaries while preserving LiDAR geometric fidelity. This frame-level approach allows continuous updating of the mesh and produces results that outperform standard geometric baselines when measured by geometric metrics.

Core claim

An incremental pipeline performs direct label transfer from vision foundation models on RGB frames onto a LiDAR-inertial odometry map, then applies semantics-aware TSDF fusion to generate meshes whose geometric quality exceeds that of purely geometric reconstruction methods.

What carries the argument

Semantics-aware TSDF fusion driven by direct projection of per-frame RGB labels onto the LiDAR map

If this is right

Semantic guidance reduces holes and spurious surfaces at structural boundaries caused by point-cloud sparsity and geometric drift.
The resulting meshes carry semantic labels that support direct use in creating USD assets for XR and digital modeling applications.
Incremental frame-by-frame processing supports continuous mesh updates as new scans arrive without restarting the reconstruction.
The modular design keeps the geometric precision of LiDAR while adding visual information only where geometry is ambiguous.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Replacing the vision foundation model with a newer or domain-adapted one could further reduce label noise at boundaries without changing the LiDAR pipeline.
The same label-transfer mechanism might allow lower-density LiDAR scans to achieve comparable mesh quality by relying more on semantics.
Extending the approach to sequences with varying lighting or moving objects would test whether the projection step remains stable over longer times.

Load-bearing premise

Projecting semantic labels from 2D RGB images onto the 3D LiDAR map can be done without introducing new errors at structural boundaries or from misalignment between the sensors.

What would settle it

A controlled test that disables the semantic label fusion step entirely and measures whether the resulting geometric metrics become equal to or better than those obtained with the full pipeline.

Figures

Figures reproduced from arXiv: 2604.09478 by George Vosselman, Muhammad Affan, Ville Lehtola.

**Figure 1.** Figure 1: Overview of the proposed pipeline: (a) direct label transfer from VFM (OneFormer) segmented RGB images onto (b) [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: The flowchart of the proposed method. Our system ( [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: 3D reconstruction of Oxford Spires (Christ Church College Scene) dataset [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: 3D reconstruction of NTU VIRAL (NYA01 Scene) dataset [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Analyzing the boundary uncertainty in 3D reconstruction of NTU VIRAL (NYA01) dataset [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Geometric high-fidelity mesh reconstruction from LiDAR-inertial scans remains challenging in large, complex indoor environments -- such as cultural buildings -- where point cloud sparsity, geometric drift, and fixed fusion parameters produce holes, over-smoothing, and spurious surfaces at structural boundaries. We propose a modular, incremental RGB+LiDAR pipeline that generates incremental semantics-aided high-quality meshes from indoor scans through scan frame-based direct label transfer. A vision foundation model labels each incoming RGB frame; labels are incrementally projected and fused onto a LiDAR-inertial odometry map; and an incremental semantics-aware Truncated Signed Distance Function (TSDF) fusion step produces the final mesh via marching cubes. This frame-level fusion strategy preserves the geometric fidelity of LiDAR while leveraging rich visual semantics to resolve geometric ambiguities at reconstruction boundaries caused by LiDAR point-cloud sparsity and geometric drift. We demonstrate that semantic guidance improves geometric reconstruction quality; quantitative evaluation is therefore performed using geometric metrics on the Oxford Spires dataset, while results from the NTU VIRAL dataset are analyzed qualitatively. The proposed method outperforms state-of-the-art geometric baselines ImMesh and Voxblox, demonstrating the benefit of semantics-aided fusion for geometric mesh quality. The resulting semantically labelled meshes are of value when reconstructing Universal Scene Description (USD) assets, offering a path from indoor LiDAR scanning to XR and digital modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a modular incremental pipeline for high-fidelity mesh reconstruction that fuses LiDAR-inertial odometry with semantic labels obtained via direct transfer from RGB frames labeled by a vision foundation model. Labels are projected onto the LiDAR map and incorporated into an incremental semantics-aware TSDF fusion process, followed by marching cubes meshing. The central claim is that semantic guidance resolves ambiguities from point-cloud sparsity and drift, yielding superior geometric mesh quality compared to pure geometric baselines (ImMesh, Voxblox) as measured by geometric metrics on the Oxford Spires dataset, with qualitative results on NTU VIRAL; the output meshes are also semantically labeled for downstream USD/XR use.

Significance. If the claimed geometric improvements are substantiated with detailed quantitative evidence, the approach would offer a practical way to enhance indoor reconstruction fidelity by leveraging readily available visual semantics without sacrificing LiDAR geometric accuracy, with direct value for cultural-heritage modeling and XR asset pipelines.

major comments (2)

[Abstract / Evaluation] Abstract and evaluation section: the claim that the method 'outperforms' ImMesh and Voxblox on geometric metrics is stated without any numerical values, tables, error bars, or statistical tests, which is load-bearing for the central claim that semantics-aided fusion improves geometric quality.
[Methodology / §4] Methodology and fusion description: no ablation studies isolate the contribution of the semantics-aware TSDF step versus the underlying LiDAR-inertial odometry or label-projection accuracy, and no details are given on how TSDF truncation distances or fusion weights (listed as free parameters) were selected or held constant across baselines.

minor comments (1)

[Pipeline overview] The description of 'direct label transfer' would benefit from an explicit equation or diagram showing the projection from RGB pixel to LiDAR map point, including handling of occlusions or depth discontinuities.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps strengthen the quantitative support for our claims and the clarity of our methodological choices. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and evaluation section: the claim that the method 'outperforms' ImMesh and Voxblox on geometric metrics is stated without any numerical values, tables, error bars, or statistical tests, which is load-bearing for the central claim that semantics-aided fusion improves geometric quality.

Authors: We agree that the central claim requires explicit numerical backing. The evaluation section of the manuscript does report geometric metrics on the Oxford Spires dataset, but we acknowledge that the abstract and the presentation lack concrete values, tables, and error bars. In the revision we will (i) insert key numerical results (e.g., mean and standard-deviation improvements in surface accuracy and completeness) directly into the abstract and (ii) add a dedicated results table in Section 5 that compares our method against ImMesh and Voxblox with error bars where multiple runs or cross-validation folds are available. Formal statistical significance tests were not performed in the original submission; we will add them if the dataset permits, or at minimum report confidence intervals. revision: yes
Referee: [Methodology / §4] Methodology and fusion description: no ablation studies isolate the contribution of the semantics-aware TSDF step versus the underlying LiDAR-inertial odometry or label-projection accuracy, and no details are given on how TSDF truncation distances or fusion weights (listed as free parameters) were selected or held constant across baselines.

Authors: We accept that the current manuscript does not isolate the semantics-aware TSDF contribution via ablation and provides insufficient detail on parameter selection. In the revised version we will add an ablation experiment that runs the identical LiDAR-inertial odometry pipeline with and without the semantics-aware TSDF weighting, thereby isolating the effect of semantic guidance. We will also document the exact truncation distances and fusion weights used, state that they were selected on a held-out validation subset of Oxford Spires, and confirm that the same values were applied to all baselines to ensure fair comparison. A short note on the sensitivity of label-projection accuracy will be included as well. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes a modular incremental pipeline that transfers labels from a vision foundation model to a LiDAR-inertial map and performs semantics-aware TSDF fusion, with the central claim evaluated empirically via geometric metrics against independent external baselines (ImMesh, Voxblox) on the Oxford Spires dataset. No equations, fitted parameters, or self-citations are shown to reduce any prediction or uniqueness claim to the inputs by construction; the method is presented as an engineering composition whose benefit is tested externally rather than derived internally from its own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions from LiDAR-inertial odometry and TSDF fusion plus the reliability of an external vision foundation model for accurate label transfer; no new entities are postulated.

free parameters (1)

TSDF truncation and fusion weights
These are typically tuned in semantics-aware fusion but not specified as fitted values here.

axioms (1)

domain assumption Vision foundation model produces reliable semantic labels transferable to LiDAR geometry without distortion from drift or calibration errors
Invoked to justify that labels resolve geometric ambiguities at boundaries.

pith-pipeline@v0.9.0 · 5545 in / 1324 out tokens · 89359 ms · 2026-05-10T17:51:23.627857+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 1 canonical work pages

[1]

https://isprs- annals.copernicus.org/articles/X-M-1-2023/215/2023/

OPENHERITAGE3D: BUILDING AN OPEN VISUAL ARCHIVE FOR SITE SCALE GIGA-RESOLUTION LIDAR AND PHOTOGRAMMETRY DATA.ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial In- formation Sciences, X-M-1-2023, 215–222. https://isprs- annals.copernicus.org/articles/X-M-1-2023/215/2023/. McCormac, J., Clark, R., Bloesch, M., Davison, A. J., Leu- tenegger, S.,...

2023
[2]

Nakajima, Y ., Sucar, E., James, S., Davison, A

SemanticFusion: Dense 3d semantic mapping with con- volutional neural networks.IEEE International Conference on Robotics and Automation (ICRA), 4628–4635. Nakajima, Y ., Sucar, E., James, S., Davison, A. J., Tateno, K.,
[3]

Newcombe, R

Panopticfusion: Online volumetric semantic mapping at the level of stuff and things.IEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS), 4205–4212. Newcombe, R. A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A. J., Kohi, P., Shotton, J., Hodges, S., Fitzgibbon, A., 2011. Kinectfusion: Real-time dense surface mapping...

2011
[4]

Pixar Animation Studios, 2023

V oxblox: Incremental 3d euclidean signed distance fields for on-board mav planning.2017 IEEE/RSJ International Con- ference on Intelligent Robots and Systems (IROS), 1366–1373. Pixar Animation Studios, 2023. Universal Scene Description (USD).https://openusd.org/release/index.html. Reijgwart, V ., Millane, A., Oleynikova, H., Siegwart, R., Ca- dena, C., N...

work page doi:10.1177/0278364914551008 2017