M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception

Francesco Nex; George Vosselman; U.V.B.L Udugama

arxiv: 2510.17363 · v1 · pith:UNYCFHA4new · submitted 2025-10-20 · 💻 cs.CV · cs.LG· cs.RO

M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception

U.V.B.L Udugama , George Vosselman , Francesco Nex This is my paper

Pith reviewed 2026-05-21 21:06 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.RO

keywords multi-task learningcross-task attentionmonocular depth estimationsemantic segmentationsurface normal estimationedge detectionreal-time spatial perception

0 comments

The pith

M2H uses window-based cross-task attention to combine semantic segmentation, depth, edge, and normal estimation from single images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces M2H as a multi-task model that processes monocular images for semantic segmentation, depth estimation, edge detection, and surface normal prediction. The core idea is a Window-Based Cross-Task Attention Module that allows tasks to share information in local windows while keeping their specific features intact. This setup is meant to make predictions more consistent across tasks and run efficiently on hardware like laptops. A reader would care if it enables better real-time spatial awareness for applications like building 3D scene graphs in changing scenes.

Core claim

M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic environments.

What carries the argument

Window-Based Cross-Task Attention Module for structured yet efficient feature exchange between tasks.

If this is right

M2H outperforms state-of-the-art multi-task models on NYUDv2.
It surpasses single-task depth and semantic baselines on Hypersim.
M2H shows superior performance on Cityscapes while remaining computationally efficient.
Validation on real-world data supports its use in practical spatial perception.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the module works as claimed, similar attention designs could help other multi-task setups in vision without custom tuning for each combination of tasks.
Consistent outputs across tasks may simplify fusion into 3D models for robotics or AR applications.
Further tests on video sequences could check if the frame-by-frame consistency holds in dynamic environments.

Load-bearing premise

The window-based cross-task attention module exchanges complementary information across tasks without introducing substantial interference or needing heavy tuning.

What would settle it

Disabling the cross-task attention and measuring whether consistency metrics on NYUDv2 fall to levels seen in independent task models would test the claim.

read the original abstract

Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. This paper introduces Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic environments. Comprehensive evaluations show that M2H outperforms state-of-the-art multi-task models on NYUDv2, surpasses single-task depth and semantic baselines on Hypersim, and achieves superior performance on the Cityscapes dataset, all while maintaining computational efficiency on laptop hardware. Beyond benchmarks, M2H is validated on real-world data, demonstrating its practicality in spatial perception tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

M2H adds a window-based cross-task attention module to a DINOv2 backbone for joint semantic, depth, edge, and normal prediction, but the reported gains are not clearly separated from the backbone itself.

read the letter

The main takeaway is that this paper builds a multi-task model on DINOv2 that uses windowed attention to let semantic segmentation, depth, edges, and normals exchange features without full global mixing. It reports stronger numbers than prior multi-task baselines on NYUDv2, Hypersim, and Cityscapes while claiming laptop-level speed and some real-world checks. That combination of tasks and the efficiency focus is the practical angle for robotics or mapping work. The architecture description and the link to downstream 3D scene graphs are the parts that feel most grounded. The real-world validation also helps show the setup is not just benchmark-tuned. The soft spot is the missing isolation of the new module. The stress-test concern holds up on the evidence given: if the gains come mostly from the strong pre-trained ViT backbone plus standard heads, then the claim that the window attention specifically improves consistency and avoids interference is not yet demonstrated. The abstract gives no ablations against the same backbone with independent heads or simple fusion, and no error bars or split details appear. That leaves the central efficiency and consistency story resting on comparisons that could be driven by training protocol or backbone choice rather than the attention design. This paper is for groups already working on efficient multi-task vision for edge devices or monocular 3D perception. A reader who needs a concrete starting point for four-task models on DINOv2 would find the implementation choices and dataset results useful. It is coherent enough and has enough experimental coverage to deserve referee time, though any review should press for the missing controls on the attention module.

Referee Report

2 major / 2 minor

Summary. The paper introduces M2H, a multi-task learning framework for monocular spatial perception performing semantic segmentation, depth estimation, edge detection, and surface normal estimation from a single image. It proposes a Window-Based Cross-Task Attention Module on a lightweight DINOv2 ViT backbone to enable efficient structured feature exchange across tasks while preserving task-specific details, claiming improved prediction consistency. Evaluations report outperformance over SOTA multi-task models on NYUDv2, superiority to single-task baselines on Hypersim, strong results on Cityscapes, real-world validation, and suitability for real-time deployment supporting downstream 3D scene graph construction.

Significance. If the central claims hold, particularly that the Window-Based Cross-Task Attention Module provides the structured exchange and consistency gains without substantial interference, this could meaningfully advance efficient multi-task models for real-time monocular perception on edge devices. Credit is due for the practical focus on laptop hardware deployment and real-world testing, which strengthens applicability to dynamic environments. The lightweight DINOv2 backbone choice is a sensible efficiency anchor, but the overall significance hinges on confirming the module's specific contribution beyond the backbone's features.

major comments (2)

[§4 and §5] §4 (Model Architecture) and §5 (Experiments): The load-bearing claim that the Window-Based Cross-Task Attention Module enables structured cross-task feature exchange improving consistency (without substantial interference) lacks isolating ablations, such as M2H versus the identical DINOv2 backbone with independent heads or simple concatenation fusion. Without these, outperformance on NYUDv2, Hypersim, and Cityscapes cannot be confidently attributed to the novel module rather than the pre-trained backbone or training protocol.
[§5.1] §5.1 (Quantitative Results): No error bars, standard deviations across runs, or dataset split details are provided for the reported metrics on NYUDv2, Hypersim, and Cityscapes. This undermines assessment of whether the claimed superiority over SOTA multi-task models and single-task baselines is statistically reliable.

minor comments (2)

[Abstract] Abstract: The phrase 'comprehensive evaluations' would benefit from briefly naming the primary metrics (e.g., mIoU, RMSE) and the magnitude of improvements to give readers an immediate quantitative sense of the gains.
[Tables/Figures] Figure captions and tables: Ensure all tables reporting benchmark comparisons include the exact backbone and training settings for each compared method to facilitate direct reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's feedback, which highlights important aspects for improving the clarity and rigor of our work. Below, we provide point-by-point responses to the major comments and describe the changes we will implement in the revised manuscript.

read point-by-point responses

Referee: [§4 and §5] §4 (Model Architecture) and §5 (Experiments): The load-bearing claim that the Window-Based Cross-Task Attention Module enables structured cross-task feature exchange improving consistency (without substantial interference) lacks isolating ablations, such as M2H versus the identical DINOv2 backbone with independent heads or simple concatenation fusion. Without these, outperformance on NYUDv2, Hypersim, and Cityscapes cannot be confidently attributed to the novel module rather than the pre-trained backbone or training protocol.

Authors: We concur that isolating the effect of the Window-Based Cross-Task Attention Module through targeted ablations is necessary to substantiate our claims. Accordingly, we will include in the revised manuscript new ablation experiments. Specifically, we will evaluate (i) the DINOv2 backbone equipped with independent heads for each task (no cross-task attention), and (ii) a fusion variant relying on simple concatenation of features from different task branches. These comparisons will allow us to attribute performance differences more confidently to the proposed module's structured exchange mechanism rather than the backbone or training setup alone. revision: yes
Referee: [§5.1] §5.1 (Quantitative Results): No error bars, standard deviations across runs, or dataset split details are provided for the reported metrics on NYUDv2, Hypersim, and Cityscapes. This undermines assessment of whether the claimed superiority over SOTA multi-task models and single-task baselines is statistically reliable.

Authors: We recognize the importance of reporting variability in results for assessing statistical reliability. In the updated version of the paper, we will perform the training and evaluation multiple times using different random seeds and report the average metrics with standard deviations for NYUDv2, Hypersim, and Cityscapes. Furthermore, we will add a detailed description of the data splits employed in our experiments to facilitate reproducibility and enable readers to better evaluate the significance of the reported improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical architecture evaluation

full rationale

The paper introduces the M2H multi-task framework and Window-Based Cross-Task Attention Module as an architectural proposal, with performance claims resting entirely on direct empirical comparisons against baselines on NYUDv2, Hypersim, and Cityscapes. No equations, derivations, or first-principles predictions appear in the abstract or described content that could reduce by construction to fitted parameters, self-definitions, or self-citation chains. The work is therefore self-contained against external benchmarks, with no load-bearing steps that equate outputs to inputs via the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of the introduced attention module and the suitability of the DINOv2 backbone for feature sharing; no explicit free parameters or invented physical entities are described in the abstract.

axioms (1)

domain assumption Pre-trained ViT models such as DINOv2 provide transferable features suitable for multiple dense prediction tasks
The architecture is built directly on this backbone without further justification in the abstract.

invented entities (1)

Window-Based Cross-Task Attention Module no independent evidence
purpose: Structured feature exchange across tasks while preserving task-specific details
Presented as the key novel component enabling the multi-task performance.

pith-pipeline@v0.9.0 · 5748 in / 1506 out tokens · 50034 ms · 2026-05-21T21:06:20.230074+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Windowed Multi-Task Cross-Attention block (WMCA) ... each task-specific feature map ... partitioned into small non-overlapping windows of size p×p ... concatenated across tasks ... multi-head cross-attention
IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mono-Hydra++: Real-Time Monocular Scene Graph Construction with Multi-Task Learning for 3D Indoor Mapping
cs.RO 2026-05 unverdicted novelty 5.0

Mono-Hydra++ is a monocular RGB-IMU pipeline that constructs hierarchical 3D scene graphs in real time while reporting lower trajectory error than some RGB-D baselines on indoor datasets.