pith. sign in

arxiv: 2506.08013 · v2 · submitted 2025-06-09 · 💻 cs.CV · cs.AI· cs.LG

StableMTL: Repurposing Latent Diffusion Models for Multi-Task Learning from Partially Annotated Synthetic Datasets

Pith reviewed 2026-05-19 10:07 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords multi-task learninglatent diffusion modelspartial annotationssynthetic datasetsdense predictiontask attentionlatent regressioncomputer vision
0
0 comments X

The pith

Repurposing latent diffusion models enables multi-task learning from synthetic datasets labeled for only subsets of tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that the generalization built into pre-trained latent diffusion models can be harnessed for multi-task dense prediction even when every training dataset carries labels for only some of the tasks. It achieves this by converting the denoising process into latent regression through task encoding, per-task conditioning, and one unified latent loss that avoids balancing separate per-task terms. A multi-stream architecture with task-attention replaces full N-to-N interactions with efficient 1-to-N attention to let tasks share useful features. A reader would care because dense labels for segmentation, depth, normals and similar tasks are expensive on real images, while synthetic data can be produced at scale. If the claim holds, multi-task models become practical to train on larger numbers of tasks without demanding complete annotations everywhere.

Core claim

StableMTL repurposes image generators for latent regression by adapting a denoising framework with task encoding, per-task conditioning and a tailored training scheme. Instead of per-task losses, a unified latent loss is used. A multi-stream model with task-attention converts N-to-N task interactions into efficient 1-to-N attention to promote cross-task synergy. The resulting model is trained on multiple synthetic datasets each supplying labels for only a subset of tasks and outperforms baselines on seven tasks across eight benchmarks.

What carries the argument

The multi-stream model with task-attention that turns expensive N-to-N cross-task interactions into efficient 1-to-N attention for inter-task synergy.

If this is right

  • Adding more tasks requires no extra loss-balancing effort because a single unified latent loss is used.
  • Multiple synthetic datasets can be combined even when no dataset labels all tasks at once.
  • Task-attention lets each task benefit from features learned for the others without explicit pairing.
  • The zero-shot partial-label setup removes the need for any single dataset to carry complete annotations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar conditioning and attention changes might let other generative models handle partial-label multi-task training.
  • Evaluating the trained model directly on real images with partial labels would test transfer beyond synthetic data.
  • The 1-to-N attention pattern could be reused in other multi-task settings that involve many output heads.

Load-bearing premise

The generalization power of pre-trained latent diffusion models is sufficient to support zero-shot extension of partial-label training when each synthetic dataset supplies labels for only a subset of tasks.

What would settle it

Train an identical architecture from random weights rather than from pre-trained diffusion weights and check whether performance on the eight benchmarks still exceeds the reported baselines.

Figures

Figures reproduced from arXiv: 2506.08013 by Anh-Quan Cao, Ivan Lopes, Raoul de Charette.

Figure 1
Figure 1. Figure 1: StableMTL output on unseen real-world data. StableMTL demonstrates robust general￾ization to real-world data, despite being trained on partially labeled synthetic datasets. * Note that semantic is trained on driving classes and is not expected to generalize to unseen classes. In practice, our method StableMTL, extends deterministic single-step LDMs [25] to the partially labeled multi-task setting with task… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of StableMTL. Our pipeline comprises two training stages. In the first stage (Sec. 3.1), we fine-tune a UNet (Uθ,τ ) to predict target annotation latents from input image latents, conditioned on multi-task tokens sampled via our training scheme to isolate task gradient (see [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Proposed task attention. In addition to standard spatial and cross-attention mechanisms, our transformer blocks in the main UNet incorporate multi-stream information from auxiliary tasks. This is achieved by connecting the dedicated frozen single-stream UNet (Uθ,τ ) to the main UNet (Uϕ,T ), providing the latter with auxiliary features. Uθ,τ is kept frozen. 5 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison. We compare against original baseline versions as these are better performing (cf . Tab. 2) than the adapted full setting variants on the three tasks displayed. StableMTL demonstrates superior qualitative results. To accommodate two-frame inputs, we triple the number of input channels in the first convolution of Uϕ,T . We initialize the expanded weights and dividing them by three as … view at source ↗
Figure 5
Figure 5. Figure 5: Task-gradient isolation strategy. In (a), we report the performance w/ and w/o our isolation strategy, showing that it drastically benefits some tasks (e.g., semantic) and improves the overall ∆m metric. (b) shows that when removing gradient isolation, tasks with smaller gradient magnitudes are overwhelmed by those with larger ones, leading to a significant performance drop. 8 [PITH_FULL_IMAGE:figures/ful… view at source ↗
Figure 7
Figure 7. Figure 7: Task attention scores. Strong interactions are observed not only among tasks with known mutual benefits but also one-way interactions, as detailed in the text. Strategy Prob. (ρ) Semantic Normal Depth Opt. Flow Scene Flow Shading Albedo MTL Perf. mIoU %↑ mAE °↓ AbsRel %↓ AbsRel %↓ EPE-2D px↓ EPE-3D m↓ RMSE↓ RMSE↓ ∆m% ↑ Cityscapes DIODE KITTI DIODE KITTI KITTI MID MID Avg Sample(πT ) 0.0 54.90 22.88 14.90 3… view at source ↗
Figure 8
Figure 8. Figure 8: Single-stream architecture. During stage 1 (Sec. 3.1), we fine-tune a UNet (Uθ,τ ) to perform latent regression. It is then used during stage 2 (Sec. 3.2) as an auxiliary stream to provide task features. We use arbitrary text prompts to identify each task, [prompt]τ ∈ {"normal", "depth", . . . }. Task prompts are passed through a CLIP text encoder to retrieve their corresponding task tokens: cτ = CLIP([pro… view at source ↗
Figure 9
Figure 9. Figure 9: Task attention scores in the U-Net (Uϕ,T ) (last layer of each encoder/decoder block shown). Attention becomes more peaky in deeper layers and highlights beneficial cross-task relationships. A.3 Training details For our method, the single stream UNet Uθ is initialized with weights from Stable Diffusion v2 [50] and trained for 20,000 steps (8 hours). The main stream UNet Uϕ trains for another 10,000 steps (… view at source ↗
Figure 10
Figure 10. Figure 10: highlights that sharing projection layers across tasks results in highly repetitive attention score patterns. Such patterns may contribute to a decline in multi-task performance, a shown by the row "w/o separate (qt, kt, vt)" in Tab. 4. Semantic Normal Depth Opt. Flow Sc. Flow Shading Albedo [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Additional qualitative results on real-world data. Despite being trained on partially an￾notated synthetic datasets, StableMTL demonstrates generalization to multi-task real-world scenarios. * Note that semantic is trained on driving classes and is not expected to generalize to unseen classes. both the magnitude and vz. For normal visualization, surface normal XYZ coordinates are directly mapped to RGB sp… view at source ↗
Figure 12
Figure 12. Figure 12: Flow color mappings. We visualize the mapping used to visualize (a) optical flow and (b) scene flow. VKITTI 2 ours Cityscapes ours Color terrain ignore road road ■ sky sky sidewalk ignore ■ tree vegetation building building ■ vegetation vegetation wall vegetation ■ building building fence ignore ■ road road pole pole ■ guardrail ignore light light ■ sign sign sign sign ■ light light vegetation vegetation … view at source ↗
read the original abstract

Multi-task learning for dense prediction is limited by the need for extensive annotation for every task, though recent works have explored training with partial task labels. Leveraging the generalization power of diffusion models, we extend the partial learning setup to a zero-shot setting, training a multi-task model on multiple synthetic datasets, each labeled for only a subset of tasks. Our method, StableMTL, repurposes image generators for latent regression. Adapting a denoising framework with task encoding, per-task conditioning and a tailored training scheme. Instead of per-task losses requiring careful balancing, a unified latent loss is adopted, enabling seamless scaling to more tasks. To encourage inter-task synergy, we introduce a multi-stream model with a task-attention mechanism that converts N-to-N task interactions into efficient 1-to-N attention, promoting effective cross-task sharing. StableMTL outperforms baselines on 7 tasks across 8 benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces StableMTL, a method that repurposes pre-trained latent diffusion models for multi-task dense prediction by training on multiple synthetic datasets, each providing labels for only a subset of tasks. It adapts the denoising framework via task encoding and per-task conditioning, replaces per-task losses with a unified latent loss, and employs a multi-stream architecture with a task-attention mechanism that converts N-to-N interactions into efficient 1-to-N attention to promote cross-task synergy. The approach is evaluated on 7 tasks across 8 benchmarks and reported to outperform baselines.

Significance. If the results hold, the work is significant because it shows how the generalization properties of pre-trained latent diffusion models can be leveraged for zero-shot partial-label multi-task regression on synthetic data, removing the need for explicit per-task loss balancing and enabling more scalable task addition. Credit is given for the extensive empirical evaluation across 8 benchmarks, which provides concrete, falsifiable support for the claimed outperformance and for the design choices in the unified loss and task-attention components.

major comments (2)
  1. [§3.3] §3.3 (Unified Latent Loss): the claim that the single latent loss removes the need for per-task balancing is central, yet the formulation appears to retain task-specific conditioning weights; an explicit derivation or ablation showing invariance to task weighting would be required to substantiate the scaling advantage.
  2. [§4.2] §4.2 (Ablation on partial labels): the zero-shot partial-label premise is load-bearing for the entire setup, but the reported gains on missing-task subsets are not isolated from the contribution of the pre-trained LDM features; a controlled ablation that freezes the latent encoder while varying the fraction of missing labels per dataset would directly test whether cross-task attention recovers the signals or whether performance relies on already-encoded features.
minor comments (2)
  1. [Figure 3] Figure 3: the task-attention diagram would benefit from explicit notation for the 1-to-N reduction (e.g., query/key/value dimensions) to clarify computational savings relative to standard multi-head attention.
  2. [Table 1] Table 1: baseline descriptions should include the exact loss-balancing strategy used for each competing method so that the advantage of the unified latent loss can be directly compared.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. Below we address each major comment point by point and describe the revisions we will make.

read point-by-point responses
  1. Referee: [§3.3] §3.3 (Unified Latent Loss): the claim that the single latent loss removes the need for per-task balancing is central, yet the formulation appears to retain task-specific conditioning weights; an explicit derivation or ablation showing invariance to task weighting would be required to substantiate the scaling advantage.

    Authors: We appreciate the referee's observation. Task-specific conditioning weights are used solely to inject task identity into the diffusion conditioning mechanism. The training objective itself remains a single unified loss computed directly in latent space and does not involve any per-task loss terms or explicit weighting coefficients that would require balancing. We will add a short derivation of the gradient of this unified loss with respect to the network parameters to show that no task-specific loss weights appear. We will also include an ablation that varies the magnitude of the conditioning weights while keeping the loss formulation fixed, demonstrating that performance is largely invariant and thereby supporting the claimed scaling advantage. revision: yes

  2. Referee: [§4.2] §4.2 (Ablation on partial labels): the zero-shot partial-label premise is load-bearing for the entire setup, but the reported gains on missing-task subsets are not isolated from the contribution of the pre-trained LDM features; a controlled ablation that freezes the latent encoder while varying the fraction of missing labels per dataset would directly test whether cross-task attention recovers the signals or whether performance relies on already-encoded features.

    Authors: We agree that isolating the contribution of the cross-task attention from the pre-trained latent features is important for validating the zero-shot partial-label premise. We will add a controlled ablation in which the latent encoder is frozen and the fraction of missing labels per dataset is systematically varied. The results of this experiment will be reported to clarify whether the task-attention mechanism enables recovery of signals for missing tasks beyond what is already present in the frozen pre-trained representations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external pre-trained model generalization

full rationale

The paper's method description in the abstract and skeptic summary introduces task encoding, per-task conditioning, unified latent loss, and task-attention as adaptations of a denoising framework for partial-label multi-task regression. No equations, fitting procedures, or self-citations are exhibited that reduce any claimed prediction or result to an input defined by the same claim. The load-bearing premise is the generalization power of pre-trained latent diffusion models, treated as an external property rather than derived internally. This qualifies as a self-contained engineering contribution without the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Report based solely on abstract; no explicit free parameters, mathematical axioms, or invented entities are stated. The central premise relies on an unelaborated domain assumption about diffusion-model generalization.

axioms (1)
  • domain assumption Leveraging the generalization power of diffusion models allows extension of partial learning to zero-shot setting with synthetic datasets each labeled for only a subset of tasks.
    Invoked in the abstract as the foundation for the entire StableMTL approach.

pith-pipeline@v0.9.0 · 5693 in / 1357 out tokens · 50124 ms · 2026-05-19T10:07:19.896466+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages

  1. [1]

    In: ICCV (2023)

    Aich, A., Schulter, S., Roy-Chowdhury, A.K., Chandraker, M., Suh, Y .: Efficient controllable multi-task architectures. In: ICCV (2023)

  2. [2]

    In: CVPR (2021)

    Aleotti, F., Poggi, M., Mattoccia, S.: Learning optical flow from still images. In: CVPR (2021)

  3. [3]

    In: AAAI (2021)

    Argaw, D.M., Kim, J., Rameau, F., Cho, J.W., Kweon, I.S.: Optical flow estimation from a single motion-blurred image. In: AAAI (2021)

  4. [4]

    In: CVPR (2024)

    Bae, G., Davison, A.J.: Rethinking inductive biases for surface normal estimation. In: CVPR (2024)

  5. [5]

    In: ICCV (2007)

    Baker, S., Roth, S., Scharstein, D., Black, M.J., Lewis, J., Szeliski, R.: A database and evaluation methodology for optical flow. In: ICCV (2007)

  6. [6]

    In: CVPR (2023)

    Borse, S., Das, D., Park, H., Cai, H., Garrepalli, R., Porikli, F.: Dejavu: Conditional regenerative learning to enhance dense prediction. In: CVPR (2023)

  7. [7]

    In: ICCV (2021)

    Brüggemann, D., Kanakis, M., Obukhov, A., Georgoulis, S., Van Gool, L.: Exploring relational context for multi-task dense prediction. In: ICCV (2021)

  8. [8]

    In: arXiv (2020)

    Cabon, Y ., Murray, N., Humenberger, M.: Virtual kitti 2. In: arXiv (2020)

  9. [9]

    ACM TOG (2023)

    Careaga, C., Aksoy, Y .: Intrinsic image decomposition via ordinal shading. ACM TOG (2023)

  10. [10]

    ACM TOG (2024)

    Careaga, C., Aksoy, Y .: Colorful diffuse intrinsic image decomposition in the wild. ACM TOG (2024)

  11. [11]

    In: ICCV (2023)

    Chen, T., Chen, X., Du, X., Rashwan, A., Yang, F., Chen, H., Wang, Z., Li, Y .: AdaMV-MoE: Adaptive multi-task vision mixture-of-experts. In: ICCV (2023)

  12. [12]

    In: ICML (2018)

    Chen, Z., Badrinarayanan, V ., Lee, C.Y ., Rabinovich, A.: Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In: ICML (2018)

  13. [13]

    In: NeurIPS (2020)

    Chen, Z., Ngiam, J., Huang, Y ., Luong, T., Kretzschmar, H., Chai, Y ., Anguelov, D.: Just pick a sign: Optimizing deep multitask models with gradient sign dropout. In: NeurIPS (2020)

  14. [14]

    In: CVPR (2023)

    Choi, W., Im, S.: Dynamic neural network for multi-task learning searching across diverse network topologies. In: CVPR (2023)

  15. [15]

    In: CVPR (2016)

    Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)

  16. [16]

    In: NeurIPS (2014)

    Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NeurIPS (2014)

  17. [17]

    In: NeurIPS (2022)

    Fan, Z., Sarkar, R., Jiang, Z., Chen, T., Zou, K., Cheng, Y ., Hao, C., Wang, Z., et al.: M3VIT: Mixture-of-experts vision transformer for efficient multi-task learning with model-accelerator co-design. In: NeurIPS (2022)

  18. [18]

    Proceedings of the IEEE (2024)

    Fontana, M., Spratling, M., Shi, M.: When multitask learning meets partial supervision: A computer vision review. Proceedings of the IEEE (2024)

  19. [19]

    In: ECCV (2024)

    Fu, X., Yin, W., Hu, M., Wang, K., Ma, Y ., Tan, P., Shen, S., Lin, D., Long, X.: Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image. In: ECCV (2024)

  20. [20]

    In: CVPR (2012)

    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: CVPR (2012)

  21. [21]

    In: ICCV (2021) 10

    Ghiasi, G., Zoph, B., Cubuk, E.D., Le, Q.V ., Lin, T.Y .: Multi-task self-training for learning general representations. In: ICCV (2021) 10

  22. [22]

    In: AAAI (2025)

    Gui, M., Schusterbauer, J., Prestel, U., Ma, P., Kotovenko, D., Grebenkova, O., Baumann, S.A., Hu, V .T., Ommer, B.: DepthFM: Fast monocular depth estimation with flow matching. In: AAAI (2025)

  23. [23]

    In: CVPR (2020)

    Guizilini, V ., Ambrus, R., Pillai, S., Raventos, A., Gaidon, A.: 3d packing for self-supervised monocular depth estimation. In: CVPR (2020)

  24. [24]

    In: ICML (2020)

    Guo, P., Lee, C.Y ., Ulbricht, D.: Learning to branch for multi-task learning. In: ICML (2020)

  25. [25]

    In: ICLR (2025)

    He, J., Li, H., Yin, W., Liang, Y ., Li, L., Zhou, K., Liu, H., Liu, B., Chen, Y .C.: Lotus: Diffusion-based visual foundation model for high-quality dense prediction. In: ICLR (2025)

  26. [26]

    In: CVPR (2016)

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

  27. [27]

    In: CVPR (2024)

    Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., Schindler, K.: Repurposing diffusion- based image generators for monocular depth estimation. In: CVPR (2024)

  28. [28]

    T-PAMI (2025)

    Ke, B., Qu, K., Wang, T., Metzger, N., Huang, S., Li, B., Obukhov, A., Schindler, K.: Marigold: Affordable adaptation of diffusion-based image generators for image analysis. T-PAMI (2025)

  29. [29]

    In: CVPR (2018)

    Kendall, A., Gal, Y ., Cipolla, R.: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: CVPR (2018)

  30. [30]

    In: ICLR (2015)

    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2015)

  31. [31]

    In: CVPR (2025)

    Le, D.H., Pham, T., Lee, S., Clark, C., Kembhavi, A., Mandt, S., Krishna, R., Lu, J.: One diffusion to generate them all. In: CVPR (2025)

  32. [32]

    In: CVPR (2022)

    Li, W.H., Liu, X., Bilen, H.: Learning multiple dense prediction tasks from partially annotated data. In: CVPR (2022)

  33. [33]

    In: NeurIPS (2022)

    Liang, X., Wu, Y ., Han, J., Xu, H., Xu, C., Liang, X.: Effective adaptation in multi-task co-training for unified autonomous driving. In: NeurIPS (2022)

  34. [34]

    In: NeurIPS (2019)

    Lin, X., Zhen, H.L., Li, Z., Zhang, Q.F., Kwong, S.: Pareto multi-task learning. In: NeurIPS (2019)

  35. [35]

    In: NeurIPS (2007)

    Liu, Q., Liao, X., Carin, L.: Semi-supervised multitask learning. In: NeurIPS (2007)

  36. [36]

    In: NeurIPS (2022)

    Liu, Y .C., Ma, C.Y ., Tian, J., He, Z., Kira, Z.: Polyhistor: Parameter-efficient multi-task adaptation for dense vision tasks. In: NeurIPS (2022)

  37. [37]

    In: W ACV (2023)

    Lopes, I., Vu, T.H., de Charette, R.: DenseMTL: Cross-task attention mechanism for dense multi-task learning. In: W ACV (2023)

  38. [38]

    In: CVPR (2021)

    Lu, Y ., Pirk, S., Dlabal, J., Brohan, A., Pasad, A., Chen, Z., Casser, V ., Angelova, A., Gordon, A.: Taskology: Utilizing task relations at scale. In: CVPR (2021)

  39. [39]

    In: CVPR (2017)

    Lu, Y ., Kumar, A., Zhai, S., Cheng, Y ., Javidi, T., Feris, R.: Fully-adaptive feature sharing in multi-task networks with applications in person attribute classification. In: CVPR (2017)

  40. [40]

    In: W ACV (2025)

    Martin Garcia, G., Abou Zeid, K., Schmidt, C., de Geus, D., Hermans, A., Leibe, B.: Fine-tuning image-conditional diffusion models is easier than you think. In: W ACV (2025)

  41. [41]

    In: CVPR (2016)

    Mayer, N., Ilg, E., Häusser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: CVPR (2016)

  42. [42]

    In: CVPR (2015)

    Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: CVPR (2015)

  43. [43]

    In: CVPR (2016)

    Misra, I., Shrivastava, A., Gupta, A., Hebert, M.: Cross-stitch networks for multi-task learning. In: CVPR (2016)

  44. [44]

    In: ICML (2022) 11

    Momma, M., Dong, C., Liu, J.: A multi-objective/multi-task learning framework induced by pareto stationarity. In: ICML (2022) 11

  45. [45]

    In: ICCV (2019)

    Murmann, L., Gharbi, M., Aittala, M., Durand, F.: A multi-illumination dataset of indoor object appearance. In: ICCV (2019)

  46. [46]

    In: CVPR (2024)

    Nishi, K., Kim, J., Li, W., Pfister, H.: Joint-task regularization for partially labeled multi-task learning. In: CVPR (2024)

  47. [47]

    In: CVPR (2020)

    Ouali, Y ., Hudelot, C., Tami, M.: Semi-supervised semantic segmentation with cross-consistency training. In: CVPR (2020)

  48. [48]

    In: CVPR (2016)

    Perazzi, F., Pont-Tuset, J., McWilliams, B., Gool, L.V ., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: CVPR (2016)

  49. [49]

    In: ICCV (2021)

    Roberts, M., Ramapuram, J., Ranjan, A., Kumar, A., Bautista, M.A., Paczan, N., Webb, R., Susskind, J.M.: Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In: ICCV (2021)

  50. [50]

    In: CVPR (2022)

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

  51. [51]

    In: AAAI (2019)

    Ruder, S., Bingel, J., Augenstein, I., Søgaard, A.: Latent multi-task architecture learning. In: AAAI (2019)

  52. [52]

    In: CVPR (2021)

    Saha, S., Obukhov, A., Paudel, D.P., Kanakis, M., Chen, Y ., Georgoulis, S., Van Gool, L.: Learning to relate depth and semantics for unsupervised domain adaptation. In: CVPR (2021)

  53. [53]

    In: CVPR (2023)

    Senushkin, D., Patakin, N., Kuznetsov, A., Konushin, A.: Independent component alignment for multi-task learning. In: CVPR (2023)

  54. [54]

    Standley, T., Zamir, A.R., Chen, D., Guibas, L., Malik, J., Savarese, S.: Which tasks should be learned together in multi-task learning? In: ICML (2020)

  55. [55]

    In: CVPR (2019)

    Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: CVPR (2019)

  56. [56]

    In: CVPR (2020)

    Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V ., Tsui, P., Guo, J., Zhou, Y ., Chai, Y ., Caine, B., Vasudevan, V ., Han, W., Ngiam, J., Zhao, H., Timofeev, A., Ettinger, S., Krivokon, M., Gao, A., Joshi, A., Zhang, Y ., Shlens, J., Chen, Z., Anguelov, D.: Scalability in perception for autonomous driving: Waymo open dataset. In: CVPR (2020)

  57. [57]

    T-PAMI (2022)

    Vandenhende, S., Georgoulis, S., Van Gansbeke, W., Proesmans, M., Dai, D., Van Gool, L.: Multi-task learning for dense prediction tasks: A survey. T-PAMI (2022)

  58. [58]

    CoRR (2019)

    Vasiljevic, I., Kolkin, N., Zhang, S., Luo, R., Wang, H., Dai, F.Z., Daniele, A.F., Mostajabi, M., Basart, S., Walter, M.R., Shakhnarovich, G.: DIODE: A Dense Indoor and Outdoor DEpth Dataset. CoRR (2019)

  59. [59]

    In: NeurIPS (2017)

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: NeurIPS (2017)

  60. [60]

    In: ICDM (2009)

    Wang, F., Wang, X., Li, T.: Semi-supervised multi-task learning with task regularizations. In: ICDM (2009)

  61. [61]

    In: W ACV (2022)

    Wang, Y ., Tsai, Y .H., Hung, W.C., Ding, W., Liu, S., Yang, M.H.: Semi-supervised multi-task learning for semantics and depth. In: W ACV (2022)

  62. [62]

    TIP (2004)

    Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: from error visibility to structural similarity. TIP (2004)

  63. [63]

    In: arXiv (2025)

    Wang, Z., Li, H., Sui, L., Zhou, T., Jiang, H., Nie, L., Liu, S.: StableMotion: Repurposing diffusion-based image priors for motion estimation. In: arXiv (2025)

  64. [64]

    In: NeurIPS (2021) 12

    Wilson, B., Qi, W., Agarwal, T., Lambert, J., Singh, J., Khandelwal, S., Pan, B., Kumar, R., Hartnett, A., Pontes, J.K., Ramanan, D., Carr, P., Hays, J.: Argoverse 2: Next generation datasets for self-driving perception and forecasting. In: NeurIPS (2021) 12

  65. [65]

    In: ITSC (2021)

    Xiao, P., Shao, Z., Hao, S., Zhang, Z., Chai, X., Jiao, J., Li, Z., Wu, J., Sun, K., Jiang, K., Wang, Y ., Yang, D.: PandaSet: Advanced sensor suite dataset for autonomous driving. In: ITSC (2021)

  66. [66]

    Xu, G., Ge, Y ., Liu, M., Fan, C., Xie, K., Zhao, Z., Chen, H., Shen, C.: What matters when repurposing diffusion models for general dense perception tasks? In: ICLR (2025)

  67. [67]

    In: ECCV (2018)

    Xu, N., Yang, L., Fan, Y ., Yang, J., Yue, D., Liang, Y ., Price, B., Cohen, S., Huang, T.: Youtube-vos: Sequence-to-sequence video object segmentation. In: ECCV (2018)

  68. [68]

    In: ECCV (2022)

    Xu, X., Zhao, H., Vineet, V ., Lim, S.N., Torralba, A.: Mtformer: Multi-task learning via transformer and cross-task reasoning. In: ECCV (2022)

  69. [69]

    ACM TOG (2024)

    Ye, C., Qiu, L., Gu, X., Zuo, Q., Wu, Y ., Dong, Z., Bo, L., Xiu, Y ., Han, X.: StableNormal: Reducing diffusion variance for stable and sharp normal. ACM TOG (2024)

  70. [70]

    In: ECCV (2022)

    Ye, H., Xu, D.: Inverted pyramid multi-task transformer for dense scene understanding. In: ECCV (2022)

  71. [71]

    In: CVPR (2024)

    Ye, H., Xu, D.: DiffusionMTL: Learning multi-task denoising diffusion model from partially annotated data. In: CVPR (2024)

  72. [72]

    In: NeurIPS (2020)

    Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., Finn, C.: Gradient surgery for multi-task learning. In: NeurIPS (2020)

  73. [73]

    In: CVPR (2020)

    Zamir, A.R., Sax, A., Cheerla, N., Suri, R., Cao, Z., Malik, J., Guibas, L.J.: Robust learning through cross-task consistency. In: CVPR (2020)

  74. [74]

    In: CVPR (2018)

    Zamir, A.R., Sax, A., Shen, W.B., Guibas, L., Malik, J., Savarese, S.: Taskonomy: Disentangling task transfer learning. In: CVPR (2018)

  75. [75]

    In: SIGGRAPH (2024)

    Zeng, Z., Deschaintre, V ., Georgiev, I., Hold-Geoffroy, Y ., Hu, Y ., Luan, F., Yan, L.Q., Hašan, M.: RGB↔X: Image decomposition and synthesis using material- and lighting-aware diffusion models. In: SIGGRAPH (2024)

  76. [76]

    In: Buntine, W., Grobelnik, M., Mladeni´c, D., Shawe-Taylor, J

    Zhang, Y ., Yeung, D.Y .: Semi-supervised multi-task regression. In: Buntine, W., Grobelnik, M., Mladeni´c, D., Shawe-Taylor, J. (eds.) ECML PKDD (2009)

  77. [77]

    In: ECCV (2018)

    Zhang, Z., Cui, Z., Xu, C., Jie, Z., Li, X., Yang, J.: Joint task-recursive learning for semantic segmentation and depth estimation. In: ECCV (2018)

  78. [78]

    normal",

    Zhao, C., Liu, M., Zheng, H., Zhu, M., Zhao, Z., Chen, H., He, T., Shen, C.: DICEPTION: A generalist diffusion model for visual perceptual tasks. In: arXiv (2025) 13 Acknowledgments. This work was funded by the French Agence Nationale de la Recherche (ANR) with project SIGHT (ANR-20-CE23-0016) and performed with HPC resources from GENCI-IDRIS (Grants AD01...