pith. sign in

arxiv: 2604.05629 · v1 · submitted 2026-04-07 · 💻 cs.CV

A Unified Foundation Model for All-in-One Multi-Modal Remote Sensing Image Restoration and Fusion with Language Prompting

Pith reviewed 2026-05-10 18:29 UTC · model grok-4.3

classification 💻 cs.CV
keywords remote sensingimage restorationfoundation modelmixture of expertslanguage promptingmulti-task learningoptimal transportimage fusion
0
0 comments X

The pith

LLaRS provides a single foundation model for handling eleven remote sensing restoration and fusion tasks using language prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Remote sensing images are degraded by clouds, haze, noise, and other issues, typically requiring separate models for each type of fix. This paper presents LLaRS as a unified model that processes multiple modalities and tasks in one framework by aligning image bands semantically and routing features through specialized expert networks guided by text prompts. The approach is enabled by a large new dataset of a million examples covering real and synthetic degradations. Experiments indicate it beats dedicated models and adapts efficiently to new scenarios with limited additional training. This matters because it could streamline the processing of vast satellite imagery archives without maintaining many different tools.

Core claim

LLaRS is presented as the first unified foundation model for multi-modal and multi-task remote sensing low-level vision. It aligns heterogeneous bands using Sinkhorn-Knopp optimal transport, routes features via three complementary mixture-of-experts layers for spatial patterns, spectral fidelity, and global context with low-rank adapters, and stabilizes training with step-level dynamic weight adjustment. Trained on the LLaRS1M dataset with eleven tasks and language prompts, it consistently outperforms seven competitive models and shows strong transfer capability through parameter-efficient finetuning on unseen data.

What carries the argument

The LLaRS architecture, which uses Sinkhorn-Knopp optimal transport for band alignment combined with three complementary mixture-of-experts layers and dynamic weighting for joint multi-task optimization.

If this is right

  • LLaRS can replace multiple task-specific models for remote sensing image restoration and fusion.
  • It achieves better performance than seven existing competitive models across the tasks.
  • Parameter-efficient finetuning enables effective adaptation to new data and unseen tasks.
  • The use of language prompts allows flexible control over the restoration process.
  • Joint training on the LLaRS1M dataset supports consistent performance without major trade-offs between tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Operational remote sensing systems could integrate this model to reduce the complexity of handling diverse degradation types in a single pipeline.
  • Natural language interfaces might enable users without deep technical expertise to request specific image enhancements directly.
  • The band alignment technique could be tested for applicability in other multi-spectral domains such as hyperspectral medical imaging.
  • Further scaling of the model size or dataset might lead to even broader generalization across sensors and conditions.

Load-bearing premise

The combination of Sinkhorn-Knopp band alignment, three complementary MoE layers, and step-level dynamic weighting can jointly optimize across eleven heterogeneous restoration tasks without requiring separate models due to performance trade-offs.

What would settle it

If separate models trained individually for each of the eleven tasks outperform LLaRS on a standard benchmark test set, or if LLaRS shows degraded performance on some tasks compared to specialized approaches, the unified model's advantage would be disproven.

Figures

Figures reproduced from arXiv: 2604.05629 by Peng Liu, Yongchuan Cui.

Figure 1
Figure 1. Figure 1: LLaRS takes a degraded remote sensing image and a [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture of LLaRS. models often struggle with input misalignment across dif￾ferent spectral channel configurations and spatial sampling intervals, leading to semantic ambiguity. Developing uni￾fied architectures that inherently align heterogeneous multi￾modal inputs and designing novel training paradigms specif￾ically for pixel-level dense prediction remain unresolved challenges in this domain.… view at source ↗
Figure 3
Figure 3. Figure 3: Entropy-regularized channel-to-slot matching. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Geographic distribution of LLaRS1M sampling locations. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (left) lists per-task sample totals and prompt counts. The magnitudes follow how each corpus is built: large cropped synthetic dehazing sets and multi-site super￾resolution archives contribute high counts, whereas paired 180° 180° 120°W 120°W 60°W 60°W 0° 0° 60°E 60°E 120°E 120°E 180° 180° 60°S 60°S 30°S 30°S 0° 0° 30°N 30°N 60°N 60°N [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Word cloud of all prompts in LLaRS1M. Landsat–MODIS series for spatiotemporal fusion are com￾paratively few; six simulation pipelines each draw a fixed budget from shared clean references. Prompt pool sizes track how richly a task can be verbalized, for instance, cloud re￾moval spans thin versus thick clouds and auxiliary cues, whereas dehaze wording stays closer to a shared lexical core. (Right) We encode… view at source ↗
Figure 7
Figure 7. Figure 7: Model predictions and error maps for denoising. [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Model predictions and error maps for destriping. [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 11
Figure 11. Figure 11: Relationship between trainable parameter ratio and aver [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗
Figure 9
Figure 9. Figure 9: Model predictions and error maps for spatiotemporal [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of t-SNE task feature separability across [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Evolution of channel-to-slot transport for eleven tasks. ( [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Evolution of per-task weight changing. the number of experts, multi-task optimization strategies, and model efficiency are provided in Sec. C. 5. Conclusion This work presents LLaRS, a multi-task foundation model for remote sensing low-level vision. We built LLaRS1M, a large-scale dataset with real pairs and synthetic degradations across eleven restoration tasks, paired with diverse language prompts. Expe… view at source ↗
Figure 14
Figure 14. Figure 14: LLaRS1M examples. 32 34 36 38 40 PSNR 37.60 35.76 32.52 0.8 0.9 1.0 SSIM 0.9172 0.9046 0.7562 0.05 0.10 0.15 SAM 0.0644 0.0726 0.1463 5 10 15 20 ERGAS 4.41 4.66 17.08 LLaRS w/o channel align w/o text prompt [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Contribution analysis of text prompt and OT-based chan [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Model predictions and error maps for deblurring. [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗
Figure 19
Figure 19. Figure 19: Model predictions and error maps for histogram equal [PITH_FULL_IMAGE:figures/full_fig_p017_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Model predictions and error maps for brightness en [PITH_FULL_IMAGE:figures/full_fig_p017_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Fine-tuning qualitative comparison for dehazing. [PITH_FULL_IMAGE:figures/full_fig_p018_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Fine-tuning qualitative comparison for super-resolution. [PITH_FULL_IMAGE:figures/full_fig_p019_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Fine-tuning qualitative comparison for SAR despeckling. [PITH_FULL_IMAGE:figures/full_fig_p020_23.png] view at source ↗
read the original abstract

Remote sensing imagery suffers from clouds, haze, noise, resolution limits, and sensor heterogeneity. Existing restoration and fusion approaches train separate models per degradation type. In this work, we present Language-conditioned Large-scale Remote Sensing restoration model (LLaRS), the first unified foundation model for multi-modal and multi-task remote sensing low-level vision. LLaRS employs Sinkhorn-Knopp optimal transport to align heterogeneous bands into semantically matched slots, routes features through three complementary mixture-of-experts layers (convolutional experts for spatial patterns, channel-mixing experts for spectral fidelity, and attention experts with low-rank adapters for global context), and stabilizes joint training via step-level dynamic weight adjustment. To train LLaRS, we construct LLaRS1M, a million-scale multi-task dataset spanning eleven restoration and enhancement tasks, integrating real paired observations and controlled synthetic degradations with diverse natural language prompts. Experiments show LLaRS consistently outperforms seven competitive models, and parameter-efficient finetuning experiments demonstrate strong transfer capability and adaptation efficiency on unseen data. Repo: https://github.com/yc-cui/LLaRS

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces LLaRS, the first unified foundation model for multi-modal and multi-task remote sensing low-level vision tasks including restoration and fusion. It employs Sinkhorn-Knopp optimal transport for aligning heterogeneous bands, routes features through three complementary mixture-of-experts layers (convolutional for spatial patterns, channel-mixing for spectral fidelity, and attention with low-rank adapters for global context), and uses step-level dynamic weight adjustment for stable joint training. A new million-scale dataset LLaRS1M is constructed covering eleven tasks with real and synthetic degradations plus language prompts. Experiments claim consistent outperformance over seven competitive models and strong transfer via parameter-efficient finetuning on unseen data.

Significance. If the empirical results hold, the work is significant for establishing a single model capable of handling eleven heterogeneous remote sensing restoration and fusion tasks without task-specific retraining, supported by a large-scale multi-task dataset and an architecture designed for joint optimization. This could reduce the proliferation of separate models in the field and enable more efficient adaptation through language prompting and PEFT, advancing foundation-model approaches in remote sensing low-level vision.

major comments (2)
  1. [§4] §4 (Experiments) and associated tables: the central claim of consistent outperformance and absence of task-specific trade-offs relies on quantitative comparisons across all eleven tasks, but the reported results must include per-task metrics, ablation on the three MoE branches plus dynamic weighting, and direct comparison to task-specific baselines trained on the same LLaRS1M data to confirm no negative transfer occurs.
  2. [§3.2] §3.2 (Architecture): the step-level dynamic weight adjustment is presented as stabilizing joint training, but the paper should provide the exact formulation of the weighting parameters and demonstrate via ablation that they are not merely fitting to the training distribution in a way that reduces the claimed generality.
minor comments (3)
  1. [Figure 1] Figure 1 and §3: the diagram of the three MoE layers and Sinkhorn-Knopp alignment would benefit from clearer annotation of input/output dimensions and how language prompts are injected at each stage.
  2. [§5] §5 (Transfer experiments): the parameter-efficient finetuning results on unseen data should report the number of trainable parameters and adaptation steps for transparency.
  3. [References] References: several recent works on multi-task remote sensing restoration and MoE in vision are missing; add citations to ensure the positioning against prior unified models is complete.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. The points raised strengthen the empirical support for our claims of unified multi-task performance and the role of the dynamic weighting mechanism. We address each major comment below.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and associated tables: the central claim of consistent outperformance and absence of task-specific trade-offs relies on quantitative comparisons across all eleven tasks, but the reported results must include per-task metrics, ablation on the three MoE branches plus dynamic weighting, and direct comparison to task-specific baselines trained on the same LLaRS1M data to confirm no negative transfer occurs.

    Authors: We agree that per-task metrics and targeted ablations are necessary to fully substantiate the absence of task-specific trade-offs. The submitted manuscript reported aggregated metrics to highlight overall trends; in the revision we will add complete per-task tables for all eleven tasks. We will also include ablations isolating each of the three MoE branches (convolutional, channel-mixing, and attention with low-rank adapters) and the dynamic weighting component. In addition, we will train task-specific baselines on the identical LLaRS1M data and report direct comparisons, thereby confirming that joint training yields no negative transfer relative to specialized models. revision: yes

  2. Referee: [§3.2] §3.2 (Architecture): the step-level dynamic weight adjustment is presented as stabilizing joint training, but the paper should provide the exact formulation of the weighting parameters and demonstrate via ablation that they are not merely fitting to the training distribution in a way that reduces the claimed generality.

    Authors: We will insert the exact mathematical formulation of the step-level dynamic weight adjustment, including the update rules for the weighting parameters, into §3.2. To address the concern about potential overfitting, we will add an ablation that trains the model both with and without dynamic weighting. Performance will be reported on held-out validation splits of LLaRS1M as well as on completely unseen tasks and data distributions. These results will show that the mechanism improves training stability while maintaining or improving generalization, rather than trading generality for in-distribution fit. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical architecture for a unified remote sensing restoration model using standard components (Sinkhorn-Knopp alignment, mixture-of-experts layers, dynamic weighting) trained on a newly constructed million-scale dataset LLaRS1M. No equations, derivations, or self-referential definitions are provided that reduce claimed performance or unification to fitted parameters or prior self-citations by construction. Central claims rest on experimental outperformance and transfer results rather than internal circular logic.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of the described architecture and dataset. No explicit free parameters, axioms, or invented entities are stated in the abstract, but the dynamic weight adjustment and expert routing implicitly introduce tunable components whose values are learned from data.

free parameters (1)
  • step-level dynamic weight adjustment parameters
    Used to stabilize joint training across tasks; values are learned or scheduled during optimization.

pith-pipeline@v0.9.0 · 5497 in / 1351 out tokens · 39816 ms · 2026-05-10T18:29:26.799763+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

117 extracted references · 117 canonical work pages

  1. [1]

    SatlasPretrain: A large-scale dataset for remote sensing image understanding

    Favyen Bastani, Piper Wolters, Ritwik Gupta, Joe Ferdinando, and Aniruddha Kembhavi. SatlasPretrain: A large-scale dataset for remote sensing image understanding. InICCV, pages 16726–16736, 2023. 2

  2. [2]

    BitFit: Simple parameter-efficient fine-tuning for transformer-based masked language-models

    Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. BitFit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. InACL, pages 1–9, Dublin, Ireland,

  3. [3]

    Association for Computational Linguistics. 7

  4. [4]

    Unsupervised learn- ing of visual features by contrasting cluster assignments

    Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learn- ing of visual features by contrasting cluster assignments. In NeurIPS, pages 9912–9924, 2020. 2

  5. [5]

    Pre-trained image processing transformer

    Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. Pre-trained image processing transformer. In CVPR, pages 12294–12305, 2021. 2

  6. [6]

    Dynamic convolution: Attention over convolution kernels

    Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen, Lu Yuan, and Zicheng Liu. Dynamic convolution: Attention over convolution kernels. InCVPR, pages 11030–11039,

  7. [7]

    GradNorm: Gradient normalization for adaptive loss balancing in deep multitask networks

    Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. GradNorm: Gradient normalization for adaptive loss balancing in deep multitask networks. InICML, pages 794–803. PMLR, 2018. 15

  8. [8]

    Trinity-Net: Gradient- guided swin transformer-based remote sensing image dehaz- ing and beyond.IEEE Trans

    Kaichen Chi, Yuan Yuan, and Qi Wang. Trinity-Net: Gradient- guided swin transformer-based remote sensing image dehaz- ing and beyond.IEEE Trans. Geosci. Remote Sens., 61:1–14,

  9. [9]

    Conde, Gregor Geigle, and Radu Timofte

    Marcos V . Conde, Gregor Geigle, and Radu Timofte. In- structIR: High-quality image restoration following human instructions. InECCV, page 1–21, Berlin, Heidelberg, 2024. Springer-Verlag. 1, 2

  10. [10]

    Lobell, and Stefano Ermon

    Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David B. Lobell, and Stefano Ermon. SatMAE: Pre-training transformers for tem- poral and multi-spectral satellite imagery. InNeurIPS, Red Hook, NY , USA, 2022. Curran Associates Inc. 2

  11. [11]

    Enpowering your pansharpening models with generalizability: Unified distri- bution is all you need

    Yongchuan Cui, Peng Liu, and Hui Zhang. Enpowering your pansharpening models with generalizability: Unified distri- bution is all you need. InICCV, pages 11850–11860, 2025. 1

  12. [12]

    Sinkhorn distances: Lightspeed computation of optimal transport

    Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. InNeurIPS, pages 2292–2300, 2013. 2, 3, 4, 8, 14

  13. [13]

    TerraFM: A scalable foundation model for unified multisensor earth observation.arXiv, 2025

    Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Muhammad Haris Khan, Rao Muhammad Anwer, Jorma Laaksonen, Fahad Shahbaz Khan, and Salman Khan. TerraFM: A scalable foundation model for unified multisensor earth observation.arXiv, 2025. 2

  14. [14]

    ImageNet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database . InCVPR, pages 248–255, Los Alamitos, CA, USA, 2009. IEEE Computer Society. 2

  15. [15]

    Machine learning in pansharpening: A benchmark, from shallow to deep networks.IEEE Geosci

    Liang-Jian Deng, Gemine Vivone, Mercedes E Paoletti, Giuseppe Scarpa, Jiang He, Yongjun Zhang, Jocelyn Chanus- sot, and Antonio Plaza. Machine learning in pansharpening: A benchmark, from shallow to deep networks.IEEE Geosci. Remote Sens. Mag., 10(3):279–315, 2022. 12, 13, 14

  16. [16]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. 2

  17. [17]

    Multisensor Data Fusion for Cloud Removal in Global and All-Season Sentinel-2 Imagery.IEEE Trans

    Patrick Ebel, Andrea Meraner, Michael Schmitt, and Xiao Xi- ang Zhu. Multisensor Data Fusion for Cloud Removal in Global and All-Season Sentinel-2 Imagery.IEEE Trans. Geosci. Remote Sens., 59(7):5866–5878, 2021. 12, 13, 14

  18. [18]

    Emelyanova, Tim R

    Irina V . Emelyanova, Tim R. McVicar, Thomas G. Van Niel, Ling Tao Li, and Albert I.J.M. van Dijk. Assessing the accu- racy of blending landsat–modis surface reflectances in two landscapes with contrasting spatial and temporal dynamics: A framework for algorithm selection.Remote Sens. Environ., 133:193–209, 2013. 12, 13, 14

  19. [19]

    Ro- bust SAR image despeckling by deep learning from near-real datasets.IEEE J

    Jianjun Guan, Ping Zhong, Fan Zhang, and Yuhan Liu. Ro- bust SAR image despeckling by deep learning from near-real datasets.IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., 17:3475–3487, 2024. 12, 13, 14

  20. [20]

    SkySense: A multi-modal remote sensing foundation model towards universal inter- pretation for earth observation imagery

    Xin Guo, Jiangwei Lao, Bo Dang, Yingying Zhang, Lei Yu, Lixiang Ru, Liheng Zhong, Ziyuan Huang, Kang Wu, Dingx- iang Hu, Huimei He, Jian Wang, Jingdong Chen, Ming Yang, Yongjun Zhang, and Yansheng Li. SkySense: A multi-modal remote sensing foundation model towards universal inter- pretation for earth observation imagery. InCVPR, pages 27662–27673, 2024. 2

  21. [21]

    Wasserstein wormhole: Scalable optimal transport distance with transformer

    Doron Haviv, Russell Zhang Kunes, Thomas Dougherty, Cas- sandra Burdziak, Tal Nawy, Anna Gilbert, and Dana Pe’er. Wasserstein wormhole: Scalable optimal transport distance with transformer. InICML, pages 17697–17718. PMLR, 2024. 2

  22. [22]

    Diffusion models in low-level vision: A survey, 2024

    Chunming He, Yuqi Shen, Chengyu Fang, Fengyang Xiao, Longxiang Tang, Yulun Zhang, Wangmeng Zuo, Zhenhua Guo, and Xiu Li. Diffusion models in low-level vision: A survey, 2024. 1

  23. [23]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, pages 770–778, 2016. 13

  24. [24]

    Parameter-efficient transfer learning for NLP

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. InICML, pages 2790–2799. PMLR, 2019. 7

  25. [25]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR,

  26. [26]

    Single satellite optical imagery dehazing using sar image prior based on conditional generative adversarial net- works

    Binghui Huang, Li Zhi, Chao Yang, Fuchun Sun, and Yixu Song. Single satellite optical imagery dehazing using sar image prior based on conditional generative adversarial net- works. InWACV, pages 1806–1813, 2020. 12, 13, 14

  27. [27]

    Transformer fusion with optimal transport

    Moritz Imfeld, Jacopo Graldi, Marco Giordano, Thomas Hof- mann, Sotiris Anagnostidis, and Sidak Pal Singh. Transformer fusion with optimal transport. InICLR, 2024. 2 9

  28. [28]

    Optimal transport aggre- gation for visual place recognition

    Sergio Izquierdo and Javier Civera. Optimal transport aggre- gation for visual place recognition. InCVPR, pages 17658– 17668, 2024. 2

  29. [29]

    Jacobs, Michael I

    Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts.Neu- ral Comput., 3(1):79–87, 1991. 3, 5

  30. [30]

    All-In-One Image Restoration for Unknown Corruption

    Boyun Li, Xiao Liu, Peng Hu, Zhongqin Wu, Jiancheng Lv, and Xi Peng. All-In-One Image Restoration for Unknown Corruption . InCVPR, pages 17431–17441, Los Alamitos, CA, USA, 2022. IEEE Computer Society. 2

  31. [31]

    Spatio-temporal fusion for remote sensing data: An overview and new benchmark.Sci

    Jun Li, Yunfei Li, Lin He, Jin Chen, and Antonio Plaza. Spatio-temporal fusion for remote sensing data: An overview and new benchmark.Sci. China Inf. Sci., 63(4):140301, 2020. 12, 13, 14

  32. [32]

    Tan, and Loong-Fah Cheong

    Ruoteng Li, Robby T. Tan, and Loong-Fah Cheong. All in one bad weather removal using architectural search. InCVPR, pages 3172–3182, 2020. 2

  33. [33]

    Scaling & shifting your features: a new baseline for efficient model tuning

    Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & shifting your features: a new baseline for efficient model tuning. InNeurIPS, Red Hook, NY , USA, 2022. Curran Associates Inc. 7

  34. [34]

    A remote sensing image dataset for cloud removal, 2019

    Daoyu Lin, Guangluan Xu, Xiaoke Wang, Yang Wang, Xian Sun, and Kun Fu. A remote sensing image dataset for cloud removal, 2019. 12, 13, 14

  35. [35]

    Conflict-averse gradient descent for multi-task learning

    Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. InNeurIPS, Red Hook, NY , USA, 2021. Curran Associates Inc. 4, 15

  36. [36]

    Dora: weight-decomposed low-rank adaptation

    Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: weight-decomposed low-rank adaptation. InICML. JMLR.org, 2024. 7

  37. [37]

    Degae: A new pretraining paradigm for low-level vision

    Yihao Liu, Jingwen He, Jinjin Gu, Xiangtao Kong, Yu Qiao, and Chao Dong. Degae: A new pretraining paradigm for low-level vision. InCVPR, pages 23292–23303, 2023. 2

  38. [38]

    Ai foundation models in remote sensing: A survey, 2024

    Siqi Lu, Junlin Guo, James R Zimmer-Dauphinee, Jordan M Nieusma, Xiao Wang, Parker VanValkenburgh, Steven A Wernke, and Yuankai Huo. Ai foundation models in remote sensing: A survey, 2024. 1

  39. [39]

    Gustafsson, Zheng Zhao, Jens Sj¨olund, and Thomas B

    Ziwei Luo, Fredrik K. Gustafsson, Zheng Zhao, Jens Sj¨olund, and Thomas B. Sch¨on. Controlling vision-language models for multi-task image restoration. InICLR, 2024. 2

  40. [40]

    Visualizing data using t-SNE.J

    Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE.J. Mach. Learn. Res., 9(86):2579–2605,

  41. [41]

    Ardakani, and Angel D

    Armin Mehri, Parichehr B. Ardakani, and Angel D. Sappa. MPRNet: Multi-path residual network for lightweight image super resolution. InWACV, pages 2703–2712, 2021. 6, 13, 16

  42. [42]

    A large-scale benchmark data set for evalu- ating pansharpening performance: Overview and implementa- tion.IEEE Geosci

    Xiangchao Meng, Yiming Xiong, Feng Shao, Huanfeng Shen, Weiwei Sun, Gang Yang, Qiangqiang Yuan, Randi Fu, and Hongyan Zhang. A large-scale benchmark data set for evalu- ating pansharpening performance: Overview and implementa- tion.IEEE Geosci. Remote Sens. Mag., 9(1):18–52, 2021. 12, 13, 14

  43. [43]

    Sen2ven µs, a dataset for the training of sentinel-2 super-resolution algorithms.Data, 7(7):96, 2022

    Julien Michel, Juan Vinasco-Salinas, Jordi Inglada, and Olivier Hagolle. Sen2ven µs, a dataset for the training of sentinel-2 super-resolution algorithms.Data, 7(7):96, 2022. 12, 13, 14

  44. [44]

    Multi-task learning as a bargaining game

    Aviv Navon, Aviv Shamsian, Idan Achituve, Haggai Maron, Kenji Kawaguchi, Gal Chechik, and Ethan Fetaya. Multi-task learning as a bargaining game. InICML, pages 16428–16446. PMLR, 2022. 4, 15

  45. [45]

    Learning dual convolutional neural networks for low-level vision

    Jinshan Pan, Sifei Liu, Deqing Sun, Jiawei Zhang, Yang Liu, Jimmy Ren, Zechao Li, Jinhui Tang, Huchuan Lu, Yu-Wing Tai, and Ming-Hsuan Yang. Learning dual convolutional neural networks for low-level vision. InCVPR, pages 3070– 3079, 2018. 2

  46. [46]

    PromptIR: prompting for all-in-one blind image restoration

    Vaishnav Potlapalli, Syed Waqas Zamir, Salman Khan, and Fahad Shahbaz Khan. PromptIR: prompting for all-in-one blind image restoration. InNeurIPS, Red Hook, NY , USA,

  47. [47]

    2, 6, 7, 13, 16

    Curran Associates Inc. 2, 6, 7, 13, 16

  48. [48]

    U-Net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. InMICCAI, pages 234–241, Cham, 2015. Springer Interna- tional Publishing. 3, 13

  49. [49]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, 2019

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, 2019. 13

  50. [50]

    SuperGlue: Learning feature match- ing with graph neural networks

    Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperGlue: Learning feature match- ing with graph neural networks. InCVPR, pages 4938–4947,

  51. [51]

    Multi-task learning as multi- objective optimization

    Ozan Sener and Vladlen Koltun. Multi-task learning as multi- objective optimization. InNeurIPS, page 525–536, Red Hook, NY , USA, 2018. Curran Associates Inc. 15

  52. [52]

    Concerning nonnegative matrices and doubly stochastic matrices.Pac

    Richard Sinkhorn and Paul Knopp. Concerning nonnegative matrices and doubly stochastic matrices.Pac. J. Math., 21(2): 343–348, 1967. 2, 3, 4, 8, 14

  53. [53]

    Diffusion enhancement for cloud removal in ultra-resolution remote sensing imagery.IEEE Trans

    Jialu Sui, Yiyang Ma, Wenhan Yang, Xiaokang Zhang, Man- On Pun, and Jiaying Liu. Diffusion enhancement for cloud removal in ultra-resolution remote sensing imagery.IEEE Trans. Geosci. Remote Sens., 62:1–14, 2024. 12, 13, 14

  54. [54]

    RingMo: A remote sensing foundation model with masked image modeling.IEEE Trans

    Xian Sun, Peijin Wang, Wanxuan Lu, Zicong Zhu, Xiao- nan Lu, Qibin He, Junxi Li, Xuee Rong, Zhujun Yang, Hao Chang, Qinglin He, Guang Yang, Ruiping Wang, Jiwen Lu, and Kun Fu. RingMo: A remote sensing foundation model with masked image modeling.IEEE Trans. Geosci. Remote Sens., 61:1–22, 2023. 2

  55. [55]

    Jeya Maria Jose Valanarasu, Rajeev Yasarla, and Vishal M. Patel. Transweather: Transformer-based restoration of images degraded by adverse weather conditions. InCVPR, pages 2353–2363, 2022. 2

  56. [56]

    Labeled dataset for training despeckling filters for SAR imagery.Data Brief., 53:110065, 2024

    Rub´en Dar´ıo V´asquez-Salazar, Ahmed Alejandro Cardona- Mesa, Luis G´omez, Carlos M Travieso-Gonz´alez, Andr´es F Garavito-Gonz´alez, and Esteban V ´asquez-Cano. Labeled dataset for training despeckling filters for SAR imagery.Data Brief., 53:110065, 2024. 12, 13, 14

  57. [57]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, pages 6000–6010, Red Hook, NY , USA, 2017. Curran Associates Inc. 2

  58. [58]

    Multisensor remote sensing 10 imagery super-resolution with conditional gan.J

    Junwei Wang, Kun Gao, Zhenzhou Zhang, Chong Ni, Zibo Hu, Dayu Chen, and Qiong Wu. Multisensor remote sensing 10 imagery super-resolution with conditional gan.J. Remote Sens., 2021, 2021. 12, 13, 14

  59. [59]

    GridFormer: Residual dense transformer with grid structure for image restoration in adverse weather conditions.IJCV, 132(10):4541–4563, 2024

    Tao Wang, Kaihao Zhang, Ziqian Shao, Wenhan Luo, Bjorn Stenger, Tong Lu, Tae-Kyun Kim, Wei Liu, and Hongdong Li. GridFormer: Residual dense transformer with grid structure for image restoration in adverse weather conditions.IJCV, 132(10):4541–4563, 2024. 6, 13, 16

  60. [60]

    Gradient as conditions: Rethinking HOG for all-in-one image restoration

    Jiawei Wu, Zhifei Yang, Zhe Wang, and Zhi Jin. Gradient as conditions: Rethinking HOG for all-in-one image restoration. AAAI, 40(13):10682–10690, 2026. 6, 13, 16

  61. [61]

    mHC: Manifold-constrained hyper-connections.arXiv, 2025

    Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Kuai Yu, Liang Zhao, Shangyan Zhou, Zhean Xu, Zhengyan Zhang, Wangding Zeng, Shengding Hu, Yuqing Wang, Jingyang Yuan, Lean Wang, and Wenfeng Liang. mHC: Manifold-constrained hyper-connections.arXiv, 2025. 2

  62. [62]

    Condconv: Conditionally parameterized convolutions for efficient inference

    Brandon Yang, Gabriel Bender, Quoc V Le, and Jiquan Ngiam. Condconv: Conditionally parameterized convolutions for efficient inference. InNeurIPS, 2019. 4

  63. [63]

    mHC-lite: You Don’t Need 20 Sinkhorn-Knopp Iterations.arXiv, 2026

    Yongyi Yang and Jianyang Gao. mHC-lite: You Don’t Need 20 Sinkhorn-Knopp Iterations.arXiv, 2026. 2

  64. [64]

    All-In-One Medical Image Restoration via Task-Adaptive Routing

    Zhiwen Yang, Haowei Chen, Ziniu Qian, Yang Yi, Hui Zhang, Dan Zhao, Bingzheng Wei, and Yan Xu. All-In-One Medical Image Restoration via Task-Adaptive Routing . InMICCAI. Springer Nature Switzerland, 2024. 6, 7, 13, 16

  65. [65]

    Gradient surgery for multi-task learning

    Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. InNeurIPS, Red Hook, NY , USA, 2020. Curran Associates Inc. 4, 15

  66. [66]

    Z. Yuan, Z. Xiong, L. Mou, and X. X. Zhu. Chatearthnet: a global-scale image–text dataset empowering vision–language geo-foundation models.Earth Syst. Sci. Data, 17(3):1245– 1263, 2025. 2

  67. [67]

    Com- plexity experts are task-discriminative learners for any image restoration

    Eduard Zamfir, Zongwei Wu, Nancy Mehta, Yuedong Tan, Danda Pani Paudel, Yulun Zhang, and Radu Timofte. Com- plexity experts are task-discriminative learners for any image restoration. InCVPR, pages 12753–12763, 2025. 6, 7, 13, 16

  68. [68]

    Restormer: Efficient transformer for high-resolution image restoration

    Syed Waqas Zamir, Aditya Arora, Salman Khan, Mu- nawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Restormer: Efficient transformer for high-resolution image restoration. InCVPR, pages 5728–5739, 2022. 6, 7, 13, 16

  69. [69]

    Dense haze removal based on dynamic collaborative inference learning for remote sensing images.IEEE Trans

    Libao Zhang and Shan Wang. Dense haze removal based on dynamic collaborative inference learning for remote sensing images.IEEE Trans. Geosci. Remote Sens., 60:1–16, 2022. 12, 13, 14

  70. [70]

    RS5M and GeoRSCLIP: A large scale vision-language dataset and a large vision-language model for remote sensing.IEEE Trans

    Zilun Zhang, Tiancheng Zhao, Yulong Guo, and Jianwei Yin. RS5M and GeoRSCLIP: A large scale vision-language dataset and a large vision-language model for remote sensing.IEEE Trans. Geosci. Remote Sens., 62:1–23, 2024. 2

  71. [71]

    Towards vision-language geo- foundation model: A survey, 2024

    Yue Zhou, Litong Feng, Yiping Ke, Xue Jiang, Junchi Yan, Xue Yang, and Wayne Zhang. Towards vision-language geo- foundation model: A survey, 2024. 1

  72. [72]

    Zeng-Hui Zhu, Wei Lu, Si-Bao Chen, Chris H. Q. Ding, Jin Tang, and Bin Luo. Real-world remote sensing image dehaz- ing: Benchmark and baseline.IEEE Trans. Geosci. Remote Sens., 63:1–14, 2025. 12, 13, 14 11 A. MoRA and softmax mixture approximation This section gives the full tensor definitions behind the com- pact MoT/MoRA update in the main paper. With r...

  73. [73]

    Remove the cloud layer to improve visibility of the surface

  74. [74]

    Apply SAR technology to mitigate cloud interference

  75. [75]

    The dense cloud cover is obstructing the view; remove it for clarity

  76. [76]

    Can you enhance the clarity of this image by removing the clouds? Prompt examples HR

  77. [77]

    Apply haze removal techniques to reveal the landscape below

  78. [78]

    The hazes are blocking the view; please remove them

  79. [79]

    Remove the haze from this remote sensing image to improve visibility

  80. [80]

    Apply dehazing to this remote sensing image for better interpretation. SR

Showing first 80 references.