pith. machine review for the scientific record. sign in

arxiv: 2605.07562 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords remote sensingvision-language modelsground sampling distanceLoRA adaptationscale conditioningparameter-efficient fine-tuningEarth observationmultimodal learning
0
0 comments X

The pith

Remote sensing VLMs can route their computations dynamically by treating ground sampling distance as a continuous conditioning variable rather than a discrete token or ignored input.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the same geographic feature produces entirely different visual patterns across ground sampling distances that span orders of magnitude. Existing RS-VLMs either drop scale information or encode it as a static text token, forcing one fixed set of weights to cover every scale. ScaleEarth instead uses a hyper-LoRA adapter whose low-rank subspace is modulated by a continuous GSD signal, so the model activates different computation paths according to physical scale. A lightweight visual head predicts the required GSD at inference time, and a new 1.5-million-sample dataset supplies matching scale-conditioned question-answer pairs. The resulting 8B model reaches state-of-the-art accuracy on multiple remote-sensing benchmarks without needing sensor metadata at deployment.

Core claim

ScaleEarth replaces discrete GSD tokens with CS-HLoRA, a hyper-LoRA whose low-rank updates are gated by a continuous ground-sampling-distance variable. This allows the 8B Qwen3-VL backbone, fine-tuned with QLoRA, to route computation according to physical scale. SSE-U, a heteroscedastic sub-head, infers GSD and its uncertainty directly from image features, removing the need for metadata at test time. The system is closed by the GeoScale-VQA corpus of 1.5M scale-layered VQA pairs whose generation is conditioned on the same scalar used to drive CS-HLoRA. The trained model sets new state-of-the-art results on XLRS-Bench and OmniEarth-Bench.

What carries the argument

CS-HLoRA (Continuous Scale-Conditioned Hyper-LoRA), which modulates the low-rank adaptation matrices through a GSD-driven gate so that scale determines the active computation subspace.

If this is right

  • The model maintains performance even on images whose scale is not supplied at inference because SSE-U supplies the missing value.
  • Question-answer pairs generated at the same physical scale as the image provide consistent supervision that matches the conditioning signal.
  • Parameter-efficient adaptation can embed physical priors such as scale without expanding the full parameter count.
  • Diverse Earth-system tasks benefit simultaneously once scale-dependent routing is available.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same continuous-conditioning idea could be applied to other physical variables such as acquisition time or sensor type by replacing the GSD gate with an analogous scalar.
  • Because the gate operates inside the low-rank update, the approach may transfer to other parameter-efficient methods beyond LoRA.
  • The visual GSD predictor opens the possibility of scale-aware models for imagery that lacks any metadata, such as historical or crowdsourced remote-sensing data.

Load-bearing premise

Modulating the LoRA subspace with a continuous GSD gate will produce stable, beneficial routing across all scales without harming performance when scale is ambiguous or missing.

What would settle it

Train an otherwise identical model that receives only a discrete GSD token or no scale input at all; if its accuracy on XLRS-Bench and OmniEarth-Bench equals or exceeds the CS-HLoRA version, the continuous-conditioning claim is falsified.

Figures

Figures reproduced from arXiv: 2605.07562 by Song Zhang, Xiaowei Zhang, Yanlong Chen, Yawei Li, Yilin Li, Yining Chen, Zili Yi.

Figure 1
Figure 1. Figure 1: Three paradigms for handling resolution in RS-VLMs. General RS-VLMs ignore GSD; GSD-as-token approaches inject it as text but elicit no parameter-level response and fail without metadata. ScaleEarth (ours) treats GSD as a continuous conditioning variable s that drives tier￾structured gating, with SSE-U predicting (µ, log σ 2 ) when GSD is unknown. blurred region in coarse GSD. Without explicit scale awaren… view at source ↗
Figure 2
Figure 2. Figure 2: GeoScale-VQA at a glance. (a) Source-dataset landscape on the GSD × image￾size plane, where marker area encodes QA count. (b) QA composition across GSD tiers, coloured by source dataset. (c) Global spatial coverage of the resulting corpus. CS-HLoRA is motivated by a fundamental prop￾erty of remote sensing imagery: its inherently scale-dependent nature. The same object cate￾gory observed at 0.1 m and 10 m g… view at source ↗
Figure 3
Figure 3. Figure 3: GeoScale-VQA construction pipeline. (A) GSD injection recovers per-sample GSD for existing corpora, through a metadata-parse, sensor-lookup, and GSD-assignment resolver. (B) Scale￾aware VQA generation prompts Qwen3-VL-32B with tier-conditioned instructions on PatternNet and MtSCCD, producing identification, scale-specific, and discriminative QA pairs before syntactic filtering. (C) Claude performs a four-a… view at source ↗
Figure 4
Figure 4. Figure 4: ScaleEarth two-stage training pipeline. Stage 1 conducts RS-domain supervised fine￾tuning on the Qwen3-VL-8B backbone. Stage 2 freezes the SFT-warmed backbone in 4-bit quan￾tization and inserts CS-HLoRA modules into the LLM blocks. An SSE-U head predicts GSD and uncertainty (µ, log σ 2 ) from ViT features, while a unified resolver provides the effective GSD to the CS-HLoRA gate. Training optimizes L = LVLM… view at source ↗
Figure 5
Figure 5. Figure 5: CS-HLoRA gating algorithm. Each rank dimension is assigned to a scale tier through a learnable threshold τk. The sigmoid gate hk(s) = σ(α(τk−s)), with s = log10(GSD), progressively deactivates fine-scale ranks as GSD becomes coarser. Fine-resolution imagery (GSD = 0.1 m) activates object-, structure-, and semantic-level ranks, whereas coarse imagery (GSD ≈ 20 m) primarily retains semantic ranks. This param… view at source ↗
Figure 6
Figure 6. Figure 6: Leave-one-out ablation on XLRS￾Bench-lite. Bars report absolute average ac￾curacy, with parenthesized in-bar values indi￾cating the relative drop from the full model. To isolate the contribution of each component, we conduct a leave-one-out ablation on XLRS-Bench￾lite, a lightweight subset of XLRS-Bench for efficient evaluation in ultra-high-resolution remote-sensing settings. As shown in [PITH_FULL_IMAGE… view at source ↗
Figure 7
Figure 7. Figure 7: Stage 1 full-parameter RS SFT training dynamics. (a) SFT cross-entropy loss LSFT with raw and EMA curves. (b) Per-step gradient norm ∥∇∥2 on a logarithmic scale. (c) Cosine learning-rate schedule with linear warmup. 0 2000 4000 training step −5 0 5 10 15 total (a) Total objective 0 2000 4000 training step 10 −1 10 0 2 × 10 −1 3 × 10 −1 4 × 10 −1 6 × 10 −1  V L M (b) VLM cross-entropy 0 2000 4000 training… view at source ↗
Figure 8
Figure 8. Figure 8: Stage 2 CS-HLoRA and SSE-U joint-training overview. (a) Total objective Ltotal = LVLM + λ eff gsdLSSE-U. (b) VLM cross-entropy loss LVLM. (c) GSD negative log-likelihood − log p(g | x). (d) Learned scale-conditioned gates hk(s), s= log10(GSD/m). (e) Calibration ratio E|gˆ−g|/σ¯. (f) Per-module LoRA effective update magnitude α¯ = α r p tr(B⊤BAA⊤). supervision on the gate values. The calibration ratio furth… view at source ↗
Figure 9
Figure 9. Figure 9: Stage 2 optimisation knobs. Left: cosine learning-rate schedule. Middle: gradient norm ∥∇∥2 on a logarithmic scale. Right: effective GSD-loss weight λ eff gsd. 0.01 0.03 0.1 0.3 1 3 10 30 Spoofed pseudo-GSD (m, log scale) 50% 60% 70% 80% 89% VQA accuracy on RSVQA-HR true GSD 0.15 m high | mid mid | low B2 (Standard LoRA, r = 64) B3 (GSD-as-text-prompt) B4 (Bucketed MoE-LoRA, K = 3, eq.-params) Ours (CS-HLo… view at source ↗
Figure 10
Figure 10. Figure 10: GSD-spoofing test on RSVQA-HR. The true GSD is 0.15 m. Ours (CS-HLoRA) peaks near the true GSD and degrades smoothly under scale mismatch, producing a clean bell on log-GSD. B4 (Bucketed MoE-LoRA) is parameter-matched and also peaks in the correct (high) bucket, confirming that conditional capacity helps; however its response is piecewise-constant within buckets and discontinuous at bucket edges, falling … view at source ↗
Figure 11
Figure 11. Figure 11: τ -decoupling analysis. Per-rank linear-probe R2 on N=5,000 probe images. Rows are CS-HLoRA rank dimensions grouped into object, structure, and semantic levels. Columns correspond to texture, geometry, and semantic factors. The block-diagonal pattern indicates factor-specific specialisation. Results [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative comparison at GSD = 0.4 m. The scale-blind RS-VLM gives a generic urban and natural contrast. The GSD-as-token baseline mentions several mid-scale entities. ScaleEarth produces a multi-level description containing object-level details, structure-level spatial patterns, and semantic-level functional interpretation. High-resolution case [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative comparison at GSD = 10 m. The scale-blind RS-VLM over-specifies sub-pixel details. The GSD-as-token baseline acknowledges the nominal resolution but still produces object-level claims. ScaleEarth restricts its response to coarse, scale-appropriate semantic descrip￾tions. Low-resolution case [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Per-task qualitative comparison on XLRS-Bench. (a) Object spatial relationship. ScaleEarth correctly localizes the largest white vessel relative to the largest parking lot. The four proprietary baselines fail. (b) Object classification in a small bounding box. ScaleEarth, Claude Opus 4.7, and Gemini 3.1 Pro correctly identify the building category, while ChatGPT 5.5 and Qwen3.6 Plus rely on the surroundin… view at source ↗
Figure 15
Figure 15. Figure 15: Per-task qualitative comparison on XLRS-Bench. (c) Object color in a small, partially shadowed bounding box. ScaleEarth, Claude Opus 4.7, Gemini 3.1 Pro, and ChatGPT 5.5 correctly identify the target vehicle as black, while Qwen3.6 Plus misclassifies the hue under shadow. (d) Case of complex reasoning task Based on the satellite images, is the area depicted suitable for economic development? (A) It is sui… view at source ↗
Figure 16
Figure 16. Figure 16: Per-task qualitative comparison on XLRS-Bench. (d) Complex reasoning. ScaleEarth identifies the visible barge fleet and port infrastructure as evidence of economic-trade suitability. (e) Anomaly detection and interpretation. ScaleEarth identifies the joint cue of roof-color uniformity and building density. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
read the original abstract

Remote sensing vision-language models (RS-VLMs) face a fundamental mismatch with natural-image counterparts: the same geographic object exhibits radically different visual evidence across ground sampling distances (GSDs) spanning multiple orders of magnitude. Yet existing RS-VLMs often discard GSD or inject it as a discrete text token, forcing a single static parameter set to absorb the entire scale spectrum. We introduce ScaleEarth, a parameter-efficient fine-tuning framework built on Qwen3-VL that treats GSD as a continuous conditioning variable governing the model's computation path. At its core, CS-HLoRA (Continuous Scale-Conditioned Hyper-LoRA) modulates the LoRA low-rank subspace through a GSD-driven gate, enabling the model to dynamically route computation by physical scale. To remove reliance on sensor metadata at deployment, we pair CS-HLoRA with SSE-U, a lightweight heteroscedastic sub-head that predicts GSD and its uncertainty from visual features. To provide matching supervision, we construct GeoScale-VQA, a 1.5M-sample scale-layered RS-VQA corpus whose question-answer generation is conditioned on the same physical scalar that drives CS-HLoRA, forming a closed method-data loop. Trained with QLoRA on an 8B backbone, ScaleEarth achieves state-of-the-art results on remote-sensing benchmarks covering diverse Earth-system tasks, including XLRS-Bench and OmniEarth-Bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ScaleEarth, a parameter-efficient fine-tuning framework for remote sensing VLMs built on Qwen3-VL. It proposes CS-HLoRA to treat GSD as a continuous conditioning variable that modulates the LoRA subspace via a GSD-driven gate for dynamic scale-aware routing, paired with SSE-U (a heteroscedastic predictor) to estimate GSD and uncertainty from visual features alone. A 1.5M-sample GeoScale-VQA dataset is constructed with QA pairs generated conditioned on the same GSD scalar, and the method is trained via QLoRA on an 8B backbone, claiming SOTA results on remote-sensing benchmarks including XLRS-Bench and OmniEarth-Bench.

Significance. If the central claims hold, the work meaningfully advances RS-VLMs by replacing discrete GSD tokens or metadata ignorance with continuous, learned scale conditioning in a PEFT setting. The closed-loop design linking data generation to model conditioning and the addition of uncertainty-aware GSD prediction are technically interesting and address a real domain mismatch. Parameter efficiency on an 8B model is a practical strength. However, without quantitative results or robustness checks, the practical impact on diverse Earth-system tasks remains unverified.

major comments (3)
  1. [Abstract] Abstract: the SOTA claim on XLRS-Bench and OmniEarth-Bench is asserted without any performance metrics, baseline comparisons, ablation tables, or statistical tests, which is load-bearing for the central contribution and prevents verification of whether CS-HLoRA + SSE-U actually outperforms static QLoRA.
  2. [Method] Method (CS-HLoRA and SSE-U integration): the assumption that the GSD-driven gate will deliver stable dynamic routing when SSE-U supplies predicted (rather than oracle) GSD is unsupported; no ablation with noisy GSD inputs, MAE of SSE-U, or comparison of oracle vs. predicted GSD performance is provided, directly undermining the no-metadata deployment claim.
  3. [Dataset construction] Dataset construction: GeoScale-VQA question-answer pairs are generated conditioned on the identical physical GSD scalar that drives the CS-HLoRA gate, creating a closed loop; this risks optimistic bias and requires explicit testing with SSE-U predictions at both training and evaluation time to confirm generalization.
minor comments (2)
  1. The abstract introduces CS-HLoRA and SSE-U without first-use expansion or a one-sentence overview of the gate mechanism, reducing immediate clarity for readers unfamiliar with the acronyms.
  2. Implementation details such as the exact form of the GSD-driven gate (e.g., functional form, initialization) and the heteroscedastic loss for SSE-U are referenced but not shown in the provided text, complicating reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment point by point below. Where the comments identify gaps in supporting evidence or potential biases, we have incorporated revisions and additional experiments to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the SOTA claim on XLRS-Bench and OmniEarth-Bench is asserted without any performance metrics, baseline comparisons, ablation tables, or statistical tests, which is load-bearing for the central contribution and prevents verification of whether CS-HLoRA + SSE-U actually outperforms static QLoRA.

    Authors: We agree that the abstract should include concrete quantitative support for the SOTA claims to allow immediate verification. In the revised manuscript we have updated the abstract to report key metrics, including absolute performance and relative gains over static QLoRA and other baselines on XLRS-Bench and OmniEarth-Bench. The full experimental section already contains the corresponding tables, ablations, and statistical details; the abstract revision simply surfaces the most salient numbers. revision: yes

  2. Referee: [Method] Method (CS-HLoRA and SSE-U integration): the assumption that the GSD-driven gate will deliver stable dynamic routing when SSE-U supplies predicted (rather than oracle) GSD is unsupported; no ablation with noisy GSD inputs, MAE of SSE-U, or comparison of oracle vs. predicted GSD performance is provided, directly undermining the no-metadata deployment claim.

    Authors: This is a fair critique of the robustness evidence. We have added a dedicated ablation subsection that reports the MAE of SSE-U on held-out data, directly compares end-to-end performance when the CS-HLoRA gate receives oracle versus predicted GSD, and includes controlled noise-injection experiments on the predicted GSD values. The new results show only marginal degradation and confirm that dynamic routing remains stable, thereby supporting the no-metadata deployment scenario. revision: yes

  3. Referee: [Dataset construction] Dataset construction: GeoScale-VQA question-answer pairs are generated conditioned on the identical physical GSD scalar that drives the CS-HLoRA gate, creating a closed loop; this risks optimistic bias and requires explicit testing with SSE-U predictions at both training and evaluation time to confirm generalization.

    Authors: We acknowledge the risk of optimistic bias inherent in the closed-loop construction. The revised manuscript now contains two additional experimental protocols: (1) training with oracle GSD but evaluating exclusively with SSE-U predictions, and (2) using SSE-U predictions for both training and evaluation. Performance remains consistent across these settings, indicating that the model does not overfit to the oracle conditioning and generalizes when the predictor is used end-to-end. revision: yes

Circularity Check

0 steps flagged

No significant circularity in claimed method or results

full rationale

The paper presents an empirical PEFT framework (CS-HLoRA + SSE-U) trained on a custom dataset (GeoScale-VQA) whose generation uses the same GSD scalar for supervision. This is explicitly described as providing 'matching supervision' rather than a mathematical derivation or first-principles prediction. No equations, uniqueness theorems, or self-cited load-bearing premises are shown that reduce the SOTA claims on XLRS-Bench/OmniEarth-Bench to inputs by construction. The inference-time use of predicted GSD is a separate deployment assumption, not a definitional loop. The work is self-contained against external benchmarks with no identified self-definitional, fitted-prediction, or ansatz-smuggling patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The framework introduces new modules (CS-HLoRA, SSE-U) and a custom dataset whose generation depends on the same scalar used at training time; without the full paper, specific learned parameters such as LoRA rank or gate network size remain unspecified.

free parameters (1)
  • CS-HLoRA gate parameters
    Learned weights that map GSD to modulation of the low-rank subspace; fitted during QLoRA fine-tuning.
axioms (1)
  • domain assumption GSD can be treated as a continuous scalar that meaningfully governs visual feature computation across orders of magnitude
    Invoked when stating that the same object exhibits radically different visual evidence across GSDs.
invented entities (2)
  • CS-HLoRA no independent evidence
    purpose: Modulates LoRA low-rank subspace via GSD-driven gate for dynamic scale routing
    New module proposed to replace discrete token injection
  • SSE-U no independent evidence
    purpose: Predicts GSD and uncertainty from visual features to remove metadata dependence
    Lightweight heteroscedastic sub-head introduced for inference without sensor data

pith-pipeline@v0.9.0 · 5571 in / 1368 out tokens · 44988 ms · 2026-05-11T01:49:08.601164+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 9 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Large language model. Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

  2. [2]

    Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs

    Yanlong Chen, Amirhossein Habibian, Luca Benini, and Yawei Li. Gated relational alignment via confidence-based distillation for efficient vlms.arXiv preprint arXiv:2601.22709,

  3. [3]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271,

  4. [4]

    Loramoe: Alleviate world knowledge for- getting in large language models via moe-style plugin,

    Shihan Dou, Enyu Zhou, Yan Liu, Songyang Gao, Jun Zhao, Wei Shen, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Xiaoran Fan, et al. Loramoe: Alleviate world knowledge forgetting in large language models via moe-style plugin.arXiv preprint arXiv:2312.09979,

  5. [5]

    Rsgpt: A remote sensing vision language model and benchmark.ArXiv, abs/2307.15266, 2023

    Y Hu, J Yuan, C Wen, X Lu, and X RSGPT Li. A remote sensing vision language model and benchmark. arxiv 2023.arXiv preprint arXiv:2307.15266,

  6. [6]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

  7. [7]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Peiyuan Zhang, Kaichen Zhang, Fanyi Pu, Xinrun Du, Yuhao Dong, Haotian Liu, Yuanhan Zhang, Ge Zhang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Accelerating the development of large multimodal models, March 2024a. URL https://github.com/EvolvingLMMs-Lab/ lmms-eval. Equal contribution: Bo Li, Peiyuan Zhang, Kaichen Zhang, and Fanyi Pu. Bo Li, Yuanhan Zh...

  8. [8]

    Jialin Luo, Yuanzhi Wang, Ziqi Gu, Yide Qiu, Shuaizhen Yao, Fuyun Wang, Chunyan Xu, Wenhua Zhang, Dan Wang, and Zhen Cui. Mmm-rs: A multi-modal, multi-gsd, multi-scene remote sensing dataset and benchmark for text-to-image generation.Advances in Neural Information Processing Systems, 37:12151–12163, 2024a. Junwei Luo, Zhen Pang, Yongjun Zhang, Tingzhu Wan...

  9. [9]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

  10. [10]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

  11. [11]

    Omniearth-bench: Towards holistic evaluation of earth’s six spheres and cross-spheres interactions with multimodal observational earth data.arXiv preprint arXiv:2505.23522, 2025a

    Fengxiang Wang, Mingshuo Chen, Xuming He, Yi-Fan Zhang, Yueying Li, Feng Liu, Zijie Guo, Zhenghao Hu, Jiong Wang, Jingyi Xu, et al. Omniearth-bench: Towards holistic evaluation of earth’s six spheres and cross-spheres interactions with multimodal observational earth data.arXiv preprint arXiv:2505.23522, 2025a. 12 Fengxiang Wang, Mingshuo Chen, Yueying Li,...

  12. [12]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191,

  13. [13]

    Mixture of lora experts,

    Xun Wu, Shaohan Huang, and Furu Wei. Mixture of lora experts.arXiv preprint arXiv:2404.13628,

  14. [14]

    Rs-gpt4v: A unified multimodal instruction-following dataset for remote sensing image understanding.arXiv preprint arXiv:2406.12479,

    Linrui Xu, Ling Zhao, Wang Guo, Qiujun Li, Kewang Long, Kaiqi Zou, Yuhan Wang, and Haifeng Li. Rs-gpt4v: A unified multimodal instruction-following dataset for remote sensing image understanding.arXiv preprint arXiv:2406.12479,

  15. [15]

    Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

    Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, et al. Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output.arXiv preprint arXiv:2407.03320, 2024a. Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, and Xuerui Mao. Earthgpt: A univers...

  16. [16]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,

  17. [17]

    19.19 33.98 23.39 20.2274.5631.99 29.46 33.26 InternVL3-7B Zhu et al

  18. [18]

    3.92 4.82 22.43 16.27 5.88 14.91 8.63 10.98 Qwen3-VL-8B (backbone, zero-shot) Bai et al. [2025]18.42 21.55 26.31 19.74 43.1822.06 18.93 24.31 Adapted from the same Qwen3-VL-8B backbone Stage-1 only (RS-GPT4V SFT) 28.74 26.91 32.40 24.06 41.5528.83 25.62 29.73 Stage-1 + Standard LoRA (r=64) [B2] 33.51 29.40 35.18 26.74 39.6230.97 28.45 31.98 Stage-1 + LoRA...

  19. [19]

    24.78 22.08 38.62 31.17 15.23 20.22 16.87 24.14 Headline result.Table 2 reports per-sphere VQA accuracy on OmniEarth-Bench. Under the protocol-matched multiple-choice setting, ScaleEarth attains an average accuracy of40.71%, exceed- ing the strongest zero-shot open-source baseline, InternVL3-72B (33.26%), by +7.45 percentage points. It also outperforms th...

  20. [20]

    Stage 2 dynamics and optimisation knobs.Figures 8 and 9 jointly summarise the behaviour of the CS-HLoRA and SSE-U joint-training stage. The multi-task objective is not a zero-sum trade-off between language modelling and scale estimation: the VLM cross-entropy decreases monotonically from approximately 0.6 to 0.18, while the GSD negative log-likelihood als...

  21. [21]

    Protocol.We use the RSVQA-HR test split, whose ground-truth ground sample distance is g⋆=0.15 m

    Figure 10:GSD-spoofing test on RSVQA-HR.The true GSD is 0.15 m.Ours (CS-HLoRA) peaks near the true GSD and degrades smoothly under scale mismatch, producing a clean bell on log-GSD.B4 (Bucketed MoE-LoRA)is parameter-matched and also peaks in the correct (high) bucket, confirming that conditional capacity helps; however its response is piecewise-constant w...

  22. [22]

    small vehicles

    The vision tower, multimodal projector, and LLM backbone remain frozen at their Stage 1 values. CS-HLoRA uses rank r=64, scaling factor α=32, and zero dropout, and is attached to all linear projections in the LLM. 25 Source # Samples GSD Annotation Role RSVQA family∼1.14 M precise dominant VQA supervision from LR, HR, and xBEN subsets MtSCCD-VQA∼194 K pre...