pith. machine review for the scientific record. sign in

arxiv: 2605.07273 · v1 · submitted 2026-05-08 · 💻 cs.CV · cs.AI

Recognition: no theorem link

From Clouds to Hallucinations: Atmospheric Retrieval Hijacking in Remote Sensing Vision-Language RAG

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:16 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords remote sensingmultimodal RAGvision-language retrievaladversarial attackcloud patternsevidence hijackingweather hallucinationinput perturbation
0
0 comments X

The pith

Adding optimized cloud-like patterns to remote sensing images hijacks vision-language retrieval in multimodal RAG systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CloudWeb, an attack that modifies only the input image by overlaying parameterized cloud- and haze-like patterns. These patterns are optimized with a retrieval-oriented objective to pull the image embedding toward target atmospheric evidence, suppress the original scene, enforce rank separation, and preserve natural appearance. Evaluation across a seven-dataset remote sensing benchmark and five CLIP-style retrievers shows consistent outperformance over clean retrieval and baselines, with large gains in weather-related evidence ranking that then affect downstream generation. A sympathetic reader would care because it identifies a vulnerability at the evidence retrieval stage in systems that ground visual queries in external knowledge, where subtle input changes can lead to incorrect grounding before any response is generated.

Core claim

CloudWeb is an atmospheric retrieval hijacking attack that overlays optimized cloud- and haze-like patterns on remote sensing images to modify only the input while keeping the retriever, generator, and knowledge base fixed. It uses a retrieval-oriented objective to pull adversarial image embeddings toward target atmospheric evidence, suppress source-scene evidence, enforce rank separation, and regularize for naturalness and coverage. On a seven-dataset benchmark with five CLIP-style retrievers including GeoRSCLIP, it consistently outperforms clean retrieval, handcrafted atmospheric baselines, random perturbations, and fixed variants, raising Weather@5 from 0.71% to 43.29% on GeoRSCLIP ViT-B/

What carries the argument

CloudWeb, an optimization procedure that parameterizes cloud- and haze-like patterns on input images and refines them to achieve specific retrieval ranking objectives while preserving apparent naturalness.

If this is right

  • Natural-looking atmospheric changes to input images can compromise evidence retrieval before generation begins in remote sensing multimodal RAG.
  • The attack requires no changes to the deployed retriever, generator, or knowledge base and works across multiple vision-language retrievers.
  • Retrieval-stage hijacking propagates to downstream vision-language generation, producing measurable weather hallucination and semantic shift.
  • Input-space threats at the retrieval stage represent a practical failure mode distinct from corpus manipulation or end-task attacks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Defenses could include input sanitization steps that detect or neutralize subtle atmospheric perturbations before retrieval occurs.
  • The vulnerability might extend to other visual domains where environmental factors influence image interpretation in RAG pipelines.
  • Testing the attack on live deployed remote sensing RAG systems would show whether the optimized patterns remain effective outside benchmark conditions.
  • Similar optimization approaches could be used to target non-weather evidence categories, broadening the scope of retrieval hijacking.

Load-bearing premise

The attack assumes that the parameterized cloud- and haze-like patterns can be optimized to achieve the retrieval objective while maintaining naturalness, and that this holds across different retrievers and datasets without the patterns being filtered or detected in practical remote sensing pipelines.

What would settle it

Applying the optimized cloud patterns to images and observing no increase in weather-related evidence within the top retrieved results on standard remote sensing datasets with CLIP-style retrievers, or finding that the patterns are automatically filtered in real pipelines, would indicate the hijacking effect does not hold.

Figures

Figures reproduced from arXiv: 2605.07273 by Chao Li, Chengyin Hu, Fengyu Zhang, Jiahuan Long, Jiaju Han, Jiujiang Guo, Qike Zhang, Xiang Chen, Xin Wang, Xuemeng Sun, Yiwei Wei.

Figure 1
Figure 1. Figure 1: Overview of CloudWeb. CloudWeb optimizes cloud- and haze-like perturbations to shift a remote sensing image toward weather-related evidence in a frozen vision-language retrieval space. The hijacked top-k evidence is then passed to the downstream VLM, inducing evidence-grounded hallucinations without modifying the retriever, generator, vector database, or retrieval logic. Existing adversarial studies on vis… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative retrieval-hijacking exam￾ples. Clean queries retrieve scene-consistent evi￾dence, while CloudWeb-perturbed queries retrieve weather-related evidence [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Propagation of retrieval hijacking to downstream generation. Clean responses preserve source-scene semantics, while adversarial responses are shifted toward weather-related semantics after CloudWeb redirects retrieved evidence. 4.3 Downstream Generation Impact [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Attention shift induced by CloudWeb. Weather-oriented activation becomes stronger around atmospheric perturbation regions in adversarial queries. 4.4 Attention-based Mechanism Analysis [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Robustness of CloudWeb under post￾processing. CloudWeb preserves retrieval disrup￾tion and weather-evidence hijacking under JPEG compression, Gaussian blur, and resizing. Post-processing robustness. The post￾processing robustness results show that optimized CloudWeb remains effective under common transformations. Across JPEG compression, resizing, and mild Gaussian blur, T@5 stays near saturation, indicati… view at source ↗
Figure 6
Figure 6. Figure 6: Opacity-severity interaction. Weather@5 generally increases with cloud opacity and severity across five retrievers, indicating that stronger atmospheric patterns more readily trigger weather-evidence retrieval. Opacity-severity interaction [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Loss ablation. Radial values are normalized to the full model and zoomed to 85–105%. Removing Ltar and Lrank most weakens weather hijacking. Loss ablation [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Perturbation component abla￾tion. Normalized radar visualization of CloudWeb variants with different rendering components and perturbation strengths. Perturbation component ablation. The compo￾nent ablation highlights the key atmospheric factors. Lower opacity and severity reduce W@5 to 8.00% and 11.71%, indicating that visible cloud-like structure is crucial for activating weather evidence. Removing globa… view at source ↗
read the original abstract

Multimodal RAG systems increasingly rely on vision-language retrievers to ground visual queries in external textual evidence. Existing adversarial studies on RAG mainly manipulate the retrieval corpus or memory, while attacks on vision-language and remote sensing models typically target end-task predictions. Input-space threats to the evidence retrieval stage of remote sensing multimodal RAG remain underexplored. To address this gap, we introduce CloudWeb, an atmospheric retrieval hijacking attack that modifies only the input image while keeping the retriever, generator, and knowledge base fixed at deployment. CloudWeb overlays parameterized cloud- and haze-like patterns on remote sensing images and optimizes them with a retrieval-oriented objective that pulls adversarial image embeddings toward target atmospheric evidence, suppresses source-scene evidence, enforces rank separation, and regularizes naturalness and coverage. To the best of our knowledge, this is the first study of retrieval-stage atmospheric evidence hijacking in remote sensing multimodal RAG. We evaluate CloudWeb on a seven-dataset remote sensing RAG benchmark with five CLIP-style retrievers, including GeoRSCLIP, RemoteCLIP, OpenAI CLIP, and OpenCLIP, together with downstream vision-language generators. Across retrievers, CloudWeb consistently outperforms clean retrieval, handcrafted atmospheric baselines, random cloud perturbations, and fixed variants in injecting weather-related evidence into top-ranked results. On GeoRSCLIP ViT-B/32, Weather@5 increases from 0.71\% to 43.29\%. Downstream generation further shows measurable weather hallucination and semantic shift, indicating that retrieval-stage hijacking can propagate to the final RAG response. These findings reveal a practical failure mode: natural-looking atmospheric changes can compromise evidence retrieval before generation begins.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CloudWeb, an input-space attack on remote sensing vision-language RAG systems that overlays optimized, parameterized cloud- and haze-like patterns on query images. The attack optimizes these patterns via a retrieval-oriented loss that pulls embeddings toward target atmospheric evidence, suppresses source evidence, enforces rank separation, and includes a naturalness regularizer, while keeping the retriever, generator, and knowledge base fixed. Evaluations across seven datasets and five CLIP-style retrievers (including GeoRSCLIP, RemoteCLIP, OpenAI CLIP, and OpenCLIP) show consistent outperformance over clean retrieval, handcrafted baselines, random perturbations, and fixed variants, with examples such as Weather@5 rising from 0.71% to 43.29% on GeoRSCLIP ViT-B/32; downstream VL generation exhibits measurable weather hallucination and semantic shift.

Significance. If the naturalness and transferability claims hold, the work is significant for exposing a practical vulnerability at the retrieval stage of multimodal RAG pipelines used in remote sensing, where small atmospheric modifications can inject misleading evidence before generation occurs. The quantitative results across multiple retrievers and datasets, together with explicit comparisons to handcrafted and random baselines, provide a concrete empirical foundation that could guide defenses such as adversarial training or input sanitization for these systems.

major comments (2)
  1. [Optimization objective and evaluation protocol] The central practical claim—that the optimized patterns remain 'natural-looking' and thus evade detection or filtering in real remote-sensing pipelines—rests on the naturalness regularizer alone. No human perceptual study, expert review, or domain-specific metric (cloud texture statistics, spectral consistency, or perceptual similarity scores) is reported to validate that the outputs pass as authentic atmospheric effects rather than artifacts.
  2. [Experimental results and tables] The reported gains (e.g., Weather@5 increases) are presented as evidence of a general attack surface, yet the manuscript provides no statistical significance tests, confidence intervals, or analysis of variance across the seven-dataset benchmark; without these, it is unclear whether the outperformance over baselines is robust or could be explained by dataset-specific biases or retriever quirks.
minor comments (2)
  1. [Abstract] The abstract introduces 'Weather@5' and 'semantic shift' without a brief parenthetical definition or reference to the precise retrieval metric formulation, which reduces immediate accessibility for readers outside the remote-sensing RAG subfield.
  2. [Figures and tables] Figure captions and table headers would benefit from explicit statements of the exact number of runs or seeds used to compute the reported percentages, aiding reproducibility assessment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve validation of naturalness and statistical robustness of the results.

read point-by-point responses
  1. Referee: [Optimization objective and evaluation protocol] The central practical claim—that the optimized patterns remain 'natural-looking' and thus evade detection or filtering in real remote-sensing pipelines—rests on the naturalness regularizer alone. No human perceptual study, expert review, or domain-specific metric (cloud texture statistics, spectral consistency, or perceptual similarity scores) is reported to validate that the outputs pass as authentic atmospheric effects rather than artifacts.

    Authors: We agree that the naturalness claim would be strengthened by additional validation beyond the regularizer. The regularizer penalizes unrealistic deviations using coverage and smoothness terms to encourage patterns resembling authentic atmospheric effects. However, the original manuscript does not include human studies or domain-specific metrics. In revision, we will add quantitative assessments such as perceptual similarity scores (e.g., LPIPS against real cloud images) and cloud texture statistics (e.g., power spectrum analysis for spatial frequency consistency) to better support that the patterns appear as genuine atmospheric phenomena rather than artifacts. revision: yes

  2. Referee: [Experimental results and tables] The reported gains (e.g., Weather@5 increases) are presented as evidence of a general attack surface, yet the manuscript provides no statistical significance tests, confidence intervals, or analysis of variance across the seven-dataset benchmark; without these, it is unclear whether the outperformance over baselines is robust or could be explained by dataset-specific biases or retriever quirks.

    Authors: We acknowledge the value of statistical analysis to confirm robustness. The manuscript reports consistent gains across seven datasets and five retrievers (e.g., Weather@5 rising from 0.71% to 43.29% on GeoRSCLIP), outperforming clean, handcrafted, random, and fixed baselines. To address this, we will add paired statistical significance tests (e.g., t-tests or Wilcoxon signed-rank) between CloudWeb and baselines, report 95% confidence intervals for key metrics, and include ANOVA to assess variance across datasets and retrievers, thereby demonstrating that improvements are not attributable to dataset-specific biases. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack evaluation is self-contained

full rationale

The paper presents CloudWeb as a parameterized optimization attack on vision-language retrievers in remote sensing RAG, using a composite objective (retrieval pull, suppression, rank separation, plus naturalness regularizer) evaluated directly on seven datasets across five CLIP-style models. Performance gains (e.g., Weather@5 from 0.71% to 43.29%) are measured outcomes against explicit baselines (clean retrieval, handcrafted patterns, random perturbations), not quantities derived by construction from the inputs or fitted parameters renamed as predictions. No self-citation chain, uniqueness theorem, or ansatz smuggling supports the central claims; the method is an empirical demonstration of retrieval hijacking rather than a deductive derivation. The naturalness term is part of the attack formulation but does not create circularity in the reported results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on the success of the optimization process, which involves multiple tunable parameters and assumptions about embedding space manipulability.

free parameters (1)
  • pattern parameterization coefficients
    The cloud and haze patterns are parameterized and optimized, implying several free parameters for their generation and the weights in the retrieval-oriented objective.
axioms (1)
  • domain assumption Vision-language retrievers like CLIP map images and text to a shared embedding space where proximity indicates semantic similarity
    The attack relies on shifting image embeddings towards target text embeddings via input perturbations.
invented entities (1)
  • CloudWeb attack framework no independent evidence
    purpose: To perform atmospheric retrieval hijacking
    Introduced as a new method in this work.

pith-pipeline@v0.9.0 · 5651 in / 1609 out tokens · 50006 ms · 2026-05-11T01:16:44.179626+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

  1. [1]

    there’s a yellow baseball field in the middle of the lawn

  2. [2]

    There’s a yellow baseball field in the middle of the lawn

  3. [3]

    There’s a yellow baseball field in the green grass

  4. [4]

    there is a baseball field with some reflections of lamp posts

  5. [5]

    the baseball field with some reflections of lamp posts is next to two buildings

  6. [6]

    Failure analysis.The baseball-diamond geometry remains a strong semantic anchor

    a white bubble shape church is in the corner of a neighbor which is divided by some streets . Failure analysis.The baseball-diamond geometry remains a strong semantic anchor. CloudWeb changes the ranking, but the perturbed retrieval still stays inside the sports-field manifold instead of crossing into weather semantics. RSICD Query ID: rsicd_test_000224_e...

  7. [7]

    the bridge built across the river has three directions, of which the right lane has a traffic jam

  8. [8]

    a big bridge across a broad river and many cars on it

  9. [9]

    this bridge is very strong

  10. [11]

    a viaduct divided into a straight light ray way and some arc light gray way . Failure analysis.The attack suppresses the detailed bridge interpretation and shifts retrieval toward a more generic overpass description, but it still does not cross the semantic boundary into cloud-like evidence. RSVQA-LR 38 Query ID: rsvqa_lr_test_263_26399 Prompt: What is th...

  11. [12]

    it is dark blue lake and green grassland

  12. [13]

    some lakes appear on the bareland

  13. [14]

    the lake is green above a lot of ship

  14. [15]

    there is a big reflection in the sunlight

  15. [16]

    a pond with sky inverted image while surrounded by many spring green plants

  16. [17]

    Failure analysis.The perturbation is absorbed into a reflection-like interpretation

    sunlight shining on the potholes in the water . Failure analysis.The perturbation is absorbed into a reflection-like interpretation. Instead of retriev- ing cloud or fog evidence, the model shifts toward bright-surface or sunlight-reflection descriptions. FloodNet Query ID: floodnet_test_002797_vqa_00 Prompt: What is the overall condition of the given ima...

  17. [18]

    two cars are in a river with some green trees on two ends of it

  18. [19]

    the long river is full of grass

  19. [20]

    The long river is full of grass

  20. [21]

    sunlight shining on the potholes in the water

  21. [22]

    a room appears on the upper left of the screen

  22. [23]

    Failure analysis.Water regions and specular highlights create a strong competing explanation

    it is white, light green and green . Failure analysis.Water regions and specular highlights create a strong competing explanation. The perturbation changes retrieval, but the new evidence is explained as water-surface reflection rather than atmospheric cloud or haze. RSIVQA-UCM Query ID: rsivqa_ucm_000762_vqa_00 Prompt: Does this picture contain trees? Cl...

  23. [24]

    there are many cars running on the overpass

  24. [26]

    many cars were running on the overpass

  25. [27]

    a light white is surrounded by the freeway and paking lots

  26. [28]

    39 Failure analysis.Linear man-made infrastructure remains dominant after perturbation

    the car in the parking lot disappears and several buildings appear on the lower-right . 39 Failure analysis.Linear man-made infrastructure remains dominant after perturbation. Retrieval moves from detailed traffic semantics to a coarser overpass description, but the structural prior prevents weather-evidence insertion. RSIVQA-Sydney Query ID: rsivqa_sydne...

  27. [30]

    the green of the sea rolled up white waves

  28. [32]

    the green of the sea rolled with white waves

  29. [33]

    The white waves are in the green water

  30. [34]

    Failure analysis.This is a harder failure in which even the top retrieval remains nearly unchanged

    the green of the sea rolled up white waves . Failure analysis.This is a harder failure in which even the top retrieval remains nearly unchanged. The sea-wave texture already contains strong white-pattern cues, yet the model still prefers the original ocean interpretation. 40