arxiv: 2601.14044 · v1 · submitted 2026-01-20 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Weather-R1: Logically Consistent Reinforcement Fine-Tuning for Multimodal Reasoning in Meteorology

Kaiyu Wu , Pucheng Han , Hualong Zhang , Naigeng Wu , Keze Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 12:33 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal reasoningreinforcement fine-tuninglogical consistencymeteorologyvision language modelsself-contradictory reasoningWeatherQA

0 comments

The pith

A logical consistency reward added to reinforcement fine-tuning produces a weather reasoning model that avoids self-contradictory answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds WeatherQA, a multimodal benchmark for meteorology questions that pairs images with reasoning tasks. It develops LoCo-RFT, which adds a reward term during reinforcement learning to penalize cases where the model's step-by-step reasoning contradicts its final answer. This training produces Weather-R1, a vision-language model that raises accuracy on the benchmark by 9.8 points over the base model and exceeds both standard supervised fine-tuning and ordinary reinforcement fine-tuning. The improvement matters in meteorology because decisions such as severe weather alerts require reasoning that remains internally consistent. The approach demonstrates that faithfulness constraints can be directly optimized rather than left to post-training checks.

Core claim

Weather-R1 is obtained by applying LoCo-RFT to a base vision-language model, where the logical consistency reward is computed by checking whether the generated reasoning chain entails the predicted answer, and the resulting model achieves a 9.8 percentage point lift on WeatherQA while outperforming both supervised fine-tuning and conventional reinforcement fine-tuning and even surpassing the original 32B base model.

What carries the argument

LoCo-RFT, the reinforcement fine-tuning procedure that augments the standard reward with an explicit logical consistency term that scores whether the reasoning steps support the final answer without internal contradiction.

If this is right

Weather-R1 can be used directly in automated weather analysis pipelines where answer consistency is required.
The same consistency reward can be applied to other multimodal reasoning benchmarks beyond meteorology.
Training with LoCo-RFT yields gains larger than those from supervised fine-tuning alone on the same base model.
Larger base models may achieve further absolute gains when fine-tuned with the same logical consistency term.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Logical consistency rewards could become a default component in reinforcement fine-tuning pipelines for any domain that penalizes internal contradictions.
Future meteorology benchmarks may add explicit tests for reasoning faithfulness to measure whether gains generalize to unseen weather phenomena.
The method suggests that faithfulness and accuracy can be improved together rather than traded off.

Load-bearing premise

The added logical consistency reward removes self-contradictory reasoning without creating new failure modes or lowering performance on other meteorology tasks.

What would settle it

A single WeatherQA example on which the trained model still produces a reasoning chain that directly contradicts its own final answer would falsify the claim that the reward reliably eliminates self-contradictory reasoning.

read the original abstract

While Vision Language Models (VLMs) show advancing reasoning capabilities, their application in meteorology is constrained by a domain gap and a reasoning faithfulness gap. Specifically, mainstream Reinforcement Fine-Tuning (RFT) can induce Self-Contradictory Reasoning (Self-Contra), where the model's reasoning contradicts its final answer, which is unacceptable in such a high-stakes domain. To address these challenges, we construct WeatherQA, a novel multimodal reasoning benchmark in meteorology. We also propose Logically Consistent Reinforcement Fine-Tuning (LoCo-RFT), which resolves Self-Contra by introducing a logical consistency reward. Furthermore, we introduce Weather-R1, the first reasoning VLM with logical faithfulness in meteorology, to the best of our knowledge. Experiments demonstrate that Weather-R1 improves performance on WeatherQA by 9.8 percentage points over the baseline, outperforming Supervised Fine-Tuning and RFT, and even surpassing the original Qwen2.5-VL-32B. These results highlight the effectiveness of our LoCo-RFT and the superiority of Weather-R1. Our benchmark and code are available at https://github.com/Marcowky/Weather-R1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Weather-R1 adds a logical consistency reward to RFT and reports a 9.8-point gain on its new WeatherQA benchmark.

read the letter

The one thing to know is that this paper adds a logical consistency reward to reinforcement fine-tuning and reports a 9.8 percentage point improvement on a new meteorology benchmark called WeatherQA. They start with Qwen2.5-VL-32B and fine-tune it using their LoCo-RFT method, which includes the consistency term to stop the model from contradicting itself in reasoning steps versus the final answer. The resulting Weather-R1 model beats supervised fine-tuning, standard RFT, and the base model on WeatherQA. They make the benchmark and code public, which is straightforward and useful. The work does a good job highlighting a real problem in VLMs for high-stakes domains. Self-contradictory reasoning is unacceptable in meteorology, and the targeted reward addresses it directly. The numerical gains over the comparison methods suggest the approach has merit in their setup. The soft spots are around verification. All gains are on the newly built benchmark, and the abstract does not include ablations or details on how the questions were selected. If the benchmark happens to favor models trained with consistency rewards, the advantage could shrink on other tests. More error analysis would help show whether the fix introduces other issues. This paper is for researchers applying VLMs to scientific fields or working on faithfulness in model outputs. A reader looking for concrete examples of reward design in RFT would get something concrete from it. It deserves a serious referee. The public artifacts allow direct checks on the claims, and the core contribution is testable. I would recommend sending it for peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces WeatherQA, a new multimodal reasoning benchmark for meteorology, and proposes LoCo-RFT, a reinforcement fine-tuning approach that adds a logical consistency reward to mitigate self-contradictory reasoning in vision-language models. It presents Weather-R1, which reportedly achieves a 9.8 percentage point gain on WeatherQA over the baseline, outperforming both supervised fine-tuning and standard RFT while surpassing the original Qwen2.5-VL-32B model. Code and benchmark are released for verification.

Significance. If the empirical gains and the causal role of the consistency reward are confirmed, the work would be significant for high-stakes multimodal reasoning domains. The release of code and benchmark directly supports reproducibility of the 9.8 pp delta and the logical-consistency metric, addressing a practical faithfulness gap in VLMs applied to meteorology.

major comments (2)

[Experiments] Experiments section: the reported 9.8 pp improvement on WeatherQA requires explicit statistical significance testing (e.g., multiple random seeds or bootstrap intervals) and full ablation tables isolating the consistency reward from prompt engineering or data effects; without these, it remains unclear whether the reward term is the primary driver.
[Benchmark] Benchmark construction (Section 3 or 4): additional details are needed on how WeatherQA items were generated and filtered to exclude any overlap with the training distribution or reward-signal tuning, given that the central claim rests on performance on this new benchmark.

minor comments (2)

[Method] Notation for the logical consistency reward should be defined once in the methods section with a clear equation rather than repeated descriptively.
[Qualitative results] Figure 2 (or equivalent) comparing reasoning traces would benefit from explicit annotation of the self-contradiction points to make the qualitative improvement immediately visible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for minor revision. We address each major comment below and will update the manuscript accordingly.

read point-by-point responses

Referee: [Experiments] Experiments section: the reported 9.8 pp improvement on WeatherQA requires explicit statistical significance testing (e.g., multiple random seeds or bootstrap intervals) and full ablation tables isolating the consistency reward from prompt engineering or data effects; without these, it remains unclear whether the reward term is the primary driver.

Authors: We agree that statistical significance testing and more granular ablations are needed to strengthen the claims. In the revised manuscript we will report results over five random seeds with bootstrap confidence intervals for the 9.8 pp gain on WeatherQA. We will also add expanded ablation tables that isolate the logical consistency reward while holding prompt engineering and data composition fixed, thereby clarifying that the reward term is the primary driver. revision: yes
Referee: [Benchmark] Benchmark construction (Section 3 or 4): additional details are needed on how WeatherQA items were generated and filtered to exclude any overlap with the training distribution or reward-signal tuning, given that the central claim rests on performance on this new benchmark.

Authors: We will expand Section 3 with a detailed description of the WeatherQA generation and filtering pipeline. The revision will explicitly document the steps taken to prevent overlap with the training distribution and to ensure that benchmark items were not influenced by reward-signal tuning, including the specific filtering criteria and verification procedures used. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical claims or benchmark evaluation

full rationale

The paper reports an empirical performance improvement (9.8 pp on WeatherQA) obtained by adding a logical-consistency reward term to reinforcement fine-tuning. The benchmark is newly constructed by the authors but is released together with code, making the measured delta independently verifiable rather than internally forced. No derivation chain, equation, or self-citation reduces the central result to a tautology or to a fitted parameter renamed as a prediction. The logical-consistency reward is an explicit, additive training signal whose effect is evaluated on held-out benchmark items; nothing in the provided text indicates that benchmark construction itself encodes the reward function. Consequently the reported ordering of methods (Weather-R1 > SFT > baseline RFT) rests on external measurement rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on the assumption that a scalar logical-consistency reward can be defined and optimized without side effects; no free parameters or invented entities are visible in the abstract.

pith-pipeline@v0.9.0 · 5524 in / 1100 out tokens · 28371 ms · 2026-05-16T12:33:24.712128+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/LogicAsFunctionalEquation.lean SatisfiesLawsOfLogic echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

RLoCo = 1 iff fa_rp = fa and RFormat = 1 (Eq. 2); Self-Contra proportion drops from ~30% (RFT) to ~2% (LoCo-RFT)
Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

singular optimization on final-answer correctness conflicts with logical consistency learned in pre-training

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 7 internal anchors

[1]

correctness of the final answer

INTRODUCTION Amid escalating global climate change, weather forecasters must interpret extensive meteorological images and charts, and deliver reliable information [1, 2]. Although deep learning has advanced data-driven weather forecasting [3, 4], open-ended interpretation and reasoning still rely heavily on human experts. Meanwhile, Vision Language Model...

work page
[2]

Weather-R1: Logically Consistent Reinforcement Fine-Tuning for Multimodal Reasoning in Meteorology

METHODOLOGY 2.1. WeatherQA Benchmark The construction of our WeatherQA benchmark comprises 4 stages: Theme and Task Definition.In collaboration with meteorologi- cal experts, we define four themes for the benchmark: precipita- tion, weather phenomena, temperature, and weather systems. These themes correspond to seven specific imaging modality tasks (see F...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

southerly airflow,

EXPERIMENTS Datasets & Tasks.We use WeatherQA as our training dataset and evaluation benchmark, following its defined cross-task protocol (see Section 2.1). Additionally, to measure the model’s OOD generaliza- tion capability, we curate a test set from ScienceQA [19], which con- sists of 324 multiple-choice questions related to weather and climate. Multip...

work page
[4]

Furthermore, we identify the Self-Contra issue in RFT and propose a novel LoCo-RFT paradigm to mitigate it by rewarding faithful reasoning

CONCLUSION In this work, we introduce WeatherQA, a novel multimodal rea- soning benchmark for meteorology. Furthermore, we identify the Self-Contra issue in RFT and propose a novel LoCo-RFT paradigm to mitigate it by rewarding faithful reasoning. Our Weather-R1 demonstrates the effectiveness of this paradigm by significantly re- ducing Self-Contra proport...

work page
[5]

The burden of heat-related mortality attributable to recent human-induced climate change,

Ana Maria Vicedo-Cabrera, Noah Scovronick, Francesco Sera, Dominic Roy´e, Rochelle Schneider, Aurelio Tobias, Christofer Astrom, Y Guo, Y Honda, DM Hondula, et al., “The burden of heat-related mortality attributable to recent human-induced climate change,”Nature climate change, vol. 11, no. 6, pp. 492–500, 2021

work page 2021
[6]

Extreme weather impacts of climate change: an attribution perspective,

Ben Clarke, Friederike Otto, Rupert Stuart-Smith, and Luke Harrington, “Extreme weather impacts of climate change: an attribution perspective,”Environmental Research: Climate, vol. 1, no. 1, pp. 012001, 2022

work page 2022
[7]

Fengwu: Pushing the skillful global medium- range weather forecast beyond 10 days lead,

Kang Chen, Tao Han, Junchao Gong, Lei Bai, Fenghua Ling, Jing-Jia Luo, Xi Chen, Leiming Ma, Tianning Zhang, Rui Su, et al., “Fengwu: Pushing the skillful global medium- range weather forecast beyond 10 days lead,”arXiv preprint arXiv:2304.02948, 2023

work page arXiv 2023
[8]

Climode: Climate and weather forecasting with physics-informed neural odes,

Yogesh Verma, Markus Heinonen, and Vikas Garg, “Climode: Climate and weather forecasting with physics-informed neural odes,”arXiv preprint arXiv:2404.10024, 2024

work page arXiv 2024
[9]

Glm-4.1 v-thinking: Towards versatile multimodal rea- soning with scalable reinforcement learning,

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al., “Glm-4.1 v-thinking: Towards versatile multimodal rea- soning with scalable reinforcement learning,”arXiv e-prints, pp. arXiv–2507, 2025

work page 2025
[10]

One rl to see them all: Visual triple unified rein- forcement learning,

Yan Ma, Linge Du, Xuyang Shen, Shaoxiang Chen, Pengfei Li, Qibing Ren, Lizhuang Ma, Yuchao Dai, Pengfei Liu, and Junjie Yan, “One rl to see them all: Visual triple unified rein- forcement learning,” 2025

work page 2025
[11]

Visual-rft: Vi- sual reinforcement fine-tuning,

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang, “Visual-rft: Vi- sual reinforcement fine-tuning,” 2025

work page 2025
[12]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al., “Kimi-vl technical report,”arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Reason-rft: Reinforcement fine-tuning for visual reasoning,

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang, “Reason-rft: Reinforcement fine-tuning for visual reasoning,” 2025

work page 2025
[14]

Llava-med: Training a large language- and-vision assistant for biomedicine in one day,

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao, “Llava-med: Training a large language- and-vision assistant for biomedicine in one day,”Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024
[15]

Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert

Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert, “Medvlm-r1: Incentivizing medical reasoning ca- pability of vision-language models (vlms) via reinforcement learning,”arXiv preprint arXiv:2502.19634, 2025

work page arXiv 2025
[16]

Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models,

Yuxiang Lai, Jike Zhong, Ming Li, Shitian Zhao, and Xiaofeng Yang, “Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models,”arXiv preprint arXiv:2503.13939, 2025

work page arXiv 2025
[17]

G-llava: Solving geometric problem with multi-modal large language model,

Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, and Lingpeng Kong, “G-llava: Solving geometric problem with multi-modal large language model,” 2025

work page 2025
[18]

Vision matters: Sim- ple visual perturbations can boost multimodal math reasoning,

Yuting Li, Lai Wei, Kaipeng Zheng, Jingyuan Huang, Linghe Kong, Lichao Sun, and Weiran Huang, “Vision matters: Sim- ple visual perturbations can boost multimodal math reasoning,” 2025

work page 2025
[19]

Math- llava: Bootstrapping mathematical reasoning for multimodal large language models,

Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Lidong Bing, and Roy Ka-Wei Lee, “Math- llava: Bootstrapping mathematical reasoning for multimodal large language models,”arXiv preprint arXiv:2406.17294, 2024

work page arXiv 2024
[20]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al., “Deepseek-r1: Incentivizing reasoning ca- pability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al., “Dapo: An open-source llm reinforcement learning system at scale,”arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Learn to explain: Multimodal reasoning via thought chains for science question answering,

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan, “Learn to explain: Multimodal reasoning via thought chains for science question answering,” inThe 36th Conference on Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[24]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al., “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

gpt-oss-120b & gpt-oss-20b model card,

OpenAI, “gpt-oss-120b & gpt-oss-20b model card,” 2025

work page 2025
[27]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models,

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” 2024

work page 2024
[28]

Self-contradictory reason- ing evaluation and detection,

Ziyi Liu, Soumya Sanyal, Isabelle Lee, Yongkang Du, Rahul Gupta, Yang Liu, and Jieyu Zhao, “Self-contradictory reason- ing evaluation and detection,” 2024

work page 2024
[29]

Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation,

Niels M ¨undler, Jingxuan He, Slobodan Jenko, and Martin Vechev, “Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation,” 2024

work page 2024
[30]

Efficient memory management for large language model serving with pagedattention,

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica, “Efficient memory management for large language model serving with pagedattention,” inPro- ceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[31]

Llava-next: Improved rea- soning, ocr, and world knowledge,

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee, “Llava-next: Improved rea- soning, ocr, and world knowledge,” January 2024

work page 2024
[32]

Qwen3 technical report,

Qwen Team, “Qwen3 technical report,” 2025

work page 2025