Recognition: no theorem link
SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models
Pith reviewed 2026-05-12 03:55 UTC · model grok-4.3
The pith
SenseBench shows that vision-language models carry strong biases toward everyday photos and fail to reliably perceive or describe degradations in remote sensing images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Comprehensive evaluation of 29 state-of-the-art VLMs on SenseBench reveals that these models exhibit skewed domain priors favoring natural ground-level images, multi-distortion collapse, fluency illusion in generated descriptions, and a perception-description inversion effect when applied to remote sensing degradations.
What carries the argument
SenseBench, the benchmark consisting of a physics-based hierarchical taxonomy that unifies non-reference and reference-based image-quality paradigms together with more than 10,000 curated instances across 6 major and 22 fine-grained remote-sensing degradation categories and two complementary protocols for objective perception and subjective diagnostic description.
If this is right
- VLMs will need explicit domain adaptation or remote-sensing-specific pretraining before they can serve as reliable interpreters of image quality in satellite or aerial data.
- Language-based diagnostic descriptions from VLMs could replace or augment scalar IQA scores only after the observed fluency illusion and inversion effects are mitigated.
- Multi-distortion scenarios common in real remote-sensing pipelines will continue to expose weaknesses in current models until the benchmark data are used for targeted training.
- The two-protocol design (objective perception plus subjective description) provides a template for future benchmarks that want to separate detection accuracy from explanatory fluency.
Where Pith is reading between the lines
- The same taxonomy-driven construction method could be applied to create parallel benchmarks for medical, underwater, or astronomical imagery where VLMs also encounter domain gaps.
- Fine-tuning VLMs on the SenseBench training split might reduce the reported inversion effect and improve both perception and description scores simultaneously.
- If the inversion effect persists across domains, it would suggest a fundamental architectural tension in current VLMs between low-level feature extraction and high-level language generation.
Load-bearing premise
The physics-based hierarchical taxonomy and the 10K curated instances accurately and representatively capture real-world remote sensing degradations without selection bias or annotation errors.
What would settle it
Re-running the same 29 VLMs on an independently assembled collection of remote-sensing images that matches the degradation taxonomy but uses different sensors and scenes, and finding no statistically significant domain skew, multi-distortion collapse, fluency illusion, or perception-description inversion.
Figures
read the original abstract
Low-level visual perception underpins reliable remote sensing (RS) image analysis, yet current image quality assessment (IQA) methods output uninterpretable scalar scores rather than characterizing physics-driven RS degradations, deviating markedly from the diagnostic needs of RS experts. While Vision-Language Models (VLMs) present a compelling alternative by delivering language-grounded IQA, their visual priors are heavily biased toward ground-level natural images. Consequently, whether VLMs can overcome this domain gap to perceive and articulate RS artifacts remains insufficiently studied. To bridge this gap, we propose \textbf{SenseBench}, the first dedicated diagnostic benchmark for RS low-level visual perception and description. Driven by a physics-based hierarchical taxonomy that unifies both non-reference and reference-based paradigms, SenseBench features over 10K meticulously curated instances across 6 major and 22 fine-grained RS degradation categories. Specifically, two complementary protocols are designed for evaluation: objective low-level visual \textit{perception} and subjective diagnostic \textit{description}. Comprehensive evaluation of 29 state-of-the-art VLMs reveals not only skewed domain priors and multi-distortion collapse, but also \textit{fluency illusion} and a \textit{perception-description inversion} effect. We hope SenseBench provides a robust evaluation testbed and high-quality diagnostic data to advance the development of VLMs in RS low-level perception. Code and datasets are available \href{https://github.com/Zhong-Chenchen/SenseBench}{\textcolor{blue}{here}}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SenseBench, the first dedicated benchmark for remote sensing low-level visual perception and description in VLMs. It is driven by a physics-based hierarchical taxonomy unifying non-reference and reference-based paradigms, featuring over 10K curated instances across 6 major and 22 fine-grained RS degradation categories. Two complementary protocols evaluate objective low-level visual perception and subjective diagnostic description. Comprehensive evaluation of 29 state-of-the-art VLMs reveals skewed domain priors, multi-distortion collapse, fluency illusion, and a perception-description inversion effect. Code and datasets are released publicly.
Significance. If the benchmark validity holds, this work would be significant for addressing the domain gap in VLMs for remote sensing applications, where standard IQA methods fall short of expert diagnostic needs. It supplies a reproducible testbed and high-quality data to diagnose specific VLM failure modes. The public release of code and datasets on GitHub is a clear strength that supports reproducibility and further research in the field.
major comments (2)
- [Abstract] Abstract: The central claims rest on evaluation outcomes for 29 VLMs showing domain priors, multi-distortion collapse, fluency illusion, and perception-description inversion, yet the abstract supplies no information on curation methodology, inter-annotator agreement, statistical tests, or controls for these effects; this leaves the results dependent on unverified assertions about the 10K instances.
- [Benchmark construction] Benchmark construction: The physics-based hierarchical taxonomy and 10K instances are presented as accurately capturing real-world RS degradations across categories, but no quantitative evidence (e.g., inter-annotator agreement, comparison to real satellite sensor logs, or external validation) is provided to rule out selection bias or annotation artifacts that could produce benchmark-specific rather than general effects.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee's report. We address the major comments point by point, agreeing to make revisions to improve the clarity and validation of our benchmark.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims rest on evaluation outcomes for 29 VLMs showing domain priors, multi-distortion collapse, fluency illusion, and perception-description inversion, yet the abstract supplies no information on curation methodology, inter-annotator agreement, statistical tests, or controls for these effects; this leaves the results dependent on unverified assertions about the 10K instances.
Authors: The abstract is intentionally concise to highlight the key contributions and findings. Detailed information on the curation methodology, including the physics-based taxonomy, instance selection criteria, inter-annotator agreement for the diagnostic descriptions, and statistical tests used in the VLM evaluations, is provided in the Methods and Experiments sections of the manuscript. We will revise the abstract to include a short statement on the curation process and validation to make the claims more self-contained. revision: yes
-
Referee: [Benchmark construction] Benchmark construction: The physics-based hierarchical taxonomy and 10K instances are presented as accurately capturing real-world RS degradations across categories, but no quantitative evidence (e.g., inter-annotator agreement, comparison to real satellite sensor logs, or external validation) is provided to rule out selection bias or annotation artifacts that could produce benchmark-specific rather than general effects.
Authors: We thank the referee for this observation. The taxonomy is constructed based on physical models of RS image formation and degradation processes documented in the remote sensing literature. The 10K instances were curated through a multi-stage process involving expert annotation. To strengthen the manuscript, we will add quantitative evidence including inter-annotator agreement metrics for the annotations and additional external validation steps. Direct comparison to proprietary satellite sensor logs is challenging due to data access restrictions, but we will incorporate comparisons to publicly available RS degradation datasets and expert surveys to mitigate concerns about selection bias. revision: partial
Circularity Check
No circularity; empirical benchmark with direct evaluation
full rationale
The paper constructs SenseBench via a physics-based taxonomy and curation of 10K instances, then reports direct empirical results from evaluating 29 VLMs on perception and description protocols. No equations, fitted parameters, self-referential derivations, or load-bearing self-citations appear in the provided text. Claims about domain priors, collapse, fluency illusion, and inversion effects are observations from the evaluation rather than reductions to inputs by construction, rendering the chain self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A physics-based hierarchical taxonomy unifies non-reference and reference-based RS degradation paradigms
Reference graph
Works this paper leans on
-
[1]
Siqi Lu, Junlin Guo, James R Zimmer-Dauphinee, Jordan M Nieusma, Xiao Wang, Parker VanValkenburgh, Steven A Wernke, and Yuankai Huo. Vision foundation models in remote sensing: A survey.IEEE Geoscience and Remote Sensing Magazine, 13(3):190–215, 2025
work page 2025
-
[2]
Xiao Xiang Zhu, Devis Tuia, Lichao Mou, Gui-Song Xia, Liangpei Zhang, Feng Xu, and Friedrich Fraundorfer. Deep learning in remote sensing: A comprehensive review and list of resources.IEEE geoscience and remote sensing magazine, 5(4):8–36, 2017
work page 2017
-
[3]
Perspectives in machine learning for wildlife conservation.Nature communications, 13(1):792, 2022
Devis Tuia, Benjamin Kellenberger, Sara Beery, Blair R Costelloe, Silvia Zuffi, Benjamin Risse, Alexander Mathis, Mackenzie W Mathis, Frank Van Langevelde, Tilo Burghardt, et al. Perspectives in machine learning for wildlife conservation.Nature communications, 13(1):792, 2022
work page 2022
-
[4]
Zijun Wei, Chaozhen Lan, Fushan Yao, Longhao Wang, Tian Gao, and Hanyang Yu. Bpme: A blind perceptual metric-driven method for enhancing degraded polar optical remote sensing imagery.IEEE Transactions on Geoscience and Remote Sensing, 64:1–13, 2025
work page 2025
-
[5]
Musiq: Multi-scale image quality transformer
Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021
work page 2021
-
[6]
Weixia Zhang, Guangtao Zhai, Ying Wei, Xiaokang Yang, and Kede Ma. Blind image quality as- sessment via vision-language correspondence: A multitask learning perspective. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14071–14081, 2023
work page 2023
-
[7]
Exploring clip for assessing the look and feel of images
Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 2555–2563, 2023
work page 2023
-
[8]
Xiang Li, Congcong Wen, Yuan Hu, Zhenghang Yuan, and Xiao Xiang Zhu. Vision-language models in remote sensing: Current progress and future trends.IEEE Geoscience and Remote Sensing Magazine, 12(2):32–66, 2024
work page 2024
-
[9]
Q-instruct: Improving low-level visual abilities for multi-modality foundation models
Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Kaixin Xu, Chunyi Li, Jingwen Hou, Guangtao Zhai, et al. Q-instruct: Improving low-level visual abilities for multi-modality foundation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 25490–25500, 2024
work page 2024
-
[10]
Depicting beyond scores: Advancing image quality assessment through multi-modal language models
Zhiyuan You, Zheyuan Li, Jinjin Gu, Zhenfei Yin, Tianfan Xue, and Chao Dong. Depicting beyond scores: Advancing image quality assessment through multi-modal language models. In European Conference on Computer Vision, pages 259–276. Springer, 2024
work page 2024
-
[11]
Xiao An, Wei He, Jiaqi Zou, Guangyi Yang, and Hongyan Zhang. Pretrain a remote sensing foundation model by promoting intra-instance similarity.IEEE Transactions on Geoscience and Remote Sensing, 62:1–15, 2024
work page 2024
-
[12]
Wenli Huang, Yang Wu, Xiaomeng Xin, Zhihong Liu, Jinjun Wang, and Ye Deng. Task-guided prompting for unified remote sensing image restoration.IEEE Transactions on Geoscience and Remote Sensing, 64:1–17, 2025
work page 2025
-
[13]
Xiang Li, Yong Tao, Siyuan Zhang, Siwei Liu, Zhitong Xiong, Chunbo Luo, Lu Liu, Mykola Pechenizkiy, Xiao Xiang Zhu, and Tianjin Huang. Reobench: Benchmarking robustness of earth observation foundation models.arXiv preprint arXiv:2505.16793, 2025
-
[14]
Xiang Li, Jian Ding, and Mohamed Elhoseiny. Vrsbench: A versatile vision-language bench- mark dataset for remote sensing image understanding.Advances in Neural Information Pro- cessing Systems, 37:3229–3242, 2024
work page 2024
-
[15]
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024. 10
work page 2024
-
[16]
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024
work page 2024
-
[17]
Zicheng Zhang, Haoning Wu, Erli Zhang, Guangtao Zhai, and Weisi Lin. Q-bench: A benchmark for multi-modal foundation models on low-level vision from single images to pairs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024
work page 2024
-
[18]
Zhiyuan You, Jinjin Gu, Xin Cai, Zheyuan Li, Kaiwen Zhu, Chao Dong, and Tianfan Xue. Enhancing descriptive image quality assessment with a large-scale multi-modal dataset.IEEE Transactions on Image Processing, 34:8201–8215, 2025
work page 2025
-
[19]
Towards open-ended visual quality comparison
Haoning Wu, Hanwei Zhu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Annan Wang, Wenxiu Sun, Qiong Yan, et al. Towards open-ended visual quality comparison. InEuropean Conference on Computer Vision, pages 360–377. Springer, 2024
work page 2024
-
[20]
Lhrs-bot: Em- powering remote sensing with vgi-enhanced large multimodal language model
Dilxat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang, and Pengfeng Xiao. Lhrs-bot: Em- powering remote sensing with vgi-enhanced large multimodal language model. InEuropean Conference on Computer Vision, pages 440–457. Springer, 2024
work page 2024
-
[21]
Geochat: Grounded large vision-language model for remote sensing
Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27831–27840, 2024
work page 2024
-
[22]
Xiao An, Jiaxing Sun, Zihan Gui, and Wei He. Choice: Benchmarking the remote sensing capabilities of large vision-language models.arXiv preprint arXiv:2411.18145, 2024
-
[23]
Junjue Wang, Weihao Xuan, Heli Qi, Zhihao Liu, Kunyi Liu, Yuhan Wu, Hongruixuan Chen, Jian Song, Junshi Xia, Zhuo Zheng, et al. Disasterm3: A remote sensing vision-language dataset for disaster damage assessment and response.arXiv preprint arXiv:2505.21089, 2025
-
[24]
Weihao Xuan, Junjue Wang, Heli Qi, Zihang Chen, Zhuo Zheng, Yanfei Zhong, Junshi Xia, and Naoto Yokoya. Dynamicvl: Benchmarking multimodal large language models for dynamic city understanding.arXiv preprint arXiv:2505.21076, 2025
-
[25]
Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik. No-reference image quality assessment in the spatial domain.IEEE Transactions on image processing, 21(12):4695–4708, 2012
work page 2012
-
[26]
Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Making a “completely blind” image quality analyzer.IEEE Signal processing letters, 20(3):209–212, 2012
work page 2012
-
[27]
Junhua Yan, Xuehan Bai, Yongqi Xiao, Yin Zhang, and Xiangyang Lv. No-reference remote sensing image quality assessment based on gradient-weighted natural scene statistics in spatial domain.Journal of Electronic Imaging, 28(1):013033–013033, 2019
work page 2019
-
[28]
Ningshan Xu, Dongao Ma, Guoqiang Ren, and Yongmei Huang. Bm-iqe: An image quality evaluator with block-matching for both real-life scenes and remote sensing scenes.Sensors, 20 (12):3472, 2020
work page 2020
-
[29]
Yiming Xiong, Feng Shao, Xiangchao Meng, Qiuping Jiang, Weiwei Sun, Randi Fu, and Yo- Sung Ho. A large-scale remote sensing database for subjective and objective quality assessment of pansharpened images.Journal of Visual Communication and Image Representation, 73: 102947, 2020
work page 2020
-
[30]
Wenzhong Tian, Arturo Sanchez-Azofeifa, Za Kan, Qingzhan Zhao, Guoshun Zhang, Yuzhen Wu, and Kai Jiang. Nr-iqa for uav hyperspectral image based on distortion constructing, feature screening, and machine learning.International Journal of Applied Earth Observation and Geoinformation, 133:104130, 2024. 11
work page 2024
-
[31]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024
work page 2024
-
[34]
arXiv preprint arXiv:2312.17090 (2023)
Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text-defined levels.arXiv preprint arXiv:2312.17090, 2023
-
[35]
Revisiting mllm based image quality assessment: Errors and remedy
Zhenchen Tang, Songlin Yang, Bo Peng, Zichuan Wang, and Jing Dong. Revisiting mllm based image quality assessment: Errors and remedy. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 9475–9483, 2026
work page 2026
-
[36]
Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025
work page 2025
-
[37]
Earthdial: Turning multi-sensory earth observations to interactive dialogues
Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fahad Shah- baz Khan, et al. Earthdial: Turning multi-sensory earth observations to interactive dialogues. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 14303–14313, 2025
work page 2025
-
[38]
Falcon: A remote sensing vision-language foundation model.arXiv preprint arXiv:2503.11070, 2025
Kelu Yao, Nuo Xu, Rong Yang, Yingying Xu, Zhuoyan Gao, Titinunt Kitrungrotsakul, Yi Ren, Pu Zhang, Jin Wang, Ning Wei, et al. Falcon: A remote sensing vision-language foundation model.arXiv preprint arXiv:2503.11070, 2025
-
[40]
Hui Lin, Danfeng Hong, Shuhang Ge, Chuyao Luo, Kai Jiang, Hao Jin, and Congcong Wen. Rs-moe: A vision-language model with mixture of experts for remote sensing image captioning and visual question answering.IEEE Transactions on Geoscience and Remote Sensing, 2025
work page 2025
-
[41]
Xu Liu and Zhouhui Lian. Rsunivlm: A unified vision language model for remote sensing via granularity-oriented mixture of experts.arXiv preprint arXiv:2412.05679, 2024
-
[42]
Jiaqi Liu, Ronghao Fu, Lang Sun, Haoran Liu, Xiao Yang, Weipeng Zhang, Xu Na, Zhuoran Duan, and Bo Yang. Skymoe: A vision-language foundation model for enhancing geospatial interpretation with mixture of experts.arXiv preprint arXiv:2512.02517, 2025
-
[43]
Yi Wang, Syed Muhammad Arsalan Bashir, Mahrukh Khan, Qudrat Ullah, Rui Wang, Yilin Song, Zhe Guo, and Yilong Niu. Remote sensing image super-resolution and object detection: Benchmark and state of the art.Expert Systems with Applications, 197:116793, 2022
work page 2022
-
[44]
Zhenghua Huang, Yang Yang, Hao Yu, Qian Li, Yu Shi, Yaozong Zhang, and Hao Fang. Rcst: Residual context-sharing transformer cascade to approximate taylor expansion for remote sensing image denoising.IEEE Transactions on Geoscience and Remote Sensing, 63:1–15, 2025. 12
work page 2025
-
[45]
Jointly rs image deblurring and super-resolution with adjustable-kernel and multi-domain attention
Yan Zhang, Pengcheng Zheng, Chengxiao Zeng, Bin Xiao, Zhenghao Li, and Xinbo Gao. Jointly rs image deblurring and super-resolution with adjustable-kernel and multi-domain attention. IEEE Transactions on Geoscience and Remote Sensing, 2024
work page 2024
-
[46]
Remote sensing image compression: A review
Shichao Zhou, Chenwei Deng, Baojun Zhao, Yatong Xia, Qisheng Li, and Zhenzhong Chen. Remote sensing image compression: A review. In2015 IEEE International conference on multimedia big data, pages 406–410. IEEE, 2015
work page 2015
-
[47]
Kinga Karwowska, Damian Wierzbicki, and Michal Kedzierski. Image inpainting and digital camouflage: Methods, applications, and perspectives for remote sensing.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2025
work page 2025
-
[48]
Noel Gorelick, Matt Hancher, Mike Dixon, Simon Ilyushchenko, David Thau, and Rebecca Moore. Google earth engine: Planetary-scale geospatial analysis for everyone.Remote sensing of Environment, 202:18–27, 2017
work page 2017
-
[49]
Bleu: a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002
work page 2002
-
[50]
Rouge: A package for automatic evaluation of summaries
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004
work page 2004
-
[51]
A survey on llm-as-a-judge.The Innovation, 2024
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.The Innovation, 2024
work page 2024
-
[52]
Prometheus 2: An open source language model specialized in evaluating other language models
Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4334–4353, 2024
work page 2024
- [53]
-
[54]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras. arXiv preprint arXiv:2503.01743, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
Teochat: A large vision-language assistant for temporal earth observation data,
Jeremy Andrew Irvin, Emily Ruoyu Liu, Joyce Chuyi Chen, Ines Dormoy, Jinyoung Kim, Samar Khanna, Zhuo Zheng, and Stefano Ermon. Teochat: A large vision-language assistant for temporal earth observation data.arXiv preprint arXiv:2410.06234, 2024
-
[59]
Zhenshi Li, Dilxat Muhtar, Feng Gu, Yanglangxing He, Xueliang Zhang, Pengfeng Xiao, Guangjun He, and Xiaoxiang Zhu. Lhrs-bot-nova: Improved multimodal large language model for remote sensing vision-language interpretation.ISPRS Journal of Photogrammetry and Remote Sensing, 227:539–550, 2025
work page 2025
-
[60]
Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embedding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024. 13
-
[61]
Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 91–104, 2025
work page 2025
-
[62]
Scaling spatial intelligence with multi- modal foundation models.arXiv preprint arXiv:2511.13719,
Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025
-
[63]
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language under- standing.arXiv preprint arXiv:2403.05525, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[64]
A statistical evaluation of recent full reference image quality assessment algorithms.IEEE Trans
Hamid R Sheikh, Muhammad F Sabir, Alan C Bovik, et al. A statistical evaluation of recent full reference image quality assessment algorithms.IEEE Trans. Image Process., 15(11):3440–3451, 2006
work page 2006
-
[65]
Nikolay Ponomarenko, Lina Jin, Oleg Ieremeiev, Vladimir Lukin, Karen Egiazarian, Jaakko Astola, Benoit V ozel, Kacem Chehdi, Marco Carli, Federica Battisti, et al. Image database tid2013: Peculiarities, results and perspectives.Signal processing: Image communication, 30: 57–77, 2015
work page 2015
-
[66]
Kadid-10k: A large-scale artificially distorted iqa database
Hanhe Lin, Vlad Hosu, and Dietmar Saupe. Kadid-10k: A large-scale artificially distorted iqa database. In2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pages 1–3. IEEE, 2019
work page 2019
-
[67]
Q-bench: A benchmark for general-purpose foundation models on low-level vision
Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Yan, Guangtao Zhai, et al. Q-bench: A benchmark for general-purpose foundation models on low-level vision.arXiv preprint arXiv:2309.14181, 2023
-
[68]
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023. 14 Appendix Contents A Benchmark Taxonomy Details 16 A.1 Definition of Each L-2 Distortions . . . . . . . . . . . . . . . . . . . . . . . ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[69]
Identify the dominant type of distortion (e.g., haze, motion blur, compression artifacts, sensor noise) and describe its visible characteristics
-
[70]
Explain how this distortion impairs interpretability or analytical utility, citing concrete visual consequences
-
[71]
sharp" when the reference says
Provide an overall assessment of the image quality.Your response must contain exactly three sentences, with no numbering, and must not speculate about how the image ’should’ look. [IMAGE1_TOKEN] Answer: Moderate gaussian blur softens the edges of buildings and roads, making structural details less distinct. The blurring reduces interpretability by obscuri...
-
[72]
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.