MAgSeg: Segmentation of Agricultural Landscapes in High-Resolution Satellite Imagery using Multimodal Large Language Models
Pith reviewed 2026-05-20 18:47 UTC · model grok-4.3
The pith
A new instruction format lets standard multimodal models segment fragmented smallholder farms in satellite images without extra decoders.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MAgSeg demonstrates that standard multimodal large language models, when fine-tuned with a novel instruction tuning data format, can segment smallholder agricultural landscapes in high-resolution satellite imagery without auxiliary vision decoders by learning global image context while producing text tokens only for a local patch.
What carries the argument
The novel instruction tuning data format that supplies global image context but restricts token generation to one local patch per output.
If this is right
- Standard multimodal models can now perform segmentation on high-resolution imagery without added vision components.
- The approach scales fine-tuning to larger images by avoiding full-context token generation.
- Evaluations across three countries show consistent gains over existing MLLM segmentation methods.
- The method supplies a practical route to mapping fragmented smallholder environments with limited labeled data.
Where Pith is reading between the lines
- The same patch-wise output trick could be tested on other remote-sensing tasks such as building detection or land-cover change.
- If the format works for agriculture, it may reduce reliance on specialized decoder architectures in other fragmented-object domains.
- One could check whether the method maintains performance when applied to multi-date image stacks rather than single scenes.
Load-bearing premise
The new instruction tuning format lets the model absorb full-image context while outputting tokens for only a local patch without any drop in segmentation accuracy.
What would settle it
Run MAgSeg on the same high-resolution satellite datasets used in the paper and check whether its segmentation accuracy on smallholder plots falls below that of decoder-equipped MLLM baselines.
Figures
read the original abstract
Agricultural landscape segmentation in the Global South is challenging as it is characterized by fragmented plots, high intra-class variance, and a scarcity of labeled training data. Recent advances in segmentation have been made by Multimodal Large Language Models (MLLMs). However, current approaches encounter critical context length bottlenecks and a domain alignment gap in understanding satellite features. We address these limitations through MAgSeg, a novel, decoder-free MLLM segmentation approach. MAgSeg is an architecturally efficient approach that enables standard MLLMs to perform segmentation of complex smallholder agricultural landscapes from high-resolution satellite imagery, without requiring auxiliary vision decoders. We introduce a novel instruction tuning data format designed to enable scalable fine-tuning and post-training on high resolution satellite imagery, which enables MAgSeg to learn from the global context of the image while generating text tokens for only a patch within the image. Extensive evaluations on datasets spanning three countries in the Global South demonstrate that MAgSeg significantly outperforms state-of-the-art MLLM baselines, offering a scalable solution to map smallholder agricultural environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents MAgSeg, a decoder-free segmentation method that adapts standard Multimodal Large Language Models to high-resolution satellite imagery for mapping fragmented smallholder agricultural landscapes in the Global South. It introduces a novel instruction-tuning data format that purportedly allows the model to internalize global image context while restricting text-token generation to a single local patch, thereby avoiding context-length bottlenecks and eliminating the need for auxiliary vision decoders. The central claim is that this architectural and data-format change yields significant performance gains over existing MLLM baselines on datasets spanning three countries.
Significance. If the core assumption holds, the work would be significant for computer vision and remote-sensing applications: it offers a scalable, decoder-free route to leverage existing MLLMs on high-resolution imagery without custom vision heads, potentially lowering the barrier for accurate mapping of complex, data-scarce agricultural environments. The emphasis on Global South smallholder landscapes also addresses an under-served domain.
major comments (3)
- [§3.2] §3.2 (Novel Instruction Tuning Data Format): The description of how global context is encoded in the prompt while restricting token generation to a local patch remains high-level; no concrete prompt templates, patch-sampling strategy, or mechanism for injecting global cues (e.g., downsampled overview tokens) are provided. Without these details it is impossible to assess whether the format truly preserves segmentation accuracy on fragmented plots with high intra-class variance.
- [§4] §4 (Experiments): The abstract and evaluation summary claim significant outperformance on three-country datasets, yet the manuscript supplies no quantitative metrics, error bars, ablation studies on the data format, or implementation details for the MLLM baselines. This absence prevents verification of the central claim that the new format delivers global context without accuracy loss.
- [§4.3] §4.3 (Ablation or Component Analysis): No ablation isolating the contribution of the novel data format versus standard instruction tuning is reported. Such an experiment is load-bearing for the claim that the format is the key enabler of decoder-free performance.
minor comments (2)
- [Figure 2] Figure 2 (architecture diagram) would benefit from explicit annotation of the global-context injection path and the local-patch token generation boundary.
- [§2] The related-work section should include a brief comparison to recent decoder-free MLLM segmentation methods outside the agricultural domain to clarify novelty.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript describing MAgSeg. The comments highlight important areas where additional clarity and evidence would strengthen the presentation. We address each major comment point by point below and outline the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Novel Instruction Tuning Data Format): The description of how global context is encoded in the prompt while restricting token generation to a local patch remains high-level; no concrete prompt templates, patch-sampling strategy, or mechanism for injecting global cues (e.g., downsampled overview tokens) are provided. Without these details it is impossible to assess whether the format truly preserves segmentation accuracy on fragmented plots with high intra-class variance.
Authors: We agree that the current description in §3.2 is high-level and would benefit from greater specificity. In the revised manuscript we will add concrete prompt templates, a detailed description of the patch-sampling strategy, and an explicit account of how global cues are injected (including the use of downsampled overview tokens). These additions will allow readers to evaluate whether the format maintains segmentation accuracy on fragmented plots with high intra-class variance. revision: yes
-
Referee: [§4] §4 (Experiments): The abstract and evaluation summary claim significant outperformance on three-country datasets, yet the manuscript supplies no quantitative metrics, error bars, ablation studies on the data format, or implementation details for the MLLM baselines. This absence prevents verification of the central claim that the new format delivers global context without accuracy loss.
Authors: We acknowledge that the current version of §4 does not provide sufficient quantitative detail to fully verify the performance claims. We will expand this section to report specific quantitative metrics with error bars, implementation details for all MLLM baselines, and additional ablation results on the data format. These changes will substantiate the reported outperformance across the three-country datasets. revision: yes
-
Referee: [§4.3] §4.3 (Ablation or Component Analysis): No ablation isolating the contribution of the novel data format versus standard instruction tuning is reported. Such an experiment is load-bearing for the claim that the format is the key enabler of decoder-free performance.
Authors: We recognize that an ablation isolating the novel instruction-tuning format is essential to support the central claim. We will add a dedicated ablation study (or expand §4.3) that directly compares the proposed data format against standard instruction tuning, thereby demonstrating its specific contribution to decoder-free performance. revision: yes
Circularity Check
No significant circularity; derivation is self-contained architectural innovation
full rationale
The paper introduces MAgSeg as a decoder-free MLLM approach relying on a novel instruction tuning data format to handle global context with local patch token generation. This is presented as an empirical architectural and data-format contribution evaluated on multi-country datasets, without any equations, fitted parameters, or derivations that reduce to prior outputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text; the central claims rest on reported outperformance rather than re-expression of inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard MLLM architectures can be instruction-tuned to output segmentation masks via text tokens when given appropriately formatted prompts.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GRPO... mean-DICE score as a direct reward signal
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Subobject-level image tokenization
[Chenet al., 2025 ] Delong Chen, Samuel Cahyawijaya, Jianfeng Liu, Baoyuan Wang, and Pascale Fung. Subobject-level image tokenization. InInternational Con- ference on Machine Learning,
work page 2025
-
[2]
Agricultural land- scape understanding at country-scale.arXiv,
[Duaet al., 2024 ] Radhika Dua, Nikita Saxena, Aditi Agar- wal, Alex Wilson, Gaurav Singh, Hoang Tran, Ishan Deshpande, Amandeep Kaur, Gaurav Aggarwal, Chandan Nath, Arnab Basu, Vishal Batchu, Sharath Holla, Bindiya Kurle, Olana Missura, Rahul Aggarwal, Shubhika Garg, Nishi Shah, Avneet Singh, Dinesh Tewari, Agata Dondzik, Bharat Adsul, Milind Sohoni, Asi...
work page 2024
-
[3]
[FAOet al., 2023] FAO, IFAD, UNICEF, WFP, and WHO
https: //arxiv.org/abs/2411.05359. [FAOet al., 2023] FAO, IFAD, UNICEF, WFP, and WHO. The state of food security and nutrition in the world
work page internal anchor Pith review arXiv 2023
-
[4]
https://doi. org/10.4060/cc3017en. [Huet al., 2022 ] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3,
-
[5]
[Kang and ¨Ozdo˘gan, 2019] Y . Kang and M.¨Ozdo˘gan. Field- level crop yield mapping with landsat using a hierarchical data assimilation approach.Remote Sensing of Environ- ment, 228:144–163,
work page 2019
-
[6]
[Kerneret al., 2023 ] Hannah Kerner, Saketh Sundar, and Mathan Satish. Multi-region transfer learning for segmen- tation of crop field boundaries in satellite images with lim- ited labels. InProceedings of the AAAI Workshop on AI to Accelerate Science and Engineering,
work page 2023
-
[7]
[Kirillovet al., 2023 ] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InPro- ceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026,
work page 2023
-
[8]
Lisa: Reason- ing segmentation via large language model
[Laiet al., 2024 ] Xin Lai, Zhuotao Tian, Yukang Chen, Yan- wei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reason- ing segmentation via large language model. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589,
work page 2024
-
[9]
Text4seg: Reimagining image segmentation as text generation
[Lanet al., 2025 ] Mengcheng Lan, Chaofeng Chen, Yue Zhou, Jiaxing Xu, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Text4seg: Reimagining image segmentation as text generation. InThe Thirteenth Inter- national Conference on Learning Representations,
work page 2025
-
[10]
[Lesivet al., 2019 ] M. Lesiv, J.C. Laso Bayas, L. See, M. Duerauer, D. Dahlia, N. Durando, R. Hazarika, P. Ku- mar Sahariah, M. Vakolyuk, and V . Blyshchyk. Estimating the global distribution of field size using crowdsourcing. Global Change Biology, 25:174–186,
work page 2019
-
[11]
[Liet al., 2024 ] Xiang Li, Congcong Wen, Yuan Hu, Zheng- hang Yuan, and Xiao Xiang Zhu. Vision-language models in remote sensing: Current progress and future trends.IEEE Geoscience and Remote Sensing Magazine, 12(2):32–66,
work page 2024
- [12]
-
[13]
[Meiet al., 2022 ] Weiye Mei, Haoyu Wang, David Fouhey, Weiqi Zhou, Isabella Hinks, Josh M Gray, Derek Van Berkel, and Meha Jain. Using deep learning and very- high-resolution imagery to map smallholder field bound- aries.Remote Sensing, 14(13):3046,
work page 2022
-
[14]
[Ministry of Agriculture and Farmers Welfare, 2024] Ministry of Agriculture and Farmers Welfare. Categorisa- tion of farmers. https://www.pib.gov.in/PressReleasePage. aspx?PRID=2085181,
work page 2024
-
[15]
[OECD, 2023] OECD.Agricultural Policy Monitoring and Evaluation 2023: Adapting Agriculture to Climate Change. OECD Publishing, Paris,
work page 2023
-
[16]
https://doi.org/ 10.1787/b14de474-en. [Perselloet al., 2023 ] Claudio Persello, Jeroen Grift, Xinyan Fan, Claudia Paris, Ronny H ¨ansch, Mila Koeva, and An- drew Nelson. Ai4smallfarms: A dataset for crop field de- lineation in southeast asian smallholder farms.IEEE Geo- science and Remote Sensing Letters, 20:1–5,
-
[17]
[Quenumet al., 2025 ] Jerome Quenum, Wen-Han Hsieh, Tsung-Han Wu, Ritwik Gupta, Trevor Darrell, and David M. Chan. LISAt: Language-instructed segmenta- tion assistant for satellite imagery. InThe Thirty-ninth An- nual Conference on Neural Information Processing Sys- tems Datasets and Benchmarks Track,
work page 2025
-
[18]
[Rada and Fuglie, 2019] N.E. Rada and K.O. Fuglie. New perspectives on farm size and productivity.Food Policy, 84:147–152,
work page 2019
-
[19]
Learning transferable visual models from nat- ural language supervision
[Radfordet al., 2021 ] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from nat- ural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR,
work page 2021
-
[20]
Glamm: Pixel grounding large multimodal model
[Rasheedet al., 2024 ] Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming- Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13009–13018,
work page 2024
-
[21]
Mission critical–satellite data is a distinct modality in machine learning
[Rolfet al., 2024 ] Esther Rolf, Konstantin Klemmer, Caleb Robinson, and Hannah Kerner. Mission critical–satellite data is a distinct modality in machine learning. InInterna- tional Conference on Learning Representations,
work page 2024
-
[22]
[Rudelet al., 2009 ] Thomas K Rudel, Laura Schneider, Maria Uriarte, Billie Lee Turner, Ruth DeFries, Deborah Lawrence, Jacqueline Geoghegan, Susanna Hecht, Amy Ickowitz, Eric F Lambin, et al. Agricultural intensifica- tion and changes in cultivated areas, 1970–2005.Proceed- ings of the national academy of sciences, 106(49):20675– 20680,
work page 2009
-
[23]
[Samberget al., 2016 ] L.H. Samberg, J.S. Gerber, N. Ra- mankutty, M. Herrero, and P.C. West. Subnational distri- bution of average farm size and smallholder contributions to global food production.Environmental Research Let- ters, 11(12):124010,
work page 2016
-
[24]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
[Shaoet al., 2024 ] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open lan- guage models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Success stories on information and communication technologies for agriculture and rural development,
[Sylvester and others, 2015] Gerard Sylvester et al. Success stories on information and communication technologies for agriculture and rural development,
work page 2015
-
[26]
[Teamet al., 2025 ] Gemma Team, Aishwarya Kamath, Jo- han Ferret, Shreya Pathak, Nino Vieillard, Ramona Mer- hej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram´e, Morgane Rivi`ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
[Vincent and Soille, 1991] Luc Vincent and Pierre Soille. Watersheds in digital spaces: an efficient algorithm based on immersion simulations.IEEE Transactions on Pattern Analysis & Machine Intelligence, 13(06):583–598,
work page 1991
-
[28]
[Waldner and Diakogiannis, 2020] Franc ¸ois Waldner and Foivos I Diakogiannis. Deep learning on edge: Extracting field boundaries from satellite images with a convolu- tional neural network.Remote sensing of environment, 245:111741,
work page 2020
-
[29]
[Waldneret al., 2021 ] Franc ¸ois Waldner, Foivos I Diako- giannis, Kathryn Batchelor, Michael Ciccotosto-Camp, Elizabeth Cooper-Williams, Chris Herrmann, Gonzalo Mata, and Andrew Toovey. Detect, consolidate, delineate: Scalable mapping of field boundaries using satellite im- ages.Remote sensing, 13(11):2197,
work page 2021
-
[30]
[Wanget al., 2022 ] Sherrie Wang, Franc ¸ois Waldner, and David B. Lobell. Unlocking large-scale crop field delin- eation in smallholder farming systems with transfer learn- ing and weak supervision.Remote Sensing, 14(22),
work page 2022
-
[31]
[Wanget al., 2023 ] Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision- centric tasks.Advances in Neural Information Processing Systems, 36:61501–61513,
work page 2023
-
[32]
[Wuet al., 2024 ] Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai, Zhaoyang Liu, Zhe Chen, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, et al. Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks.Advances in Neural Information Processing Systems, 37:69925–69975,
work page 2024
-
[33]
[Wuet al., 2025 ] Haiyang Wu, Zhuofei Du, Dandan Zhong, Yuze Wang, and Chao Tao. Fsvlm: A vision-language model for remote sensing farmland segmentation.IEEE Transactions on Geoscience and Remote Sensing,
work page 2025
-
[34]
Gsva: Gener- alized segmentation via multimodal large language mod- els
[Xiaet al., 2024 ] Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Gener- alized segmentation via multimodal large language mod- els. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3858–3869, June
work page 2024
-
[35]
[Xieet al., 2021 ] Enze Xie, Wenhai Wang, Zhiding Yu, An- ima Anandkumar, Jose M Alvarez, and Ping Luo. Seg- former: Simple and efficient design for semantic segmen- tation with transformers.Advances in neural information processing systems, 34:12077–12090,
work page 2021
-
[36]
[Yanget al., 2022 ] Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip H.S. Torr. Lavt: Language-aware vision transformer for referring image segmentation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 18155–18165, June
work page 2022
-
[37]
[Zhenget al., 2025 ] Juepeng Zheng, Zi Ye, Yibin Wen, Jianxi Huang, Zhiwei Zhang, Qingmei Li, Qiong Hu, Baodong Xu, Lingyuan Zhao, and Haohuan Fu. A com- prehensive review of agricultural parcel and boundary de- lineation from remote sensing images: Recent progress and future perspectives.arXiv preprint arXiv:2508.14558,
-
[38]
Image segmentation in foundation model era: A survey.arXiv preprint arXiv:2408.12957,
[Zhouet al., 2024 ] Tianfei Zhou, Wang Xia, Fei Zhang, Boyu Chang, Wenguan Wang, Ye Yuan, Ender Konukoglu, and Daniel Cremers. Image segmentation in foundation model era: A survey.arXiv preprint arXiv:2408.12957,
-
[39]
[Zhuet al., 2017 ] Xiao Xiang Zhu, Devis Tuia, Lichao Mou, Gui-Song Xia, Liangpei Zhang, Feng Xu, and Friedrich Fraundorfer. Deep learning in remote sensing: A compre- hensive review and list of resources.IEEE geoscience and remote sensing magazine, 5(4):8–36,
work page 2017
-
[40]
The refined predictions are obtained by refining the coarse predictions using SAM ViT-H backbone. C Evaluation Metrics Following [Duaet al., 2024 ], for instance-wise metrics, we merge and match multiple predictions that overlap with a ground truth instance. A particular predicted instance is con- sidered to match with a ground truth instance if they belo...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.