VisText-Mosquito: A Unified Multimodal Dataset for Visual Detection, Segmentation, and Textual Explanation on Mosquito Breeding Sites

Md. Adnanul Islam; Md. Asaduzzaman Shuvo; Md Asiful Islam; Md. Faiyaz Abdullah Sayeedi; Shahanur Rahman Bappy; Swakkhar Shatabda

arxiv: 2506.14629 · v3 · submitted 2025-06-17 · 💻 cs.CV · cs.CL

VisText-Mosquito: A Unified Multimodal Dataset for Visual Detection, Segmentation, and Textual Explanation on Mosquito Breeding Sites

Md. Adnanul Islam , Md. Faiyaz Abdullah Sayeedi , Md. Asaduzzaman Shuvo , Shahanur Rahman Bappy , Md Asiful Islam , Swakkhar Shatabda This is my paper

Pith reviewed 2026-05-19 08:57 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords mosquito breeding sitesobject detectionimage segmentationmultimodal datasetvision-language modelstextual explanationspublic health surveillance

0 comments

The pith

A new multimodal dataset pairs images of mosquito breeding sites with human-written explanations to train AI for detection, segmentation, and description.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VisText-Mosquito, a dataset containing 1,828 annotated images for object detection, 142 images for water surface segmentation, and linked natural language explanations for each image. This combination lets models learn to locate breeding sites visually while also producing readable text that explains what the image shows and why it matters for control. If the approach works at scale, it could shift mosquito surveillance from manual inspection to automated systems that deliver both locations and clear guidance for prevention. The authors evaluate several models on the data and report specific performance numbers for the best ones.

Core claim

The paper presents the VisText-Mosquito dataset that unifies visual detection, segmentation, and textual explanation tasks on mosquito breeding sites. It reports that YOLOv9s reaches a precision of 0.92926 and mAP@50 of 0.92891 on the detection subset, YOLOv11n-Seg reaches a precision of 0.91587 on segmentation, and a fine-tuned Mosquito-LLaMA3-8B model achieves a BLEU score of 54.7, BERTScore of 0.91, and ROUGE-L of 0.85 on explanation generation.

What carries the argument

The VisText-Mosquito dataset, which links visual annotations for detection and segmentation directly to human-written textual explanations so that vision and language models can be trained together on the same examples.

If this is right

Automated systems can locate breeding sites in new field images without requiring constant human review.
Generated explanations can translate visual findings into guidance that health workers or residents can act on directly.
The joint visual-textual output supports proactive mapping and intervention rather than response after disease cases appear.
Public release of the dataset and code allows other groups to extend the same multimodal setup to related public-health monitoring tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same image-plus-explanation structure could be applied to breeding sites of other disease vectors such as ticks or sandflies.
Mobile-phone versions of the models might let community members photograph potential sites and receive instant explanations and advice.
Repeated use over time could produce density maps that help forecast outbreak risk based on the number and type of sites detected.

Load-bearing premise

The 1,828 detection images, 142 segmentation images, and their human-written explanations are diverse enough and annotated accurately enough to support reliable model performance in real-world conditions.

What would settle it

A sharp drop in detection precision or explanation quality when the trained models are tested on a large collection of breeding-site photos from a geographic region or season absent from the current dataset.

Figures

Figures reproduced from arXiv: 2506.14629 by Md. Adnanul Islam, Md. Asaduzzaman Shuvo, Md Asiful Islam, Md. Faiyaz Abdullah Sayeedi, Shahanur Rahman Bappy, Swakkhar Shatabda.

**Figure 1.** Figure 1: Overview of the VisText-Mosquito pipeline, which involves several key steps, beginning with the collection of image and textual data, annotation, preprocessing, and model training. It supports object detection, water surface segmentation, and reasoning generation for robust mosquito breeding site analysis. This paper is structured as follows: Section II reviews existing related works. Section III describes… view at source ↗

**Figure 2.** Figure 2: Fully tagged and labeled images mosquito habitat detection and surface segmentation. Additionally, a text modality is included in the dataset to enable multimodal analysis. This dataset contains 3,762 instances, each associated with an image and annotated with three text fields: (a) Question: A binary question asking whether the image shows a mosquito breeding site. (b) Response: A ’Yes’ or ’No’ answer (3… view at source ↗

read the original abstract

Mosquito-borne diseases pose a major global health risk, requiring early detection and proactive control of breeding sites to prevent outbreaks. In this paper, we present VisText-Mosquito, a multimodal dataset that integrates visual and textual data to support automated detection, segmentation, and explanation for mosquito breeding site analysis. The dataset includes 1,828 annotated images for object detection, 142 images for water surface segmentation, and natural language explanation texts linked to each image. The YOLOv9s model achieves the highest precision of 0.92926 and mAP@50 of 0.92891 for object detection, while YOLOv11n-Seg reaches a segmentation precision of 0.91587 and mAP@50 of 0.79795. For textual explanation generation, we tested a range of large vision-language models (LVLMs) in both zero-shot and few-shot settings. Our fine-tuned Mosquito-LLaMA3-8B model achieved the best results, with a final loss of 0.0028, a BLEU score of 54.7, BERTScore of 0.91, and ROUGE-L of 0.85. This dataset and model framework emphasize the theme "Prevention is Better than Cure", showcasing how AI-based detection can proactively address mosquito-borne disease risks. The dataset and implementation code are publicly available at GitHub: https://github.com/adnanul-islam-jisun/VisText-Mosquito

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A useful new multimodal dataset for mosquito breeding sites, but the 142-image segmentation split and missing annotation details make the performance numbers hard to trust fully.

read the letter

Colleague, the main takeaway is that this paper releases VisText-Mosquito, a dataset that pairs images of mosquito breeding sites with object detection boxes, segmentation masks on water surfaces, and linked natural-language explanations. That combination is new for this public-health application and could support field tools that not only spot sites but also generate explanations for users. They report concrete numbers: YOLOv9s reaches 0.929 precision and 0.929 mAP@50 on the 1,828 detection images, YOLOv11n-Seg hits 0.916 precision and 0.798 mAP@50 on the 142 segmentation images, and their fine-tuned Mosquito-LLaMA3-8B scores BLEU 54.7, BERTScore 0.91, and ROUGE-L 0.85 on explanation generation. The GitHub release of data and code is a clear positive for anyone who wants to build on it. The work is straightforward empirical benchmarking on a curated collection rather than a new algorithm or derivation. The soft spots are real but contained. The segmentation set is tiny at 142 images, so the mAP figure is vulnerable to split choices or label noise, and the paper gives no information on geographic diversity, collection protocol, train-test splits, or inter-annotator agreement for boxes, masks, or free-text explanations. Those gaps make it difficult to judge how well the results would hold up outside the collected images. The textual explanations appear human-written, but that could be stated more explicitly. This paper is for researchers working on applied computer vision for vector control or similar narrow-domain multimodal tasks. Readers who need benchmark data for detection-plus-explanation pipelines will find the release practical. The thinking is clear and the metrics are direct empirical measurements with low circularity. It deserves peer review because the dataset itself is novel for the use case and the benchmarks are reported in usable form, even if the referee will likely ask for more on annotation quality and diversity. I would send it forward with a request to expand the data-collection and validation sections.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the VisText-Mosquito dataset, which combines 1,828 images for mosquito breeding site object detection, 142 images for water surface segmentation, and linked natural language explanations. It reports that YOLOv9s attains the highest detection performance with precision 0.92926 and mAP@50 of 0.92891, YOLOv11n-Seg achieves segmentation precision 0.91587 and mAP@50 0.79795, and the fine-tuned Mosquito-LLaMA3-8B model yields BLEU 54.7, BERTScore 0.91, and ROUGE-L 0.85 for explanation generation. The dataset and code are publicly released.

Significance. If the reported results hold under proper validation, the work contributes a multimodal dataset and baseline models for an important public health application in detecting mosquito breeding sites. The integration of visual detection, segmentation, and textual explanations is novel for this domain, and the public availability of the data and implementation code facilitates reproducibility and extension by the community.

major comments (3)

[Dataset Description] Dataset section: The manuscript provides no details on annotation protocols, inter-annotator agreement, train-test splits, geographic diversity, or collection methods for the 1,828 detection images and 142 segmentation images. These omissions directly affect the interpretability of the reported precision 0.92926 / mAP@50 0.92891 and segmentation mAP@50 0.79795.
[Experiments and Results] Segmentation experiments: The 0.79795 mAP@50 on the 142-image subset lacks any sensitivity analysis, cross-validation, or discussion of label noise, making this metric especially sensitive to sampling artifacts and undermining claims of reliable segmentation performance.
[Textual Explanation Generation] Textual explanation section: It is not stated whether the linked natural language explanations are human-written or model-generated, nor is any quality control described; this is required to properly evaluate the fine-tuned Mosquito-LLaMA3-8B results (BLEU 54.7, BERTScore 0.91, ROUGE-L 0.85).

minor comments (2)

[Abstract] The abstract would be strengthened by explicitly stating the dataset sizes and top-line metrics.
Model names (YOLOv9s, YOLOv11n-Seg, Mosquito-LLaMA3-8B) should be used consistently in tables and text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key areas where additional documentation and analysis will improve the manuscript. We address each major comment below and commit to revisions that enhance clarity, reproducibility, and rigor without altering the core contributions of the VisText-Mosquito dataset.

read point-by-point responses

Referee: [Dataset Description] Dataset section: The manuscript provides no details on annotation protocols, inter-annotator agreement, train-test splits, geographic diversity, or collection methods for the 1,828 detection images and 142 segmentation images. These omissions directly affect the interpretability of the reported precision 0.92926 / mAP@50 0.92891 and segmentation mAP@50 0.79795.

Authors: We agree that the current manuscript omits critical details on dataset construction. In the revised version, we will expand the Dataset section with explicit descriptions of the annotation protocols (including tools and guidelines used), inter-annotator agreement metrics (e.g., Cohen's kappa or percentage agreement), the precise train-test split ratios and randomization procedure, the geographic sources and diversity of the images, and the collection methodology (e.g., sources, capture conditions, and ethical considerations). These additions will directly support interpretation of the reported detection and segmentation metrics. revision: yes
Referee: [Experiments and Results] Segmentation experiments: The 0.79795 mAP@50 on the 142-image subset lacks any sensitivity analysis, cross-validation, or discussion of label noise, making this metric especially sensitive to sampling artifacts and undermining claims of reliable segmentation performance.

Authors: We concur that the modest size of the segmentation subset renders the mAP@50 metric vulnerable to sampling variability. In revision, we will add a sensitivity analysis (e.g., performance across multiple random splits and bootstrap estimates) and an explicit discussion of potential label noise sources. While exhaustive k-fold cross-validation on only 142 images is statistically limited, we will report these robustness checks and frame the results with appropriate caveats regarding dataset scale. revision: partial
Referee: [Textual Explanation Generation] Textual explanation section: It is not stated whether the linked natural language explanations are human-written or model-generated, nor is any quality control described; this is required to properly evaluate the fine-tuned Mosquito-LLaMA3-8B results (BLEU 54.7, BERTScore 0.91, ROUGE-L 0.85).

Authors: The natural language explanations linked to each image in VisText-Mosquito are human-written by domain experts in entomology and vector control. We will revise the relevant section to state this explicitly and describe the quality-control procedures, including annotation guidelines, consistency checks across annotators, and any iterative review process. This clarification will allow readers to contextualize the reported BLEU, BERTScore, and ROUGE-L scores for the fine-tuned Mosquito-LLaMA3-8B model. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical metrics are direct measurements on held-out data

full rationale

The paper presents a new multimodal dataset and reports standard benchmark metrics (precision, mAP@50, BLEU, BERTScore, ROUGE-L) obtained by training and evaluating off-the-shelf models (YOLOv9s, YOLOv11n-Seg, fine-tuned LLaMA3) on held-out splits of that dataset. No equations, fitted parameters, or theoretical derivations are used to obtain the reported numbers; each metric is computed directly from model outputs versus ground-truth annotations. The central claims therefore rest on external empirical evaluation rather than reducing to the paper's own inputs by construction. No self-citation chains, ansatzes, or uniqueness theorems appear in the provided text that would support the performance figures. This is a standard dataset-plus-benchmark paper whose results are falsifiable against the released data and code.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the collected images and annotations are representative and correctly labeled; no new physical entities or mathematical axioms are introduced.

axioms (1)

domain assumption Standard supervised learning assumptions hold for the YOLO and LLaMA fine-tuning pipelines (i.i.d. train/test splits, accurate ground-truth labels).
Invoked implicitly when reporting precision and BLEU scores as evidence of model quality.

pith-pipeline@v0.9.0 · 5847 in / 1386 out tokens · 44386 ms · 2026-05-19T08:57:09.986837+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Chapter 2 - mosquito-borne diseases,

A. I. Qureshi, “Chapter 2 - mosquito-borne diseases,” in Zika Virus Disease, A. I. Qureshi, Ed. Academic Press, 2018, pp. 27–45. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ B9780128123652000032

work page 2018
[2]

Mosquito-borne arboviruses of african origin: review of key viruses and vectors,

L. Braack, A. P. Gouveia de Almeida, A. J. Cornel, R. Swanepoel, and C. De Jager, “Mosquito-borne arboviruses of african origin: review of key viruses and vectors,” Parasites & vectors , vol. 11, pp. 1–26, 2018

work page 2018
[3]

Dengue virus infection and neurological manifestations: an update,

S.-L. Fong, K.-T. Wong, and C.-T. Tan, “Dengue virus infection and neurological manifestations: an update,” Brain, vol. 147, no. 3, pp. 830– 838, 2024

work page 2024
[4]

Yolov7 for mosquito breeding grounds detection and tracking,

C. Laranjeira, D. Andrade, and J. A. dos Santos, “Yolov7 for mosquito breeding grounds detection and tracking,” in 2023 IEEE International Conference on Image Processing Challenges and Workshops (ICIPCW) , 2023, pp. 1–5

work page 2023
[5]

Mosquitofusion: A multiclass dataset for real-time detection of mosquitoes, swarms, and breeding sites using deep learning,

M. F. A. Sayeedi, F. Hafiz, and M. A. Rahman, “Mosquitofusion: A multiclass dataset for real-time detection of mosquitoes, swarms, and breeding sites using deep learning,” arXiv preprint arXiv:2404.01501 , 2024

work page arXiv 2024
[7]

Identifying mosquito breeding sites via drone images,

C. Suduwella, A. Amarasinghe, L. Niroshan, C. Elvitigala, K. De Zoysa, and C. Keppetiyagama, “Identifying mosquito breeding sites via drone images,” in proceedings of the 3rd workshop on micro aerial vehicle networks, systems, and applications , 2017, pp. 27–30

work page 2017
[8]

Image analysis for identify- ing mosquito breeding grounds,

M. Mehra, A. Bagri, X. Jiang, and J. Ortiz, “Image analysis for identify- ing mosquito breeding grounds,” in 2016 IEEE International conference on sensing, communication and networking (SECON workshops) . IEEE, 2016, pp. 1–6

work page 2016
[9]

Detection of mosquito breeding areas using semantic segmentation,

P. Mylvaganam and M. B. Dissanayake, “Detection of mosquito breeding areas using semantic segmentation,” in Proceedings of the 3rd Interna- tional Women in Engineering Symposium 2022 , 2022, pp. 34–36

work page 2022
[10]

Automatic detection of potential mosquito breeding sites from aerial images acquired by unmanned aerial vehicles,

D. T. Bravo, G. A. Lima, W. A. L. Alves, V . P. Colombo, L. Djog- benou, S. V . D. Pamboukian, C. C. Quaresma, and S. A. de Araujo, “Automatic detection of potential mosquito breeding sites from aerial images acquired by unmanned aerial vehicles,” Computers, Environment and Urban Systems , vol. 90, p. 101692, 2021

work page 2021
[11]

Synthesis of a multimodal ecological model for scalable, high-resolution arboviral risk prediction in florida,

S. P. Beeman, “Synthesis of a multimodal ecological model for scalable, high-resolution arboviral risk prediction in florida,” Ph.D. dissertation, University of South Florida, 2021

work page 2021
[12]

Mosquitominer: A light weight rover for detecting and eliminating mosquito breeding sites,

M. A. Islam, M. F. Abdullah Sayeedi, J. F. Deepti, S. R. Bappy, S. S. Islam, and F. Hafiz, “Mosquitominer: A light weight rover for detecting and eliminating mosquito breeding sites,” in 2024 IEEE Region 10 Symposium (TENSYMP) , 2024, pp. 1–6

work page 2024
[13]

Roboflow (version 1.0)[soft- ware],

B. Dwyer, J. Nelson, J. Solawetz et al. , “Roboflow (version 1.0)[soft- ware],” 2022

work page 2022
[14]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International conference on machine learning . PMLR, 2022, pp. 12 888–12 900

work page 2022

[1] [1]

Chapter 2 - mosquito-borne diseases,

A. I. Qureshi, “Chapter 2 - mosquito-borne diseases,” in Zika Virus Disease, A. I. Qureshi, Ed. Academic Press, 2018, pp. 27–45. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ B9780128123652000032

work page 2018

[2] [2]

Mosquito-borne arboviruses of african origin: review of key viruses and vectors,

L. Braack, A. P. Gouveia de Almeida, A. J. Cornel, R. Swanepoel, and C. De Jager, “Mosquito-borne arboviruses of african origin: review of key viruses and vectors,” Parasites & vectors , vol. 11, pp. 1–26, 2018

work page 2018

[3] [3]

Dengue virus infection and neurological manifestations: an update,

S.-L. Fong, K.-T. Wong, and C.-T. Tan, “Dengue virus infection and neurological manifestations: an update,” Brain, vol. 147, no. 3, pp. 830– 838, 2024

work page 2024

[4] [4]

Yolov7 for mosquito breeding grounds detection and tracking,

C. Laranjeira, D. Andrade, and J. A. dos Santos, “Yolov7 for mosquito breeding grounds detection and tracking,” in 2023 IEEE International Conference on Image Processing Challenges and Workshops (ICIPCW) , 2023, pp. 1–5

work page 2023

[5] [5]

Mosquitofusion: A multiclass dataset for real-time detection of mosquitoes, swarms, and breeding sites using deep learning,

M. F. A. Sayeedi, F. Hafiz, and M. A. Rahman, “Mosquitofusion: A multiclass dataset for real-time detection of mosquitoes, swarms, and breeding sites using deep learning,” arXiv preprint arXiv:2404.01501 , 2024

work page arXiv 2024

[6] [7]

Identifying mosquito breeding sites via drone images,

C. Suduwella, A. Amarasinghe, L. Niroshan, C. Elvitigala, K. De Zoysa, and C. Keppetiyagama, “Identifying mosquito breeding sites via drone images,” in proceedings of the 3rd workshop on micro aerial vehicle networks, systems, and applications , 2017, pp. 27–30

work page 2017

[7] [8]

Image analysis for identify- ing mosquito breeding grounds,

M. Mehra, A. Bagri, X. Jiang, and J. Ortiz, “Image analysis for identify- ing mosquito breeding grounds,” in 2016 IEEE International conference on sensing, communication and networking (SECON workshops) . IEEE, 2016, pp. 1–6

work page 2016

[8] [9]

Detection of mosquito breeding areas using semantic segmentation,

P. Mylvaganam and M. B. Dissanayake, “Detection of mosquito breeding areas using semantic segmentation,” in Proceedings of the 3rd Interna- tional Women in Engineering Symposium 2022 , 2022, pp. 34–36

work page 2022

[9] [10]

Automatic detection of potential mosquito breeding sites from aerial images acquired by unmanned aerial vehicles,

D. T. Bravo, G. A. Lima, W. A. L. Alves, V . P. Colombo, L. Djog- benou, S. V . D. Pamboukian, C. C. Quaresma, and S. A. de Araujo, “Automatic detection of potential mosquito breeding sites from aerial images acquired by unmanned aerial vehicles,” Computers, Environment and Urban Systems , vol. 90, p. 101692, 2021

work page 2021

[10] [11]

Synthesis of a multimodal ecological model for scalable, high-resolution arboviral risk prediction in florida,

S. P. Beeman, “Synthesis of a multimodal ecological model for scalable, high-resolution arboviral risk prediction in florida,” Ph.D. dissertation, University of South Florida, 2021

work page 2021

[11] [12]

Mosquitominer: A light weight rover for detecting and eliminating mosquito breeding sites,

M. A. Islam, M. F. Abdullah Sayeedi, J. F. Deepti, S. R. Bappy, S. S. Islam, and F. Hafiz, “Mosquitominer: A light weight rover for detecting and eliminating mosquito breeding sites,” in 2024 IEEE Region 10 Symposium (TENSYMP) , 2024, pp. 1–6

work page 2024

[12] [13]

Roboflow (version 1.0)[soft- ware],

B. Dwyer, J. Nelson, J. Solawetz et al. , “Roboflow (version 1.0)[soft- ware],” 2022

work page 2022

[13] [14]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International conference on machine learning . PMLR, 2022, pp. 12 888–12 900

work page 2022