OASIS: A Multilingual and Multimodal Dataset for Culturally Grounded Spoken Visual QA

Ali Ezzat Shahroor; Basel Mousi; Fahim Dalvi; Firoj Alam; Hunzalah Hassan Bhatti; Md. Arid Hasan; Mohamed Bayan Kmainasi; Nadir Durrani; Natasa Milic-Frayling; Shammur Absar Chowdhury

arxiv: 2510.06371 · v3 · submitted 2025-10-07 · 💻 cs.CL · cs.AI

OASIS: A Multilingual and Multimodal Dataset for Culturally Grounded Spoken Visual QA

Firoj Alam , Ali Ezzat Shahroor , Md. Arid Hasan , Zien Sheikh Ali , Hunzalah Hassan Bhatti , Mohamed Bayan Kmainasi , Shammur Absar Chowdhury , Basel Mousi

show 3 more authors

Fahim Dalvi Nadir Durrani Natasa Milic-Frayling

This is my paper

Pith reviewed 2026-05-18 08:45 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords culturally grounded QAmultimodal datasetspoken visual QAArabic dialectscommonsense reasoningmultilingual VQAvisual question answeringEverydayMMQA framework

0 comments

The pith

OASIS supplies nearly one million real images and 14.8 million QA pairs to test models on culturally grounded spoken visual reasoning in English and Arabic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OASIS, a large-scale dataset for multimodal question answering that draws on images, text, and speech from everyday settings across 18 countries. It pairs real photographs with questions and answers in both standard and dialectal Arabic as well as English, using four different input combinations that include speech. The work targets evaluation of pragmatic and commonsense reasoning rather than simple object identification. Current multimodal models often fail on queries that depend on local cultural knowledge or low-resource language varieties, so a resource of this scale and design allows direct measurement of those gaps. The accompanying framework uses semi-automatic generation followed by human review to scale the collection while aiming to keep cultural accuracy.

Core claim

We introduce OASIS, a large-scale culturally grounded multimodal QA dataset covering images, text, and speech. OASIS is built with EverydayMMQA, a scalable semi-automatic framework for creating localized spoken and visual QA resources, supported by multi-stage human-in-the-loop validation. OASIS contains approximately 0.92M real images and 14.8M QA pairs, including 3.7M spoken questions, with 383 hours of human-recorded speech and 20K hours of voice-cloned speech from 42 speakers. It supports four input settings: text-only, speech-only, text+image, and speech+image. The dataset focuses on English and Arabic varieties across 18 countries and is designed to evaluate models beyond object recog

What carries the argument

The OASIS dataset itself, generated through the EverydayMMQA framework and multi-stage human validation to produce culturally localized spoken visual QA pairs.

If this is right

Models can be tested across text-only, speech-only, text-plus-image, and speech-plus-image settings on the same culturally situated questions.
Benchmark results on closed-source, open-source, and fine-tuned models reveal performance differences on pragmatic and commonsense tasks.
The public release of both dataset and framework allows other researchers to extend coverage or adapt the collection method.
Focus on Arabic dialects alongside English enables direct comparison of model behavior on high-resource versus dialectal language varieties.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same collection method could be applied to additional languages to create parallel culturally grounded resources.
Training on OASIS pairs might improve model robustness when deployed in regions where standard VQA data under-represents local practices.
Combining the spoken and visual tracks could support development of systems that handle real-time conversational queries in varied cultural contexts.

Load-bearing premise

The multi-stage human validation process yields questions and answers that genuinely reflect local cultural knowledge and avoid annotation artifacts across the covered countries and dialects.

What would settle it

If independent cultural experts from the represented countries rate a large share of the QA pairs as answerable from generic visual or linguistic patterns without specific local knowledge, the claim of cultural grounding would be undermined.

Figures

Figures reproduced from arXiv: 2510.06371 by Ali Ezzat Shahroor, Basel Mousi, Fahim Dalvi, Firoj Alam, Hunzalah Hassan Bhatti, Md. Arid Hasan, Mohamed Bayan Kmainasi, Nadir Durrani, Natasa Milic-Frayling, Shammur Absar Chowdhury, Zien Sheikh Ali.

**Figure 2.** Figure 2: Proposed EverydayMMQA framework, OASIS dataset construction and experimental pipeline. Our findings shows visual grounding is the dominant lever, driving systematic performance gains across all models and languages. It narrows cross-lingual and dialectal disparities and acts as a modality equalizer, disproportionately benefiting speech and transcript inputs. Finally, with images and light fine-tuning, com… view at source ↗

**Figure 3.** Figure 3: OASIS dataset overview: geographic coverage across 18 Arab countries, languages and dialects, modality setups (text, image, speech), QA types, audio durations, token counts, and per-(sub)category distributions. Total QA - total number of images (0.92M) × 4 questions × 4 language varieties. where each LLM (l ∈ L) generates nl distinct natural queries Q (l) c,j,t = {q1, q2, . . . , qnl }, for a country c, a… view at source ↗

**Figure 4.** Figure 4: MSA Judge scores across modalities. Left: Qwen2.5-7B vs Gemini 2.5-pro. Right: Qwen2.5-3B [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of commonsense and knowledge based for the whole dataset. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: LLM-Judge scores across modalities. Top row: Qwen2.5-7B vs Gemini 2.5-pro for English (left) [PITH_FULL_IMAGE:figures/full_fig_p030_6.png] view at source ↗

read the original abstract

Large-scale multimodal models achieve strong results on tasks like Visual Question Answering (VQA), but they are often limited when queries require cultural and visual information, everyday knowledge, particularly in low-resource and underrepresented languages. We introduce OASIS, a large-scale culturally grounded multimodal QA dataset covering images, text, and speech. OASIS is built with EverydayMMQA, a scalable semi-automatic framework for creating localized spoken and visual QA resources, supported by multi-stage human-in-the-loop validation. OASIS contains approximately 0.92M real images and 14.8M QA pairs, including 3.7M spoken questions, with 383 hours of human-recorded speech, and 20K hours of voice-cloned speech, from 42 speakers. It supports four input settings: text-only, speech-only, text+image, and speech+image. The dataset focuses on English and Arabic varieties across 18 countries, covering Modern Standard Arabic (MSA) as well as dialectal Arabic. It is designed to evaluate models beyond object recognition, targeting pragmatic, commonsense, and culturally grounded reasoning in real-world scenarios. We benchmark four closed-source models, three open-source models, and one fine-tuned model on OASIS. The framework and dataset will be made publicly available to the community. https://huggingface.co/datasets/QCRI/OASIS

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OASIS is a sizable new public dataset for spoken visual QA with Arabic dialect coverage, but the evidence that questions demand cultural knowledge rather than generic cues remains thin.

read the letter

This paper introduces OASIS, a large multimodal dataset for visual QA that includes speech and targets cultural grounding in English and Arabic varieties from 18 countries. The core new thing is the dataset itself plus the EverydayMMQA framework used to build it at scale. They collected about 0.92 million real images and generated 14.8 million QA pairs, with 3.7 million spoken ones. There's 383 hours of human speech and a lot more cloned. It covers Modern Standard Arabic and dialects, and supports text, speech, or combined inputs with images. They also ran some benchmarks on closed and open models too. The effort to make this public and cover underrepresented languages is solid. Dataset papers like this can help push work on pragmatic reasoning in multimodal settings, especially where data has been scarce. The soft spot is in how well the questions actually require cultural knowledge. The multi-stage validation is mentioned, but without reported metrics on cultural dependency or comparisons to standard VQA sets, it's possible many items could be answered with general visual or commonsense cues. That would weaken the main selling point. This is for researchers in multimodal and low-resource NLP who need new data for training or evaluation. It would be useful for anyone working on pragmatic reasoning in vision-language models. I would send it to peer review. The dataset is substantial and the topic important, so referees can help strengthen the validation claims.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces OASIS, a large-scale multilingual and multimodal dataset for culturally grounded spoken visual question answering. It comprises approximately 0.92M real images and 14.8M QA pairs (including 3.7M spoken questions with 383 hours of human-recorded speech and 20K hours of voice-cloned speech) from 42 speakers across 18 countries, focusing on English and Arabic varieties including Modern Standard Arabic and dialects. The dataset is constructed via the EverydayMMQA semi-automatic framework with multi-stage human-in-the-loop validation and is designed to evaluate models on pragmatic, commonsense, and culturally grounded reasoning beyond object recognition. The authors provide benchmarks for several closed-source, open-source, and fine-tuned models.

Significance. If the validation process successfully ensures that the QA pairs require culturally specific knowledge rather than generic visual or commonsense cues, OASIS would represent a significant contribution to the field of multimodal NLP and cultural AI. It fills a gap in existing VQA resources by incorporating spoken language, dialectal variations, and real-world cultural contexts from underrepresented regions. The public availability of the dataset and framework could facilitate the development of more culturally aware multimodal models. The inclusion of both text and speech modalities in multiple input settings adds to its utility for evaluating diverse model capabilities.

major comments (2)

[Section 3] Section 3 (Dataset Construction): The multi-stage human-in-the-loop validation process is described at a high level, but the manuscript provides no quantitative metrics such as inter-annotator agreement on cultural dependency, cultural-relevance scores, or direct comparisons against VQA v2-style baselines to confirm that the 14.8M QA pairs require locale-specific knowledge from the 18 countries rather than universal visual cues or generic commonsense. This evidence is load-bearing for the central claim that OASIS targets culturally grounded reasoning beyond object recognition.
[Section 5] Section 5 (Benchmarking): The reported model performances across the four input settings (text-only, speech-only, text+image, speech+image) do not include an ablation or subset analysis separating culturally specific items from generic ones, making it difficult to quantify the added challenge attributable to cultural grounding versus other factors like dialectal speech or image realism.

minor comments (2)

[Abstract and Introduction] The abstract and introduction should explicitly state the total number of unique speakers per dialect and any quality assurance steps applied to the 20K hours of voice-cloned speech to ensure it does not introduce artifacts that affect cultural grounding evaluation.
[Dataset Statistics] Figure 1 or the dataset statistics table would benefit from a breakdown of QA pair distribution by country and dialect to better illustrate coverage across the 18 countries and Arabic varieties.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive summary of OASIS and for the constructive major comments. We agree that stronger quantitative evidence would better support the central claims about cultural grounding and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Section 3] Section 3 (Dataset Construction): The multi-stage human-in-the-loop validation process is described at a high level, but the manuscript provides no quantitative metrics such as inter-annotator agreement on cultural dependency, cultural-relevance scores, or direct comparisons against VQA v2-style baselines to confirm that the 14.8M QA pairs require locale-specific knowledge from the 18 countries rather than universal visual cues or generic commonsense. This evidence is load-bearing for the central claim that OASIS targets culturally grounded reasoning beyond object recognition.

Authors: We agree that quantitative validation metrics are important to substantiate the cultural specificity claim. In the revision we will report inter-annotator agreement scores computed during the human validation stages for cultural dependency and relevance, along with average cultural-relevance scores. We will also add a direct comparison on a sampled subset against VQA v2-style annotations to show that OASIS items require locale-specific knowledge beyond universal visual or commonsense cues. These additions will be placed in an expanded Section 3. revision: yes
Referee: [Section 5] Section 5 (Benchmarking): The reported model performances across the four input settings (text-only, speech-only, text+image, speech+image) do not include an ablation or subset analysis separating culturally specific items from generic ones, making it difficult to quantify the added challenge attributable to cultural grounding versus other factors like dialectal speech or image realism.

Authors: We concur that an ablation isolating cultural grounding would strengthen the benchmarking section. We will add a subset analysis on a representative sample (stratified across countries and dialects) in which items are labeled as culturally specific versus generic based on the existing human validation data. Model results on these subsets will be reported for the four input settings to quantify the incremental difficulty due to cultural factors. Because of the dataset scale, the analysis will be performed on a large but sampled subset rather than the full 14.8M pairs. revision: partial

Circularity Check

0 steps flagged

No circularity: dataset construction paper with independent process description

full rationale

This is a resource-creation paper whose central contribution is the OASIS dataset itself, built via the EverydayMMQA semi-automatic framework plus multi-stage human-in-the-loop validation. No mathematical derivations, equations, predictions, or fitted parameters exist that could reduce to inputs by construction. Claims about cultural grounding and coverage across 18 countries rest on the described annotation pipeline rather than any self-definitional loop, self-citation chain, or renaming of prior results. The process is presented as externally verifiable through the released data and framework, satisfying the criterion for a self-contained contribution against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No mathematical axioms or free parameters are involved. The work rests on the domain assumption that human annotators can reliably identify and validate culturally grounded questions across the targeted countries and dialects.

axioms (1)

domain assumption Human annotators can consistently judge cultural grounding and naturalness of generated QA pairs across 18 countries and Arabic varieties.
The multi-stage human-in-the-loop validation is presented as the quality guarantee; this assumption is required for the dataset to deliver on its cultural-grounding claim.

pith-pipeline@v0.9.0 · 5834 in / 1296 out tokens · 39641 ms · 2026-05-18T08:45:04.511936+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce OASIS, a large-scale culturally grounded multimodal QA dataset covering images, text, and speech... targeting pragmatic, commonsense, and culturally grounded reasoning
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

EverydayMMQA framework... multi-stage human-in-the-loop validation... 14.8M QA pairs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

[1]

Topics should be engaging, highly visual, and unique to the specified country

work page
[2]

Ensure a mix of historical, modern, and futuristic aspects based on the subcategory

work page
[3]

Use well-known landmarks, cultural elements, or emerging trends where relevant

work page
[4]

Prioritize topics that are frequently searched for in image search engines

work page
[5]

If the subcategory is broad, ensure a diverse selection covering different aspects

work page
[6]

Do not include generic topics that could apply to any country; make them location-specific

work page
[7]

If the subcategory is too narrow and lacks visual topics, expand the scope slightly to include related themes

work page
[8]

If necessary, include related visual aspects

Generate exactly 10 topics per request. If necessary, include related visual aspects

work page
[9]

Avoid redundant or overly generic suggestions

work page
[10]

Futuristic Skyscrapers

Ensure diversity in the topics; avoid generating closely related topics. JSON Format: 20 Under review Table 10: Evaluated models across input modalities: text ( T ), text+image ( T +I), speech ( S), and speech+image (S+I), with text as output. A ✓ indicates experiments conducted; ✗ indicates not applicable. Egy = Egyptian Arabic (arz), Lev = Levantine Ara...

work page
[11]

- Vary punctuation, phrasing, and structure - some queries should be formal, others casual or conversational

Reflect natural human behavior: - Use informal phrasing, spelling mistakes, colloquial expressions, and incomplete or autocomplete-style fragments. - Vary punctuation, phrasing, and structure - some queries should be formal, others casual or conversational. - Mimic how people write queries on their phones or in autocomplete (e.g., H. Q ªÖÏ@ Pñ ÉÔ g. @, ¨...

work page
[12]

@, 4K, k Pñ , éJ ®J ®k, Ð@Q ® J @ áÓ, È@ñk

Use visual and search-specific descriptors in Arabic: - Words like: Pñ, HAJ ®Ê g, ÉK Q K, ú GAm.× ÉJ Òm ', Pñ ÉÔ g. @, 4K, k Pñ , éJ ®J ®k, Ð@Q ® J @ áÓ, È@ñk. éJ ®Ê g

work page
[13]

Q ªÖÏ@

Mimic real-world Arabic search styles: - Pure keywords - **Keyword-based**: e.g., ú G. Q ªÖÏ@ . ÊË@ Pñ , éK Q K@ Qm.Ì'@ Z@QjË@ HAJ ®Ê g - **Questions**: e.g., ? áÒJ Ë@ ú ¯ éÖß Y ®Ë@ @ñ B@ Pñ Yg . @ áK @ - **Incomplete phrases**: e.g., àAÔ« áÓ Pñ ÉÔ g. @, éK YJ Ê ® K Pñ ÉK Q K

work page
[14]

queries": [

Use localized and culturally relevant language: - Include Arabic **city names**, dialectal slang, and popular references from the country (e.g., éJ K. Cm.Ì'@, ÈA ®ªË@, éJ K. Q ªÓ é . Ë). - Mention famous places or cultural features (e.g., pyramids, mosques, old souks, traditional outfits). - Dialects to consider: Egyptian, Gulf, Levantine, Maghrebi - dep...

work page
[15]

Types of Q&A Pairs (generate all for each image):

work page
[16]

Open-ended: A free-form question with an informative answer based on the image

work page
[17]

25 Under review

Multiple-choice: A question with three plausible options, clearly marking the correct answer. 25 Under review

work page
[18]

For type 1 and 2 you should generate one QA pair for each

True/False: A question-answer pair that can be answered with 'True' or 'False'. For type 1 and 2 you should generate one QA pair for each. For type 3 you should generate two QA pairs, one with True and one with False

work page
[19]

Semantic Focus: - Use the following semantic labels to guide your questions. Match the image content to the most relevant labels: - Location and Place Identification - Scene Interpretation and Context - Architectural Features and Functions - Cultural Significance and Heritage - Traditional Clothing and Attire - Tourism and Cultural Activities - Event and ...

work page
[20]

- Common sense-based questions (requiring general reasoning or everyday knowledge to answer)

Cognitive Focus: - Ensure a balanced mix of: - Knowledge-based questions (requiring factual knowledge related to the image). - Common sense-based questions (requiring general reasoning or everyday knowledge to answer). - Assign a label to each question indicating its cognitive focus (knowledge-based or common sense-based)

work page
[21]

Language: - All Q&A pairs must be written in native-sounding English

work page
[22]

- Vary the phrasing and difficulty across the different question types

Question Quality: - Ensure the questions are natural, conversational, and human-like. - Vary the phrasing and difficulty across the different question types. Questions should be engaging and thought-provoking. A mix of simple and complex questions is encouraged

work page
[23]

- Use correct grammar and maintain high readability

Answer Quality: - Answers must be factually correct, clear, concise, and well-structured. - Use correct grammar and maintain high readability

work page
[24]

- Ensure cultural references are accurate and specific to the image

Cultural Sensitivity: - Avoid stereotypes or cultural misrepresentations. - Ensure cultural references are accurate and specific to the image

work page
[25]

Context Utilization: - Use the provided image description, category, and subcategory to enrich the context while formulating the questions

work page
[26]

open-ended

Reasoning: - For each Q&A pair, also provide a short explanation justifying why the answer is correct. Limit the explanation to less than 100 words. Strictly follow these instructions to ensure the generated VQA data is of the highest quality and suitable for model evaluation and fine-tuning. 26 Under review ### **Output Format (JSON):** json {{ "open-end...

work page 2025

[1] [1]

Topics should be engaging, highly visual, and unique to the specified country

work page

[2] [2]

Ensure a mix of historical, modern, and futuristic aspects based on the subcategory

work page

[3] [3]

Use well-known landmarks, cultural elements, or emerging trends where relevant

work page

[4] [4]

Prioritize topics that are frequently searched for in image search engines

work page

[5] [5]

If the subcategory is broad, ensure a diverse selection covering different aspects

work page

[6] [6]

Do not include generic topics that could apply to any country; make them location-specific

work page

[7] [7]

If the subcategory is too narrow and lacks visual topics, expand the scope slightly to include related themes

work page

[8] [8]

If necessary, include related visual aspects

Generate exactly 10 topics per request. If necessary, include related visual aspects

work page

[9] [9]

Avoid redundant or overly generic suggestions

work page

[10] [10]

Futuristic Skyscrapers

Ensure diversity in the topics; avoid generating closely related topics. JSON Format: 20 Under review Table 10: Evaluated models across input modalities: text ( T ), text+image ( T +I), speech ( S), and speech+image (S+I), with text as output. A ✓ indicates experiments conducted; ✗ indicates not applicable. Egy = Egyptian Arabic (arz), Lev = Levantine Ara...

work page

[11] [11]

- Vary punctuation, phrasing, and structure - some queries should be formal, others casual or conversational

Reflect natural human behavior: - Use informal phrasing, spelling mistakes, colloquial expressions, and incomplete or autocomplete-style fragments. - Vary punctuation, phrasing, and structure - some queries should be formal, others casual or conversational. - Mimic how people write queries on their phones or in autocomplete (e.g., H. Q ªÖÏ@ Pñ ÉÔ g. @, ¨...

work page

[12] [12]

@, 4K, k Pñ , éJ ®J ®k, Ð@Q ® J @ áÓ, È@ñk

Use visual and search-specific descriptors in Arabic: - Words like: Pñ, HAJ ®Ê g, ÉK Q K, ú GAm.× ÉJ Òm ', Pñ ÉÔ g. @, 4K, k Pñ , éJ ®J ®k, Ð@Q ® J @ áÓ, È@ñk. éJ ®Ê g

work page

[13] [13]

Q ªÖÏ@

Mimic real-world Arabic search styles: - Pure keywords - **Keyword-based**: e.g., ú G. Q ªÖÏ@ . ÊË@ Pñ , éK Q K@ Qm.Ì'@ Z@QjË@ HAJ ®Ê g - **Questions**: e.g., ? áÒJ Ë@ ú ¯ éÖß Y ®Ë@ @ñ B@ Pñ Yg . @ áK @ - **Incomplete phrases**: e.g., àAÔ« áÓ Pñ ÉÔ g. @, éK YJ Ê ® K Pñ ÉK Q K

work page

[14] [14]

queries": [

Use localized and culturally relevant language: - Include Arabic **city names**, dialectal slang, and popular references from the country (e.g., éJ K. Cm.Ì'@, ÈA ®ªË@, éJ K. Q ªÓ é . Ë). - Mention famous places or cultural features (e.g., pyramids, mosques, old souks, traditional outfits). - Dialects to consider: Egyptian, Gulf, Levantine, Maghrebi - dep...

work page

[15] [15]

Types of Q&A Pairs (generate all for each image):

work page

[16] [16]

Open-ended: A free-form question with an informative answer based on the image

work page

[17] [17]

25 Under review

Multiple-choice: A question with three plausible options, clearly marking the correct answer. 25 Under review

work page

[18] [18]

For type 1 and 2 you should generate one QA pair for each

True/False: A question-answer pair that can be answered with 'True' or 'False'. For type 1 and 2 you should generate one QA pair for each. For type 3 you should generate two QA pairs, one with True and one with False

work page

[19] [19]

Semantic Focus: - Use the following semantic labels to guide your questions. Match the image content to the most relevant labels: - Location and Place Identification - Scene Interpretation and Context - Architectural Features and Functions - Cultural Significance and Heritage - Traditional Clothing and Attire - Tourism and Cultural Activities - Event and ...

work page

[20] [20]

- Common sense-based questions (requiring general reasoning or everyday knowledge to answer)

Cognitive Focus: - Ensure a balanced mix of: - Knowledge-based questions (requiring factual knowledge related to the image). - Common sense-based questions (requiring general reasoning or everyday knowledge to answer). - Assign a label to each question indicating its cognitive focus (knowledge-based or common sense-based)

work page

[21] [21]

Language: - All Q&A pairs must be written in native-sounding English

work page

[22] [22]

- Vary the phrasing and difficulty across the different question types

Question Quality: - Ensure the questions are natural, conversational, and human-like. - Vary the phrasing and difficulty across the different question types. Questions should be engaging and thought-provoking. A mix of simple and complex questions is encouraged

work page

[23] [23]

- Use correct grammar and maintain high readability

Answer Quality: - Answers must be factually correct, clear, concise, and well-structured. - Use correct grammar and maintain high readability

work page

[24] [24]

- Ensure cultural references are accurate and specific to the image

Cultural Sensitivity: - Avoid stereotypes or cultural misrepresentations. - Ensure cultural references are accurate and specific to the image

work page

[25] [25]

Context Utilization: - Use the provided image description, category, and subcategory to enrich the context while formulating the questions

work page

[26] [26]

open-ended

Reasoning: - For each Q&A pair, also provide a short explanation justifying why the answer is correct. Limit the explanation to less than 100 words. Strictly follow these instructions to ensure the generated VQA data is of the highest quality and suitable for model evaluation and fine-tuning. 26 Under review ### **Output Format (JSON):** json {{ "open-end...

work page 2025