OASIS: A Multilingual and Multimodal Dataset for Culturally Grounded Spoken Visual QA
Pith reviewed 2026-05-18 08:45 UTC · model grok-4.3
The pith
OASIS supplies nearly one million real images and 14.8 million QA pairs to test models on culturally grounded spoken visual reasoning in English and Arabic.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce OASIS, a large-scale culturally grounded multimodal QA dataset covering images, text, and speech. OASIS is built with EverydayMMQA, a scalable semi-automatic framework for creating localized spoken and visual QA resources, supported by multi-stage human-in-the-loop validation. OASIS contains approximately 0.92M real images and 14.8M QA pairs, including 3.7M spoken questions, with 383 hours of human-recorded speech and 20K hours of voice-cloned speech from 42 speakers. It supports four input settings: text-only, speech-only, text+image, and speech+image. The dataset focuses on English and Arabic varieties across 18 countries and is designed to evaluate models beyond object recog
What carries the argument
The OASIS dataset itself, generated through the EverydayMMQA framework and multi-stage human validation to produce culturally localized spoken visual QA pairs.
If this is right
- Models can be tested across text-only, speech-only, text-plus-image, and speech-plus-image settings on the same culturally situated questions.
- Benchmark results on closed-source, open-source, and fine-tuned models reveal performance differences on pragmatic and commonsense tasks.
- The public release of both dataset and framework allows other researchers to extend coverage or adapt the collection method.
- Focus on Arabic dialects alongside English enables direct comparison of model behavior on high-resource versus dialectal language varieties.
Where Pith is reading between the lines
- The same collection method could be applied to additional languages to create parallel culturally grounded resources.
- Training on OASIS pairs might improve model robustness when deployed in regions where standard VQA data under-represents local practices.
- Combining the spoken and visual tracks could support development of systems that handle real-time conversational queries in varied cultural contexts.
Load-bearing premise
The multi-stage human validation process yields questions and answers that genuinely reflect local cultural knowledge and avoid annotation artifacts across the covered countries and dialects.
What would settle it
If independent cultural experts from the represented countries rate a large share of the QA pairs as answerable from generic visual or linguistic patterns without specific local knowledge, the claim of cultural grounding would be undermined.
Figures
read the original abstract
Large-scale multimodal models achieve strong results on tasks like Visual Question Answering (VQA), but they are often limited when queries require cultural and visual information, everyday knowledge, particularly in low-resource and underrepresented languages. We introduce OASIS, a large-scale culturally grounded multimodal QA dataset covering images, text, and speech. OASIS is built with EverydayMMQA, a scalable semi-automatic framework for creating localized spoken and visual QA resources, supported by multi-stage human-in-the-loop validation. OASIS contains approximately 0.92M real images and 14.8M QA pairs, including 3.7M spoken questions, with 383 hours of human-recorded speech, and 20K hours of voice-cloned speech, from 42 speakers. It supports four input settings: text-only, speech-only, text+image, and speech+image. The dataset focuses on English and Arabic varieties across 18 countries, covering Modern Standard Arabic (MSA) as well as dialectal Arabic. It is designed to evaluate models beyond object recognition, targeting pragmatic, commonsense, and culturally grounded reasoning in real-world scenarios. We benchmark four closed-source models, three open-source models, and one fine-tuned model on OASIS. The framework and dataset will be made publicly available to the community. https://huggingface.co/datasets/QCRI/OASIS
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces OASIS, a large-scale multilingual and multimodal dataset for culturally grounded spoken visual question answering. It comprises approximately 0.92M real images and 14.8M QA pairs (including 3.7M spoken questions with 383 hours of human-recorded speech and 20K hours of voice-cloned speech) from 42 speakers across 18 countries, focusing on English and Arabic varieties including Modern Standard Arabic and dialects. The dataset is constructed via the EverydayMMQA semi-automatic framework with multi-stage human-in-the-loop validation and is designed to evaluate models on pragmatic, commonsense, and culturally grounded reasoning beyond object recognition. The authors provide benchmarks for several closed-source, open-source, and fine-tuned models.
Significance. If the validation process successfully ensures that the QA pairs require culturally specific knowledge rather than generic visual or commonsense cues, OASIS would represent a significant contribution to the field of multimodal NLP and cultural AI. It fills a gap in existing VQA resources by incorporating spoken language, dialectal variations, and real-world cultural contexts from underrepresented regions. The public availability of the dataset and framework could facilitate the development of more culturally aware multimodal models. The inclusion of both text and speech modalities in multiple input settings adds to its utility for evaluating diverse model capabilities.
major comments (2)
- [Section 3] Section 3 (Dataset Construction): The multi-stage human-in-the-loop validation process is described at a high level, but the manuscript provides no quantitative metrics such as inter-annotator agreement on cultural dependency, cultural-relevance scores, or direct comparisons against VQA v2-style baselines to confirm that the 14.8M QA pairs require locale-specific knowledge from the 18 countries rather than universal visual cues or generic commonsense. This evidence is load-bearing for the central claim that OASIS targets culturally grounded reasoning beyond object recognition.
- [Section 5] Section 5 (Benchmarking): The reported model performances across the four input settings (text-only, speech-only, text+image, speech+image) do not include an ablation or subset analysis separating culturally specific items from generic ones, making it difficult to quantify the added challenge attributable to cultural grounding versus other factors like dialectal speech or image realism.
minor comments (2)
- [Abstract and Introduction] The abstract and introduction should explicitly state the total number of unique speakers per dialect and any quality assurance steps applied to the 20K hours of voice-cloned speech to ensure it does not introduce artifacts that affect cultural grounding evaluation.
- [Dataset Statistics] Figure 1 or the dataset statistics table would benefit from a breakdown of QA pair distribution by country and dialect to better illustrate coverage across the 18 countries and Arabic varieties.
Simulated Author's Rebuttal
We thank the referee for their positive summary of OASIS and for the constructive major comments. We agree that stronger quantitative evidence would better support the central claims about cultural grounding and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Section 3] Section 3 (Dataset Construction): The multi-stage human-in-the-loop validation process is described at a high level, but the manuscript provides no quantitative metrics such as inter-annotator agreement on cultural dependency, cultural-relevance scores, or direct comparisons against VQA v2-style baselines to confirm that the 14.8M QA pairs require locale-specific knowledge from the 18 countries rather than universal visual cues or generic commonsense. This evidence is load-bearing for the central claim that OASIS targets culturally grounded reasoning beyond object recognition.
Authors: We agree that quantitative validation metrics are important to substantiate the cultural specificity claim. In the revision we will report inter-annotator agreement scores computed during the human validation stages for cultural dependency and relevance, along with average cultural-relevance scores. We will also add a direct comparison on a sampled subset against VQA v2-style annotations to show that OASIS items require locale-specific knowledge beyond universal visual or commonsense cues. These additions will be placed in an expanded Section 3. revision: yes
-
Referee: [Section 5] Section 5 (Benchmarking): The reported model performances across the four input settings (text-only, speech-only, text+image, speech+image) do not include an ablation or subset analysis separating culturally specific items from generic ones, making it difficult to quantify the added challenge attributable to cultural grounding versus other factors like dialectal speech or image realism.
Authors: We concur that an ablation isolating cultural grounding would strengthen the benchmarking section. We will add a subset analysis on a representative sample (stratified across countries and dialects) in which items are labeled as culturally specific versus generic based on the existing human validation data. Model results on these subsets will be reported for the four input settings to quantify the incremental difficulty due to cultural factors. Because of the dataset scale, the analysis will be performed on a large but sampled subset rather than the full 14.8M pairs. revision: partial
Circularity Check
No circularity: dataset construction paper with independent process description
full rationale
This is a resource-creation paper whose central contribution is the OASIS dataset itself, built via the EverydayMMQA semi-automatic framework plus multi-stage human-in-the-loop validation. No mathematical derivations, equations, predictions, or fitted parameters exist that could reduce to inputs by construction. Claims about cultural grounding and coverage across 18 countries rest on the described annotation pipeline rather than any self-definitional loop, self-citation chain, or renaming of prior results. The process is presented as externally verifiable through the released data and framework, satisfying the criterion for a self-contained contribution against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human annotators can consistently judge cultural grounding and naturalness of generated QA pairs across 18 countries and Arabic varieties.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce OASIS, a large-scale culturally grounded multimodal QA dataset covering images, text, and speech... targeting pragmatic, commonsense, and culturally grounded reasoning
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
EverydayMMQA framework... multi-stage human-in-the-loop validation... 14.8M QA pairs
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Topics should be engaging, highly visual, and unique to the specified country
-
[2]
Ensure a mix of historical, modern, and futuristic aspects based on the subcategory
-
[3]
Use well-known landmarks, cultural elements, or emerging trends where relevant
-
[4]
Prioritize topics that are frequently searched for in image search engines
-
[5]
If the subcategory is broad, ensure a diverse selection covering different aspects
-
[6]
Do not include generic topics that could apply to any country; make them location-specific
-
[7]
If the subcategory is too narrow and lacks visual topics, expand the scope slightly to include related themes
-
[8]
If necessary, include related visual aspects
Generate exactly 10 topics per request. If necessary, include related visual aspects
-
[9]
Avoid redundant or overly generic suggestions
-
[10]
Ensure diversity in the topics; avoid generating closely related topics. JSON Format: 20 Under review Table 10: Evaluated models across input modalities: text ( T ), text+image ( T +I), speech ( S), and speech+image (S+I), with text as output. A ✓ indicates experiments conducted; ✗ indicates not applicable. Egy = Egyptian Arabic (arz), Lev = Levantine Ara...
-
[11]
Reflect natural human behavior: - Use informal phrasing, spelling mistakes, colloquial expressions, and incomplete or autocomplete-style fragments. - Vary punctuation, phrasing, and structure - some queries should be formal, others casual or conversational. - Mimic how people write queries on their phones or in autocomplete (e.g., H. Q ªÖÏ@ Pñ ÉÔ g. @, ¨...
-
[12]
@, 4K, k Pñ , éJ ®J ®k, Ð@Q ® J @ áÓ, È@ñk
Use visual and search-specific descriptors in Arabic: - Words like: Pñ, HAJ ®Ê g, ÉK Q K, ú GAm.× ÉJ Òm ', Pñ ÉÔ g. @, 4K, k Pñ , éJ ®J ®k, Ð@Q ® J @ áÓ, È@ñk. éJ ®Ê g
-
[13]
Mimic real-world Arabic search styles: - Pure keywords - **Keyword-based**: e.g., ú G. Q ªÖÏ@ . ÊË@ Pñ , éK Q K@ Qm.Ì'@ Z@QjË@ HAJ ®Ê g - **Questions**: e.g., ? áÒJ Ë@ ú ¯ éÖß Y ®Ë@ @ñ B@ Pñ Yg . @ áK @ - **Incomplete phrases**: e.g., àAÔ« áÓ Pñ ÉÔ g. @, éK YJ Ê ® K Pñ ÉK Q K
-
[14]
Use localized and culturally relevant language: - Include Arabic **city names**, dialectal slang, and popular references from the country (e.g., éJ K. Cm.Ì'@, ÈA ®ªË@, éJ K. Q ªÓ é . Ë). - Mention famous places or cultural features (e.g., pyramids, mosques, old souks, traditional outfits). - Dialects to consider: Egyptian, Gulf, Levantine, Maghrebi - dep...
-
[15]
Types of Q&A Pairs (generate all for each image):
-
[16]
Open-ended: A free-form question with an informative answer based on the image
-
[17]
Multiple-choice: A question with three plausible options, clearly marking the correct answer. 25 Under review
-
[18]
For type 1 and 2 you should generate one QA pair for each
True/False: A question-answer pair that can be answered with 'True' or 'False'. For type 1 and 2 you should generate one QA pair for each. For type 3 you should generate two QA pairs, one with True and one with False
-
[19]
Semantic Focus: - Use the following semantic labels to guide your questions. Match the image content to the most relevant labels: - Location and Place Identification - Scene Interpretation and Context - Architectural Features and Functions - Cultural Significance and Heritage - Traditional Clothing and Attire - Tourism and Cultural Activities - Event and ...
-
[20]
- Common sense-based questions (requiring general reasoning or everyday knowledge to answer)
Cognitive Focus: - Ensure a balanced mix of: - Knowledge-based questions (requiring factual knowledge related to the image). - Common sense-based questions (requiring general reasoning or everyday knowledge to answer). - Assign a label to each question indicating its cognitive focus (knowledge-based or common sense-based)
-
[21]
Language: - All Q&A pairs must be written in native-sounding English
-
[22]
- Vary the phrasing and difficulty across the different question types
Question Quality: - Ensure the questions are natural, conversational, and human-like. - Vary the phrasing and difficulty across the different question types. Questions should be engaging and thought-provoking. A mix of simple and complex questions is encouraged
-
[23]
- Use correct grammar and maintain high readability
Answer Quality: - Answers must be factually correct, clear, concise, and well-structured. - Use correct grammar and maintain high readability
-
[24]
- Ensure cultural references are accurate and specific to the image
Cultural Sensitivity: - Avoid stereotypes or cultural misrepresentations. - Ensure cultural references are accurate and specific to the image
-
[25]
Context Utilization: - Use the provided image description, category, and subcategory to enrich the context while formulating the questions
-
[26]
Reasoning: - For each Q&A pair, also provide a short explanation justifying why the answer is correct. Limit the explanation to less than 100 words. Strictly follow these instructions to ensure the generated VQA data is of the highest quality and suitable for model evaluation and fine-tuning. 26 Under review ### **Output Format (JSON):** json {{ "open-end...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.