Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages

Enamul Hoque Prince; Firoj Alam; Shammur Absar Chowdhury

arxiv: 2605.17152 · v1 · pith:WD2BNU7Anew · submitted 2026-05-16 · 💻 cs.CL

Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages

Firoj Alam , Shammur Absar Chowdhury , Enamul Hoque Prince This is my paper

Pith reviewed 2026-05-20 14:29 UTC · model grok-4.3

classification 💻 cs.CL

keywords multilingual LLMsmultimodal modelslow-resource languagesadapter alignmentculture-aware evaluationspeech-text LLMsvision-language models

0 comments

The pith

Multilingual multimodal LLMs can be built for low-resource languages using low-cost data methods and adapter stacks for text-speech-vision alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The tutorial synthesizes practical approaches for extending multimodal large language models beyond English to handle text, speech, and vision in settings with scarce data and compute. It reviews foundations, specific models like PALO and Maya, techniques for inexpensive dataset creation and curation, adapter-based methods to align the three modalities, and evaluation frameworks that incorporate cultural context. A sympathetic reader would care because these steps aim to make advanced AI capabilities available in languages and regions where standard high-resource pipelines do not apply. The content includes hands-on resources for fine-tuning compact vision-language models and constructing speech-to-text-to-LLM pipelines.

Core claim

The paper shows that tri-modal multilingual systems can be assembled under tight resource constraints by combining low-cost data curation, adapter stacks that align vision, speech, and text representations, and evaluation protocols that move beyond English-centric benchmarks to include cultural awareness.

What carries the argument

Adapter stacks for tri-modal alignment, which stack lightweight modules to connect vision, speech, and text encoders efficiently in multilingual settings.

If this is right

Researchers can create usable tri-modal models without requiring large-scale compute clusters.
Culture-aware evaluation will produce benchmarks that better reflect real-world performance in non-English contexts.
Speech-text pipelines can be wired into existing LLM setups at low additional cost.
Hands-on tutorials will enable more teams to experiment with compact multilingual vision-language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The adapter approach may scale to additional modalities if the alignment layers remain lightweight across languages.
Open release of the curation recipes could allow community-driven expansion to languages not covered in the original tutorial.
Testing the same pipeline on language pairs with different scripts or tonal systems would reveal where further adjustments become necessary.

Load-bearing premise

The described methods and resources will transfer to additional low-resource languages with only minor language-specific adjustments.

What would settle it

An experiment in which a new low-resource language is added to the fine-tuning pipeline described in the tutorial yet shows no improvement over English-only baselines on multimodal tasks.

read the original abstract

Multimodal LLMs are evolving from vision-language to tri-modality that see, hear, and read, yet pipelines and benchmarks remain English-centric and compute-heavy. The tutorial offers an overview of this emerging research area for multilingual multimodality across text, speech, and vision under limited data/compute budgets, synthesizing foundations, recent multilingual models (PALO, Maya), speech-text LLMs. We cover low-cost data creation/curation; adapter stacks for tri-modal alignment; culture-aware evaluation beyond English and hands on resources for fine-tuning a compact multilingual VLM and wiring a speech->text->LLM pipeline. The content will be delivered as an interactive half-day tutorial, designed for researchers and practitioners working on multilingual, multimodal AI in low-resource language settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a tutorial proposal synthesizing existing work on multilingual multimodal LLMs for low-resource languages rather than reporting new results or experiments.

read the letter

This is a tutorial proposal on multilingual and multimodal LLMs for low-resource languages. The core point is that it organizes existing techniques into a teaching format rather than advancing new models or data. It pulls together foundations, models like PALO and Maya, speech-text LLMs, low-cost data methods, adapter stacks for alignment, and culture-aware evaluation into one half-day session with some hands-on resources for fine-tuning a compact VLM and wiring a speech-to-text-to-LLM pipeline. That structure could help practitioners who need a starting point for work in under-supported languages. The outline addresses real pain points like English-centric pipelines and high compute demands in a direct way. The main limitation is the lack of any new experiments, datasets, or validation. Transfer to additional low-resource languages is presented as workable under limited budgets, but without case studies or results shown, that remains an untested assumption. The paper does not claim novel algorithms, so the citation pattern to prior models is appropriate and not overstated. Readers looking for original empirical findings or formal derivations will not find them. This is aimed at researchers and practitioners building inclusive AI systems who want an organized overview and practical pointers. It could be useful in a reading group focused on applied multilingual work. I would send it for peer review in a venue that accepts tutorial proposals because the topic is relevant and the plan looks organized, even though it is not a standard research paper.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes a half-day interactive tutorial on multilingual and multimodal LLMs for low-resource languages. It synthesizes foundations of multimodal models, recent work on models such as PALO and Maya, speech-text LLMs, low-cost data creation/curation techniques, adapter stacks for tri-modal alignment, culture-aware evaluation methods beyond English-centric benchmarks, and hands-on resources for fine-tuning compact multilingual VLMs and constructing speech-to-text-to-LLM pipelines, targeted at researchers and practitioners operating under limited data and compute budgets.

Significance. If delivered as outlined, the tutorial would provide a timely and practical synthesis of an emerging area, helping to address the English-centric and compute-heavy nature of current multimodal LLM pipelines. The emphasis on low-resource settings, low-cost data methods, and culture-aware evaluation could meaningfully support more inclusive model development. The inclusion of hands-on components for fine-tuning and pipeline wiring is a strength that could translate the overview into actionable skills for the target audience.

minor comments (2)

Abstract: the list of covered topics is dense; explicitly indicating the approximate time allocation or session structure for foundations versus hands-on components would help readers evaluate feasibility within a half-day format.
Abstract: 'hands on resources' is mentioned but not illustrated with example datasets, code repositories, or specific tools; adding one or two concrete examples would strengthen the practical appeal without altering the proposal's scope.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of the tutorial proposal and for recommending minor revision. The summary accurately reflects the intended scope, including the focus on low-resource settings, tri-modal alignment, and practical hands-on components.

Circularity Check

0 steps flagged

No significant circularity in tutorial overview

full rationale

This document is a tutorial proposal synthesizing existing external work on multilingual multimodal LLMs for low-resource languages. It covers foundations, models such as PALO and Maya, speech-text LLMs, low-cost data methods, adapter stacks, and culture-aware evaluation without any original derivations, equations, or empirical claims. No load-bearing step reduces a prediction or result to fitted inputs or self-citations by construction; transferability is framed as educational content rather than a tested derivation. The paper is self-contained as an overview referencing prior independent work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a tutorial overview with no new mathematical models, free parameters, axioms, or invented entities; it draws on prior literature for models such as PALO and Maya and standard multimodal techniques.

pith-pipeline@v0.9.0 · 5666 in / 1119 out tokens · 42596 ms · 2026-05-20T14:29:34.055716+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We cover low-cost data creation/curation; adapter stacks for tri-modal alignment; culture-aware evaluation beyond English and hands on resources for fine-tuning a compact multilingual VLM and wiring a speech→text→LLM pipeline.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Adapter/projector stacks for VLMs (e.g., BLIP-2 Q-Former), early vs. late fusion; PEFT in practice (LoRA/QLoRA)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

245 extracted references · 245 canonical work pages · 32 internal anchors

[1]

6th International Conference on Learning Representations (ICLR 2018), Workshop Track Proceedings , publisher =

FigureQA: An Annotated Figure Dataset for Visual Reasoning , author =. 6th International Conference on Learning Representations (ICLR 2018), Workshop Track Proceedings , publisher =. 2018 , url =

work page 2018
[2]

Advances in Neural Information Processing Systems , volume =

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs , author =. Advances in Neural Information Processing Systems , volume =. 2024 , url =

work page 2024
[3]

Findings of the Association for Computational Linguistics: ACL 2025 , address =

ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering , author =. Findings of the Association for Computational Linguistics: ACL 2025 , address =. 2025 , pages =. doi:10.18653/v1/2025.findings-acl.978 , url =

work page doi:10.18653/v1/2025.findings-acl.978 2025
[4]

Findings of the Association for Computational Linguistics: EACL 2026 , address =

DashboardQA: Benchmarking Multimodal Agents for Question Answering on Interactive Dashboards , author =. Findings of the Association for Computational Linguistics: EACL 2026 , address =. 2026 , pages =. doi:10.18653/v1/2026.findings-eacl.177 , url =

work page doi:10.18653/v1/2026.findings-eacl.177 2026
[5]

Chen, Xi and Wang, Xiao and Changpinyo, Soravit and Padlewski, Piotr and Salz, Daniel and Goodman, Sebastian and Grycner, Adam and Mustafa, Basil and Beyer, Lucas and others , journal=

work page
[6]

Bapna, Ankur and Cherry, Colin and Zhang, Yu and Jia, Ye and Johnson, Melvin and Cheng, Yong and Khanuja, Simran and Riesa, Jason and Conneau, Alexis , journal=

work page
[7]

arXiv preprint arXiv:2502.05568 , year=

Large multimodal models for low-resource languages: a survey , author=. arXiv preprint arXiv:2502.05568 , year=

work page arXiv
[8]

arXiv preprint arXiv:2311.13165 , year =

Multimodal Large Language Models: A Survey , author =. arXiv preprint arXiv:2311.13165 , year =

work page arXiv
[9]

National Science Review , volume =

A survey on multimodal large language models , author =. National Science Review , volume =. 2024 , doi =

work page 2024
[10]

2023 , publisher =

Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , booktitle =. 2023 , publisher =

work page 2023
[11]

Advances in Neural Information Processing Systems 36 (NeurIPS 2023) , year =

Visual Instruction Tuning , author =. Advances in Neural Information Processing Systems 36 (NeurIPS 2023) , year =

work page 2023
[12]

Driess, Danny and Xia, Fei and Sajjadi, Mehdi S. M. and Lynch, Corey and Chowdhery, Aakanksha and Ichter, Brian and Wahid, Ayzaan and Tompson, Jonathan and Vuong, Quan and Yu, Tianhe and Huang, Wenlong and Chebotar, Yevgen and Sermanet, Pierre and Duckworth, Daniel and Levine, Sergey and Vanhoucke, Vincent and Hausman, Karol and Toussaint, Marc and Greff,...

work page 2023
[13]

Language Is Not All You Need: Aligning Perception with Language Models

Language Is Not All You Need: Aligning Perception with Language Models , author =. arXiv preprint arXiv:2302.14045 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[14]

and Baldwin, Timothy and Felsberg, Michael and Khan, Fahad S

Rasheed, Hanoona and Maaz, Muhammad and Shaker, Abdelrahman and Khan, Salman and Cholakkal, Hisham and Anwer, Rao M. and Baldwin, Timothy and Felsberg, Michael and Khan, Fahad S. , journal =. 2024 , url =

work page 2024
[15]

arXiv preprint arXiv:2412.07112 , year =

Maya: An Instruction Finetuned Multilingual Multimodal Model , author =. arXiv preprint arXiv:2412.07112 , year =

work page arXiv
[16]

Nature , volume =

Joint speech and text machine translation for up to 100 languages , author =. Nature , volume =. 2025 , doi =

work page 2025
[17]

and Asawaroengchai, Chulayuth and Nguyen, Duc Dung and others , journal =

Rubenstein, Paul K. and Asawaroengchai, Chulayuth and Nguyen, Duc Dung and others , journal =. 2023 , url =

work page 2023
[18]

2024 , url =

Shen, Leyang and Chen, Gongwei and Shao, Rui and Guan, Weili and Nie, Liqiang , booktitle =. 2024 , url =

work page 2024
[19]

and Roth, Stefan and Vulić, Ivan and Gurevych, Iryna , booktitle =

Pfeiffer, Jonas and Geigle, Gregor and Kamath, Aishwarya and Steitz, Jan-Martin O. and Roth, Stefan and Vulić, Ivan and Gurevych, Iryna , booktitle =. 2022 , doi =

work page 2022
[20]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =

Visually Grounded Reasoning across Languages and Cultures , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =. 2021 , doi =

work page 2021
[21]

2023 , doi =

Parida, Shantipriya and Abdulmumin, Idris and Muhammad, Shamsuddeen Hassan and Bose, Aneesh and Kohli, Guneet Singh and Ahmad, Ibrahim and Kotwal, Ketan and Deb Sarkar, Sayan and Bojar, Ondřej and Kakudi, Habeebah , booktitle =. 2023 , doi =

work page 2023
[22]

Alam, Firoj and Shahroor, Ali Ezzat and Hasan, Md Arid and Ali, Zien Sheikh and Bhatti, Hunzalah Hassan and Kmainasi, Mohamed Bayan and Chowdhury, Shammur Absar and Mousi, Basel and Dalvi, Fahim and Durrani, Nadir and others , journal=

work page
[23]

Driess, Danny and Xia, Fei and Sajjadi, Mehdi S. M. and et al. , booktitle =. 2023 , url =

work page 2023
[24]

2025 , doi =

Li, Yunxin and Jiang, Shenyuan and Hu, Baotian and Wang, Longyue and Zhong, Wanqi and Luo, Wenhan and Ma, Lin and Zhang, Min , journal =. 2025 , doi =

work page 2025
[25]

Alam, Firoj and Chowdhury, Shammur Absar and Boughorbel, Sabri and Hasanain, Maram , booktitle=

work page
[26]

Holistic Evaluation of Language Models

Holistic evaluation of language models , author=. arXiv preprint arXiv:2211.09110 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

A Survey on In-context Learning

A survey for in-context learning , author=. arXiv preprint arXiv:2301.00234 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

A Survey of Large Language Models

A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

arXiv preprint arXiv:2307.12980 , year=

A systematic survey of prompt engineering on vision-language foundation models , author=. arXiv preprint arXiv:2307.12980 , year=

work page arXiv
[30]

ACM Computing Surveys , volume=

Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing , author=. ACM Computing Surveys , volume=. 2023 , publisher=

work page 2023
[31]

2023 , publisher =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.10256836
[32]

arXiv preprint arXiv:2309.16058 , year=

Anymal: An efficient and scalable any-modality augmented language model , author=. arXiv preprint arXiv:2309.16058 , year=

work page arXiv
[33]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

2023 , month =

OpenAI , title =. 2023 , month =

work page 2023
[35]

Zhu, Kaijie and Wang, Jindong and Zhou, Jiaheng and Wang, Zichen and Chen, Hao and Wang, Yidong and Yang, Linyi and Ye, Wei and Gong, Neil Zhenqiang and Zhang, Yue and others , journal=

work page
[36]

O pen ICL : An Open-Source Framework for In-context Learning

Wu, Zhenyu and Wang, Yaoxiang and Ye, Jiacheng and Wu, Zhiyong and Feng, Jiangtao and Xu, Jingjing and Qiao, Yu. O pen ICL : An Open-Source Framework for In-context Learning. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 2023

work page 2023
[37]

arXiv 2308.04945 , archivePrefix=

Fahim Dalvi and Maram Hasanain and Sabri Boughorbel and Basel Mousi and Samir Abdaljalil and Nizi Nazar and Ahmed Abdelali and Shammur Absar Chowdhury and Hamdy Mubarak and Ahmed Ali and Majd Hawasly and Nadir Durrani and Firoj Alam , year=. arXiv 2308.04945 , archivePrefix=

work page arXiv
[38]

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval , pages=

The use of MMR, diversity-based reranking for reordering documents and producing summaries , author=. Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval , pages=

work page
[39]

2023 , journal=

GPTAraEval: A Comprehensive Evaluation of ChatGPT on Arabic NLP , author=. 2023 , journal=

work page 2023
[40]

2023 , journal=

Zero- and Few-Shot Prompting with LLMs: A Comparative Study with Fine-tuned Models for Bangla Sentiment Analysis , author=. 2023 , journal=. 2308.10783 , archivePrefix=

work page arXiv 2023
[41]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. arXiv preprint arXiv:2301.12597 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

See https://vicuna

Vicuna: An open-source chatbot impressing gpt-4 with 90\ author=. See https://vicuna. lmsys. org (accessed 14 April 2023) , year=

work page 2023
[43]

Zhu, Deyao and Chen, Jun and Shen, Xiaoqian and Li, Xiang and Elhoseiny, Mohamed , journal=

work page
[44]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Bloom: A 176b-parameter open-access multilingual language model , author=. arXiv preprint arXiv:2211.05100 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Thoppilan, Romal and De Freitas, Daniel and Hall, Jamie and Shazeer, Noam and Kulshreshtha, Apoorv and Cheng, Heng-Tze and Jin, Alicia and Bos, Taylor and Baker, Leslie and Du, Yu and others , journal=

work page
[46]

Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems , pages=

Prompt programming for large language models: Beyond the few-shot paradigm , author=. Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems , pages=

work page 2021
[47]

Automatic Chain of Thought Prompting in Large Language Models

Automatic chain of thought prompting in large language models , author=. arXiv preprint arXiv:2210.03493 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023
[49]

Lai, Viet Dac and Ngo, Nghia Trung and Veyseh, Amir Pouran Ben and Man, Hieu and Dernoncourt, Franck and Bui, Trung and Nguyen, Thien Huu , journal=

work page
[50]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Scaling language models: Methods, analysis & insights from training gopher , author=. arXiv preprint arXiv:2112.11446 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

mt5: A massively multilingual pre-trained text-to-text transformer

mT5: A massively multilingual pre-trained text-to-text transformer , author=. arXiv preprint arXiv:2010.11934 , year=

work page arXiv 2010
[52]

Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Veselin and Zettlemoyer, Luke , booktitle=

work page
[53]

Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin , journal=

work page
[54]

Visual Instruction Tuning

Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae , title =. arXiv:2304.08485 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[55]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page
[56]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

work page 2018
[57]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Google's neural machine translation system: Bridging the gap between human and machine translation , author=. arXiv preprint arXiv:1609.08144 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Neural Machine Translation of Rare Words with Subword Units , author=. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[59]

Mielke, Zaid Alyafeai, Elizabeth Salesky, Colin Raffel, Manan Dey, Matthias Gall ´e, Arun Raja, Chen- glei Si, Wilson Y

Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP , author=. arXiv preprint arXiv:2112.10508 , year=

work page arXiv
[60]

arXiv preprint arXiv:2305.14982 , year=

Benchmarking Arabic AI with Large Language Models , author=. arXiv preprint arXiv:2305.14982 , year=

work page arXiv
[61]

2023 , eprint=

Can LLMs facilitate interpretation of pre-trained language models? , author=. 2023 , eprint=

work page 2023
[62]

Sanad: Single-label

Einea, Omar and Elnagar, Ashraf and Al Debsi, Ridhwan , journal=. Sanad: Single-label. 2019 , publisher=

work page 2019
[63]

Diacritization of Moroccan and Tunisian

Darwish, Kareem and Abdelali, Ahmed and Mubarak, Hamdy and Samih, Younes and Attia, Mohammed , journal=. Diacritization of Moroccan and Tunisian

work page
[64]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations , pages=

PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations , pages=

work page
[65]

NeurIPS 2022 Foundation Models for Decision Making Workshop , year=

Large Language Models Are Human-Level Prompt Engineers , author=. NeurIPS 2022 Foundation Models for Decision Making Workshop , year=

work page 2022
[66]

Nabil, Mahmoud and Aly, Mohamed and Atiya, Amir , booktitle=. Astd:

work page
[67]

Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) , pages=

SemEval-2017 Task 4: Sentiment Analysis in Twitter , author=. Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) , pages=

work page 2017
[68]

arXiv preprint arXiv:2107.02276 , year=

Sarcasm detection: A comparative study , author=. arXiv preprint arXiv:2107.02276 , year=

work page arXiv
[69]

Attentional Multi-Reading Sarcasm Detection

Attentional multi-reading sarcasm detection , author=. arXiv preprint arXiv:1809.03051 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[70]

Proceedings of the 12th international workshop on semantic evaluation , pages=

Semeval-2018 task 1: Affect in tweets , author=. Proceedings of the 12th international workshop on semantic evaluation , pages=

work page 2018
[71]

Proceedings of the Thirteenth Language Resources and Evaluation Conference , pages=

Cross-lingual Emotion Detection , author=. Proceedings of the Thirteenth Language Resources and Evaluation Conference , pages=

work page
[72]

, author=

Universals and cultural differences in facial expressions of emotion. , author=. Nebraska symposium on motivation , year=

work page
[73]

A Framework for Automatic Human Emotion Classification Using Emotion Profiles , year=

Mower, Emily and Matarić, Maja J and Narayanan, Shrikanth , journal=. A Framework for Automatic Human Emotion Classification Using Emotion Profiles , year=

work page
[74]

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery , volume=

Deep learning for sentiment analysis: A survey , author=. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery , volume=. 2018 , publisher=

work page 2018
[75]

Mining text data , pages=

A survey of opinion mining and sentiment analysis , author=. Mining text data , pages=. 2012 , publisher=

work page 2012
[76]

and Beex, A

Etman, A. and Beex, A. A. Louis , booktitle=. Language and Dialect Identification: A survey , year=

work page
[77]

Eighth Annual Conference of the International Speech Communication Association , year=

Improving speech translation with automatic boundary prediction , author=. Eighth Annual Conference of the International Speech Communication Association , year=

work page
[78]

Jones and Florian Wolf and Edward Gibson and Elliott Williams and Evelina Fedorenko and Douglas A

Douglas A. Jones and Florian Wolf and Edward Gibson and Elliott Williams and Evelina Fedorenko and Douglas A. Reynolds and Marc A. Zissman , title =. 8th European Conference on Speech Communication and Technology,. 2003 , url =

work page 2003
[79]

International Journal of Computational Linguistics & Chinese Language Processing, Volume 12, Number 3, September 2007: Special Issue on Invited Papers from ISCSLP 2006 , pages=

Multilingual spoken language corpus development for communication research , author=. International Journal of Computational Linguistics & Chinese Language Processing, Volume 12, Number 3, September 2007: Special Issue on Invited Papers from ISCSLP 2006 , pages=

work page 2007
[80]

SN Computer Science , volume=

Contents-Based Spam Detection on Social Networks Using RoBERTa Embedding and Stacked BLSTM , author=. SN Computer Science , volume=. 2023 , publisher=

work page 2023

Showing first 80 references.

[1] [1]

6th International Conference on Learning Representations (ICLR 2018), Workshop Track Proceedings , publisher =

FigureQA: An Annotated Figure Dataset for Visual Reasoning , author =. 6th International Conference on Learning Representations (ICLR 2018), Workshop Track Proceedings , publisher =. 2018 , url =

work page 2018

[2] [2]

Advances in Neural Information Processing Systems , volume =

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs , author =. Advances in Neural Information Processing Systems , volume =. 2024 , url =

work page 2024

[3] [3]

Findings of the Association for Computational Linguistics: ACL 2025 , address =

ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering , author =. Findings of the Association for Computational Linguistics: ACL 2025 , address =. 2025 , pages =. doi:10.18653/v1/2025.findings-acl.978 , url =

work page doi:10.18653/v1/2025.findings-acl.978 2025

[4] [4]

Findings of the Association for Computational Linguistics: EACL 2026 , address =

DashboardQA: Benchmarking Multimodal Agents for Question Answering on Interactive Dashboards , author =. Findings of the Association for Computational Linguistics: EACL 2026 , address =. 2026 , pages =. doi:10.18653/v1/2026.findings-eacl.177 , url =

work page doi:10.18653/v1/2026.findings-eacl.177 2026

[5] [5]

Chen, Xi and Wang, Xiao and Changpinyo, Soravit and Padlewski, Piotr and Salz, Daniel and Goodman, Sebastian and Grycner, Adam and Mustafa, Basil and Beyer, Lucas and others , journal=

work page

[6] [6]

Bapna, Ankur and Cherry, Colin and Zhang, Yu and Jia, Ye and Johnson, Melvin and Cheng, Yong and Khanuja, Simran and Riesa, Jason and Conneau, Alexis , journal=

work page

[7] [7]

arXiv preprint arXiv:2502.05568 , year=

Large multimodal models for low-resource languages: a survey , author=. arXiv preprint arXiv:2502.05568 , year=

work page arXiv

[8] [8]

arXiv preprint arXiv:2311.13165 , year =

Multimodal Large Language Models: A Survey , author =. arXiv preprint arXiv:2311.13165 , year =

work page arXiv

[9] [9]

National Science Review , volume =

A survey on multimodal large language models , author =. National Science Review , volume =. 2024 , doi =

work page 2024

[10] [10]

2023 , publisher =

Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , booktitle =. 2023 , publisher =

work page 2023

[11] [11]

Advances in Neural Information Processing Systems 36 (NeurIPS 2023) , year =

Visual Instruction Tuning , author =. Advances in Neural Information Processing Systems 36 (NeurIPS 2023) , year =

work page 2023

[12] [12]

Driess, Danny and Xia, Fei and Sajjadi, Mehdi S. M. and Lynch, Corey and Chowdhery, Aakanksha and Ichter, Brian and Wahid, Ayzaan and Tompson, Jonathan and Vuong, Quan and Yu, Tianhe and Huang, Wenlong and Chebotar, Yevgen and Sermanet, Pierre and Duckworth, Daniel and Levine, Sergey and Vanhoucke, Vincent and Hausman, Karol and Toussaint, Marc and Greff,...

work page 2023

[13] [13]

Language Is Not All You Need: Aligning Perception with Language Models

Language Is Not All You Need: Aligning Perception with Language Models , author =. arXiv preprint arXiv:2302.14045 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

and Baldwin, Timothy and Felsberg, Michael and Khan, Fahad S

Rasheed, Hanoona and Maaz, Muhammad and Shaker, Abdelrahman and Khan, Salman and Cholakkal, Hisham and Anwer, Rao M. and Baldwin, Timothy and Felsberg, Michael and Khan, Fahad S. , journal =. 2024 , url =

work page 2024

[15] [15]

arXiv preprint arXiv:2412.07112 , year =

Maya: An Instruction Finetuned Multilingual Multimodal Model , author =. arXiv preprint arXiv:2412.07112 , year =

work page arXiv

[16] [16]

Nature , volume =

Joint speech and text machine translation for up to 100 languages , author =. Nature , volume =. 2025 , doi =

work page 2025

[17] [17]

and Asawaroengchai, Chulayuth and Nguyen, Duc Dung and others , journal =

Rubenstein, Paul K. and Asawaroengchai, Chulayuth and Nguyen, Duc Dung and others , journal =. 2023 , url =

work page 2023

[18] [18]

2024 , url =

Shen, Leyang and Chen, Gongwei and Shao, Rui and Guan, Weili and Nie, Liqiang , booktitle =. 2024 , url =

work page 2024

[19] [19]

and Roth, Stefan and Vulić, Ivan and Gurevych, Iryna , booktitle =

Pfeiffer, Jonas and Geigle, Gregor and Kamath, Aishwarya and Steitz, Jan-Martin O. and Roth, Stefan and Vulić, Ivan and Gurevych, Iryna , booktitle =. 2022 , doi =

work page 2022

[20] [20]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =

Visually Grounded Reasoning across Languages and Cultures , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =. 2021 , doi =

work page 2021

[21] [21]

2023 , doi =

Parida, Shantipriya and Abdulmumin, Idris and Muhammad, Shamsuddeen Hassan and Bose, Aneesh and Kohli, Guneet Singh and Ahmad, Ibrahim and Kotwal, Ketan and Deb Sarkar, Sayan and Bojar, Ondřej and Kakudi, Habeebah , booktitle =. 2023 , doi =

work page 2023

[22] [22]

Alam, Firoj and Shahroor, Ali Ezzat and Hasan, Md Arid and Ali, Zien Sheikh and Bhatti, Hunzalah Hassan and Kmainasi, Mohamed Bayan and Chowdhury, Shammur Absar and Mousi, Basel and Dalvi, Fahim and Durrani, Nadir and others , journal=

work page

[23] [23]

Driess, Danny and Xia, Fei and Sajjadi, Mehdi S. M. and et al. , booktitle =. 2023 , url =

work page 2023

[24] [24]

2025 , doi =

Li, Yunxin and Jiang, Shenyuan and Hu, Baotian and Wang, Longyue and Zhong, Wanqi and Luo, Wenhan and Ma, Lin and Zhang, Min , journal =. 2025 , doi =

work page 2025

[25] [25]

Alam, Firoj and Chowdhury, Shammur Absar and Boughorbel, Sabri and Hasanain, Maram , booktitle=

work page

[26] [26]

Holistic Evaluation of Language Models

Holistic evaluation of language models , author=. arXiv preprint arXiv:2211.09110 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

A Survey on In-context Learning

A survey for in-context learning , author=. arXiv preprint arXiv:2301.00234 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

A Survey of Large Language Models

A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

arXiv preprint arXiv:2307.12980 , year=

A systematic survey of prompt engineering on vision-language foundation models , author=. arXiv preprint arXiv:2307.12980 , year=

work page arXiv

[30] [30]

ACM Computing Surveys , volume=

Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing , author=. ACM Computing Surveys , volume=. 2023 , publisher=

work page 2023

[31] [31]

2023 , publisher =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.10256836

[32] [32]

arXiv preprint arXiv:2309.16058 , year=

Anymal: An efficient and scalable any-modality augmented language model , author=. arXiv preprint arXiv:2309.16058 , year=

work page arXiv

[33] [33]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

2023 , month =

OpenAI , title =. 2023 , month =

work page 2023

[35] [35]

Zhu, Kaijie and Wang, Jindong and Zhou, Jiaheng and Wang, Zichen and Chen, Hao and Wang, Yidong and Yang, Linyi and Ye, Wei and Gong, Neil Zhenqiang and Zhang, Yue and others , journal=

work page

[36] [36]

O pen ICL : An Open-Source Framework for In-context Learning

Wu, Zhenyu and Wang, Yaoxiang and Ye, Jiacheng and Wu, Zhiyong and Feng, Jiangtao and Xu, Jingjing and Qiao, Yu. O pen ICL : An Open-Source Framework for In-context Learning. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 2023

work page 2023

[37] [37]

arXiv 2308.04945 , archivePrefix=

Fahim Dalvi and Maram Hasanain and Sabri Boughorbel and Basel Mousi and Samir Abdaljalil and Nizi Nazar and Ahmed Abdelali and Shammur Absar Chowdhury and Hamdy Mubarak and Ahmed Ali and Majd Hawasly and Nadir Durrani and Firoj Alam , year=. arXiv 2308.04945 , archivePrefix=

work page arXiv

[38] [38]

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval , pages=

The use of MMR, diversity-based reranking for reordering documents and producing summaries , author=. Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval , pages=

work page

[39] [39]

2023 , journal=

GPTAraEval: A Comprehensive Evaluation of ChatGPT on Arabic NLP , author=. 2023 , journal=

work page 2023

[40] [40]

2023 , journal=

Zero- and Few-Shot Prompting with LLMs: A Comparative Study with Fine-tuned Models for Bangla Sentiment Analysis , author=. 2023 , journal=. 2308.10783 , archivePrefix=

work page arXiv 2023

[41] [41]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. arXiv preprint arXiv:2301.12597 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

See https://vicuna

Vicuna: An open-source chatbot impressing gpt-4 with 90\ author=. See https://vicuna. lmsys. org (accessed 14 April 2023) , year=

work page 2023

[43] [43]

Zhu, Deyao and Chen, Jun and Shen, Xiaoqian and Li, Xiang and Elhoseiny, Mohamed , journal=

work page

[44] [44]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Bloom: A 176b-parameter open-access multilingual language model , author=. arXiv preprint arXiv:2211.05100 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

Thoppilan, Romal and De Freitas, Daniel and Hall, Jamie and Shazeer, Noam and Kulshreshtha, Apoorv and Cheng, Heng-Tze and Jin, Alicia and Bos, Taylor and Baker, Leslie and Du, Yu and others , journal=

work page

[46] [46]

Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems , pages=

Prompt programming for large language models: Beyond the few-shot paradigm , author=. Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems , pages=

work page 2021

[47] [47]

Automatic Chain of Thought Prompting in Large Language Models

Automatic chain of thought prompting in large language models , author=. arXiv preprint arXiv:2210.03493 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023

[49] [49]

Lai, Viet Dac and Ngo, Nghia Trung and Veyseh, Amir Pouran Ben and Man, Hieu and Dernoncourt, Franck and Bui, Trung and Nguyen, Thien Huu , journal=

work page

[50] [50]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Scaling language models: Methods, analysis & insights from training gopher , author=. arXiv preprint arXiv:2112.11446 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[51] [51]

mt5: A massively multilingual pre-trained text-to-text transformer

mT5: A massively multilingual pre-trained text-to-text transformer , author=. arXiv preprint arXiv:2010.11934 , year=

work page arXiv 2010

[52] [52]

Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Veselin and Zettlemoyer, Luke , booktitle=

work page

[53] [53]

Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin , journal=

work page

[54] [54]

Visual Instruction Tuning

Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae , title =. arXiv:2304.08485 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[55] [55]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page

[56] [56]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

work page 2018

[57] [57]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Google's neural machine translation system: Bridging the gap between human and machine translation , author=. arXiv preprint arXiv:1609.08144 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[58] [58]

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Neural Machine Translation of Rare Words with Subword Units , author=. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[59] [59]

Mielke, Zaid Alyafeai, Elizabeth Salesky, Colin Raffel, Manan Dey, Matthias Gall ´e, Arun Raja, Chen- glei Si, Wilson Y

Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP , author=. arXiv preprint arXiv:2112.10508 , year=

work page arXiv

[60] [60]

arXiv preprint arXiv:2305.14982 , year=

Benchmarking Arabic AI with Large Language Models , author=. arXiv preprint arXiv:2305.14982 , year=

work page arXiv

[61] [61]

2023 , eprint=

Can LLMs facilitate interpretation of pre-trained language models? , author=. 2023 , eprint=

work page 2023

[62] [62]

Sanad: Single-label

Einea, Omar and Elnagar, Ashraf and Al Debsi, Ridhwan , journal=. Sanad: Single-label. 2019 , publisher=

work page 2019

[63] [63]

Diacritization of Moroccan and Tunisian

Darwish, Kareem and Abdelali, Ahmed and Mubarak, Hamdy and Samih, Younes and Attia, Mohammed , journal=. Diacritization of Moroccan and Tunisian

work page

[64] [64]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations , pages=

PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations , pages=

work page

[65] [65]

NeurIPS 2022 Foundation Models for Decision Making Workshop , year=

Large Language Models Are Human-Level Prompt Engineers , author=. NeurIPS 2022 Foundation Models for Decision Making Workshop , year=

work page 2022

[66] [66]

Nabil, Mahmoud and Aly, Mohamed and Atiya, Amir , booktitle=. Astd:

work page

[67] [67]

Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) , pages=

SemEval-2017 Task 4: Sentiment Analysis in Twitter , author=. Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) , pages=

work page 2017

[68] [68]

arXiv preprint arXiv:2107.02276 , year=

Sarcasm detection: A comparative study , author=. arXiv preprint arXiv:2107.02276 , year=

work page arXiv

[69] [69]

Attentional Multi-Reading Sarcasm Detection

Attentional multi-reading sarcasm detection , author=. arXiv preprint arXiv:1809.03051 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[70] [70]

Proceedings of the 12th international workshop on semantic evaluation , pages=

Semeval-2018 task 1: Affect in tweets , author=. Proceedings of the 12th international workshop on semantic evaluation , pages=

work page 2018

[71] [71]

Proceedings of the Thirteenth Language Resources and Evaluation Conference , pages=

Cross-lingual Emotion Detection , author=. Proceedings of the Thirteenth Language Resources and Evaluation Conference , pages=

work page

[72] [72]

, author=

Universals and cultural differences in facial expressions of emotion. , author=. Nebraska symposium on motivation , year=

work page

[73] [73]

A Framework for Automatic Human Emotion Classification Using Emotion Profiles , year=

Mower, Emily and Matarić, Maja J and Narayanan, Shrikanth , journal=. A Framework for Automatic Human Emotion Classification Using Emotion Profiles , year=

work page

[74] [74]

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery , volume=

Deep learning for sentiment analysis: A survey , author=. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery , volume=. 2018 , publisher=

work page 2018

[75] [75]

Mining text data , pages=

A survey of opinion mining and sentiment analysis , author=. Mining text data , pages=. 2012 , publisher=

work page 2012

[76] [76]

and Beex, A

Etman, A. and Beex, A. A. Louis , booktitle=. Language and Dialect Identification: A survey , year=

work page

[77] [77]

Eighth Annual Conference of the International Speech Communication Association , year=

Improving speech translation with automatic boundary prediction , author=. Eighth Annual Conference of the International Speech Communication Association , year=

work page

[78] [78]

Jones and Florian Wolf and Edward Gibson and Elliott Williams and Evelina Fedorenko and Douglas A

Douglas A. Jones and Florian Wolf and Edward Gibson and Elliott Williams and Evelina Fedorenko and Douglas A. Reynolds and Marc A. Zissman , title =. 8th European Conference on Speech Communication and Technology,. 2003 , url =

work page 2003

[79] [79]

International Journal of Computational Linguistics & Chinese Language Processing, Volume 12, Number 3, September 2007: Special Issue on Invited Papers from ISCSLP 2006 , pages=

Multilingual spoken language corpus development for communication research , author=. International Journal of Computational Linguistics & Chinese Language Processing, Volume 12, Number 3, September 2007: Special Issue on Invited Papers from ISCSLP 2006 , pages=

work page 2007

[80] [80]

SN Computer Science , volume=

Contents-Based Spam Detection on Social Networks Using RoBERTa Embedding and Stacked BLSTM , author=. SN Computer Science , volume=. 2023 , publisher=

work page 2023