pith. sign in

arxiv: 2606.03304 · v1 · pith:6XFEXPJ4new · submitted 2026-06-02 · 💻 cs.CL · cs.LG

From Script to Semantics: Prompting Strategies for African NLI

Pith reviewed 2026-06-28 10:10 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords prompting strategiesnatural language inferenceAfrican languageslow-resource languagesmultilingual evaluationzero-shot promptingclass balance
0
0 comments X

The pith

Contrastive prompting gives the most reliable gains for natural language inference in Swahili, Yoruba, and Hausa.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests five prompt designs on two mid-sized models for natural language inference in three African languages, keeping the setup free of few-shot examples and chain-of-thought steps. It reports that contrastive prompting produces steadier accuracy lifts across languages and models while reducing the tendency to collapse answers into one class. The authors conclude that prompt wording by itself can outperform stronger baselines that rely on extra examples or reasoning steps. This finding matters for low-resource languages because it shows that existing models can be made more usable through prompt structure alone rather than additional training data.

Core claim

Contrastive prompting is the most reliable and steadily improving strategy across languages and models; it maintains better balance across answer classes and overall accuracy gains, and well-constructed prompts suffice to surpass baselines supplied with few-shot examples and chain-of-thought reasoning.

What carries the argument

Contrastive prompting, a prompt structure that presents options for explicit comparison during inference.

If this is right

  • Prompt design can substitute for few-shot examples when performing inference on low-resource languages.
  • Language-aware decision structuring improves robustness without requiring model fine-tuning.
  • Neutral-class collapse can be reduced by changing how the prompt organizes possible answers.
  • Performance ordering among prompt strategies holds across at least two different mid-sized models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same contrastive structure might transfer to other classification tasks beyond NLI in the same languages.
  • Adding statistical controls or more models could clarify whether prompt effects remain dominant.
  • The approach could be checked on languages outside the three studied to test broader applicability.

Load-bearing premise

Differences in results across the five prompt strategies come mainly from the wording of the prompts rather than from interactions with the two chosen models or the single benchmark dataset.

What would settle it

Repeating the five strategies on a third model or an additional African-language NLI dataset and finding no consistent advantage or better class balance for the contrastive version.

Figures

Figures reproduced from arXiv: 2606.03304 by Anuj Tiwari, Hannah Nwokocha, Terry Oko-odion.

Figure 1
Figure 1. Figure 1: Delta accuracy improvement over the zero-shot baseline for Llama3.2-3B across prompt￾ing strategies and languages. Positive values indi￾cate gains over baseline; negative values indicate degradation. The most consistent are improve￾ments brought by contrastive prompting (across languages) [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Table of delta accuracy improvement of Gemma3-4B between zero-shot base and prompt￾ing strategies in both languages. Contrastive prompting shows stable cross-language gains com￾pared to other structured prompts. strong class imbalance tendencies in certain en￾vironments. Systematic strategies of prompting change both the overall performance as well as the performance of the classes [PITH_FULL_IMAGE:figure… view at source ↗
Figure 3
Figure 3. Figure 3: Per-class F1 heatmap of Gemma3-4B, between prompting strategies, languages and la￾bels (E, C, N). The higher the values are darker, the better the performance on the basis of classes. Under a variety of prompting set-ups, neutral-class instability is apparent [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-class F1 heatmap for Llama3.2-3B across prompting strategies, languages, and labels. Compared to Gemma3-4B, Llama exhibits stronger sensitivity to prompt formulation and greater class imbalance in certain configurations. language and model, Contrastive prompting ap￾pears to be the most stable and progressively im￾proving strategy compared to the base. Despite language specific variation in the magnitud… view at source ↗
Figure 5
Figure 5. Figure 5: F1 of neutral between prompting strate￾gies of Llama3.2-3B. There are a number of strate￾gies that are of neutral-class collapsing, and Con￾trastive prompting has more stable neutral perfor￾mance [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: F1 of neutral class with respect to prompt [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly evaluated in multilingual settings, yet their inference behavior in low-resource African languages remains underexplored especially under pure prompting without fine-tuning. We present a systematic study of prompting strategies for Natural Language Inference (NLI) in Swahili, Yoruba, and Hausa using the AfriXNLI benchmark. We evaluate five prompting strategies Baseline (zero-shot), Script-Aware, Language Specific, Contrastive, and Native-Label Self-Translation (NL-STP) across two mid-sized open weight models (Llama3.2-3B and Gemma3-4B). To isolate the effect of prompt design, the effect of few-shot examples and Chain-of-Thought reasoning is eliminated in our study. We find a significant difference in performance of class wise across strategies with highly neutral class collapse and high prediction skew in some configurations. Contrastive prompting proves to be the most reliable and steadily improving strategy over language and model and has better balance of class behavior and balance of overall accuracy gains. Notably, well-constructed prompts are sufficient to beat more powerful baselines that are provided with few-shot prompts and Chain-of-Thought prompts. We have found that prompt formulation is essential to multilingual NLI with low-resource languages and that language aware decision structuring can be used to meaningfully enhance robustness in resource challenged settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper evaluates five prompting strategies (Baseline zero-shot, Script-Aware, Language Specific, Contrastive, and Native-Label Self-Translation) for NLI on the AfriXNLI benchmark in Swahili, Yoruba, and Hausa, using Llama3.2-3B and Gemma3-4B. Few-shot examples and Chain-of-Thought are eliminated internally to isolate prompt effects. It reports class-wise performance differences including neutral-class collapse and prediction skew in some setups, identifies Contrastive prompting as the most reliable strategy with steady improvement across languages/models and better class balance/accuracy gains, and claims that well-constructed prompts suffice to outperform more powerful few-shot/CoT baselines. The conclusion stresses the importance of prompt formulation for robustness in low-resource multilingual NLI.

Significance. If the attribution of gains to prompt design holds under expanded testing, the work would provide concrete evidence that targeted prompting can mitigate class imbalance and improve NLI reliability in African languages without fine-tuning, offering actionable guidance for low-resource multilingual settings and underscoring language-aware decision structuring as a low-cost robustness tool.

major comments (3)
  1. [Abstract] Abstract: the statement of a 'significant difference in performance of class wise across strategies' is unsupported because no quantitative metrics, error bars, sample sizes per language/model, or statistical significance tests are reported, leaving the magnitude and reliability of observed differences (including neutral collapse and skew) unquantified.
  2. [Abstract] Abstract and experimental setup: the central attribution that performance differences (class balance, accuracy gains, reliability across languages) can be attributed primarily to the five prompting strategies is under-supported, as the study uses only two models and one benchmark without additional controls, ablations for model pretraining overlap, label skew, or statistical tests; the claim of 'steadily improving over language and model' therefore extrapolates from a narrow sample.
  3. [Abstract] Abstract: the claim that 'well-constructed prompts are sufficient to beat more powerful baselines that are provided with few-shot prompts and Chain-of-Thought prompts' requires explicit reporting of the external baseline numbers, models, and exact configurations used for comparison, since the internal experiments eliminate few-shot/CoT and the manuscript provides no side-by-side table or quantitative delta.
minor comments (2)
  1. The five strategy names are introduced without a dedicated table or section summarizing their exact prompt templates, making it hard to reproduce the 'well-constructed' variants.
  2. No mention of the total number of test instances per language or any handling of label distribution in AfriXNLI, which is relevant given the reported class-collapse issues.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and have revised the manuscript to strengthen the claims where appropriate while maintaining accuracy to our experimental scope.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the statement of a 'significant difference in performance of class wise across strategies' is unsupported because no quantitative metrics, error bars, sample sizes per language/model, or statistical significance tests are reported, leaving the magnitude and reliability of observed differences (including neutral collapse and skew) unquantified.

    Authors: We agree that 'significant' implies statistical testing not performed in the study. In the revised version we have changed the phrasing to 'notable differences in class-wise performance' and added explicit per-class accuracy numbers, sample sizes from AfriXNLI (approximately 2,000 examples per language), and error bars on the relevant figures in the results section. revision: yes

  2. Referee: [Abstract] Abstract and experimental setup: the central attribution that performance differences (class balance, accuracy gains, reliability across languages) can be attributed primarily to the five prompting strategies is under-supported, as the study uses only two models and one benchmark without additional controls, ablations for model pretraining overlap, label skew, or statistical tests; the claim of 'steadily improving over language and model' therefore extrapolates from a narrow sample.

    Authors: We acknowledge the limited scope (two models, three languages, single benchmark) and have added an explicit limitations paragraph noting that broader validation is required. Within the controlled setting that isolates prompt design by removing few-shot and CoT, the observed trends remain consistent; we have revised the abstract wording to 'demonstrates consistent improvement across the three evaluated languages and two models' and included additional discussion of label distribution in AfriXNLI. revision: partial

  3. Referee: [Abstract] Abstract: the claim that 'well-constructed prompts are sufficient to beat more powerful baselines that are provided with few-shot prompts and Chain-of-Thought prompts' requires explicit reporting of the external baseline numbers, models, and exact configurations used for comparison, since the internal experiments eliminate few-shot/CoT and the manuscript provides no side-by-side table or quantitative delta.

    Authors: The manuscript references external few-shot and CoT results from the AfriXNLI benchmark paper, but we accept that a direct side-by-side comparison is missing. We have added a new table in the results section that reports the exact baseline numbers, model sizes, and configurations from the cited works alongside our contrastive-prompt results with quantitative deltas. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison of prompting strategies

full rationale

The paper reports results from running five prompting strategies on two models and one benchmark (AfriXNLI). No equations, fitted parameters, derivations, or self-referential claims appear in the abstract or described methodology. Performance differences are presented as observations from model runs rather than predictions derived from prior results by the same authors. The central claim (contrastive prompting superiority) rests on experimental outcomes, not on any reduction to inputs by construction or load-bearing self-citation. This is a standard empirical evaluation with no detectable circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical evaluation study with no mathematical derivations, free parameters, axioms, or invented entities; all claims rest on experimental observations from two models and one benchmark.

pith-pipeline@v0.9.1-grok · 5778 in / 1070 out tokens · 25657 ms · 2026-06-28T10:10:41.041403+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 1 linked inside Pith

  1. [1]

    Introduction Largelanguagemodels(LLMs)haveshownhighef- ficiency in performing most natural language under- standing tasks, however, their performance in low- resource multilingual conditions has been under- characterized. Specifically, Natural Language Infer- ence (NLI) is one of the fundamental tasks that can beusedtoassessreasoningandsemanticcompre- hen...

  2. [2]

    Related Work 2.1. African Language Benchmarks and Multilingual Evaluation Recent large-scale evaluation efforts have high- lighted persistent performance gaps between African languages and high-resource languages in large language models. IrokoBench (Adelani et al., 2025) presents assessment suites includ- ing AfriXNLI, AfriMGSM and AfriMMLU across 17 Afr...

  3. [3]

    and AfriTeVa (Jude Ogundepo et al., 2023) in the situations where supervised data exists. While these benchmarks provide performance comparison in a broad way, they a small number of prompt templates do not systematically design the promptsuchasinstructionframing,labelsemantics, or cultural grounding. Consequently, the role of promptstructurehasbecomeanun...

  4. [4]

    All strategies require the model to output exactly one English label (entailment, con- tradiction, or neutral) without explanations

    Prompting Strategies We evaluate five zero-shot prompting strategies de- signed to systematically vary linguistic grounding anddecisionstructurewhilekeepingthetaskformu- lation constant. All strategies require the model to output exactly one English label (entailment, con- tradiction, or neutral) without explanations. Full prompt templates are provided in...

  5. [5]

    Dataset We evaluate our prompting strategies on the AfriXNLI benchmark, a multilingual Natural Lan- guage Inference (NLI) dataset covering several African languages

    Experimental Setup 4.1. Dataset We evaluate our prompting strategies on the AfriXNLI benchmark, a multilingual Natural Lan- guage Inference (NLI) dataset covering several African languages. We are targeting three lan- guages; Swahili, Yoruba, and Hausa. in all lan- guages, the full test set of 600 examples (equally balanced by the three labels entailment,...

  6. [6]

    Metrics are reported on complete 600 example test split of each language

    Results We evaluate five prompting strategies across three languages (Swahili, Yoruba, Hausa) and two mid sized open weight models (Llama3.2-3B and Gemma3-4B). Metrics are reported on complete 600 example test split of each language. The overall results are summarized in the Table 1 of Appendix, which reports accuracy and macro-F1 score values across all ...

  7. [7]

    Overall Performance In all languages and models, there are signifi- cant variations between performance based on prompt formulation

    Analysis 6.1. Overall Performance In all languages and models, there are signifi- cant variations between performance based on prompt formulation. The zero-shot prompt base- line has a moderate level of accuracy, but contains Figure 1: Delta accuracy improvement over the zero-shot baseline for Llama3.2-3B across prompt- ing strategies and languages. Posit...

  8. [8]

    Even where the general differences in accuracy are intermediate, there is significant re- structuring of the behavior and prediction distribu- tions of classes by prompt structure

    Discussion The results of our study indicate that the level of multilingual NLI in African languages under low resources is extremely delicate as to prompt for- mulation. Even where the general differences in accuracy are intermediate, there is significant re- structuring of the behavior and prediction distribu- tions of classes by prompt structure. This ...

  9. [9]

    Our results show that prompt design signif- icantly shapes class-wise behavior and prediction stability, even when overall accuracy differences are modest

    Conclusion In AfriXNLI benchmark and with two open-weight models with middle size, we performed a system- atic survey of zero-shot prompting methods on Nat- ural Language Inference in Swahili, Yoruba, and Hausa. Our results show that prompt design signif- icantly shapes class-wise behavior and prediction stability, even when overall accuracy differences a...

  10. [10]

    The finding might not be generaliz- able to other low-resource languages

    Ethical Considerations and Limitations • Low coverage of the languages: We only as- sess three African languages (Swahili, Yoruba, Hausa). The finding might not be generaliz- able to other low-resource languages. • Single benchmark: Experiments are done on the AfriXNLI only. It would be improved by testing other NLI data. • Model scale constraints: We use...

  11. [11]

    Bibliographical References David Ifeoluwa Adelani, Marek Masiak, Is- rael Abebe Azime, and Others. 2023. MasakhaNEWS: News topic classification for African languages. InProceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Associa- tion for Computational Linguistics ...

  12. [12]

    Reason as a native {lang} speaker would in daily conversation, not using formal logic

  13. [13]

    Consider what a typical speaker would naturally infer from the first sentence about the second

  14. [14]

    Decide the relationship based on common-sense and pragmatic understanding in {lang}

  15. [16]

    {premise}

    Do NOT output explanations or any extra text. Decision rules (according to native {lang} usage): - entailment: a typical {lang} speaker would accept the second sentence as true because of the first. - contradiction: a typical {lang} speaker would judge the second sentence as incompatible with the first. - neutral: a typical {lang} speaker would find that ...

  16. [17]

    Consider each of the three possibilities below

  17. [18]

    Decide which one best matches the relationship between the sentences

  18. [19]

    Output exactly ONE English word: entailment, contradiction, or neutral

  19. [20]

    {premise}

    Do NOT output explanations or any extra text. Interpretations: - entailment: the premise makes the hypothesis true. - contradiction: the premise makes the hypothesis false. - neutral: the premise neither guarantees nor contradicts the hypothesis. Premise: "{premise}" Hypothesis: "{hypothesis}" Which interpretation fits best? Answer: """ A.3. Native-Label ...

  20. [21]

    Do NOT output it

    Internally transliterate the text into {lang} written in Latin script. Do NOT output it

  21. [23]

    {premise}

    Output exactly ONE English word: entailment, contradiction, or neutral. Decision rules: - entailment: premise makes hypothesis true. - contradiction: premise makes hypothesis false. - neutral: neither true nor false. Premise (Ajami): "{premise}" Hypothesis (Ajami): "{hypothesis}" Answer: """ Latin Script Variant: PROMPT = """ You are a fluent {lang} speak...

  22. [24]

    Read and reason in {lang}

  23. [25]

    Decide the relationship

  24. [26]

    {premise}

    Output exactly ONE English word: entailment, contradiction, or neutral. Decision rules: - entailment: premise makes hypothesis true. - contradiction: premise makes hypothesis false. - neutral: neither true nor false. Premise: "{premise}" Hypothesis: "{hypothesis}" Answer: """ B. Appendix - Full Results Strategy Lang Model Acc Macro-F1 Script Aware Sw Llam...