pith. sign in

arxiv: 2606.20255 · v2 · pith:6245XPQ6new · submitted 2026-06-18 · 💻 cs.CL · cs.AI

The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse

Pith reviewed 2026-06-26 17:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords Meaning Intelligence FrameworkNigerian public discourseregister classificationpragmatic intentcontext failureAI evaluationsentiment analysiscode-mixed language
0
0 comments X

The pith

A nine-dimension schema raises AI register classification accuracy on Nigerian discourse from 33.3 percent to 73.3 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Meaning Intelligence Framework, a nine-dimension schema that scores register, surface sentiment, true intent, irony, coded subtext, risk tier, annotator confidence, speaker emotion, and recommended action to capture pragmatic force in Nigerian public discourse. It argues that AI failures on this material stem from missing context rather than translation issues, and that the same utterance can carry opposite meanings depending on speaker, audience, and situation. Tests on three frontier models show a 40-point jump in register accuracy once the schema is supplied in prompts, yet larger models do not outperform smaller ones and gain nothing from the schema. The work supplies a 30-item calibration set across Standard English, Nigerian English, Pidgin, and code-mixed registers.

Core claim

Zero-shot register classification accuracy sits at 33.3 percent and rises to 73.3 percent when models receive the MIF schema in context. Model capability and cultural competence are decoupled: GPT-5 and Gemini 2.5 Pro score lower than Gemini 2.5 Flash on the meaning intelligence score, and neither larger model benefits from schema-informed prompting.

What carries the argument

The Meaning Intelligence Framework (MIF), a nine-dimension annotation and evaluation schema that separates surface sentiment from true communicative intent across register, irony, coded subtext, risk tier, and related dimensions.

If this is right

  • Schema-informed prompting can close much of the register gap for models operating on Nigerian discourse.
  • General model scale does not guarantee better handling of pragmatic or cultural intent in this domain.
  • The released framework, guidelines, and calibration set enable direct reproducibility checks and further annotation work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Comparable dimension-based schemas could be tested on other context-sensitive discourse domains where literal and intended meanings diverge.
  • The observed decoupling suggests that targeted cultural or pragmatic fine-tuning may be more effective than scaling alone for register-sensitive tasks.

Load-bearing premise

The 30-item calibration dataset is representative enough of Nigerian public discourse to support the claims about the Register Gap and model decoupling.

What would settle it

A new test collection of at least 100 utterances drawn from Nigerian public discourse, scored by multiple annotators, would show whether the 40-point accuracy gain from in-context MIF prompting still appears when the models are re-evaluated.

read the original abstract

We introduce the Meaning Intelligence Framework (MIF), a nine-dimension annotation and evaluation schema for Nigerian public discourse that separates surface sentiment from true communicative intent. Existing benchmarks for Nigerian languages, including NaijaSenti and AfriSenti, treat sentiment classification as a three-way polarity task. We argue that the dominant failure mode of AI systems on Nigerian discourse is not translation failure but context failure: the same utterance carries opposite pragmatic force depending on speaker, audience, and situation. The MIF operationalises this insight across nine scored dimensions: register, surface sentiment, true intent, irony, coded subtext, risk tier, annotator confidence, speaker emotion, and recommended communications action. We construct a 30-item calibration dataset spanning Standard English, Nigerian English, Nigerian Pidgin, and code-mixed registers, and evaluate three frontier language models (Gemini 2.5 Flash, GPT-5, and Gemini 2.5 Pro) under zero-shot and schema-informed prompting conditions. Two headline findings emerge. First, the Register Gap: zero-shot register classification accuracy is 33.3%, rising to 73.3% (+40 points) when the model receives the MIF schema in-context. Second, model capability and cultural competence are decoupled: GPT-5 (MIS 67.8) and Gemini 2.5 Pro (MIS 65.4) score lower than Flash (MIS 78.6), and neither benefits from schema-informed prompting. We release the framework specification, annotation guidelines, and calibration set to support reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces the Meaning Intelligence Framework (MIF), a nine-dimension schema for annotating Nigerian public discourse that distinguishes surface sentiment from pragmatic intent across registers including Standard English, Nigerian English, Pidgin, and code-mixed forms. It constructs a 30-item calibration dataset and evaluates three LLMs (Gemini 2.5 Flash, GPT-5, Gemini 2.5 Pro) in zero-shot versus schema-informed conditions, reporting a Register Gap with register classification accuracy rising from 33.3% to 73.3% (+40 points) when MIF is provided in-context, plus a decoupling result where Flash achieves a higher Meaning Intelligence Score (78.6) than the larger models (67.8 and 65.4) and does not benefit from the schema. The framework specification, guidelines, and calibration set are released.

Significance. If the central claims hold after validation, the work would identify a practically important failure mode in frontier LLMs—context and register sensitivity rather than translation per se—in Nigerian discourse, and supply a reusable nine-dimension schema plus open calibration materials that could support more targeted evaluation and fine-tuning in low-resource cultural NLP. The explicit release of the dataset and guidelines is a clear strength that enables external checks and extensions.

major comments (3)
  1. [Abstract] Abstract (Register Gap claim): the reported improvement from 10/30 to 22/30 correct register classifications is presented without binomial confidence intervals, stratification by register or language, or any description of how the ground-truth intent labels were established or validated.
  2. [Abstract] Abstract (decoupling claim): the finding that Gemini 2.5 Flash (MIS 78.6) outperforms GPT-5 and Gemini 2.5 Pro is computed on the identical 30-item set; any sampling instability or annotation noise therefore propagates directly to both headline results.
  3. [Abstract] Abstract (evaluation protocol): no inter-annotator agreement is reported for the 30-item set, and the manuscript supplies no sampling frame or external validation that would establish the set as representative of Nigerian public discourse.
minor comments (1)
  1. [Abstract] The abstract references NaijaSenti and AfriSenti as existing benchmarks but does not supply citations for them.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on the abstract and evaluation protocol. We address each point below and will revise the manuscript to incorporate additional statistical reporting, clarifications on annotation, and explicit discussion of limitations.

read point-by-point responses
  1. Referee: [Abstract] Abstract (Register Gap claim): the reported improvement from 10/30 to 22/30 correct register classifications is presented without binomial confidence intervals, stratification by register or language, or any description of how the ground-truth intent labels were established or validated.

    Authors: We will add binomial confidence intervals for the 33.3% and 73.3% figures using the Clopper-Pearson method. The ground-truth labels were established by the lead author (a native Nigerian English speaker with training in pragmatics) via iterative contextual analysis of speaker, audience, and situation for each of the 30 items; we will expand the methods section to describe this process explicitly. A supplementary table will provide stratification by register. These additions will be made in the revision. revision: yes

  2. Referee: [Abstract] Abstract (decoupling claim): the finding that Gemini 2.5 Flash (MIS 78.6) outperforms GPT-5 and Gemini 2.5 Pro is computed on the identical 30-item set; any sampling instability or annotation noise therefore propagates directly to both headline results.

    Authors: We agree that both headline results derive from the same 30-item calibration set and therefore share any sampling or annotation effects. We will revise the abstract, results, and discussion sections to state this limitation explicitly and to frame the decoupling observation as preliminary, recommending independent validation sets in future work. revision: yes

  3. Referee: [Abstract] Abstract (evaluation protocol): no inter-annotator agreement is reported for the 30-item set, and the manuscript supplies no sampling frame or external validation that would establish the set as representative of Nigerian public discourse.

    Authors: The 30-item collection is explicitly positioned as a calibration set constructed purposively to cover the four registers rather than as a statistically representative sample; no formal sampling frame was used. Annotation was performed by a single expert annotator, so inter-annotator agreement metrics do not apply. We will add a dedicated limitations paragraph clarifying these points and noting the absence of external validation beyond author expertise. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper constructs a new nine-dimension MIF schema and a 30-item calibration set, then reports direct accuracy counts (10/30 zero-shot, 22/30 schema-informed) on register classification. These are empirical measurements on the authors' own labeled items rather than any fitted parameter renamed as a prediction, self-definitional loop, or load-bearing self-citation. No equations, uniqueness theorems, or ansatzes are invoked that reduce the headline Register Gap or model-decoupling claims to the inputs by construction. The evaluation is self-contained against the stated benchmark; external validity concerns exist but do not constitute circularity under the specified criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that context failure is the dominant error mode and that a 30-item set can demonstrate general improvement; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption The dominant failure mode of AI systems on Nigerian discourse is context failure rather than translation failure.
    Explicitly stated in the abstract as the motivation for separating surface sentiment from true intent.
invented entities (1)
  • Meaning Intelligence Framework (MIF) no independent evidence
    purpose: Operationalises nine scored dimensions to capture pragmatic force in Nigerian discourse.
    Newly defined schema introduced in the paper; no independent evidence outside this work is provided in the abstract.

pith-pipeline@v0.9.1-grok · 5804 in / 1275 out tokens · 23385 ms · 2026-06-26T17:38:53.267454+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 9 canonical work pages

  1. [1]

    I., et al

    Adelani, D. I., et al. (2021). MasakhaNER: Named entity recognition for African languages. Transactions of the Association for Computational Linguistics, 9, 1116–1131

  2. [2]

    I., et al

    Adelani, D. I., et al. (2023). AfroBench: How good are large language models on African languages? arXiv preprint arXiv:2311.07978

  3. [3]

    Iskandardinata, M., Christian, W., & Suhartono, D. (2025). Context -aware pragmatic metacognitive prompting for sarcasm detection. arXiv preprint arXiv:2511.21066

  4. [4]

    Lee, J., et al. (2024). Pragmatic metacognitive prompting improves LLM performance on sarcasm detection. arXiv preprint arXiv:2412.04509

  5. [5]

    H., et al

    Muhammad, S. H., et al. (2022). NaijaSenti: A Nigerian Twitter sentiment corpus for multilingual sentiment analysis. arXiv preprint arXiv:2201.08277

  6. [6]

    H., et al

    Muhammad, S. H., et al. (2023). AfriSenti: A Twitter sentiment analysis benchmark for African languages. arXiv preprint arXiv:2302.08956

  7. [7]

    Ochieng, M., et al. (2025). Reasoning beyond labels: Measuring LLM sentiment in low -resource, culturally nuanced contexts. arXiv preprint arXiv:2508.04199

  8. [8]

    Oyewusi, W., et al. (2021). Semantic enrichment of Nigerian Pidgin English for contextual sentiment classification. In Proceedings of the AfricaNLP Workshop

  9. [9]

    Saeed, M., Bourgonje, P., & Demberg, V. (2024). Implicit discourse relation classification for Nigerian Pidgin. arXiv preprint arXiv:2406.18776

  10. [10]

    Shode, I., et al. (2023). NollySenti: Leveraging transfer learning and machine translation for Nigerian movie sentiment classification. arXiv preprint arXiv:2305.10971. 10

  11. [11]

    O., et al

    Yu, H., Alabi, J. O., et al. (2025). INJONGO: A multicultural intent detection and slot-filling dataset for 16 African languages. arXiv preprint arXiv:2502.09814. Supplementary materials: The MIF Master Specification v2.0, Annotation Guidelines v1.0, and the 30-item public calibration set (with gold labels) are available as companion documents. The privat...