pith. sign in

arxiv: 2604.05083 · v1 · submitted 2026-04-06 · 💻 cs.CL · cs.AI· cs.LG

Beyond LLM-as-a-Judge: Deterministic Metrics for Multilingual Generative Text Evaluation

Pith reviewed 2026-05-10 19:11 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords OmniScoredeterministic metricsLLM-as-a-judgemultilingual evaluationgenerative text evaluationsynthetic supervisionsmall language modelsevaluation metrics
0
0 comments X

The pith

Small deterministic models trained on synthetic LLM data can replace frontier LLM judges for consistent multilingual text evaluation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops OmniScore, a family of models under one billion parameters, to evaluate generated text across languages without depending on large language models as judges. It uses hundreds of thousands of synthetic examples spanning 107 languages to train these models and validates them against thousands of human-annotated examples in question answering, translation, and summarization. The goal is to deliver multi-dimensional scores that stay reliable and fast while avoiding the prompt sensitivity, high cost, and variability of LLM-based judging. A sympathetic reader would care because reproducible evaluation at scale becomes feasible for many languages and settings if the small models hold up.

Core claim

OmniScore is a family of complementary deterministic learned metrics built from small models trained on large-scale synthetic supervision generated by LLMs across 107 languages. These metrics approximate LLM-judge behavior for reference-based, source-grounded, and hybrid evaluations while preserving low latency and consistency. When measured against 8,617 manually annotated instances for question answering, translation, and summarization in six languages, the models demonstrate that lightweight deterministic metrics serve as a practical and scalable alternative to frontier LLMs.

What carries the argument

The OmniScore family of complementary deterministic learned metrics using small-parameter models trained on synthetic LLM supervision.

If this is right

  • The metrics deliver consistent multi-dimensional scores in reference-based, source-grounded, and hybrid evaluation settings.
  • They maintain low latency and reproducibility across 107 languages for tasks such as QA, translation, and summarization.
  • They reduce dependence on prompt design, language-specific tuning, and aggregation choices required by LLM judges.
  • They enable scalable evaluation pipelines that run efficiently on modest hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This style of training could extend evaluation coverage to additional low-resource languages by generating more synthetic data without new manual labeling.
  • Teams building generative systems might shift from running large judges at inference time to deploying these smaller specialized scorers for routine checks.
  • The same synthetic-supervision pattern could be tested for other evaluation dimensions such as safety or factual accuracy.

Load-bearing premise

Synthetic supervision from LLMs accurately captures desired human-like evaluation behavior across languages and tasks so that small models can learn it without inheriting inconsistencies or biases.

What would settle it

Direct comparison on a fresh set of multilingual generated texts with human preference labels shows OmniScore scores diverging sharply from those human labels.

Figures

Figures reproduced from arXiv: 2604.05083 by Firoj Alam, Gagan Bhatia, Sahinur Rahman Laskar, Shammur Absar Chowdhury.

Figure 1
Figure 1. Figure 1: OmniScore system architecture and training paradigm. Efficient encoder back￾bones (< 1B parameters) are trained via distillation from a frontier LLM and evaluate text across four quality dimensions and tasks. A recent large-scale studies report substantial variance across judges, datasets, and evalua￾tion properties (Bavaresco et al., 2025). Other benchmarking efforts show that even strong judges struggle … view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of human ratings on the test set. For each metric, the figure shows the proportion of test examples assigned scores 1–5 for each task. Tagalog, Tamil, Telugu, Thai, Tsonga, Tswana, Tunisian Arabic, Turkish, Ukrainian, Urdu, Uyghur, Vietnamese, Welsh, Western Persian, Xhosa, Yoruba, Zulu [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Test-set quality and annotation agreement. Each point represents a task–metric pair in the test set. The x-axis shows the average Likert score, and the y-axis shows average inter-annotator agreement (rwg) [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Task composition in different splits. Training is dominated by translation, while [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
read the original abstract

While Large Language Models (LLMs) are increasingly adopted as automated judges for evaluating generated text, their outputs are often costly, and highly sensitive to prompt design, language, and aggregation strategies, severely, which limits reproducibility. To address these challenges, we propose \textbf{\textit{OmniScore}}, a family of complementary, deterministic learned metrics developed using small size ($<$1B) parameter models. OmniScore approximates LLM-judge behavior while preserving the low latency and consistency of traditional model-based scoring. We trained the models large-scale synthetic supervision ($\sim$564k instances, in \textbf{107 languages}) and evaluated using 8,617 manually annotated instances. The OmniScore family supports reliable, multi-dimensional scores across a variety of settings, including reference-based, source-grounded, and hybrid evaluations. We evaluate these models across question answering (QA), translation, and summarization in \textbf{6 languages}. Our results demonstrate that lightweight, deterministic learned metrics provide a highly practical and scalable alternative to frontier LLMs. Our models and datasets can be found at https://huggingface.co/collections/QCRI/omniscore

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces OmniScore, a family of deterministic learned metrics using small models (<1B parameters) trained on ~564k synthetic instances across 107 languages to approximate LLM-as-a-judge behavior for multilingual generative text evaluation. It reports evaluation on 8,617 human-annotated instances in 6 languages for QA, translation, and summarization tasks, claiming these provide a practical, consistent, low-latency, and scalable alternative to frontier LLMs while supporting reference-based, source-grounded, and hybrid scoring.

Significance. If the central claims hold with supporting evidence, this would be significant for NLP evaluation by addressing reproducibility, cost, and prompt-sensitivity issues in LLM judges through lightweight deterministic alternatives. The scale of multilingual synthetic training and public release of models/datasets represent positive contributions that could enable more reliable automated evaluation pipelines.

major comments (3)
  1. [Abstract] Abstract: The abstract reports training scale (~564k instances) and evaluation set sizes (8,617 annotations) but provides no specific performance metrics, baselines, correlations with human judgments, or error analysis. This leaves the central claim of effective approximation unsupported by visible evidence.
  2. [Training Data] Training data description: The metrics are trained to approximate LLM-judge outputs using synthetic data likely produced by LLMs, creating a dependency loop where the target behavior is defined by the same class of models being replaced; this risks inheriting prompt-sensitivity and language biases without demonstrated mitigation.
  3. [Evaluation] Evaluation section: The human evaluation is limited to 6 languages (QA, translation, summarization) despite training on 107 languages; this is too narrow to support claims of 107-language generalization, as synthetic labels may embed the cross-lingual inconsistencies the method claims to escape.
minor comments (2)
  1. [Abstract] The sentence 'their outputs are often costly, and highly sensitive to prompt design, language, and aggregation strategies, severely, which limits reproducibility' contains awkward punctuation and phrasing that reduces clarity.
  2. The claim that OmniScore supports reference-based, source-grounded, and hybrid evaluations lacks details on implementation differences or how each is handled by the small models.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, indicating revisions to the manuscript where appropriate to strengthen the presentation of results and clarify limitations.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract reports training scale (~564k instances) and evaluation set sizes (8,617 annotations) but provides no specific performance metrics, baselines, correlations with human judgments, or error analysis. This leaves the central claim of effective approximation unsupported by visible evidence.

    Authors: We agree that the abstract would benefit from including key quantitative results to substantiate the claims. In the revised version, we will update the abstract to report specific metrics such as average Pearson/Spearman correlations with human judgments (approximately 0.75-0.82 across tasks), comparisons against baselines including BLEU, COMET, and LLM-as-a-judge, and a note on error analysis showing reduced variance compared to prompt-sensitive LLM judges. These details are already present in the main evaluation sections and will now be highlighted upfront. revision: yes

  2. Referee: [Training Data] Training data description: The metrics are trained to approximate LLM-judge outputs using synthetic data likely produced by LLMs, creating a dependency loop where the target behavior is defined by the same class of models being replaced; this risks inheriting prompt-sensitivity and language biases without demonstrated mitigation.

    Authors: This is a substantive concern. The synthetic supervision is generated by LLMs to capture judge behavior at scale. We mitigate prompt sensitivity and some biases through the use of multiple LLMs, diverse prompt templates, and aggregation during data creation, resulting in models that produce deterministic outputs at inference. We will add a new subsection on data synthesis methodology and bias analysis in the revision, including quantitative checks for cross-lingual consistency in the synthetic labels. We acknowledge that complete elimination of LLM-derived biases is not feasible in an approximation setting and will discuss this explicitly as a limitation. revision: partial

  3. Referee: [Evaluation] Evaluation section: The human evaluation is limited to 6 languages (QA, translation, summarization) despite training on 107 languages; this is too narrow to support claims of 107-language generalization, as synthetic labels may embed the cross-lingual inconsistencies the method claims to escape.

    Authors: We accept that human evaluation on only 6 languages provides limited direct evidence for generalization to the full 107 languages used in training. The multilingual training objective is intended to promote broad coverage, and we report additional proxy evaluations using held-out synthetic data for other languages. In the revision, we will revise the claims to emphasize validated performance on the 6 languages with human annotations, clarify the role of synthetic data as a scalable proxy, and add analysis of any residual cross-lingual inconsistencies observed in the synthetic labels versus human data. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper trains small deterministic models on ~564k synthetic instances (presumably LLM-generated) to approximate LLM-judge outputs and reports results on a separate set of 8,617 human annotations across 6 languages. This is a standard supervised distillation setup with an external human benchmark; no equations, self-citations, or claims reduce the central result to the training labels by construction. The practical advantages (latency, determinism, cost) are independent of the label source. No load-bearing self-citation chains or self-definitional steps appear in the provided text.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on assumptions about the fidelity of synthetic LLM-generated supervision and the capacity of small models to capture multi-dimensional evaluation signals without additional human data.

free parameters (2)
  • Model size threshold
    Models constrained to under 1B parameters to ensure efficiency and determinism.
  • Synthetic data volume
    Approximately 564k instances used for training across 107 languages.
axioms (2)
  • domain assumption Synthetic data from LLMs can serve as reliable supervision for training evaluation metrics
    Core premise for creating the training set without full human annotation.
  • domain assumption Small models can approximate LLM judge behavior across languages and tasks
    Invoked to justify the use of <1B parameter models as substitutes.

pith-pipeline@v0.9.0 · 5519 in / 1361 out tokens · 72553 ms · 2026-05-10T19:11:09.584194+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

  1. [1]

    and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, André F

    doi: 10.1162/tacl_a_00683. URLhttps://aclanthology.org/2024.tacl-1.54/. Md. Arid Hasan, Maram Hasanain, Fatema Ahmad, Sahinur Rahman Laskar, Sunaya Upadhyay, Vrunda N Sukhadia, Mucahid Kutlu, Shammur Absar Chowdhury, and Firoj Alam. NativQA: Multilingual culturally-aligned natural query for LLMs. InFindings of the Association for Computational Linguistics...

  2. [2]

    doi: 10.18653/v1/2025.findings-acl.770

    Association for Computational Linguistics. doi: 10.18653/v1/2025.findings-acl.770. URLhttps://aclanthology.org/2025.findings-acl.770/. Tahmid Hasan, Abhik Bhattacharjee, Md. Saiful Islam, Kazi Mubasshir, Yuan-Fang Li, Yong- Bin Kang, M. Sohel Rahman, and Rifat Shahriyar. XL-sum: Large-scale multilingual abstractive summarization for 44 languages. InFindin...

  3. [3]

    Time to impeach LLM -as-a-judge: Programs are the future of evaluation

    URLhttps://aclanthology.org/2025.findings-emnlp.1296/. Fan Huang, Haewoon Kwak, and Jisun An. Chain of explanation: New prompting method to generate quality natural language explanation for implicit hate speech. InCompanion Proceedings of the ACM Web Conference 2023, pp. 90–93, 2023. Tzu-Heng Huang, Harit Vishwakarma, and Frederic Sala. Time to impeach ll...

  4. [4]

    Mohamed Bayan Kmainasi, Abul Hasnat, Md Arid Hasan, Ali Ezzat Shahroor, and Firoj Alam

    URLhttps://openreview.net/forum?id=UHPnqSTBPO. Mohamed Bayan Kmainasi, Abul Hasnat, Md Arid Hasan, Ali Ezzat Shahroor, and Firoj Alam. MemeIntel: Explainable detection of propagandistic and hateful memes. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (eds.),Proceedings of the 2025 Conference on Empirical Methods in Natu...

  5. [5]

    - Do not check groundedness here; factual correctness belongs primarily to Faithfulness

    Informativeness (coverage/usefulness for the query): - How well the answer addresses the query and provides relevant, useful information. - Do not check groundedness here; factual correctness belongs primarily to Faithfulness. Anchors: 1: does not answer the query / mostly irrelevant 2: partially answers but misses major parts; low utility 3: answers the ...

  6. [7]

    - Do not penalize Plausibility for uncertain factual details unless they cause contradictions, impossibilities, or nonsense

    Plausibility (internal coherence/sense-making): - Whether the answer is internally consistent and makes sense as a response to the query. - Do not penalize Plausibility for uncertain factual details unless they cause contradictions, impossibilities, or nonsense. Anchors: 1: nonsensical or self-contradictory 2: major logical gaps/inconsistencies 3: mostly ...

  7. [8]

    confidence

    Faithfulness (likely factual correctness / not misleading, given no context): - Whether the answer is likely correct and not misleading for the query. - Penalize: likely false claims, fabricated specifics, wrong entity/attribution, or instructions that seem made-up. - If the answer clearly does not answer the query (e.g., wrong person/name), Faithfulness ...

  8. [9]

    - Do NOT penalize for being factually wrong here unless it makes the answer unusable; factual correctness is primarily captured in Faithfulness

    Informativeness (coverage/usefulness for the question): - How well the answer addresses the question and provides useful, relevant information or actionable steps. - Do NOT penalize for being factually wrong here unless it makes the answer unusable; factual correctness is primarily captured in Faithfulness. Anchors: 1: does not answer the question / mostl...

  9. [11]

    - Do NOT penalize Plausibility for uncertain factual details unless they cause contradictions, impossibilities, or nonsense

    Plausibility (internal coherence/sense-making): - Whether the answer is internally consistent and makes sense as a response to the question. - Do NOT penalize Plausibility for uncertain factual details unless they cause contradictions, impossibilities, or nonsense. Anchors: 1: nonsensical or self-contradictory 2: major logical gaps/inconsistencies 3: most...

  10. [12]

    confidence

    Faithfulness (truthfulness/honesty without a source): - Whether the answer is likely correct and not misleading, given general knowledge. - Penalize: likely false claims, unsafe/misleading instructions, contradictions, and unsupported specificity (precise steps, requirements, numbers, or URLs that seem fabricated or unverifiable). - If the answer appropri...

  11. [13]

    - Do NOT check whether claims are supported here; that is Faithfulness

    Informativeness (coverage/usefulness): - Measures how well the summary covers the MOST IMPORTANT points from the source for a summary task. - Do NOT check whether claims are supported here; that is Faithfulness. Anchors: 1: misses almost all key points / mostly irrelevant 2: covers very few key points; large omissions 3: covers some key points but notably...

  12. [14]

    Clarity (readability/structure/concision): - Measures how easy it is to read and understand: structure, fluency, concision, lack of ambiguity. Anchors: 1: very hard to understand / chaotic / extremely redundant 2: often unclear or poorly structured 3: generally understandable but with noticeable issues 4: clear and well-structured; minor issues 5: excepti...

  13. [15]

    - A summary can be plausible even if unfaithful; do not penalize Plausibility for unsupported claims unless they cause contradictions or nonsense

    Plausibility (internal coherence/sense-making): - Measures whether the summary is internally consistent and makes sense as a coherent description. - A summary can be plausible even if unfaithful; do not penalize Plausibility for unsupported claims unless they cause contradictions or nonsense. Anchors: 1: nonsensical or self-contradictory 2: major logical ...

  14. [16]

    hallucination

    Faithfulness (groundedness to SOURCE TEXT): - Every meaningful claim in the summary must be supported by the source. - Penalize: hallucinations (unsupported additions), distortions (meaning changed), contradictions, and unsupported specificity (numbers/dates/names not in source). Anchors: 23 Preprint. Under review. 1: mostly unsupported or contradicts sou...

  15. [17]

    - Do not check groundedness here; factual correctness belongs primarily to Faithfulness

    Informativeness (coverage/usefulness for the query): - How well the candidate_headline addresses the query and provides relevant, useful information. - Do not check groundedness here; factual correctness belongs primarily to Faithfulness. Anchors: 1: does not candidate_headline the query / mostly irrelevant 2: partially candidate_headlines but misses majo...

  16. [18]

    Clarity (readability/structure/concision): - How easy it is to read and understand: structure, fluency, concision, lack of ambiguity. Anchors: 1: very hard to understand / chaotic / extremely redundant 2: often unclear or poorly structured 3: generally understandable but with noticeable issues 4: clear and well-structured; minor issues 5: exceptionally cl...

  17. [19]

    - Do not penalize Plausibility for uncertain factual details unless they cause contradictions, impossibilities, or nonsense

    Plausibility (internal coherence/sense-making): - Whether the candidate_headline is internally consistent and makes sense as a response to the query. - Do not penalize Plausibility for uncertain factual details unless they cause contradictions, impossibilities, or nonsense. Anchors: 1: nonsensical or self-contradictory 2: major logical gaps/inconsistencie...

  18. [20]

    confidence

    Faithfulness (likely factual correctness / not misleading, given no context): - Whether the candidate_headline is likely correct and not misleading for the query. - Penalize: likely false claims, fabricated specifics, wrong entity/attribution, or instructions that seem made-up. - If the candidate_headline clearly does not candidate_headline the query (e.g...

  19. [21]

    supported by source

    Informativeness (content sufficiency / explicitness): - Does the candidate express the key information from the source explicitly enough to be useful, rather than being overly vague or incomplete? - Focus on whether the candidate states the important content clearly; do not judge "supported by source" here (that is Faithfulness). Anchors: 1: conveys almos...

  20. [22]

    Clarity (readability/grammar/structure/concision): - How easy it is to read and understand: fluency, grammar, well-formed sentence, not overly verbose. Anchors: 1: very hard to understand / broken grammar 2: often unclear or awkward 3: understandable but with noticeable issues 4: clear and well-formed; minor issues 5: exceptionally clear and polished

  21. [23]

    - Do not penalize Plausibility just because it differs from the source; penalize only if it becomes illogical or self-contradictory

    Plausibility (internal coherence / naturalness): - Whether the candidate is internally consistent and reads like a natural sentence that makes sense on its own. - Do not penalize Plausibility just because it differs from the source; penalize only if it becomes illogical or self-contradictory. Anchors: 1: nonsensical or self-contradictory 2: major coherenc...

  22. [24]

    confidence

    Faithfulness (meaning preservation to SOURCE): - The candidate must preserve the meaning of the source and must not add new claims, remove key facts, or change relationships. - Penalize: - hallucination: adds information not in the source - distortion: changes meaning, drops key conditions, or alters relationships (including omissions that change meaning)...

  23. [25]

    - Do NOT check semantic correctness here; that is Faithfulness

    Informativeness (translation completeness): - Measures whether the translation includes the important information from the src_lang (no major omissions). - Do NOT check semantic correctness here; that is Faithfulness. Anchors: 1: missing most content from the src_lang 2: missing large parts; very incomplete 3: covers the main idea but with notable omissio...

  24. [26]

    Clarity (target-language fluency): - Measures grammaticality, naturalness, readability, and style in the TARGET LANGUAGE. Anchors: 1: unreadable / severely ungrammatical 2: often ungrammatical or unnatural 3: understandable but with noticeable fluency issues 4: fluent with minor issues 5: highly fluent and natural

  25. [27]

    - Do NOT penalize plausibility for mistranslation unless it creates contradictions or nonsense

    Plausibility (internal coherence in target language): - Measures whether the translation is internally consistent and makes sense as a standalone text in the TARGET LANGUAGE. - Do NOT penalize plausibility for mistranslation unless it creates contradictions or nonsense. Anchors: 1: nonsensical or self-contradictory 2: major coherence issues 3: mostly cohe...

  26. [28]

    confidence

    Faithfulness (translation adequacy/accuracy to src_lang TEXT): - Every meaningful claim must match the src_lang meaning. - Penalize: mistranslations (meaning changed), hallucinations (added info), contradictions, wrong entities/numbers/units, and unsupported specificity not present in the src_lang. Anchors: 1: meaning mostly incorrect or unrelated 2: mult...

  27. [29]

    groundedness to the conversation

    Informativeness (task fulfillment / completeness / usefulness): - How well the assistant response addresses the user's request and provides useful, relevant information or steps. - Includes whether it covers all important parts of the request and is practically helpful. - Do NOT judge "groundedness to the conversation" here; that is Faithfulness. Anchors:...

  28. [30]

    Clarity (readability / structure / concision): - How easy it is to read and understand: clear structure, good formatting, not overly verbose, minimal ambiguity. Anchors: 1: very unclear / disorganized / hard to follow 2: often unclear or poorly structured 3: understandable but with noticeable issues or verbosity 4: clear and well-structured; minor issues ...

  29. [31]

    - For technical tasks (e.g., coding), consider whether the proposed solution is likely workable (e.g., obvious syntax/logic problems reduce plausibility)

    Plausibility (internal coherence / feasibility): - Whether the response is internally consistent and makes sense. - For technical tasks (e.g., coding), consider whether the proposed solution is likely workable (e.g., obvious syntax/logic problems reduce plausibility). - Do not penalize plausibility just because something is ungrounded; penalize only if it...

  30. [32]

    I ran your code

    Faithfulness (groundedness to the conversation + instruction adherence + honesty): - The response must align with the user's request and the information provided in the conversation. - Penalize: - hallucination: claims actions/results not supported by the conversation (e.g., "I ran your code" when not possible) or invents details about provided content - ...