Beyond LLM-as-a-Judge: Deterministic Metrics for Multilingual Generative Text Evaluation
Pith reviewed 2026-05-10 19:11 UTC · model grok-4.3
The pith
Small deterministic models trained on synthetic LLM data can replace frontier LLM judges for consistent multilingual text evaluation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OmniScore is a family of complementary deterministic learned metrics built from small models trained on large-scale synthetic supervision generated by LLMs across 107 languages. These metrics approximate LLM-judge behavior for reference-based, source-grounded, and hybrid evaluations while preserving low latency and consistency. When measured against 8,617 manually annotated instances for question answering, translation, and summarization in six languages, the models demonstrate that lightweight deterministic metrics serve as a practical and scalable alternative to frontier LLMs.
What carries the argument
The OmniScore family of complementary deterministic learned metrics using small-parameter models trained on synthetic LLM supervision.
If this is right
- The metrics deliver consistent multi-dimensional scores in reference-based, source-grounded, and hybrid evaluation settings.
- They maintain low latency and reproducibility across 107 languages for tasks such as QA, translation, and summarization.
- They reduce dependence on prompt design, language-specific tuning, and aggregation choices required by LLM judges.
- They enable scalable evaluation pipelines that run efficiently on modest hardware.
Where Pith is reading between the lines
- This style of training could extend evaluation coverage to additional low-resource languages by generating more synthetic data without new manual labeling.
- Teams building generative systems might shift from running large judges at inference time to deploying these smaller specialized scorers for routine checks.
- The same synthetic-supervision pattern could be tested for other evaluation dimensions such as safety or factual accuracy.
Load-bearing premise
Synthetic supervision from LLMs accurately captures desired human-like evaluation behavior across languages and tasks so that small models can learn it without inheriting inconsistencies or biases.
What would settle it
Direct comparison on a fresh set of multilingual generated texts with human preference labels shows OmniScore scores diverging sharply from those human labels.
Figures
read the original abstract
While Large Language Models (LLMs) are increasingly adopted as automated judges for evaluating generated text, their outputs are often costly, and highly sensitive to prompt design, language, and aggregation strategies, severely, which limits reproducibility. To address these challenges, we propose \textbf{\textit{OmniScore}}, a family of complementary, deterministic learned metrics developed using small size ($<$1B) parameter models. OmniScore approximates LLM-judge behavior while preserving the low latency and consistency of traditional model-based scoring. We trained the models large-scale synthetic supervision ($\sim$564k instances, in \textbf{107 languages}) and evaluated using 8,617 manually annotated instances. The OmniScore family supports reliable, multi-dimensional scores across a variety of settings, including reference-based, source-grounded, and hybrid evaluations. We evaluate these models across question answering (QA), translation, and summarization in \textbf{6 languages}. Our results demonstrate that lightweight, deterministic learned metrics provide a highly practical and scalable alternative to frontier LLMs. Our models and datasets can be found at https://huggingface.co/collections/QCRI/omniscore
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces OmniScore, a family of deterministic learned metrics using small models (<1B parameters) trained on ~564k synthetic instances across 107 languages to approximate LLM-as-a-judge behavior for multilingual generative text evaluation. It reports evaluation on 8,617 human-annotated instances in 6 languages for QA, translation, and summarization tasks, claiming these provide a practical, consistent, low-latency, and scalable alternative to frontier LLMs while supporting reference-based, source-grounded, and hybrid scoring.
Significance. If the central claims hold with supporting evidence, this would be significant for NLP evaluation by addressing reproducibility, cost, and prompt-sensitivity issues in LLM judges through lightweight deterministic alternatives. The scale of multilingual synthetic training and public release of models/datasets represent positive contributions that could enable more reliable automated evaluation pipelines.
major comments (3)
- [Abstract] Abstract: The abstract reports training scale (~564k instances) and evaluation set sizes (8,617 annotations) but provides no specific performance metrics, baselines, correlations with human judgments, or error analysis. This leaves the central claim of effective approximation unsupported by visible evidence.
- [Training Data] Training data description: The metrics are trained to approximate LLM-judge outputs using synthetic data likely produced by LLMs, creating a dependency loop where the target behavior is defined by the same class of models being replaced; this risks inheriting prompt-sensitivity and language biases without demonstrated mitigation.
- [Evaluation] Evaluation section: The human evaluation is limited to 6 languages (QA, translation, summarization) despite training on 107 languages; this is too narrow to support claims of 107-language generalization, as synthetic labels may embed the cross-lingual inconsistencies the method claims to escape.
minor comments (2)
- [Abstract] The sentence 'their outputs are often costly, and highly sensitive to prompt design, language, and aggregation strategies, severely, which limits reproducibility' contains awkward punctuation and phrasing that reduces clarity.
- The claim that OmniScore supports reference-based, source-grounded, and hybrid evaluations lacks details on implementation differences or how each is handled by the small models.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below, indicating revisions to the manuscript where appropriate to strengthen the presentation of results and clarify limitations.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract reports training scale (~564k instances) and evaluation set sizes (8,617 annotations) but provides no specific performance metrics, baselines, correlations with human judgments, or error analysis. This leaves the central claim of effective approximation unsupported by visible evidence.
Authors: We agree that the abstract would benefit from including key quantitative results to substantiate the claims. In the revised version, we will update the abstract to report specific metrics such as average Pearson/Spearman correlations with human judgments (approximately 0.75-0.82 across tasks), comparisons against baselines including BLEU, COMET, and LLM-as-a-judge, and a note on error analysis showing reduced variance compared to prompt-sensitive LLM judges. These details are already present in the main evaluation sections and will now be highlighted upfront. revision: yes
-
Referee: [Training Data] Training data description: The metrics are trained to approximate LLM-judge outputs using synthetic data likely produced by LLMs, creating a dependency loop where the target behavior is defined by the same class of models being replaced; this risks inheriting prompt-sensitivity and language biases without demonstrated mitigation.
Authors: This is a substantive concern. The synthetic supervision is generated by LLMs to capture judge behavior at scale. We mitigate prompt sensitivity and some biases through the use of multiple LLMs, diverse prompt templates, and aggregation during data creation, resulting in models that produce deterministic outputs at inference. We will add a new subsection on data synthesis methodology and bias analysis in the revision, including quantitative checks for cross-lingual consistency in the synthetic labels. We acknowledge that complete elimination of LLM-derived biases is not feasible in an approximation setting and will discuss this explicitly as a limitation. revision: partial
-
Referee: [Evaluation] Evaluation section: The human evaluation is limited to 6 languages (QA, translation, summarization) despite training on 107 languages; this is too narrow to support claims of 107-language generalization, as synthetic labels may embed the cross-lingual inconsistencies the method claims to escape.
Authors: We accept that human evaluation on only 6 languages provides limited direct evidence for generalization to the full 107 languages used in training. The multilingual training objective is intended to promote broad coverage, and we report additional proxy evaluations using held-out synthetic data for other languages. In the revision, we will revise the claims to emphasize validated performance on the 6 languages with human annotations, clarify the role of synthetic data as a scalable proxy, and add analysis of any residual cross-lingual inconsistencies observed in the synthetic labels versus human data. revision: partial
Circularity Check
No significant circularity
full rationale
The paper trains small deterministic models on ~564k synthetic instances (presumably LLM-generated) to approximate LLM-judge outputs and reports results on a separate set of 8,617 human annotations across 6 languages. This is a standard supervised distillation setup with an external human benchmark; no equations, self-citations, or claims reduce the central result to the training labels by construction. The practical advantages (latency, determinism, cost) are independent of the label source. No load-bearing self-citation chains or self-definitional steps appear in the provided text.
Axiom & Free-Parameter Ledger
free parameters (2)
- Model size threshold
- Synthetic data volume
axioms (2)
- domain assumption Synthetic data from LLMs can serve as reliable supervision for training evaluation metrics
- domain assumption Small models can approximate LLM judge behavior across languages and tasks
Reference graph
Works this paper leans on
-
[1]
and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, André F
doi: 10.1162/tacl_a_00683. URLhttps://aclanthology.org/2024.tacl-1.54/. Md. Arid Hasan, Maram Hasanain, Fatema Ahmad, Sahinur Rahman Laskar, Sunaya Upadhyay, Vrunda N Sukhadia, Mucahid Kutlu, Shammur Absar Chowdhury, and Firoj Alam. NativQA: Multilingual culturally-aligned natural query for LLMs. InFindings of the Association for Computational Linguistics...
-
[2]
doi: 10.18653/v1/2025.findings-acl.770
Association for Computational Linguistics. doi: 10.18653/v1/2025.findings-acl.770. URLhttps://aclanthology.org/2025.findings-acl.770/. Tahmid Hasan, Abhik Bhattacharjee, Md. Saiful Islam, Kazi Mubasshir, Yuan-Fang Li, Yong- Bin Kang, M. Sohel Rahman, and Rifat Shahriyar. XL-sum: Large-scale multilingual abstractive summarization for 44 languages. InFindin...
-
[3]
Time to impeach LLM -as-a-judge: Programs are the future of evaluation
URLhttps://aclanthology.org/2025.findings-emnlp.1296/. Fan Huang, Haewoon Kwak, and Jisun An. Chain of explanation: New prompting method to generate quality natural language explanation for implicit hate speech. InCompanion Proceedings of the ACM Web Conference 2023, pp. 90–93, 2023. Tzu-Heng Huang, Harit Vishwakarma, and Frederic Sala. Time to impeach ll...
-
[4]
Mohamed Bayan Kmainasi, Abul Hasnat, Md Arid Hasan, Ali Ezzat Shahroor, and Firoj Alam
URLhttps://openreview.net/forum?id=UHPnqSTBPO. Mohamed Bayan Kmainasi, Abul Hasnat, Md Arid Hasan, Ali Ezzat Shahroor, and Firoj Alam. MemeIntel: Explainable detection of propagandistic and hateful memes. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (eds.),Proceedings of the 2025 Conference on Empirical Methods in Natu...
-
[5]
- Do not check groundedness here; factual correctness belongs primarily to Faithfulness
Informativeness (coverage/usefulness for the query): - How well the answer addresses the query and provides relevant, useful information. - Do not check groundedness here; factual correctness belongs primarily to Faithfulness. Anchors: 1: does not answer the query / mostly irrelevant 2: partially answers but misses major parts; low utility 3: answers the ...
-
[7]
Plausibility (internal coherence/sense-making): - Whether the answer is internally consistent and makes sense as a response to the query. - Do not penalize Plausibility for uncertain factual details unless they cause contradictions, impossibilities, or nonsense. Anchors: 1: nonsensical or self-contradictory 2: major logical gaps/inconsistencies 3: mostly ...
-
[8]
Faithfulness (likely factual correctness / not misleading, given no context): - Whether the answer is likely correct and not misleading for the query. - Penalize: likely false claims, fabricated specifics, wrong entity/attribution, or instructions that seem made-up. - If the answer clearly does not answer the query (e.g., wrong person/name), Faithfulness ...
-
[9]
Informativeness (coverage/usefulness for the question): - How well the answer addresses the question and provides useful, relevant information or actionable steps. - Do NOT penalize for being factually wrong here unless it makes the answer unusable; factual correctness is primarily captured in Faithfulness. Anchors: 1: does not answer the question / mostl...
-
[11]
Plausibility (internal coherence/sense-making): - Whether the answer is internally consistent and makes sense as a response to the question. - Do NOT penalize Plausibility for uncertain factual details unless they cause contradictions, impossibilities, or nonsense. Anchors: 1: nonsensical or self-contradictory 2: major logical gaps/inconsistencies 3: most...
-
[12]
Faithfulness (truthfulness/honesty without a source): - Whether the answer is likely correct and not misleading, given general knowledge. - Penalize: likely false claims, unsafe/misleading instructions, contradictions, and unsupported specificity (precise steps, requirements, numbers, or URLs that seem fabricated or unverifiable). - If the answer appropri...
-
[13]
- Do NOT check whether claims are supported here; that is Faithfulness
Informativeness (coverage/usefulness): - Measures how well the summary covers the MOST IMPORTANT points from the source for a summary task. - Do NOT check whether claims are supported here; that is Faithfulness. Anchors: 1: misses almost all key points / mostly irrelevant 2: covers very few key points; large omissions 3: covers some key points but notably...
-
[14]
Clarity (readability/structure/concision): - Measures how easy it is to read and understand: structure, fluency, concision, lack of ambiguity. Anchors: 1: very hard to understand / chaotic / extremely redundant 2: often unclear or poorly structured 3: generally understandable but with noticeable issues 4: clear and well-structured; minor issues 5: excepti...
-
[15]
Plausibility (internal coherence/sense-making): - Measures whether the summary is internally consistent and makes sense as a coherent description. - A summary can be plausible even if unfaithful; do not penalize Plausibility for unsupported claims unless they cause contradictions or nonsense. Anchors: 1: nonsensical or self-contradictory 2: major logical ...
-
[16]
Faithfulness (groundedness to SOURCE TEXT): - Every meaningful claim in the summary must be supported by the source. - Penalize: hallucinations (unsupported additions), distortions (meaning changed), contradictions, and unsupported specificity (numbers/dates/names not in source). Anchors: 23 Preprint. Under review. 1: mostly unsupported or contradicts sou...
-
[17]
- Do not check groundedness here; factual correctness belongs primarily to Faithfulness
Informativeness (coverage/usefulness for the query): - How well the candidate_headline addresses the query and provides relevant, useful information. - Do not check groundedness here; factual correctness belongs primarily to Faithfulness. Anchors: 1: does not candidate_headline the query / mostly irrelevant 2: partially candidate_headlines but misses majo...
-
[18]
Clarity (readability/structure/concision): - How easy it is to read and understand: structure, fluency, concision, lack of ambiguity. Anchors: 1: very hard to understand / chaotic / extremely redundant 2: often unclear or poorly structured 3: generally understandable but with noticeable issues 4: clear and well-structured; minor issues 5: exceptionally cl...
-
[19]
Plausibility (internal coherence/sense-making): - Whether the candidate_headline is internally consistent and makes sense as a response to the query. - Do not penalize Plausibility for uncertain factual details unless they cause contradictions, impossibilities, or nonsense. Anchors: 1: nonsensical or self-contradictory 2: major logical gaps/inconsistencie...
-
[20]
Faithfulness (likely factual correctness / not misleading, given no context): - Whether the candidate_headline is likely correct and not misleading for the query. - Penalize: likely false claims, fabricated specifics, wrong entity/attribution, or instructions that seem made-up. - If the candidate_headline clearly does not candidate_headline the query (e.g...
-
[21]
Informativeness (content sufficiency / explicitness): - Does the candidate express the key information from the source explicitly enough to be useful, rather than being overly vague or incomplete? - Focus on whether the candidate states the important content clearly; do not judge "supported by source" here (that is Faithfulness). Anchors: 1: conveys almos...
-
[22]
Clarity (readability/grammar/structure/concision): - How easy it is to read and understand: fluency, grammar, well-formed sentence, not overly verbose. Anchors: 1: very hard to understand / broken grammar 2: often unclear or awkward 3: understandable but with noticeable issues 4: clear and well-formed; minor issues 5: exceptionally clear and polished
-
[23]
Plausibility (internal coherence / naturalness): - Whether the candidate is internally consistent and reads like a natural sentence that makes sense on its own. - Do not penalize Plausibility just because it differs from the source; penalize only if it becomes illogical or self-contradictory. Anchors: 1: nonsensical or self-contradictory 2: major coherenc...
-
[24]
Faithfulness (meaning preservation to SOURCE): - The candidate must preserve the meaning of the source and must not add new claims, remove key facts, or change relationships. - Penalize: - hallucination: adds information not in the source - distortion: changes meaning, drops key conditions, or alters relationships (including omissions that change meaning)...
-
[25]
- Do NOT check semantic correctness here; that is Faithfulness
Informativeness (translation completeness): - Measures whether the translation includes the important information from the src_lang (no major omissions). - Do NOT check semantic correctness here; that is Faithfulness. Anchors: 1: missing most content from the src_lang 2: missing large parts; very incomplete 3: covers the main idea but with notable omissio...
-
[26]
Clarity (target-language fluency): - Measures grammaticality, naturalness, readability, and style in the TARGET LANGUAGE. Anchors: 1: unreadable / severely ungrammatical 2: often ungrammatical or unnatural 3: understandable but with noticeable fluency issues 4: fluent with minor issues 5: highly fluent and natural
-
[27]
- Do NOT penalize plausibility for mistranslation unless it creates contradictions or nonsense
Plausibility (internal coherence in target language): - Measures whether the translation is internally consistent and makes sense as a standalone text in the TARGET LANGUAGE. - Do NOT penalize plausibility for mistranslation unless it creates contradictions or nonsense. Anchors: 1: nonsensical or self-contradictory 2: major coherence issues 3: mostly cohe...
-
[28]
Faithfulness (translation adequacy/accuracy to src_lang TEXT): - Every meaningful claim must match the src_lang meaning. - Penalize: mistranslations (meaning changed), hallucinations (added info), contradictions, wrong entities/numbers/units, and unsupported specificity not present in the src_lang. Anchors: 1: meaning mostly incorrect or unrelated 2: mult...
-
[29]
groundedness to the conversation
Informativeness (task fulfillment / completeness / usefulness): - How well the assistant response addresses the user's request and provides useful, relevant information or steps. - Includes whether it covers all important parts of the request and is practically helpful. - Do NOT judge "groundedness to the conversation" here; that is Faithfulness. Anchors:...
-
[30]
Clarity (readability / structure / concision): - How easy it is to read and understand: clear structure, good formatting, not overly verbose, minimal ambiguity. Anchors: 1: very unclear / disorganized / hard to follow 2: often unclear or poorly structured 3: understandable but with noticeable issues or verbosity 4: clear and well-structured; minor issues ...
-
[31]
Plausibility (internal coherence / feasibility): - Whether the response is internally consistent and makes sense. - For technical tasks (e.g., coding), consider whether the proposed solution is likely workable (e.g., obvious syntax/logic problems reduce plausibility). - Do not penalize plausibility just because something is ungrounded; penalize only if it...
-
[32]
Faithfulness (groundedness to the conversation + instruction adherence + honesty): - The response must align with the user's request and the information provided in the conversation. - Penalize: - hallucination: claims actions/results not supported by the conversation (e.g., "I ran your code" when not possible) or invents details about provided content - ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.