OmniScore is a family of lightweight deterministic learned metrics that approximate LLM-judge behavior for reliable multilingual evaluation of generative text in tasks such as QA, translation, and summarization.
Time to impeach LLM -as-a-judge: Programs are the future of evaluation
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CL 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Localizing judge prompts to five languages shows that LLM backbones interact with language in agent-as-a-judge evaluations, inverting rankings and revealing no universal best model with low inter-judge agreement.
citing papers explorer
-
Beyond LLM-as-a-Judge: Deterministic Metrics for Multilingual Generative Text Evaluation
OmniScore is a family of lightweight deterministic learned metrics that approximate LLM-judge behavior for reliable multilingual evaluation of generative text in tasks such as QA, translation, and summarization.
-
Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation
Localizing judge prompts to five languages shows that LLM backbones interact with language in agent-as-a-judge evaluations, inverting rankings and revealing no universal best model with low inter-judge agreement.