The authors generate and publicly release the first large-scale open dataset of three million structured moral fables produced by small open language models together with a reproducible LLM-judge evaluation pipeline.
Are large language models good evaluators for abstractive summarization?
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 4representative citing papers
Multi-agent debate among LLMs yields more reliable text evaluations than single-agent prompting by simulating collaborative human judgment.
IFEval is a new benchmark of 25 verifiable instruction types and ~500 prompts for objective, reproducible evaluation of LLMs' instruction-following abilities.
A tree-of-thoughts inspired hybrid extractive-abstractive LLM prompt yields better legal case judgment summaries than standard extractive or abstractive prompts.
citing papers explorer
-
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
Multi-agent debate among LLMs yields more reliable text evaluations than single-agent prompting by simulating collaborative human judgment.
-
Instruction-Following Evaluation for Large Language Models
IFEval is a new benchmark of 25 verifiable instruction types and ~500 prompts for objective, reproducible evaluation of LLMs' instruction-following abilities.