Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models

Yen-Ting Lin, Yun-Nung Chen · 2023 · arXiv 2305.13711

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

PR-CAD: Progressive Refinement for Unified Controllable and Faithful Text-to-CAD Generation with Large Language Models

cs.CL · 2026-03-27 · unverdicted · novelty 7.0

PR-CAD unifies text-to-CAD generation and editing via progressive refinement with LLMs, a new interaction dataset, and RL-enhanced reasoning to achieve better controllability and faithfulness.

The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

cs.CL · 2024-06-06 · accept · novelty 7.0

This systematic survey organizes prompt engineering into a taxonomy of 58 LLM techniques and 40 others, supplies a shared vocabulary, and offers guidelines for state-of-the-art models.

Poller: Are LLMs Suitable for Evaluating the Poetry Understanding Task?

cs.CL · 2026-06-29 · unverdicted · novelty 6.0

Poller reduces LLM-human disagreement in evaluating Chinese poetry understanding by having LLMs role-play as authors, with reported error reductions of 94.55% and 89.53% on rhetorical techniques and defamiliarization.

Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why

cs.CL · 2026-05-25 · conditional · novelty 6.0

For binary LLM judge validation, Pearson's r, Spearman's ρ, Kendall's τ_b, phi, and Matthews correlation all equal a single number on non-degenerate data, Cohen's κ supplies the extra signal on label-rate drift, and a reporting checklist is provided.

Data Selection for Multi-turn Dialogue Instruction Tuning

cs.CL · 2026-04-09 · unverdicted · novelty 6.0

MDS selects better multi-turn dialogues for instruction tuning by combining bin-wise global coverage with local entity-topic and format consistency scoring, outperforming prior selectors on benchmarks.

OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models

cs.LG · 2025-11-13 · unverdicted · novelty 6.0

OutSafe-Bench supplies the first large-scale four-modality safety dataset and evaluation framework that exposes persistent unsafe outputs in nine leading multimodal LLMs.

The Rise and Potential of Large Language Model Based Agents: A Survey

cs.AI · 2023-09-14 · accept · novelty 4.0

The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

cs.CL · 2024-12-07 · accept · novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

citing papers explorer

Showing 4 of 4 citing papers after filters.

PR-CAD: Progressive Refinement for Unified Controllable and Faithful Text-to-CAD Generation with Large Language Models cs.CL · 2026-03-27 · unverdicted · none · ref 8
PR-CAD unifies text-to-CAD generation and editing via progressive refinement with LLMs, a new interaction dataset, and RL-enhanced reasoning to achieve better controllability and faithfulness.
Poller: Are LLMs Suitable for Evaluating the Poetry Understanding Task? cs.CL · 2026-06-29 · unverdicted · none · ref 28
Poller reduces LLM-human disagreement in evaluating Chinese poetry understanding by having LLMs role-play as authors, with reported error reductions of 94.55% and 89.53% on rhetorical techniques and defamiliarization.
Data Selection for Multi-turn Dialogue Instruction Tuning cs.CL · 2026-04-09 · unverdicted · none · ref 3
MDS selects better multi-turn dialogues for instruction tuning by combining bin-wise global coverage with local entity-topic and format consistency scoring, outperforming prior selectors on benchmarks.
OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models cs.LG · 2025-11-13 · unverdicted · none · ref 31
OutSafe-Bench supplies the first large-scale four-modality safety dataset and evaluation framework that exposes persistent unsafe outputs in nine leading multimodal LLMs.

Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer