Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation

· 2025 · arXiv 2507.21028

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

Self-Refining Topology Optimization via an LLM-Based Multi-Agent Framework

cs.MA · 2026-05-22 · unverdicted · novelty 6.0

TopOptAgents deploys six LLM agents in self-refining loops to automate the full topology optimization workflow and succeeds on problem classes where single LLMs fail.

ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation

cs.LG · 2026-04-25 · unverdicted · novelty 6.0

ProEval is a proactive framework using pre-trained GPs, Bayesian quadrature, and superlevel set sampling to estimate performance and find failures in generative AI with 8-65x fewer samples than baselines.

SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems

cs.AI · 2026-04-17 · unverdicted · novelty 6.0

SocialGrid benchmark shows even top LLMs achieve below 60% in embodied planning and task completion, with deception detection near random chance regardless of model scale.

citing papers explorer

Showing 3 of 3 citing papers.

Self-Refining Topology Optimization via an LLM-Based Multi-Agent Framework cs.MA · 2026-05-22 · unverdicted · none · ref 22
TopOptAgents deploys six LLM agents in self-refining loops to automate the full topology optimization workflow and succeeds on problem classes where single LLMs fail.
ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation cs.LG · 2026-04-25 · unverdicted · none · ref 13
ProEval is a proactive framework using pre-trained GPs, Bayesian quadrature, and superlevel set sampling to estimate performance and find failures in generative AI with 8-65x fewer samples than baselines.
SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems cs.AI · 2026-04-17 · unverdicted · none · ref 3
SocialGrid benchmark shows even top LLMs achieve below 60% in embodied planning and task completion, with deception detection near random chance regardless of model scale.

Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer