Frabench and ufeval: Unified fine-grained evaluation with task and aspect generalization, 2025

URLhttps://arxiv · 2025 · arXiv 2505.12795

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

cs.CV · 2026-02-27 · unverdicted · novelty 7.0

DLEBench is the first benchmark for small-scale object editing in instruction-based image editing models, using 1889 samples, seven instruction types, and a dual-mode evaluation protocol to reveal performance gaps in 10 tested models.

OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

cs.CL · 2026-05-22 · unverdicted · novelty 6.0

OpenSkillEval automatically builds realistic tasks from evolving artifacts to audit skill effectiveness in LLM agents, finding that skill use depends on model and framework and that many popular skills do not outperform base agents.

citing papers explorer

Showing 2 of 2 citing papers.

DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model cs.CV · 2026-02-27 · unverdicted · none · ref 9
DLEBench is the first benchmark for small-scale object editing in instruction-based image editing models, using 1889 samples, seven instruction types, and a dual-mode evaluation protocol to reveal performance gaps in 10 tested models.
OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents cs.CL · 2026-05-22 · unverdicted · none · ref 9
OpenSkillEval automatically builds realistic tasks from evolving artifacts to audit skill effectiveness in LLM agents, finding that skill use depends on model and framework and that many popular skills do not outperform base agents.

Frabench and ufeval: Unified fine-grained evaluation with task and aspect generalization, 2025

fields

years

verdicts

representative citing papers

citing papers explorer