DLEBench is the first benchmark for small-scale object editing in instruction-based image editing models, using 1889 samples, seven instruction types, and a dual-mode evaluation protocol to reveal performance gaps in 10 tested models.
Frabench and ufeval: Unified fine-grained evaluation with task and aspect generalization, 2025
2 Pith papers cite this work. Polarity classification is still indexing.
years
2026 2verdicts
UNVERDICTED 2representative citing papers
OpenSkillEval automatically builds realistic tasks from evolving artifacts to audit skill effectiveness in LLM agents, finding that skill use depends on model and framework and that many popular skills do not outperform base agents.
citing papers explorer
-
DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model
DLEBench is the first benchmark for small-scale object editing in instruction-based image editing models, using 1889 samples, seven instruction types, and a dual-mode evaluation protocol to reveal performance gaps in 10 tested models.
-
OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents
OpenSkillEval automatically builds realistic tasks from evolving artifacts to audit skill effectiveness in LLM agents, finding that skill use depends on model and framework and that many popular skills do not outperform base agents.