Autobench-v: Can large vision-language models benchmark themselves?arXiv preprint arXiv:2410.21259

Han Bao, Yue Huang, Yanbo Wang, Jiayi Ye, Xiangqi Wang, Xiuying Chen, Yue Zhao, Tianyi Zhou, Mohamed Elhoseiny, Xiangliang Zhang · 2021 · arXiv 2410.21259

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

representative citing papers

PolicyLLM: Towards Excellent Comprehension of Public Policy for Large Language Models

cs.CL · 2026-04-14 · unverdicted · novelty 7.0

PolicyBench is the first large-scale US-China policy comprehension benchmark for LLMs with 21K cases, paired with PolicyMoE that performs best on application and structured reasoning tasks.

SkillGen: Verified Inference-Time Agent Skill Synthesis

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

SkillGen synthesizes auditable skills from agent trajectories via contrastive induction on successes and failures, then verifies net performance impact by comparing outcomes with and without the skill on identical tasks.

citing papers explorer

Showing 2 of 2 citing papers.

PolicyLLM: Towards Excellent Comprehension of Public Policy for Large Language Models cs.CL · 2026-04-14 · unverdicted · none · ref 1
PolicyBench is the first large-scale US-China policy comprehension benchmark for LLMs with 21K cases, paired with PolicyMoE that performs best on application and structured reasoning tasks.
SkillGen: Verified Inference-Time Agent Skill Synthesis cs.LG · 2026-05-09 · unverdicted · none · ref 2
SkillGen synthesizes auditable skills from agent trajectories via contrastive induction on successes and failures, then verifies net performance impact by comparing outcomes with and without the skill on identical tasks.

Autobench-v: Can large vision-language models benchmark themselves?arXiv preprint arXiv:2410.21259

fields

years

verdicts

representative citing papers

citing papers explorer