A2RBench automates creation of verifiable abstract reasoning benchmarks via LLM task generation and cycle-consistency checks, revealing that top LLMs score 39.8% versus humans at 68.5% on representative tasks.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
A2RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation
A2RBench automates creation of verifiable abstract reasoning benchmarks via LLM task generation and cycle-consistency checks, revealing that top LLMs score 39.8% versus humans at 68.5% on representative tasks.