LLMEval-Logic is a solver-verified Chinese logical reasoning benchmark with 246 base and 190 hard items that shows frontier LLMs reach only 37.5% hard-item accuracy and 60.16% joint formalization score.
Biometrics , volume =
2 Pith papers cite this work. Polarity classification is still indexing.
years
2026 2representative citing papers
DisaBench supplies a participatory taxonomy of twelve disability harm types, paired benign-adversarial prompts across seven life domains, and human-annotated data showing that standard safety tests miss context-dependent harms.
citing papers explorer
-
LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening
LLMEval-Logic is a solver-verified Chinese logical reasoning benchmark with 246 base and 190 hard items that shows frontier LLMs reach only 37.5% hard-item accuracy and 60.16% joint formalization score.
-
DisaBench: A Participatory Evaluation Framework for Disability Harms in Language Models
DisaBench supplies a participatory taxonomy of twelve disability harm types, paired benign-adversarial prompts across seven life domains, and human-annotated data showing that standard safety tests miss context-dependent harms.