Refusal rate misranks LLMs on bio safety
RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts
Matched prompts show top risk discriminators often refuse fewer queries than high-refusal peers.
full image
Software Engineering
Covers design tools, software metrics, testing and debugging, programming environments, etc. Roughly includes material in all of ACM Subject Classes D.2, except that D.2.4 (program verification) should probably have Logics in Computer Science as the primary subject area.
RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts
Matched prompts show top risk discriminators often refuse fewer queries than high-refusal peers.
full image
RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades
New benchmark of 115 multi-file changes from actual projects shows sharp drop from simpler bug-fix results.
full image