Holmes is a probing benchmark compiling over 200 datasets from 270 studies to evaluate linguistic competence across syntax, morphology, semantics, reasoning, and discourse in more than 50 language models.
Kyle Mahowald, Anna A
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2024 2verdicts
UNVERDICTED 2representative citing papers
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
citing papers explorer
-
Holmes: A Benchmark to Assess the Linguistic Competence of Language Models
Holmes is a probing benchmark compiling over 200 datasets from 270 studies to evaluate linguistic competence across syntax, morphology, semantics, reasoning, and discourse in more than 50 language models.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.