Evaluating Gemini in an arena for learning
read the original abstract
Artificial intelligence (AI) is poised to transform education, but the research community lacks a robust, general benchmark to evaluate AI models for learning. To assess state-of-the-art support for educational use cases, we ran an "arena for learning" where educators and pedagogy experts conduct blind, head-to-head, multi-turn comparisons of leading AI models. In particular, $N = 189$ educators drew from their experience to role-play realistic learning use cases, interacting with two models sequentially, after which $N = 206$ experts judged which model better supported the user's learning goals. The arena evaluated a slate of state-of-the-art models: Gemini 2.5 Pro, Claude 3.7 Sonnet, GPT-4o, and OpenAI o3. Excluding ties, experts preferred Gemini 2.5 Pro in 73.2% of these match-ups -- ranking it first overall in the arena. Gemini 2.5 Pro also demonstrated markedly higher performance across key principles of good pedagogy. Altogether, these results position Gemini 2.5 Pro as a leading model for learning.
This paper has not been read by Pith yet.
Forward citations
Cited by 3 Pith papers
-
Hey Chat, Can You Teach Me? Structuring Socratic Dialogue for Human Learning in the Wild
A PPO policy for deciding topic order and duration on a prerequisite knowledge graph, paired with an LLM for Socratic dialogue, improves student mastery rates and reduces turns compared to baselines and scaled models ...
-
Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems
AI explanations in language learning often fail across six dimensions like diagnostic accuracy and self-regulation support, creating hidden risks that demand better evaluation frameworks such as L2-Bench.
-
Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems
Introduces L2-Bench benchmark for AI feedback in language education across six dimensions and identifies explainability pitfalls in AI-generated explanations that appear helpful but are flawed.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.