Evaluating Gemini in an arena for learning

Aditya Srikanth Veerubhotla; Aliya Rysbek; Andrea Huber; Ankit Anand; Avishkar Bhoopchand; Brett Wiltshire; Daniel Gillick; Daniel Kasenberg; Eleni Sgouritsa; Gal Elidan

arxiv: 2505.24477 · v1 · pith:W7TVENUXnew · submitted 2025-05-30 · 💻 cs.CY · cs.AI· cs.LG

Evaluating Gemini in an arena for learning

LearnLM Team Google: Abhinit Modi , Aditya Srikanth Veerubhotla , Aliya Rysbek , Andrea Huber , Ankit Anand , Avishkar Bhoopchand , Brett Wiltshire , Daniel Gillick

show 28 more authors

Daniel Kasenberg Eleni Sgouritsa Gal Elidan Hengrui Liu Holger Winnemoeller Irina Jurenka James Cohan Jennifer She Julia Wilkowski Kaiz Alarakyia Kevin R. McKee Komal Singh Lisa Wang Markus Kunesch Miruna P\^islar Niv Efron Parsa Mahmoudieh Pierre-Alexandre Kamienny Sara Wiltberger Shakir Mohamed Shashank Agarwal Shubham Milind Phal Sun Jae Lee Theofilos Strinopoulos Wei-Jen Ko Yael Gold-Zamir Yael Haramaty Yannis Assael

This is my paper

classification 💻 cs.CY cs.AIcs.LG

keywords learninggeminiarenamodelsexpertscaseseducatorsleading

0 comments

read the original abstract

Artificial intelligence (AI) is poised to transform education, but the research community lacks a robust, general benchmark to evaluate AI models for learning. To assess state-of-the-art support for educational use cases, we ran an "arena for learning" where educators and pedagogy experts conduct blind, head-to-head, multi-turn comparisons of leading AI models. In particular, $N = 189$ educators drew from their experience to role-play realistic learning use cases, interacting with two models sequentially, after which $N = 206$ experts judged which model better supported the user's learning goals. The arena evaluated a slate of state-of-the-art models: Gemini 2.5 Pro, Claude 3.7 Sonnet, GPT-4o, and OpenAI o3. Excluding ties, experts preferred Gemini 2.5 Pro in 73.2% of these match-ups -- ranking it first overall in the arena. Gemini 2.5 Pro also demonstrated markedly higher performance across key principles of good pedagogy. Altogether, these results position Gemini 2.5 Pro as a leading model for learning.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Hey Chat, Can You Teach Me? Structuring Socratic Dialogue for Human Learning in the Wild
cs.CL 2026-06 unverdicted novelty 7.0

A PPO policy for deciding topic order and duration on a prerequisite knowledge graph, paired with an LLM for Socratic dialogue, improves student mastery rates and reduces turns compared to baselines and scaled models ...
Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems
cs.HC 2026-04 unverdicted novelty 4.0

AI explanations in language learning often fail across six dimensions like diagnostic accuracy and self-regulation support, creating hidden risks that demand better evaluation frameworks such as L2-Bench.
Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems
cs.HC 2026-04 unverdicted novelty 4.0

Introduces L2-Bench benchmark for AI feedback in language education across six dimensions and identifies explainability pitfalls in AI-generated explanations that appear helpful but are flawed.