LearnLM: Improving Gemini for Learning

Aditya Srikanth Veerubhotla; Aliya Rysbek; Amy Wang; Andrea Huber; Ankit Anand; Avishkar Bhoopchand; Brett Wiltshire; Brian Veprek; Daniel Gillick; Daniel Kasenberg

arxiv: 2412.16429 · v3 · pith:IXCFBIQKnew · submitted 2024-12-21 · 💻 cs.CY · cs.AI· cs.LG

LearnLM: Improving Gemini for Learning

LearnLM Team Google: Abhinit Modi , Aditya Srikanth Veerubhotla , Aliya Rysbek , Andrea Huber , Brett Wiltshire , Brian Veprek , Daniel Gillick , Daniel Kasenberg

show 37 more authors

Derek Ahmed Irina Jurenka James Cohan Jennifer She Julia Wilkowski Kaiz Alarakyia Kevin R. McKee Lisa Wang Markus Kunesch Mike Schaekermann Miruna P\^islar Nikhil Joshi Parsa Mahmoudieh Paul Jhun Sara Wiltberger Shakir Mohamed Shashank Agarwal Shubham Milind Phal Sun Jae Lee Theofilos Strinopoulos Wei-Jen Ko Amy Wang Ankit Anand Avishkar Bhoopchand Dan Wild Divya Pandya Filip Bar Garth Graham Holger Winnemoeller Mahvish Nagda Prateek Kolhar Renee Schneider Shaojian Zhu Stephanie Chan Steve Yadlowsky Viknesh Sounderajah Yannis Assael

This is my paper

classification 💻 cs.CY cs.AIcs.LG

keywords learningmodelpedagogicalgeminilearnlmbehaviordesiredfollowing

0 comments

read the original abstract

Today's generative AI systems are tuned to present information by default, rather than engage users in service of learning as a human tutor would. To address the wide range of potential education use cases for these systems, we reframe the challenge of injecting pedagogical behavior as one of \textit{pedagogical instruction following}, where training and evaluation examples include system-level instructions describing the specific pedagogy attributes present or desired in subsequent model turns. This framing avoids committing our models to any particular definition of pedagogy, and instead allows teachers or developers to specify desired model behavior. It also clears a path to improving Gemini models for learning -- by enabling the addition of our pedagogical data to post-training mixtures -- alongside their rapidly expanding set of capabilities. Both represent important changes from our initial tech report. We show how training with pedagogical instruction following produces a LearnLM model (available on Google AI Studio) that experts substantially prefer across a diverse set of learning scenarios, with average preference strengths of +31\% over GPT-4o, +11\% over Claude 3.5 Sonnet, and +13\% over the Gemini 1.5 Pro model on which LearnLM was based.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Curiosity as Linguistic Intervention: Using LLM Tutoring Dialogues to Influence Exploratory Learning Behavior
cs.CL 2026-06 unverdicted novelty 7.0

Curiosity-oriented linguistic interventions in LLM tutoring dialogues increased exploratory learner behaviors up to 2.4x across 270 conversations spanning multiple models and domains.
Hey Chat, Can You Teach Me? Structuring Socratic Dialogue for Human Learning in the Wild
cs.CL 2026-06 unverdicted novelty 7.0

A PPO policy for deciding topic order and duration on a prerequisite knowledge graph, paired with an LLM for Socratic dialogue, improves student mastery rates and reduces turns compared to baselines and scaled models ...
The Tutoring Effectiveness Index: Predicting LLM Math Tutor Quality from Four Conversation Signals
cs.CY 2026-05 unverdicted novelty 6.0

The Tutoring Effectiveness Index (TEI) uses four signals from LLM conversations to select math tutoring responses, raising student improvement rates from 59.0% to 81.9% at N=8 on a frozen DeepSeek-R1-8B model without ...
LLMs Are Already Good Tutors: Training-Free Prompt Optimization for Pedagogical Math Tutoring
cs.CL 2026-05 unverdicted novelty 6.0

Training-free prompt optimization methods, including five new education-focused ones, surpass the strongest RL-trained baseline across five conditions on two OOD suites while showing distinct teaching behavior patterns.
Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows
cs.AI 2026-05 unverdicted novelty 6.0

EduAgentBench is a new source-grounded benchmark that evaluates tutor agents across pedagogical judgment, situated multi-turn tutoring, and Canvas-style workflow completion, finding frontier models capable of basic ju...
Behavior Latticing: Inferring User Motivations from Unstructured Interactions
cs.HC 2026-04 unverdicted novelty 6.0

Behavior latticing synthesizes connections across unstructured user interactions to generate insights into underlying motivations, yielding deeper and more accurate user understanding than task-only models.
Mitigating LLM biases toward spurious social contexts using direct preference optimization
cs.AI 2026-04 unverdicted novelty 6.0

Debiasing-DPO reduces bias to spurious social contexts by 84% and improves predictive accuracy by 52% on average for LLMs evaluating U.S. classroom transcripts.
Large Language Models as Students Who Think Aloud: Overly Coherent, Verbose, and Confident
cs.CL 2026-02 unverdicted novelty 6.0

LLMs simulating student think-alouds in multi-step chemistry tutoring produce overly coherent, verbose, and confident reasoning that overestimates learner success compared to 630 human utterances.
Reinforcement Learning for Special Education: Aligning LLM Tutors to Diverse Learners through Disability-Adaptive Training
cs.CY 2026-05 unverdicted novelty 5.0

Special-R1 combines two-dimensional adaptive prompts and a disability-conditioned Thinking Reward in RL training, lifting persona-aware Fit by 1.65 and SPED Helpfulness by 0.048 on a 690-dialogue test set while stayin...
Uncertainty-Aware Generation and Decision-Making Under Ambiguity
cs.CL 2026-06 unverdicted novelty 4.0

Uncertainty-aware algorithms based on Bayesian decision theory improve generation utility on tutoring and reviewing tasks while risk-averse methods can degrade performance under high ambiguity, with conformal predicti...
Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems
cs.HC 2026-04 unverdicted novelty 4.0

Introduces L2-Bench benchmark for AI feedback in language education across six dimensions and identifies explainability pitfalls in AI-generated explanations that appear helpful but are flawed.
Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems
cs.HC 2026-04 unverdicted novelty 4.0

AI explanations in language learning often fail across six dimensions like diagnostic accuracy and self-regulation support, creating hidden risks that demand better evaluation frameworks such as L2-Bench.