LearnLM: Improving Gemini for Learning
read the original abstract
Today's generative AI systems are tuned to present information by default, rather than engage users in service of learning as a human tutor would. To address the wide range of potential education use cases for these systems, we reframe the challenge of injecting pedagogical behavior as one of \textit{pedagogical instruction following}, where training and evaluation examples include system-level instructions describing the specific pedagogy attributes present or desired in subsequent model turns. This framing avoids committing our models to any particular definition of pedagogy, and instead allows teachers or developers to specify desired model behavior. It also clears a path to improving Gemini models for learning -- by enabling the addition of our pedagogical data to post-training mixtures -- alongside their rapidly expanding set of capabilities. Both represent important changes from our initial tech report. We show how training with pedagogical instruction following produces a LearnLM model (available on Google AI Studio) that experts substantially prefer across a diverse set of learning scenarios, with average preference strengths of +31\% over GPT-4o, +11\% over Claude 3.5 Sonnet, and +13\% over the Gemini 1.5 Pro model on which LearnLM was based.
This paper has not been read by Pith yet.
Forward citations
Cited by 12 Pith papers
-
Curiosity as Linguistic Intervention: Using LLM Tutoring Dialogues to Influence Exploratory Learning Behavior
Curiosity-oriented linguistic interventions in LLM tutoring dialogues increased exploratory learner behaviors up to 2.4x across 270 conversations spanning multiple models and domains.
-
Hey Chat, Can You Teach Me? Structuring Socratic Dialogue for Human Learning in the Wild
A PPO policy for deciding topic order and duration on a prerequisite knowledge graph, paired with an LLM for Socratic dialogue, improves student mastery rates and reduces turns compared to baselines and scaled models ...
-
The Tutoring Effectiveness Index: Predicting LLM Math Tutor Quality from Four Conversation Signals
The Tutoring Effectiveness Index (TEI) uses four signals from LLM conversations to select math tutoring responses, raising student improvement rates from 59.0% to 81.9% at N=8 on a frozen DeepSeek-R1-8B model without ...
-
LLMs Are Already Good Tutors: Training-Free Prompt Optimization for Pedagogical Math Tutoring
Training-free prompt optimization methods, including five new education-focused ones, surpass the strongest RL-trained baseline across five conditions on two OOD suites while showing distinct teaching behavior patterns.
-
Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows
EduAgentBench is a new source-grounded benchmark that evaluates tutor agents across pedagogical judgment, situated multi-turn tutoring, and Canvas-style workflow completion, finding frontier models capable of basic ju...
-
Behavior Latticing: Inferring User Motivations from Unstructured Interactions
Behavior latticing synthesizes connections across unstructured user interactions to generate insights into underlying motivations, yielding deeper and more accurate user understanding than task-only models.
-
Mitigating LLM biases toward spurious social contexts using direct preference optimization
Debiasing-DPO reduces bias to spurious social contexts by 84% and improves predictive accuracy by 52% on average for LLMs evaluating U.S. classroom transcripts.
-
Large Language Models as Students Who Think Aloud: Overly Coherent, Verbose, and Confident
LLMs simulating student think-alouds in multi-step chemistry tutoring produce overly coherent, verbose, and confident reasoning that overestimates learner success compared to 630 human utterances.
-
Reinforcement Learning for Special Education: Aligning LLM Tutors to Diverse Learners through Disability-Adaptive Training
Special-R1 combines two-dimensional adaptive prompts and a disability-conditioned Thinking Reward in RL training, lifting persona-aware Fit by 1.65 and SPED Helpfulness by 0.048 on a 690-dialogue test set while stayin...
-
Uncertainty-Aware Generation and Decision-Making Under Ambiguity
Uncertainty-aware algorithms based on Bayesian decision theory improve generation utility on tutoring and reviewing tasks while risk-averse methods can degrade performance under high ambiguity, with conformal predicti...
-
Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems
Introduces L2-Bench benchmark for AI feedback in language education across six dimensions and identifies explainability pitfalls in AI-generated explanations that appear helpful but are flawed.
-
Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems
AI explanations in language learning often fail across six dimensions like diagnostic accuracy and self-regulation support, creating hidden risks that demand better evaluation frameworks such as L2-Bench.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.