Formalizing Learning from Language Feedback with Provable Guarantees

Adith Swaminathan; Aditya Modi; Allen Nie; Ching-An Cheng; Ruijie Zheng; Wanqiao Xu

arxiv: 2506.10341 · v2 · pith:KAFUXR4Anew · submitted 2025-06-12 · 💻 cs.LG · cs.CL

Formalizing Learning from Language Feedback with Provable Guarantees

Wanqiao Xu , Allen Nie , Ruijie Zheng , Aditya Modi , Adith Swaminathan , Ching-An Cheng This is my paper

classification 💻 cs.LG cs.CL

keywords learninglanguagefeedbackdespitedimensioneluderempiricalformalize

0 comments

read the original abstract

Interactively learning from observation and language feedback is an increasingly studied area driven by the emergence of large language model (LLM) agents. Despite impressive empirical demonstrations, so far a principled framing of these decision problems remains lacking. We formalize the Learning from Language Feedback (LLF) problem, assert sufficient assumptions to enable learning despite latent rewards, and introduce $\textit{transfer eluder dimension}$ as a measure to characterize the hardness of LLF. We formalize the intuition that information in the language feedback governs the learning complexity, and demonstrate cases where learning from rich language feedback can be exponentially faster than learning from reward. We develop a no-regret algorithm, called $\texttt{HELiX}$, that provably solves LLF problems through sequential interactions, with performance guarantees that scale with the transfer eluder dimension. Across several empirical domains, we show that $\texttt{HELiX}$ performs well even when repeatedly prompting LLMs does not work reliably. Our contributions mark an important step towards designing principled interactive learning algorithms using generic language feedback.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning from Language Feedback via Variational Policy Distillation
cs.LG 2026-05 unverdicted novelty 7.0

VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming f...