pith. sign in

Training language models with language feedback at scale

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

fields

cs.LG 4 cs.AI 1

years

2026 4 2023 1

verdicts

UNVERDICTED 5

clear filters

representative citing papers

Learning from Language Feedback via Variational Policy Distillation

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.

Reinforcing Human Behavior Simulation via Verbal Feedback

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.

ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

cs.AI · 2026-05-13 · unverdicted · novelty 6.0

ICRL uses joint RL training of solver and critic with distribution-calibration re-weighting and role-wise advantage estimation to internalize critique into unassisted LLM performance, yielding 6.4-point gains on agentic tasks and 7.0 on math reasoning with Qwen3 models.

RL with Learnable Textual Feedback: A Bilevel Approach

cs.LG · 2026-05-23 · unverdicted · novelty 5.0

Bi-NAC frames RL with textual feedback as a Stackelberg bilevel program and reports that 2B and 6B models trained this way outperform larger GRPO baselines on MATH-500 and GPQA.

citing papers explorer

Showing 4 of 4 citing papers after filters.

  • Learning from Language Feedback via Variational Policy Distillation cs.LG · 2026-05-14 · unverdicted · none · ref 29

    VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.

  • Reinforcing Human Behavior Simulation via Verbal Feedback cs.LG · 2026-05-19 · unverdicted · none · ref 31

    DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.

  • ICRL: Learning to Internalize Self-Critique with Reinforcement Learning cs.AI · 2026-05-13 · unverdicted · none · ref 5

    ICRL uses joint RL training of solver and critic with distribution-calibration re-weighting and role-wise advantage estimation to internalize critique into unassisted LLM performance, yielding 6.4-point gains on agentic tasks and 7.0 on math reasoning with Qwen3 models.

  • RL with Learnable Textual Feedback: A Bilevel Approach cs.LG · 2026-05-23 · unverdicted · none · ref 15

    Bi-NAC frames RL with textual feedback as a Stackelberg bilevel program and reports that 2B and 6B models trained this way outperform larger GRPO baselines on MATH-500 and GPQA.