pith. sign in

Training language models with language feedback at scale

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

fields

cs.LG 3 cs.AI 1

years

2026 3 2023 1

verdicts

UNVERDICTED 4

representative citing papers

Learning from Language Feedback via Variational Policy Distillation

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.

Reinforcing Human Behavior Simulation via Verbal Feedback

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.

ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

cs.AI · 2026-05-13 · unverdicted · novelty 6.0

ICRL uses joint RL training of solver and critic with distribution-calibration re-weighting and role-wise advantage estimation to internalize critique into unassisted LLM performance, yielding 6.4-point gains on agentic tasks and 7.0 on math reasoning with Qwen3 models.

citing papers explorer

Showing 4 of 4 citing papers.

  • Learning from Language Feedback via Variational Policy Distillation cs.LG · 2026-05-14 · unverdicted · none · ref 29

    VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.

  • Reinforcing Human Behavior Simulation via Verbal Feedback cs.LG · 2026-05-19 · unverdicted · none · ref 31

    DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.

  • ICRL: Learning to Internalize Self-Critique with Reinforcement Learning cs.AI · 2026-05-13 · unverdicted · none · ref 5

    ICRL uses joint RL training of solver and critic with distribution-calibration re-weighting and role-wise advantage estimation to internalize critique into unassisted LLM performance, yielding 6.4-point gains on agentic tasks and 7.0 on math reasoning with Qwen3 models.

  • RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment cs.LG · 2023-04-13 · unverdicted · none · ref 132

    RAFT aligns generative models by ranking samples with a reward model and fine-tuning only on the top-ranked outputs, reporting gains on reward scores and automated metrics for LLMs and diffusion models.