arXiv preprint arXiv:1909.12238 , year=

V-mpo: On-policy maximum a posteriori policy optimization for discrete, continuous control , author= · 1909 · arXiv 1909.12238

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

cs.LG · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

Listwise Policy Optimization explicitly performs target-projection on the LLM response simplex, unifying and improving group-based RLVR methods with monotonic improvement and flexible divergences.

Improving alignment of dialogue agents via targeted human judgements

cs.LG · 2022-09-28 · unverdicted · novelty 6.0

Sparrow uses targeted rule-based human feedback and evidence provision to outperform baselines in preference while violating rules only 8% of the time under adversarial probing.

citing papers explorer

Showing 3 of 3 citing papers.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution cs.CL · 2023-09-28 · unverdicted · none · ref 299
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex cs.LG · 2026-05-07 · unverdicted · none · ref 21 · 2 links
Listwise Policy Optimization explicitly performs target-projection on the LLM response simplex, unifying and improving group-based RLVR methods with monotonic improvement and flexible divergences.
Improving alignment of dialogue agents via targeted human judgements cs.LG · 2022-09-28 · unverdicted · none · ref 6
Sparrow uses targeted rule-based human feedback and evidence provision to outperform baselines in preference while violating rules only 8% of the time under adversarial probing.

arXiv preprint arXiv:1909.12238 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer