MaPPO incorporates prior reward knowledge into a Maximum a Posteriori objective for LLM preference optimization, generalizing DPO and variants while supporting offline and online settings.
We need to determine the product of these two integers
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2verdicts
UNVERDICTED 2representative citing papers
Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.
citing papers explorer
-
MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge
MaPPO incorporates prior reward knowledge into a Maximum a Posteriori objective for LLM preference optimization, generalizing DPO and variants while supporting offline and online settings.
-
Reinforcement Learning for Scalable and Trustworthy Intelligent Systems
Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.