PRISM is a contrastive, policy-aware training framework for process reward models that reduces false positives by 22% on PRMBench and boosts downstream accuracy up to 33% in Best-of-N selection by learning reliable relative comparisons instead of pointwise labels.
Prin- cipled data selection for alignment: The hidden risks of difficult examples.arXiv preprint arXiv:2502.09650
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
DharmaOCR models reach 0.925 and 0.911 extraction scores with 0.40% and 0.20% degeneration rates on a new benchmark covering printed, handwritten, and legal documents, outperforming open-source and commercial baselines via SFT plus DPO.
ShaPO improves LLM safety robustness over standard preference optimization by enforcing worst-case objectives via selective geometry control at token and reward levels.
Selecting preference pairs whose DPO implicit reward gap is small yields better LLM alignment than random or baseline selection while using only 10% of the data.
citing papers explorer
-
The Hidden Bias of Process Reward Models:PRISM for Rewarding the Right Reasoning
PRISM is a contrastive, policy-aware training framework for process reward models that reduces false positives by 22% on PRMBench and boosts downstream accuracy up to 33% in Best-of-N selection by learning reliable relative comparisons instead of pointwise labels.
-
DharmaOCR: Specialized Small Language Models for Structured OCR that outperform Open-Source and Commercial Baselines
DharmaOCR models reach 0.925 and 0.911 extraction scores with 0.40% and 0.20% degeneration rates on a new benchmark covering printed, handwritten, and legal documents, outperforming open-source and commercial baselines via SFT plus DPO.
-
Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control
ShaPO improves LLM safety robustness over standard preference optimization by enforcing worst-case objectives via selective geometry control at token and reward levels.
-
Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap
Selecting preference pairs whose DPO implicit reward gap is small yields better LLM alignment than random or baseline selection while using only 10% of the data.