DPA-GRPO trains a generator-verifier pair via group-relative policy optimization on paired counterfactual actions, improving structured output accuracy on TaxCalcBench over zero-shot and generator-only baselines.
Taxcalcbench: Evaluating frontier models on the tax calculation task.arXiv preprint arXiv:2507.16126
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Interactive Critique-Revision Training for Reliable Structured LLM Generation
DPA-GRPO trains a generator-verifier pair via group-relative policy optimization on paired counterfactual actions, improving structured output accuracy on TaxCalcBench over zero-shot and generator-only baselines.