From Reasoning to Code: GRPO Optimization for Underrepresented Languages

Andrea Gurioli; Bianca Raimondi; Federico Pennino; Massimo Rondelli; Maurizio Gabbrielli

arxiv: 2506.11027 · v3 · pith:4E3Y5CIOnew · submitted 2025-05-20 · 💻 cs.LG · cs.AI· cs.PL

From Reasoning to Code: GRPO Optimization for Underrepresented Languages

Federico Pennino , Bianca Raimondi , Massimo Rondelli , Andrea Gurioli , Maurizio Gabbrielli This is my paper

classification 💻 cs.LG cs.AIcs.PL

keywords languagescodereasoningunderrepresentedapproachfeedbackgrpooptimization

0 comments

read the original abstract

Generating accurate and executable code using Large Language Models (LLMs) remains a significant challenge for underrepresented programming languages, such as Prolog and Lisp, due to the scarcity of public training data compared to high-resource languages like Python. This paper introduces a generalizable Reinforcement Learning (RL) approach that combines small-scale versions of the Qwen2.5-Coder model with Group Relative Policy Optimization (GRPO) to enable effective code generation through reasoning. To address the limitations of sparse datasets, we integrate execution-driven feedback directly into the RL loop, utilizing a reward system that exploits both logical correctness and structural formatting. Experimental results on GSM8K dataset demonstrate significant improvements in reasoning quality and code accuracy across underrepresented languages. These findings underscore the potential of our approach to benefit a wide range of programming languages lacking extensive training resources by leveraging symbolic reasoning and interpreter-based feedback.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

StaRPO: Stability-Augmented Reinforcement Policy Optimization
cs.AI 2026-04 unverdicted novelty 5.0

StaRPO improves LLM reasoning by adding autocorrelation function and path efficiency stability metrics to RL policy optimization, yielding higher accuracy and fewer logic errors on reasoning benchmarks.