Information-Consistent Language Model Recommendations through Group Relative Policy Optimization

Balaji Padmanabhan; Kaushik Dutta; Sonal Prabhune

arxiv: 2512.12858 · v3 · submitted 2025-12-14 · 💻 cs.LG · cs.AI

Information-Consistent Language Model Recommendations through Group Relative Policy Optimization

Sonal Prabhune , Balaji Padmanabhan , Kaushik Dutta This is my paper

Pith reviewed 2026-05-16 22:02 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LLM consistencyGroup Relative Policy Optimizationinformation consistencyreinforcement learningprompt variabilityrecommendation systemsenterprise AI

0 comments

The pith

Adapting Group Relative Policy Optimization to groups of equivalent prompts reduces variability in LLM recommendations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models frequently produce inconsistent recommendations when users rephrase the same request in slightly different ways, even when the underlying meaning stays identical. In enterprise settings such as investment advice or job recommendations, this variability erodes trust and creates compliance risks because users expect the same factual content regardless of wording or prior chat history. The paper shows that treating sets of semantically equivalent prompts as groups, resetting context between them, and applying Group Relative Policy Optimization with entropy-based rewards for both helpfulness and stability directly trains models to keep information content stable. Experiments on investment and job recommendation tasks confirm the fine-tuned models exhibit lower response variability than the untuned baseline. This approach reframes output differences not as useful diversity but as a flaw that reinforcement learning can correct.

Core claim

By adapting Group Relative Policy Optimization to treat collections of semantically equivalent prompts as groups, resetting conversational context to isolate phrasing effects, and using entropy-based rewards that balance helpfulness with stability, the resulting model produces recommendations whose information content remains consistent across prompt variants on investment and job recommendation tasks.

What carries the argument

Group Relative Policy Optimization applied to prompt-variant groups, with entropy-based helpfulness and stability rewards plus context resets to enforce information invariance.

If this is right

Enterprise systems can enforce invariant policy or onboarding information independent of user phrasing.
Variability becomes a tunable parameter rather than an inherent property of generative models.
GRPO extends from reasoning tasks to direct alignment for content stability in recommendation domains.
Compliance and user-experience metrics improve when models no longer change facts across equivalent inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same grouping and reward structure could apply to medical or legal summaries where factual invariance is required.
Automated semantic clustering of prompts would be needed to scale beyond manually defined groups.
Hybrid use with retrieval methods might add factuality on top of the achieved stability.

Load-bearing premise

Groups of semantically equivalent prompts can be reliably identified and context resets isolate phrasing effects without side effects on overall recommendation quality.

What would settle it

If the GRPO-tuned model still produces materially different information content on new, unseen groups of equivalent prompts, or if context resets measurably degrade helpfulness scores, the consistency claim would not hold.

read the original abstract

Large Language Models (LLMs) are increasingly deployed in business-critical domains such as finance, education, healthcare, and customer support, where users expect consistent and reliable recommendations. Yet LLMs often exhibit variability when prompts are phrased with minor differences, even when semantically equivalent. Such inconsistency undermines trust, complicates compliance, and disrupts user experience. While personalization is desirable in certain contexts, many enterprise scenarios, such as HR onboarding, customer support, or policy disclosure, require invariant information delivery regardless of phrasing or prior conversational history. Existing approaches, including retrieval-augmented generation (RAG) and temperature tuning, improve factuality or reduce stochasticity, but cannot guarantee stability across equivalent prompts. In this paper, we propose a reinforcement learning framework based on Group Relative Policy Optimization (GRPO) to directly optimize for consistency. Unlike prior applications of GRPO, which have been limited to reasoning and code generation, we adapt GRPO to enforce the stability of information content across groups of semantically equivalent prompts. We introduce entropy-based helpfulness and stability rewards, treating prompt variants as groups and resetting conversational context to isolate phrasing effects. Experiments on investment and job recommendation tasks show that our GRPO-fine-tuned model reduces variability compared to the baseline LLM model. To our knowledge, this is a novel application of GRPO for aligning LLMs toward information consistency, reframing variability not as an acceptable feature of generative diversity, but as a correctable flaw in enterprise deployments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts GRPO with entropy rewards to reduce output variability across rephrased prompts in recommendation tasks, but supplies almost no experimental detail to back the claim.

read the letter

The main takeaway is that this work takes Group Relative Policy Optimization, previously applied to reasoning and code, and points it at a concrete enterprise issue: LLMs changing their recommendations when the same question is asked with slightly different wording. They add entropy-based stability rewards, treat prompt variants as groups, and reset context between them. On investment and job recommendation tasks the abstract claims lower variability than a plain baseline model. That framing is useful because it treats inconsistency as a fixable flaw rather than acceptable diversity when compliance or trust is on the line.

Referee Report

3 major / 2 minor

Summary. The paper proposes adapting Group Relative Policy Optimization (GRPO) to fine-tune LLMs for information consistency across semantically equivalent prompts. It introduces entropy-based helpfulness and stability rewards, treats prompt variants as groups, resets conversational context to isolate phrasing effects, and reports that the resulting model exhibits reduced variability relative to a baseline LLM on investment and job recommendation tasks.

Significance. If the central claim holds after proper validation, the work would provide a practical RL-based method for reducing undesirable output variability in enterprise LLM deployments where invariant information delivery is required. The reframing of variability as a correctable flaw rather than generative diversity is a useful conceptual shift, and the extension of GRPO beyond reasoning/code tasks is a modest but clear novelty.

major comments (3)

[Abstract] Abstract: the headline claim that the GRPO-fine-tuned model 'reduces variability compared to the baseline LLM model' is stated without any quantitative metrics, error bars, baseline details, statistical tests, or effect sizes, so the magnitude and reliability of the reported improvement cannot be assessed.
[Method] Method section: the procedure for constructing and validating groups of semantically equivalent prompts is not described or ablated; because the stability reward is defined directly on these groups, the absence of this detail makes it impossible to determine whether the observed consistency gain is produced by the GRPO objective or by the prompt-construction process itself.
[Experiments] Experiments section: no ablation or control experiment isolates the contribution of the conversational-context reset from the GRPO update; without this, it is unclear whether the stability improvement arises from the claimed information-consistency mechanism or from side effects of the reset.

minor comments (2)

[Abstract] The abstract would be clearer if it explicitly defined the two entropy-based rewards and stated the precise form of the GRPO objective used.
[Method] Notation for the stability reward should be introduced once and used consistently; the current description leaves the weighting between helpfulness and stability rewards implicit.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements where they strengthen the presentation of our results.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that the GRPO-fine-tuned model 'reduces variability compared to the baseline LLM model' is stated without any quantitative metrics, error bars, baseline details, statistical tests, or effect sizes, so the magnitude and reliability of the reported improvement cannot be assessed.

Authors: We agree that the abstract would benefit from quantitative support. In the revised version we will add concise metrics (e.g., mean entropy reduction and standard deviation across prompt groups), a brief baseline comparison, and an effect-size summary while keeping the abstract within length limits; full tables with error bars and statistical tests will remain in the Experiments section. revision: yes
Referee: [Method] Method section: the procedure for constructing and validating groups of semantically equivalent prompts is not described or ablated; because the stability reward is defined directly on these groups, the absence of this detail makes it impossible to determine whether the observed consistency gain is produced by the GRPO objective or by the prompt-construction process itself.

Authors: We accept this point. The revised Method section will explicitly describe the prompt-group construction pipeline (paraphrasing model plus human validation) and the semantic-equivalence criteria used. We will also add an ablation that trains with randomly grouped prompts versus our validated groups, thereby isolating the contribution of the grouping procedure from the GRPO objective itself. revision: yes
Referee: [Experiments] Experiments section: no ablation or control experiment isolates the contribution of the conversational-context reset from the GRPO update; without this, it is unclear whether the stability improvement arises from the claimed information-consistency mechanism or from side effects of the reset.

Authors: The context reset is a deliberate design choice to isolate phrasing effects, yet we recognize the value of an explicit control. The revised Experiments section will include a new ablation that runs the identical GRPO training without the reset step; results will be reported side-by-side with the main setting to quantify any incremental benefit attributable to the reset versus the stability reward. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses standard RL on externally defined rewards

full rationale

The paper adapts GRPO with entropy-based helpfulness and stability rewards applied to prompt groups, but the optimization follows standard policy gradient updates without any fitted parameter being renamed as a prediction or any self-referential definition of the target consistency metric. The abstract and described method treat group identification and context reset as input procedures rather than derived outputs that loop back into the equations. No self-citation chain is load-bearing for the central claim, and the reported reduction in variability is presented as an empirical outcome of the RL objective rather than a mathematical identity. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Abstract-only view limits visibility into exact parameters; assumes standard RL axioms plus domain-specific prompt equivalence and context reset.

free parameters (1)

reward weighting between helpfulness and stability
Balance between the two entropy-based rewards must be chosen or tuned but is not specified.

axioms (2)

domain assumption Semantically equivalent prompts should produce identical core information content
Invoked to justify treating prompt variants as groups for stability optimization.
domain assumption Resetting conversational context isolates phrasing effects
Used to ensure measured variability stems only from prompt wording.

pith-pipeline@v0.9.0 · 5566 in / 1211 out tokens · 30290 ms · 2026-05-16T22:02:05.510083+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 3 internal anchors

[1]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Urvashi Khandelwal, Lintao Wolf, Devang Choudhary, Barlas Oguz, Sebastian Riedel, Luke Zettlemoyer, Veselin Stoyanov, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. InAdvances in Neural Information Processing Systems, 2020

work page 2020
[2]

The curious case of neural text degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2019

work page 2019
[3]

Federal court allows collective action lawsuit over alleged age bias in ai hiring, May

Holland & Knight LLP. Federal court allows collective action lawsuit over alleged age bias in ai hiring, May

work page
[4]

Coverage ofMobley v. Workday

work page
[5]

Another employer faces ai hiring bias lawsuit, December 2024

Fisher Phillips LLP. Another employer faces ai hiring bias lawsuit, December 2024. Coverage ofHarper v. Sirius XM

work page 2024
[6]

Bc tribunal confirms companies remain liable for information provided by ai chatbots, February 2024

American Bar Association. Bc tribunal confirms companies remain liable for information provided by ai chatbots, February 2024. Business Law Today analysis

work page 2024
[7]

Raine et al

A. Raine et al. Raine v. openai: Wrongful death complaint, 2024. Ongoing U.S. litigation alleging chatbot-related harm

work page 2024
[8]

Sensitivity and robustness of large language models to prompt variations

Chunying Gan et al. Sensitivity and robustness of large language models to prompt variations. InPACLIC, 2023

work page 2023
[9]

What did i do wrong? quantifying llms' sensitivity and consistency to prompt engineering

Y . Sharma et al. What did i do wrong? quantifying llms’ sensitivity and consistency to prompt rephrasings.arXiv preprint arXiv:2406.12334, 2024

work page arXiv 2024
[10]

Liu et al

H. Liu et al. Aligning with logic: Measuring, evaluating and improving logical consistency of llms.arXiv preprint arXiv:2410.02205, 2024

work page arXiv 2024
[11]

Zhang et al

M. Zhang et al. The effect of sampling temperature on problem solving in large language models.arXiv preprint arXiv:2402.05201, 2024. 11 Sonal Prabhune et al

work page arXiv 2024
[12]

Does temperature 0 guarantee deterministic llm outputs?, 2025

Vincent Schmalbach. Does temperature 0 guarantee deterministic llm outputs?, 2025

work page 2025
[13]

Jiuding Sun, Chantal Shaib, and Byron C. Wallace. Evaluating the zero-shot robustness of instruction-tuned language models. InICLR, 2024

work page 2024
[14]

Improving the robustness of large language models via consistency alignment

Yukun Zhao et al. Improving the robustness of large language models via consistency alignment. InLREC- COLING, 2024

work page 2024
[15]

Wu et al

H. Wu et al. Harnessing response consistency for superior llm performance: The promise and peril of answer- augmented prompting.Electronics, 13(23):4581, 2024

work page 2024
[16]

Improving consistency in large language models through chain of guidance

Hardik Raj et al. Improving consistency in large language models through chain of guidance. InOpenReview, 2025

work page 2025
[17]

Improving consistency in retrieval-augmented systems with group similarity rewards.arXiv preprint arXiv:2510.04392, 2025

Faisal Hamman, Chenyang Zhu, Anoop Kumar, Xujun Peng, Sanghamitra Dutta, Daben Liu, and Alfy Samuel. Improving consistency in retrieval-augmented systems with group similarity rewards.arXiv preprint arXiv:2510.04392, 2025

work page arXiv 2025
[18]

The unequal opportunities of large language models: Examining demographic biases in job recommendations by chatgpt and llama

Abel Salinas, Parth Shah, Yuzhong Huang, Robert McCormack, and Fred Morstatter. The unequal opportunities of large language models: Examining demographic biases in job recommendations by chatgpt and llama. In Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, pages 1–15, 2023

work page 2023
[19]

Sumei Hu. The effect of artificial intelligence-assisted personalized learning on student learning outcomes: A meta-analysis based on 31 empirical research papers.Science Insights Education Frontiers, 24(1):3873–3894, 2024

work page 2024
[20]

The role of digital health technologies in women’s health, empowerment, and gender equality: Project report

World Health Organization Regional Office for Europe. The role of digital health technologies in women’s health, empowerment, and gender equality: Project report. Technical report, World Health Organization Europe, March

work page
[21]

WHO Europe technical document, 8 March 2024

work page 2024
[22]

Towards a standard for identifying and managing bias in artificial intelligence

Reva Schwartz, Apostol Vassilev, Kristen Greene, Lori Perine, Andrew Burt, and Patrick Hall. Towards a standard for identifying and managing bias in artificial intelligence. Technical Report 1270, National Institute of Standards and Technology, March 2022. NIST Special Publication 1270

work page 2022
[23]

Regulation (eu) 2024/1689 of the european parliament and of the council on artificial intelligence (ai act), August 2025

European Parliament and Council of the European Union. Regulation (eu) 2024/1689 of the european parliament and of the council on artificial intelligence (ai act), August 2025. Official Journal of the European Union, L 211, 12 August 2024

work page 2024
[24]

Recommendation on the ethics of artificial intelligence

United Nations Educational, Scientific and Cultural Organization. Recommendation on the ethics of artificial intelligence. Technical report, UNESCO, 2022. Adopted at the 41st Session of the UNESCO General Conference

work page 2022
[25]

Oecd principles on artificial intelligence

Organisation for Economic Co-operation and Development. Oecd principles on artificial intelligence. Technical report, OECD Council, May 2019. Adopted on 22 May 2019

work page 2019
[26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Da Guo et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

ReCode: Reinforcing Code Generation with Reasoning-Process Rewards

Lishui Fan, Yu Zhang, Mouxiang Chen, and Zhongxin Liu. Posterior-grpo: Rewarding reasoning processes in code generation.arXiv preprint arXiv:2508.05170, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Improving llm-generated code quality with grpo

Maxime Robeyns and Laurence Aitchison. Improving llm-generated code quality with grpo. InRLC Workshop on RL Beyond Rewards, 2025

work page 2025
[30]

Llama-3.2-1b-instruct

Meta AI. Llama-3.2-1b-instruct. https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct , 2024. Accessed: 2025-11-12

work page 2024
[31]

Llama-3.2-1b-instruct

Unsloth AI. Llama-3.2-1b-instruct. https://huggingface.co/unsloth/Llama-3.2-1B-Instruct , 2024. Optimized and fine-tuned by Unsloth for efficient training and inference. Accessed: 2025-11-12

work page 2024
[32]

Do llms have a gender (entropy) bias?arXiv preprint arXiv:2505.20343, 2025

Sonal Prabhune, Balaji Padmanabhan, and Kaushik Dutta. Do llms have a gender (entropy) bias?arXiv preprint arXiv:2505.20343, 2025

work page arXiv 2025
[33]

Unsloth, 2023

Michael Han Daniel Han and Unsloth team. Unsloth, 2023. 12 Information-Consistent Language Model Recommendations through Group Relative Policy Optimization A Reward Function Implementation Listing 1: Combined Reward Function for GRPO Training defcombined_reward ( prompts , c o m p l e t i o n s , a l p h a = 0 . 4 , b e t a = 0 . 6 , ** kwargs ) : h e l p...

work page 2023

[1] [1]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Urvashi Khandelwal, Lintao Wolf, Devang Choudhary, Barlas Oguz, Sebastian Riedel, Luke Zettlemoyer, Veselin Stoyanov, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. InAdvances in Neural Information Processing Systems, 2020

work page 2020

[2] [2]

The curious case of neural text degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2019

work page 2019

[3] [3]

Federal court allows collective action lawsuit over alleged age bias in ai hiring, May

Holland & Knight LLP. Federal court allows collective action lawsuit over alleged age bias in ai hiring, May

work page

[4] [4]

Coverage ofMobley v. Workday

work page

[5] [5]

Another employer faces ai hiring bias lawsuit, December 2024

Fisher Phillips LLP. Another employer faces ai hiring bias lawsuit, December 2024. Coverage ofHarper v. Sirius XM

work page 2024

[6] [6]

Bc tribunal confirms companies remain liable for information provided by ai chatbots, February 2024

American Bar Association. Bc tribunal confirms companies remain liable for information provided by ai chatbots, February 2024. Business Law Today analysis

work page 2024

[7] [7]

Raine et al

A. Raine et al. Raine v. openai: Wrongful death complaint, 2024. Ongoing U.S. litigation alleging chatbot-related harm

work page 2024

[8] [8]

Sensitivity and robustness of large language models to prompt variations

Chunying Gan et al. Sensitivity and robustness of large language models to prompt variations. InPACLIC, 2023

work page 2023

[9] [9]

What did i do wrong? quantifying llms' sensitivity and consistency to prompt engineering

Y . Sharma et al. What did i do wrong? quantifying llms’ sensitivity and consistency to prompt rephrasings.arXiv preprint arXiv:2406.12334, 2024

work page arXiv 2024

[10] [10]

Liu et al

H. Liu et al. Aligning with logic: Measuring, evaluating and improving logical consistency of llms.arXiv preprint arXiv:2410.02205, 2024

work page arXiv 2024

[11] [11]

Zhang et al

M. Zhang et al. The effect of sampling temperature on problem solving in large language models.arXiv preprint arXiv:2402.05201, 2024. 11 Sonal Prabhune et al

work page arXiv 2024

[12] [12]

Does temperature 0 guarantee deterministic llm outputs?, 2025

Vincent Schmalbach. Does temperature 0 guarantee deterministic llm outputs?, 2025

work page 2025

[13] [13]

Jiuding Sun, Chantal Shaib, and Byron C. Wallace. Evaluating the zero-shot robustness of instruction-tuned language models. InICLR, 2024

work page 2024

[14] [14]

Improving the robustness of large language models via consistency alignment

Yukun Zhao et al. Improving the robustness of large language models via consistency alignment. InLREC- COLING, 2024

work page 2024

[15] [15]

Wu et al

H. Wu et al. Harnessing response consistency for superior llm performance: The promise and peril of answer- augmented prompting.Electronics, 13(23):4581, 2024

work page 2024

[16] [16]

Improving consistency in large language models through chain of guidance

Hardik Raj et al. Improving consistency in large language models through chain of guidance. InOpenReview, 2025

work page 2025

[17] [17]

Improving consistency in retrieval-augmented systems with group similarity rewards.arXiv preprint arXiv:2510.04392, 2025

Faisal Hamman, Chenyang Zhu, Anoop Kumar, Xujun Peng, Sanghamitra Dutta, Daben Liu, and Alfy Samuel. Improving consistency in retrieval-augmented systems with group similarity rewards.arXiv preprint arXiv:2510.04392, 2025

work page arXiv 2025

[18] [18]

The unequal opportunities of large language models: Examining demographic biases in job recommendations by chatgpt and llama

Abel Salinas, Parth Shah, Yuzhong Huang, Robert McCormack, and Fred Morstatter. The unequal opportunities of large language models: Examining demographic biases in job recommendations by chatgpt and llama. In Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, pages 1–15, 2023

work page 2023

[19] [19]

Sumei Hu. The effect of artificial intelligence-assisted personalized learning on student learning outcomes: A meta-analysis based on 31 empirical research papers.Science Insights Education Frontiers, 24(1):3873–3894, 2024

work page 2024

[20] [20]

The role of digital health technologies in women’s health, empowerment, and gender equality: Project report

World Health Organization Regional Office for Europe. The role of digital health technologies in women’s health, empowerment, and gender equality: Project report. Technical report, World Health Organization Europe, March

work page

[21] [21]

WHO Europe technical document, 8 March 2024

work page 2024

[22] [22]

Towards a standard for identifying and managing bias in artificial intelligence

Reva Schwartz, Apostol Vassilev, Kristen Greene, Lori Perine, Andrew Burt, and Patrick Hall. Towards a standard for identifying and managing bias in artificial intelligence. Technical Report 1270, National Institute of Standards and Technology, March 2022. NIST Special Publication 1270

work page 2022

[23] [23]

Regulation (eu) 2024/1689 of the european parliament and of the council on artificial intelligence (ai act), August 2025

European Parliament and Council of the European Union. Regulation (eu) 2024/1689 of the european parliament and of the council on artificial intelligence (ai act), August 2025. Official Journal of the European Union, L 211, 12 August 2024

work page 2024

[24] [24]

Recommendation on the ethics of artificial intelligence

United Nations Educational, Scientific and Cultural Organization. Recommendation on the ethics of artificial intelligence. Technical report, UNESCO, 2022. Adopted at the 41st Session of the UNESCO General Conference

work page 2022

[25] [25]

Oecd principles on artificial intelligence

Organisation for Economic Co-operation and Development. Oecd principles on artificial intelligence. Technical report, OECD Council, May 2019. Adopted on 22 May 2019

work page 2019

[26] [26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Da Guo et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

ReCode: Reinforcing Code Generation with Reasoning-Process Rewards

Lishui Fan, Yu Zhang, Mouxiang Chen, and Zhongxin Liu. Posterior-grpo: Rewarding reasoning processes in code generation.arXiv preprint arXiv:2508.05170, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Improving llm-generated code quality with grpo

Maxime Robeyns and Laurence Aitchison. Improving llm-generated code quality with grpo. InRLC Workshop on RL Beyond Rewards, 2025

work page 2025

[30] [30]

Llama-3.2-1b-instruct

Meta AI. Llama-3.2-1b-instruct. https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct , 2024. Accessed: 2025-11-12

work page 2024

[31] [31]

Llama-3.2-1b-instruct

Unsloth AI. Llama-3.2-1b-instruct. https://huggingface.co/unsloth/Llama-3.2-1B-Instruct , 2024. Optimized and fine-tuned by Unsloth for efficient training and inference. Accessed: 2025-11-12

work page 2024

[32] [32]

Do llms have a gender (entropy) bias?arXiv preprint arXiv:2505.20343, 2025

Sonal Prabhune, Balaji Padmanabhan, and Kaushik Dutta. Do llms have a gender (entropy) bias?arXiv preprint arXiv:2505.20343, 2025

work page arXiv 2025

[33] [33]

Unsloth, 2023

Michael Han Daniel Han and Unsloth team. Unsloth, 2023. 12 Information-Consistent Language Model Recommendations through Group Relative Policy Optimization A Reward Function Implementation Listing 1: Combined Reward Function for GRPO Training defcombined_reward ( prompts , c o m p l e t i o n s , a l p h a = 0 . 4 , b e t a = 0 . 6 , ** kwargs ) : h e l p...

work page 2023