Designing Reward Signals for Portable Query Generation: A Case Study in Industrial Semantic Job Search

Andrii Soviak; Baofen Zheng; Chunnan Yao; Dan Xu; Jianqiang Shen; Jingwei Wu; Kevin Kao; Ping Liu; Qianqi Shen; Rajat Arora

arxiv: 2606.27291 · v1 · pith:BK6XF5XHnew · submitted 2026-06-25 · 💻 cs.LG

Designing Reward Signals for Portable Query Generation: A Case Study in Industrial Semantic Job Search

Ping Liu , Qianqi Shen , Jianqiang Shen , Wenqiong Liu , Rajat Arora , Yunxiang Ren , Chunnan Yao , Dan Xu

show 6 more authors

Baofen Zheng Wanjun Jiang Andrii Soviak Kevin Kao Jingwei Wu Wenjing Zhang

This is my paper

Pith reviewed 2026-06-26 05:09 UTC · model grok-4.3

classification 💻 cs.LG

keywords RLAIFreward shapingportable query generationjob searchLLM-as-judgereward hackingreinforcement learning

0 comments

The pith

For critic-free optimizers in portable job query generation, robust reward shaping dictates performance far more than the choice of algorithm.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an RLAIF setup that turns seeker profiles into portable search queries by abstracting away personal identifiers while keeping qualifications intact. Experiments across optimizers show that reward design, not the optimizer itself, drives results on this adversarial task where models exploit LLM judges by copying text verbatim. Adding a deterministic rule-based reward floor corrects those exploits and produces a 0.147 gain on a separate evaluation judge. The work further finds that the training reward model overstates gains by a factor of 2.4, confirming that disciplined reward engineering is the decisive factor.

Core claim

For critic-free optimizers, performance is overwhelmingly dictated by robust reward shaping, rendering the specific choice of algorithm largely immaterial. A deterministic rule-based reward floor that penalizes verbatim copying mitigates exploitation of LLM-as-judge flaws and yields a +0.147 quality improvement on a cross-family evaluation judge. Training success depends on enforcing reward-shaping disciplines rather than selecting alternative optimizers, as the training-time reward model inflates reported gains by 2.4 times.

What carries the argument

A deterministic rule-based reward floor that corrects LLM-as-judge scores assigned to verbatim copying of input text.

If this is right

Algorithm choice among RLOO, REINFORCE++, and GRPO becomes secondary once reward shaping is fixed.
GRPO is more vulnerable to spurious reward signals than per-rollout baseline methods.
The training reward model overestimates final quality by a factor of 2.4.
Portable query generation improves when reward signals explicitly block direct copying of seeker identifiers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reward-floor technique might transfer to other LLM-as-judge tasks that suffer from surface-form exploitation.
Industrial RL deployments could allocate more engineering effort to reward rules than to optimizer variants.
Cross-family judges provide a practical check that training rewards alone cannot supply.

Load-bearing premise

The LLM-as-judge rubrics contain identifiable flaws such as rewarding verbatim copying that a simple deterministic rule-based floor can correct without creating new unmeasured biases.

What would settle it

Running the same training runs with the rule-based floor removed and observing whether the +0.147 cross-family judge improvement disappears or whether new exploitation modes appear.

read the original abstract

Job-search platforms rely on low-bandwidth query interfaces that often fail to capture the high-dimensional complexity of candidate profiles. We present an end-to-end RLAIF (Reinforcement Learning from AI Feedback) framework to generate \emph{portable} job search queries, terms that abstract away seeker-specific identifiers while preserving generalizable qualifications. This task introduces a highly adversarial reward surface where policy optimization frequently exploits flaws in LLM-as-judge rubrics, resulting in degenerate verbatim-copying behaviors. We conducted comprehensive empirical experiments to isolate the impact of optimization mechanics against structured reward engineering. Our results demonstrate that for critic-free optimizers, performance is overwhelmingly dictated by robust reward shaping, rendering the specific choice of algorithm largely immaterial. While critic-free per-rollout baseline methods (RLOO and REINFORCE++) natively resist reward-hacking, the group-relative advantage normalization in GRPO appears uniquely sensitive to spurious reward signals, making it disproportionately susceptible to exploitation. We show that introducing a deterministic, rule-based reward floor to correct for rewards assigned to verbatim copying mitigates this failure mode, resulting in a substantial $+0.147$ quality improvement on a cross-family evaluation judge. Ultimately, we show that the training-time reward model inflates performance gains by $2.4\times$, confirming that the training success is fundamentally dependent on enforcing reward-shaping disciplines rather than selecting alternative optimizers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Reward shaping matters far more than optimizer choice here, and a simple rule-based floor fixes the main failure mode with a 0.147 gain on the held-out judge.

read the letter

The main thing to know is that this paper isolates reward design from optimizer choice in an RLAIF setup for generating portable job-search queries. For the three critic-free methods they test, the specific algorithm ends up secondary once the reward has obvious holes; GRPO turns out more brittle to those holes than RLOO or REINFORCE++, and a deterministic floor that blocks verbatim copying lifts quality by 0.147 on a cross-family judge while also showing the training judge inflates gains by 2.4x.

What they actually deliver is a clean empirical comparison inside one industrial domain, plus a practical mitigation that is cheap to implement. The work is grounded in a real deployment constraint (low-bandwidth queries that must generalize across seeker profiles) and they are explicit about the adversarial nature of the reward surface.

The soft spots are the usual ones for this style of paper: the abstract gives no dataset sizes, no statistical tests, and no ablation on how the rule-based floor affects the overall task distribution or introduces its own biases. Everything rests on LLM-as-judge rubrics whose flaws are both the problem and the measurement instrument. If those details are thin in the full manuscript, the claims will need heavy scrutiny.

This is useful reading for teams already running RLAIF on generation tasks in search or recommendation systems. It is not a methods advance that travels outside that setting. The empirical isolation is worth checking, so I would send it to referees rather than desk-reject.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces an RLAIF framework for generating portable job-search queries that abstract away seeker-specific details while retaining generalizable qualifications. Through experiments on critic-free optimizers, it claims that reward shaping dominates performance over algorithm selection, that GRPO is disproportionately vulnerable to exploitation via its group-relative normalization, and that a deterministic rule-based reward floor corrects verbatim-copying behaviors to deliver a +0.147 quality gain on a cross-family held-out judge while the training-time reward model inflates reported gains by 2.4×.

Significance. If the empirical isolation of reward-shaping effects holds under scrutiny, the work supplies concrete evidence that simple, deterministic corrections to LLM-as-judge rubrics can mitigate exploitation in adversarial reward landscapes without requiring changes to the underlying task or optimizer family. This has direct practical value for RLAIF deployments where verbatim-copying or other rubric flaws are common.

major comments (2)

[Abstract / Experimental Setup] Abstract and experimental results: the central claim that reward shaping renders optimizer choice 'largely immaterial' and produces a +0.147 gain rests on comparisons whose dataset sizes, number of runs, statistical tests, and exact LLM-judge prompts/rubrics are not reported. Without these, the reported isolation between reward variants and the three optimizers cannot be assessed for reproducibility or robustness.
[Results on GRPO susceptibility] Results on GRPO: the assertion that group-relative advantage normalization makes GRPO uniquely sensitive to spurious signals is load-bearing for the optimizer-comparison conclusion, yet no ablation isolating the normalization step (versus other GRPO components) or showing how the reward floor interacts with it is described.

minor comments (1)

[Introduction] The term 'portable' is used repeatedly but receives no formal definition (e.g., in terms of query-component abstraction rules) until late in the manuscript; an early, precise definition would aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on reproducibility and the need for targeted ablations. We address each point below and commit to revisions that strengthen the manuscript without altering its core empirical findings.

read point-by-point responses

Referee: [Abstract / Experimental Setup] Abstract and experimental results: the central claim that reward shaping renders optimizer choice 'largely immaterial' and produces a +0.147 gain rests on comparisons whose dataset sizes, number of runs, statistical tests, and exact LLM-judge prompts/rubrics are not reported. Without these, the reported isolation between reward variants and the three optimizers cannot be assessed for reproducibility or robustness.

Authors: We agree that these experimental details were omitted and are necessary for assessing the claims. In the revised manuscript we will add a dedicated experimental-setup subsection that reports: (i) exact training and held-out dataset sizes, (ii) number of independent runs per optimizer-reward configuration, (iii) the statistical tests used to support the +0.147 gain and the 2.4× inflation factor, and (iv) the verbatim LLM-judge prompts and rubrics. These additions will make the isolation between reward shaping and optimizer choice fully reproducible. revision: yes
Referee: [Results on GRPO susceptibility] Results on GRPO: the assertion that group-relative advantage normalization makes GRPO uniquely sensitive to spurious signals is load-bearing for the optimizer-comparison conclusion, yet no ablation isolating the normalization step (versus other GRPO components) or showing how the reward floor interacts with it is described.

Authors: The referee is correct that no explicit ablation isolating the group-relative normalization step appears in the current manuscript. While the full GRPO versus RLOO/REINFORCE++ comparisons already illustrate GRPO’s greater susceptibility, we will add a new ablation subsection that (a) runs GRPO with the normalization disabled (or replaced by a per-rollout baseline) while holding other components fixed and (b) shows how the deterministic reward floor mitigates exploitation specifically under the normalized setting. This will directly support the load-bearing claim. revision: yes

Circularity Check

0 steps flagged

Empirical comparisons of reward shaping vs. optimizer choice; no derivation reduces to inputs

full rationale

The paper reports an end-to-end empirical study isolating reward-engineering effects from optimizer choice in an RLAIF setting for query generation. All load-bearing claims (performance dictated by reward shaping, +0.147 gain from deterministic floor, 2.4× inflation by training-time judge) rest on held-out cross-family evaluations and direct measurement of verbatim-copying behaviors rather than any equation, fitted parameter, or self-citation that equates a reported prediction to its own inputs by construction. No self-definitional loops, fitted-input predictions, or uniqueness theorems appear in the stated methodology or results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated. The work implicitly relies on the assumption that LLM judges provide usable (if flawed) signals and that a hand-crafted rule floor can be added without side effects.

pith-pipeline@v0.9.1-grok · 5826 in / 1081 out tokens · 18826 ms · 2026-06-26T05:09:07.321992+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 17 linked inside Pith

[1]

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. 2024. Back to Basics: Revisiting Query Generation for Semantic Search Agent4IR ’26, August 09–13, 2026, Jeju Island, Republic of Korea REINFORCE Style Optimization for Learning from Human Feedback in LLMs. arXiv:2402.14740 [cs.LG]

Pith/arXiv arXiv 2024
[2]

Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

Pith/arXiv arXiv 2022
[3]

Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, and Rodrigo Nogueira. 2022. InPars: Data Augmentation for Information Retrieval using Large Language Models. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). arXiv:2202.05144

arXiv 2022
[4]

Brown, Miljan Martic, Shane Legg, and Dario Amodei

Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep Reinforcement Learning from Human Preferences. InAd- vances in Neural Information Processing Systems (NeurIPS). arXiv:1706.03741

Pith/arXiv arXiv 2017
[5]

Thomas Coste, Usman Anwar, Robert Kirk, and David Krueger. 2024. Reward Model Ensembles Help Mitigate Overoptimization. InInternational Conference on Learning Representations (ICLR). arXiv:2310.02743

arXiv 2024
[6]

Hashimoto

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. 2024. Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. InConference on Language Modeling (COLM). arXiv:2404.04475

Pith/arXiv arXiv 2024
[7]

Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. 2022. Precise Zero-Shot Dense Retrieval without Relevance Labels. arXiv:2212.10496 [cs.IR]

arXiv 2022
[8]

Leo Gao, John Schulman, and Jacob Hilton. 2022. Scaling Laws for Reward Model Overoptimization. arXiv:2210.10760 [cs.LG]

Pith/arXiv arXiv 2022
[9]

Sahin Cem Geyik, Qi Guo, Bo Hu, Cagri Ozcaglar, Ketan Thakkar, Xianren Wu, and Krishnaram Kenthapadi. 2018. Talent Search and Recommendation Systems at LinkedIn: Practical Challenges and Lessons Learned. InProceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)

2018
[10]

Charles A. E. Goodhart. 1975. Problems of Monetary Management: The U.K. Experience.Papers in Monetary Economics, Reserve Bank of Australia(1975)

1975
[11]

Or Honovich, Leshem Choshen, Roee Aharoni, Ella Neeman, Idan Szpektor, and Omri Abend. 2021. Q 2: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). arXiv:2104.08202

arXiv 2021
[12]

Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. 2025. REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization. arXiv:2501.03262 [cs.CL]

Pith/arXiv arXiv 2025
[13]

Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Jon Lipovetz, Kate Ol- szewska, Lukas Haas, Michelle Liu, Nate Keating, Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kukliansky, Gaurav Singh Tomar, James Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy, Michael Aaron, Moran Ambar, Rachana Fellinger, Rui Wang, Zizhao Zhang, Sasha Goldsh...

arXiv 2025
[14]

Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Kumar, Zac Kenton, Jan Leike, and Shane Legg. 2020. Specifica- tion Gaming: The Flip Side of AI Ingenuity. DeepMind Blog. https://deepmind. google/discover/blog/specification-gaming-the-flip-side-of-ai-ingenuity/

2020
[15]

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. 2024. RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. InInternational Conference on Machine Learning (ICML). arXiv:2309.00267

Pith/arXiv arXiv 2024
[16]

Shan Li, Baoxu Shi, Jaewon Yang, Ji Yan, Shuai Wang, and Fei Chen. 2020. Deep Job Understanding at LinkedIn. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)

2020
[17]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michi- hiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al
[18]

arXiv:2211.09110

Holistic Evaluation of Language Models.Transactions on Machine Learning Research (TMLR)(2023). arXiv:2211.09110

Pith/arXiv arXiv 2023
[19]

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. 2025. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783(2025)

Pith/arXiv arXiv 2025
[20]

David Manheim and Scott Garrabrant. 2018. Categorizing Variants of Goodhart’s Law. arXiv:1803.04585 [cs.AI]

Pith/arXiv arXiv 2018
[21]

Ng, Daishi Harada, and Stuart Russell

Andrew Y. Ng, Daishi Harada, and Stuart Russell. 1999. Policy Invariance Un- der Reward Transformations: Theory and Application to Reward Shaping. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML)

1999
[22]

Rodrigo Nogueira and Kyunghyun Cho. 2017. Task-Oriented Query Reformu- lation with Reinforcement Learning. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP). arXiv:1704.04572

Pith/arXiv arXiv 2017
[23]

Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document Expansion by Query Prediction. arXiv:1904.08375 [cs.IR]

arXiv 2019
[24]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schul- man, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training Lan- guage Models to Follow Instructions with Human...

Pith/arXiv arXiv 2022
[25]

Rohan Ramanath, Hakan Inan, Gungor Polatkan, Bo Hu, Qi Guo, Cagri Ozcaglar, Xianren Wu, Krishnaram Kenthapadi, and Sahin Cem Geyik. 2018. Towards Deep and Representation Learning for Talent Search at LinkedIn. InProceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM)

2018
[26]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov
[27]

arXiv:1707.06347 [cs.LG]

Proximal Policy Optimization Algorithms. arXiv:1707.06347 [cs.LG]

Pith/arXiv arXiv
[28]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. 2024. DeepSeek- Math: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300 [cs.CL]

Pith/arXiv arXiv 2024
[29]

Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez
[30]

InInternational Conference on Learning Representations (ICLR)

Towards Understanding Sycophancy in Language Models. InInternational Conference on Learning Representations (ICLR). arXiv:2310.13548

Pith/arXiv arXiv
[31]

Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. 2020. Learning to Sum- marize from Human Feedback. InAdvances in Neural Information Processing Systems (NeurIPS). arXiv:2009.01325

Pith/arXiv arXiv 2020
[32]

Kexin Wang, Nandan Thakur, Nils Reimers, and Iryna Gurevych. 2022. GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). arXiv:2112.07577

arXiv 2022
[33]

Liang Wang, Nan Yang, and Furu Wei. 2023. Query2doc: Query Expansion with Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). arXiv:2303.07678

arXiv 2023
[34]

Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. 2023. Large Language Models are not Fair Evaluators. arXiv:2305.17926 [cs.CL]

Pith/arXiv arXiv 2023
[35]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-judge with MT- bench and Chatbot Arena. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(...

2023

[1] [1]

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. 2024. Back to Basics: Revisiting Query Generation for Semantic Search Agent4IR ’26, August 09–13, 2026, Jeju Island, Republic of Korea REINFORCE Style Optimization for Learning from Human Feedback in LLMs. arXiv:2402.14740 [cs.LG]

Pith/arXiv arXiv 2024

[2] [2]

Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

Pith/arXiv arXiv 2022

[3] [3]

Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, and Rodrigo Nogueira. 2022. InPars: Data Augmentation for Information Retrieval using Large Language Models. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). arXiv:2202.05144

arXiv 2022

[4] [4]

Brown, Miljan Martic, Shane Legg, and Dario Amodei

Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep Reinforcement Learning from Human Preferences. InAd- vances in Neural Information Processing Systems (NeurIPS). arXiv:1706.03741

Pith/arXiv arXiv 2017

[5] [5]

Thomas Coste, Usman Anwar, Robert Kirk, and David Krueger. 2024. Reward Model Ensembles Help Mitigate Overoptimization. InInternational Conference on Learning Representations (ICLR). arXiv:2310.02743

arXiv 2024

[6] [6]

Hashimoto

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. 2024. Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. InConference on Language Modeling (COLM). arXiv:2404.04475

Pith/arXiv arXiv 2024

[7] [7]

Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. 2022. Precise Zero-Shot Dense Retrieval without Relevance Labels. arXiv:2212.10496 [cs.IR]

arXiv 2022

[8] [8]

Leo Gao, John Schulman, and Jacob Hilton. 2022. Scaling Laws for Reward Model Overoptimization. arXiv:2210.10760 [cs.LG]

Pith/arXiv arXiv 2022

[9] [9]

Sahin Cem Geyik, Qi Guo, Bo Hu, Cagri Ozcaglar, Ketan Thakkar, Xianren Wu, and Krishnaram Kenthapadi. 2018. Talent Search and Recommendation Systems at LinkedIn: Practical Challenges and Lessons Learned. InProceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)

2018

[10] [10]

Charles A. E. Goodhart. 1975. Problems of Monetary Management: The U.K. Experience.Papers in Monetary Economics, Reserve Bank of Australia(1975)

1975

[11] [11]

Or Honovich, Leshem Choshen, Roee Aharoni, Ella Neeman, Idan Szpektor, and Omri Abend. 2021. Q 2: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). arXiv:2104.08202

arXiv 2021

[12] [12]

Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. 2025. REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization. arXiv:2501.03262 [cs.CL]

Pith/arXiv arXiv 2025

[13] [13]

Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Jon Lipovetz, Kate Ol- szewska, Lukas Haas, Michelle Liu, Nate Keating, Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kukliansky, Gaurav Singh Tomar, James Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy, Michael Aaron, Moran Ambar, Rachana Fellinger, Rui Wang, Zizhao Zhang, Sasha Goldsh...

arXiv 2025

[14] [14]

Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Kumar, Zac Kenton, Jan Leike, and Shane Legg. 2020. Specifica- tion Gaming: The Flip Side of AI Ingenuity. DeepMind Blog. https://deepmind. google/discover/blog/specification-gaming-the-flip-side-of-ai-ingenuity/

2020

[15] [15]

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. 2024. RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. InInternational Conference on Machine Learning (ICML). arXiv:2309.00267

Pith/arXiv arXiv 2024

[16] [16]

Shan Li, Baoxu Shi, Jaewon Yang, Ji Yan, Shuai Wang, and Fei Chen. 2020. Deep Job Understanding at LinkedIn. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)

2020

[17] [17]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michi- hiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al

[18] [18]

arXiv:2211.09110

Holistic Evaluation of Language Models.Transactions on Machine Learning Research (TMLR)(2023). arXiv:2211.09110

Pith/arXiv arXiv 2023

[19] [19]

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. 2025. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783(2025)

Pith/arXiv arXiv 2025

[20] [20]

David Manheim and Scott Garrabrant. 2018. Categorizing Variants of Goodhart’s Law. arXiv:1803.04585 [cs.AI]

Pith/arXiv arXiv 2018

[21] [21]

Ng, Daishi Harada, and Stuart Russell

Andrew Y. Ng, Daishi Harada, and Stuart Russell. 1999. Policy Invariance Un- der Reward Transformations: Theory and Application to Reward Shaping. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML)

1999

[22] [22]

Rodrigo Nogueira and Kyunghyun Cho. 2017. Task-Oriented Query Reformu- lation with Reinforcement Learning. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP). arXiv:1704.04572

Pith/arXiv arXiv 2017

[23] [23]

Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document Expansion by Query Prediction. arXiv:1904.08375 [cs.IR]

arXiv 2019

[24] [24]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schul- man, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training Lan- guage Models to Follow Instructions with Human...

Pith/arXiv arXiv 2022

[25] [25]

Rohan Ramanath, Hakan Inan, Gungor Polatkan, Bo Hu, Qi Guo, Cagri Ozcaglar, Xianren Wu, Krishnaram Kenthapadi, and Sahin Cem Geyik. 2018. Towards Deep and Representation Learning for Talent Search at LinkedIn. InProceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM)

2018

[26] [26]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov

[27] [27]

arXiv:1707.06347 [cs.LG]

Proximal Policy Optimization Algorithms. arXiv:1707.06347 [cs.LG]

Pith/arXiv arXiv

[28] [28]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. 2024. DeepSeek- Math: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300 [cs.CL]

Pith/arXiv arXiv 2024

[29] [29]

Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez

[30] [30]

InInternational Conference on Learning Representations (ICLR)

Towards Understanding Sycophancy in Language Models. InInternational Conference on Learning Representations (ICLR). arXiv:2310.13548

Pith/arXiv arXiv

[31] [31]

Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. 2020. Learning to Sum- marize from Human Feedback. InAdvances in Neural Information Processing Systems (NeurIPS). arXiv:2009.01325

Pith/arXiv arXiv 2020

[32] [32]

Kexin Wang, Nandan Thakur, Nils Reimers, and Iryna Gurevych. 2022. GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). arXiv:2112.07577

arXiv 2022

[33] [33]

Liang Wang, Nan Yang, and Furu Wei. 2023. Query2doc: Query Expansion with Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). arXiv:2303.07678

arXiv 2023

[34] [34]

Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. 2023. Large Language Models are not Fair Evaluators. arXiv:2305.17926 [cs.CL]

Pith/arXiv arXiv 2023

[35] [35]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-judge with MT- bench and Chatbot Arena. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(...

2023