arxiv: 2604.27495 · v1 · submitted 2026-04-30 · 💻 cs.CL · cs.AI

Recognition: unknown

Debiasing Reward Models via Causally Motivated Inference-Time Intervention

Kazutoshi Shinoda , Kosuke Nishida , Kyosuke Nishida

Authors on Pith no claims yet

Pith reviewed 2026-05-07 09:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords reward modelsdebiasinginference-time interventionneuron suppressionLLM alignmentspurious featurespreference annotation

0 comments

The pith

Small reward models achieve alignment performance comparable to 70B models by suppressing fewer than 2% of neurons correlated with bias attributes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a targeted intervention on neurons whose activations correlate with predefined bias attributes, such as response length, can reduce reward model sensitivity to multiple spurious features at inference time. This method avoids the trade-offs that arise when prior approaches focus only on length correction. A sympathetic reader would care because it demonstrates that much smaller 2B and 7B reward models, after editing under 2 percent of their neurons, can annotate preferences effectively enough for large language models to reach alignment levels previously requiring state-of-the-art 70B models on AlpacaEval and MT-Bench. The work also locates the relevant bias signals mainly in early layers, clarifying where these models exploit unwanted cues.

Core claim

By first identifying neurons whose activations are strongly correlated with predefined bias attributes and then applying neuron-level suppression of those signals, the method reduces reward model sensitivity to spurious features across diverse bias types without inducing performance trade-offs. When the edited small reward models are used for preference annotation, 2B and 7B models enable large language models to improve alignment to levels comparable to a state-of-the-art 70B reward model on AlpacaEval and MT-Bench.

What carries the argument

Neuron-level intervention that suppresses activations of neurons identified by correlation with predefined bias attributes.

If this is right

Reward models exhibit reduced sensitivity to multiple spurious features such as response length.
No core performance degradation occurs on standard reward-model benchmarks.
Preference annotations from the edited 2B and 7B models produce large-language-model alignment gains matching those from a 70B model.
Bias signals concentrate primarily in early layers of the reward models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The early-layer concentration implies that bias mitigation can be performed with interventions limited to the initial stages of processing.
The same correlation-based identification could be tested on bias attributes beyond those predefined in the study.
If the approach generalizes, it offers a route to make reward-model training more robust by regularizing only the identified neurons rather than the entire model.

Load-bearing premise

Neurons whose activations merely correlate with bias attributes are the causal drivers of those biases, and suppressing them will not impair the reward model's ability to judge genuine preference quality.

What would settle it

An experiment in which the neuron-suppression intervention is applied to a 7B reward model yet the resulting large-language-model alignment scores on AlpacaEval and MT-Bench remain clearly below those obtained with an unedited 70B reward model, or in which controlled tests still show strong preference for longer responses after intervention.

Figures

Figures reproduced from arXiv: 2604.27495 by Kazutoshi Shinoda, Kosuke Nishida, Kyosuke Nishida.

**Figure 1.** Figure 1: RESPONSE A has superior formatting (length, bold text, paragraphs, etc.), but it is not truthful. RESPONSE B is concise and truthful. Skilled human annotators would disfavor A due to the false answer. In contrast, the FsfairX (Dong et al., 2024) reward model prefer A due to its format; however, this issue is mitigated by applying our method ( view at source ↗

**Figure 2.** Figure 2: Overview of our CIRM. (a) On the prepared validation set, we compute the activations of each neuron view at source ↗

**Figure 3.** Figure 3: Causal graphs for reward models. define neurons with top and bottom k Spearman’s ρ as bias-specific neurons. We let m(x) denote the activations of bias-specific neurons for input x. 2.4 Causal Intervention To reduce the impact of the biases on rewards, we aim to answer the following counterfactual question: “What would the rewards be if the responses were equivalent with respect to these biases?” To answe… view at source ↗

**Figure 4.** Figure 4: Histogram of neurons with the top and bottom 500 Spearman’s view at source ↗

**Figure 5.** Figure 5: Histogram of neurons with the top and bottom 500 Spearman’s view at source ↗

**Figure 6.** Figure 6: Distributions of neurons in GRM with top and view at source ↗

read the original abstract

Reward models (RMs) play a central role in aligning large language models (LLMs) with human preferences. However, RMs are often sensitive to spurious features such as response length. Existing inference-time approaches for mitigating these biases typically focus exclusively on response length, resulting in performance trade-offs. In this paper, we propose causally motivated intervention for mitigating multiple types of biases in RMs at inference time. Our method first identifies neurons whose activations are strongly correlated with predefined bias attributes, and applies neuron-level intervention that suppresses these signals. We evaluate our method on RM benchmarks and observe reductions in sensitivity to spurious features across diverse bias types, without inducing performance trade-offs. Moreover, when used for preference annotation, small RMs (2B and 7B) with our method, which edits less than 2% of all the neurons in RMs, enable LLMs to improve alignment, achieving performance comparable to that of a state-of-the-art 70B RM on AlpacaEval and MT-Bench. Further analysis reveals that bias signals are primarily encoded by neurons in early layers, shedding light on the internal mechanisms of bias exploitation in RMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that reward models (RMs) for LLM alignment can be debiased at inference time via a causally motivated intervention: neurons whose activations correlate strongly with predefined bias attributes (e.g., response length) are identified and suppressed, editing <2% of neurons. This reduces sensitivity to multiple spurious features without performance trade-offs on RM benchmarks. When the debiased small RMs (2B/7B) are used for preference annotation, the resulting LLMs achieve alignment performance on AlpacaEval and MT-Bench comparable to a 70B RM. Additional analysis indicates bias signals concentrate in early layers.

Significance. If the central claims hold, the work is significant for offering an efficient, low-parameter intervention that makes small RMs competitive with much larger ones in downstream alignment, potentially lowering compute costs. The neuron-level analysis provides mechanistic insight into how RMs exploit spurious features. The absence of trade-offs and the scale of the reported gains (small RM + intervention ≈ 70B RM) would be a notable practical contribution if supported by rigorous causal validation and full experimental controls.

major comments (3)

[Method] Method section (neuron identification procedure): the approach selects neurons solely by correlation of activations with predefined bias attributes and then applies suppression. No do-operator, counterfactual ablation, or causal graph is used to confirm these neurons lie on the causal path from bias attribute to reward output. This leaves open whether suppression removes the actual bias mechanism or merely correlated signals, which directly undermines the 'causally motivated' framing and the claim that core preference evaluation remains intact.
[Experiments] Experimental results and headline claim (AlpacaEval/MT-Bench comparisons): the assertion that editing <2% of neurons in 2B/7B RMs yields preference annotations enabling LLM alignment comparable to a 70B RM requires explicit evidence that the intervention does not degrade reward accuracy on unbiased inputs. The abstract states 'no performance trade-offs,' but without reported ablations on neuron selection criteria, validation sets free of the predefined biases, or controls for shared upstream features, the support for the central empirical claim cannot be assessed.
[Method] Bias attribute definition and generalizability: the method presupposes a fixed set of 'predefined bias attributes.' If these attributes are chosen post-hoc or overlap with legitimate quality signals, the intervention risks either incomplete debiasing or unintended degradation. The paper should clarify the selection process and test on held-out or emergent bias types.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a concise statement of the exact neuron-selection threshold and suppression operation (e.g., activation scaling factor) to allow immediate replication.
[Experiments] Tables reporting RM benchmark results should include standard deviations across multiple runs and explicit comparison to the unmodified small RM baseline to quantify the intervention's isolated effect.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the detailed and constructive referee report. We appreciate the feedback on the methodological framing, empirical controls, and generalizability. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Method] Method section (neuron identification procedure): the approach selects neurons solely by correlation of activations with predefined bias attributes and then applies suppression. No do-operator, counterfactual ablation, or causal graph is used to confirm these neurons lie on the causal path from bias attribute to reward output. This leaves open whether suppression removes the actual bias mechanism or merely correlated signals, which directly undermines the 'causally motivated' framing and the claim that core preference evaluation remains intact.

Authors: We thank the referee for this precise observation. Our approach is motivated by causal intervention concepts from mechanistic interpretability, in which we identify neurons strongly associated with bias attributes via correlation and suppress their activations to reduce spurious influence on the reward output. However, we agree that the method does not include formal causal identification steps such as do-operators or an explicit causal graph. We will revise the manuscript to describe the technique more accurately as 'correlation-driven neuron suppression motivated by causal considerations in bias encoding' and add a dedicated limitations paragraph discussing the distinction between correlation-based identification and full causal validation. This preserves the practical contribution while clarifying the scope of the causal claim. revision: partial
Referee: [Experiments] Experimental results and headline claim (AlpacaEval/MT-Bench comparisons): the assertion that editing <2% of neurons in 2B/7B RMs yields preference annotations enabling LLM alignment comparable to a 70B RM requires explicit evidence that the intervention does not degrade reward accuracy on unbiased inputs. The abstract states 'no performance trade-offs,' but without reported ablations on neuron selection criteria, validation sets free of the predefined biases, or controls for shared upstream features, the support for the central empirical claim cannot be assessed.

Authors: We agree that stronger controls are needed to support the no-trade-off claim. The current results show that debiased RMs retain competitive performance on standard RM benchmarks (which contain diverse inputs) and that downstream LLMs achieve comparable alignment scores. To directly address the concern, we will add ablations in the revised version: (1) performance on validation sets constructed to minimize the predefined bias attributes, (2) comparisons with alternative neuron selection criteria (e.g., random selection or selection by unrelated attributes), and (3) controls that isolate the effect of shared upstream features. These additions will provide explicit evidence that the targeted suppression does not degrade core reward accuracy on unbiased inputs. revision: yes
Referee: [Method] Bias attribute definition and generalizability: the method presupposes a fixed set of 'predefined bias attributes.' If these attributes are chosen post-hoc or overlap with legitimate quality signals, the intervention risks either incomplete debiasing or unintended degradation. The paper should clarify the selection process and test on held-out or emergent bias types.

Authors: The bias attributes were chosen based on well-documented spurious correlations reported in prior reward model literature (e.g., length bias, toxicity proxies). We will expand the method section to explicitly document this selection rationale with citations. Regarding generalizability, we demonstrate results across several predefined attributes but did not evaluate completely held-out emergent biases. We will add a clarification paragraph and, where feasible, include an additional experiment on one held-out bias type to illustrate robustness. We also note that the preservation of RM benchmark performance provides indirect evidence against unintended degradation of legitimate signals. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical method: neuron identification via correlation with predefined bias attributes followed by suppression intervention, evaluated on RM benchmarks and downstream alignment tasks. No mathematical derivations, equations, or predictions are described that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The central claims rest on benchmark results rather than tautological logic, satisfying the criteria for a self-contained, non-circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Assessment based solely on abstract; full paper details unavailable.

axioms (1)

domain assumption Correlation between neuron activations and bias attributes identifies causally relevant bias signals
The method relies on this to select neurons for intervention.

invented entities (1)

bias-correlated neurons no independent evidence
purpose: Target for suppression to remove spurious feature sensitivity
Postulated via correlation analysis; no independent falsifiable evidence provided in abstract.

pith-pipeline@v0.9.0 · 5506 in / 1246 out tokens · 57877 ms · 2026-05-07T09:08:08.946692+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 19 canonical work pages · 3 internal anchors

[1]

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. https://doi.org/10.1145/3292500.3330701 Optuna: A next-generation hyperparameter optimization framework . In KDD, page 2623–2631

work page doi:10.1145/3292500.3330701 2019
[2]

Guillaume Alain and Yoshua Bengio. 2016. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644

work page internal anchor Pith review arXiv 2016
[3]

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, and 1 others. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862

work page internal anchor Pith review arXiv 2022
[4]

James Bergstra, R\' e mi Bardenet, Yoshua Bengio, and Bal\' a zs K\' e gl. 2011. Algorithms for hyper-parameter optimization. In NIPS

2011
[5]

Ralph Allan Bradley and Milton E. Terry. 1952. http://www.jstor.org/stable/2334029 Rank analysis of incomplete block designs: I. the method of paired comparisons . Biometrika, 39(3/4):324--345

work page arXiv 1952
[6]

Lichang Chen, Chen Zhu, Jiuhai Chen, Davit Soselia, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, and Bryan Catanzaro. 2024. https://openreview.net/forum?id=zcIV8OQFVF ODIN : Disentangled reward mitigates hacking in RLHF . In ICML

2024
[7]

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2024. https://openreview.net/forum?id=BOorDpKHiJ ULTRAFEEDBACK : Boosting language models with scaled AI feedback . In ICML

2024
[8]

Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. 2024. https://openreview.net/forum?id=a13aYUU9eU RLHF workflow: From reward modeling to online RLHF . Transactions on Machine Learning Research

2024
[9]

Yann Dubois, Percy Liang, and Tatsunori Hashimoto. 2024. https://openreview.net/forum?id=CybBmzWBX0 Length-controlled alpacaeval: A simple debiasing of automatic evaluators . In COLM

2024
[10]

Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alexander Nicholas D'Amour, Krishnamurthy Dj Dvijotham, Adam Fisch, Katherine A Heller, Stephen Robert Pfohl, Deepak Ramachandran, Peter Shaw, and Jonathan Berant. 2024. https://openreview.net/forum?id=5u1GpUkKtG Helping or herding? reward model ensembles mitigate but do not eliminate reward h...

2024
[11]

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, and 5 others. 2024. https://doi.org/10.5281/zenodo.12608602 The languag...

work page doi:10.5281/zenodo.12608602 2024
[12]

Gemma Team . 2024. https://doi.org/10.34740/KAGGLE/M/3301 Gemma

work page doi:10.34740/kaggle/m/3301 2024
[13]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review arXiv 2024
[14]

Zeyu Huang, Zihan Qiu, Zili Wang, Edoardo M Ponti, and Ivan Titov. 2024. Post-hoc reward calibration: A case study on length bias. arXiv preprint arXiv:2409.17407

work page arXiv 2024
[15]

Takeshi Kojima, Itsuki Okimura, Yusuke Iwasawa, Hitomi Yanaka, and Yutaka Matsuo. 2024. https://doi.org/10.18653/v1/2024.naacl-long.384 On the multilingual ability of decoder-based pre-trained language models: Finding and controlling language-specific neurons . In NAACL, pages 6919--6971

work page doi:10.18653/v1/2024.naacl-long.384 2024
[16]

and Hajishirzi, Hannaneh

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi. 2025. https://doi.org/10.18653/v1/2025.findings-naacl.96 R eward B ench: Evaluating reward models for language modeling . In Findings of the Association for Computational L...

work page doi:10.18653/v1/2025.findings-naacl.96 2025
[17]

Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. https://doi.org/10.18653/v1/2022.acl-long.229 T ruthful QA : Measuring how models mimic human falsehoods . In ACL, pages 3214--3252

work page doi:10.18653/v1/2022.acl-long.229 2022
[18]

Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, and Juanzi Li. 2025. https://openreview.net/forum?id=QEHrmQPBdd RM -bench: Benchmarking reward models of language models with subtlety and style . In ICLR

2025
[19]

Thomas and Pavlick, Ellie and Linzen, Tal

R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. 2019. https://doi.org/10.18653/v1/P19-1334 Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference . In ACL, pages 3428--3448

work page doi:10.18653/v1/p19-1334 2019
[20]

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/6f1d43d5a82a37e89b0665b33bf3a182-Paper-Conference.pdf Locating and editing factual associations in gpt . In NeurIPS

2022
[21]

Yu Meng, Mengzhou Xia, and Danqi Chen. 2024. https://openreview.net/forum?id=3Tzcot1LKb Sim PO : Simple preference optimization with a reference-free reward . In NeurIPS

2024
[22]

Yuchun Miao, Sen Zhang, Liang Ding, Rong Bao, Lefei Zhang, and Dacheng Tao. 2024. https://openreview.net/forum?id=3XnBVK9sD6 Info RM : Mitigating reward hacking in RLHF via information-theoretic reward modeling . In NeurIPS

2024
[23]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. https://openreview.net/forum?id=TG8KACxEON Training language ...

2022
[24]

Junsoo Park, Seungyeon Jwa, Ren Meiying, Daeyoung Kim, and Sanghyuk Choi. 2024 a . https://doi.org/10.18653/v1/2024.findings-emnlp.57 O ffset B ias: Leveraging debiased data for tuning evaluators . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 1043--1067

work page doi:10.18653/v1/2024.findings-emnlp.57 2024
[25]

Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. 2024 b . https://doi.org/10.18653/v1/2024.findings-acl.297 Disentangling length from quality in direct preference optimization . In Findings of the Association for Computational Linguistics: ACL 2024, pages 4998--5017

work page doi:10.18653/v1/2024.findings-acl.297 2024
[26]

Judea Pearl. 2009. Causality: Models, Reasoning and Inference, 2nd edition. Cambridge University Press, USA

2009
[27]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. https://openreview.net/forum?id=HPuSIXJaa9 Direct preference optimization: Your language model is secretly a reward model . In NeurIPS

2023
[28]

Alexandre Rame, Nino Vieillard, Leonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, and Johan Ferret. 2024. https://openreview.net/forum?id=s7RDnNUJy6 WARM : On the benefits of weight averaged reward models . In ICML

2024
[29]

Kazutoshi Shinoda, Saku Sugawara, and Akiko Aizawa. 2021. https://doi.org/10.18653/v1/2021.mrqa-1.6 Can question generation debias question answering models? a case study on question -- context lexical overlap . In Proceedings of the 3rd Workshop on Machine Reading for Question Answering, pages 63--72

work page doi:10.18653/v1/2021.mrqa-1.6 2021
[30]

Kazutoshi Shinoda, Saku Sugawara, and Akiko Aizawa. 2022. https://doi.org/10.18653/v1/2022.blackboxnlp-1.35 Look to the right: Mitigating relative position bias in extractive question answering . In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 418--425

work page doi:10.18653/v1/2022.blackboxnlp-1.35 2022
[31]

Kazutoshi Shinoda, Saku Sugawara, and Akiko Aizawa. 2023. https://doi.org/10.1609/aaai.v37i11.26590 Which shortcut solution do question answering models prefer to learn? In AAAI, pages 13564--13572

work page doi:10.1609/aaai.v37i11.26590 2023
[32]

Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. 2024. https://openreview.net/forum?id=G8LaO1P0xv A long way to go: Investigating length correlations in RLHF . In COLM

2024
[33]

Spearman

C. Spearman. 1904. http://www.jstor.org/stable/1412159 The proof and measurement of association between two things . The American Journal of Psychology, 15(1):72--101

work page arXiv 1904
[34]

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. https://proceedings.mlr.press/v70/sundararajan17a.html Axiomatic attribution for deep networks . In ICML, pages 3319--3328

2017
[35]

Prasetya Ajie Utama, Nafise Sadat Moosavi, and Iryna Gurevych. 2020. https://aclanthology.org/2020.acl-main.770/ Mind the trade-off: Debiasing NLU models without degrading the in-distribution performance . In ACL, pages 8717--8729

2020
[36]

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. 2020. https://proceedings.neurips.cc/paper_files/paper/2020/file/92650b2e92217715fe312e6fa7b90d82-Paper.pdf Investigating gender bias in language models using causal mediation analysis . In NeurIPS

2020
[37]

Zhilin Wang, Alexander Bukharin, Olivier Delalleau, Daniel Egert, Gerald Shen, Jiaqi Zeng, Oleksii Kuchaiev, and Yi Dong. 2025. https://openreview.net/forum?id=MnfHxPP5gs Helpsteer2-preference: Complementing ratings with preferences . In ICLR

2025
[38]

Zhang, Makesh Narsimhan Sreedhar, and Oleksii Kuchaiev

Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert, Jimmy J. Zhang, Makesh Narsimhan Sreedhar, and Oleksii Kuchaiev. 2024. https://openreview.net/forum?id=PvVKUFhaNy Helpsteer 2: Open-source dataset for training top-performing reward models . In NeurIPS

2024
[39]

Rui Yang, Ruomeng Ding, Yong Lin, Huan Zhang, and Tong Zhang. 2024. https://openreview.net/forum?id=jwh9MHEfmY Regularizing hidden states enables learning generalizable reward model for LLM s . In NeurIPS

2024
[40]

Xuanchang Zhang, Wei Xiong, Lichang Chen, Tianyi Zhou, Heng Huang, and Tong Zhang. 2025. https://doi.org/10.18653/v1/2025.acl-long.1308 From lists to emojis: How format bias affects model alignment . In ACL, pages 26940--26961

work page doi:10.18653/v1/2025.acl-long.1308 2025
[41]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[42]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...