ClaHF: A Human Feedback-inspired Reinforcement Learning Framework for Improving Classification Tasks

Jiayin Wang; Tianxiang Xu; Xiaoyan Zhu; Xin Lai

arxiv: 2605.17458 · v1 · pith:HZAUJL73new · submitted 2026-05-17 · 💻 cs.LG

ClaHF: A Human Feedback-inspired Reinforcement Learning Framework for Improving Classification Tasks

Tianxiang Xu , Xiaoyan Zhu , Xin Lai , Jiayin Wang This is my paper

Pith reviewed 2026-05-20 15:06 UTC · model grok-4.3

classification 💻 cs.LG

keywords text classificationreinforcement learningpreference modelingreward modelconfidence calibrationpolicy optimizationclassification improvement

0 comments

The pith

ClaHF converts standard classification labels into ranked preference signals that a reward model uses to optimize the classifier via reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Text classification models trained only on single labels per example overlook relative quality differences among possible outputs for the same input. ClaHF generates multiple candidate predictions per input, derives their relative ordering from the model's own scores, and trains a reward model to capture both the single best prediction and the ordering of the rest. Reinforcement learning then updates the classifier using these derived preferences instead of isolated labels. The method is evaluated on eight tasks spanning different scenarios and yields gains in accuracy plus better calibration of predictive confidence. All preference data comes from the original labeled examples, so no new human annotations are introduced.

Core claim

ClaHF constructs multiple candidate predictions together with their relative ranking relations, and jointly models the Top-1 preference and the ordering among non-optimal candidates within a reward model. This design converts conventional label supervision into preference signals that are directly applicable to policy optimization.

What carries the argument

A reward model that jointly scores the top-ranked candidate and the relative ordering among the remaining candidates, converting label data into preference pairs usable for reinforcement learning policy optimization.

If this is right

Classification accuracy rises because the model learns from comparative quality rather than isolated labels.
Predictive confidence becomes better calibrated as the reward model distinguishes optimal from near-optimal outputs.
Decision boundaries shift to reflect preference ordering, reducing over- or under-on borderline cases.
The same pipeline applies across multiple language models without task-specific redesign.
No extra human labeling is needed beyond the original supervised dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The ranking construction step could be adapted to regression or structured prediction tasks where multiple outputs can be scored and ordered.
In low-data regimes the method might extract more signal from each labeled example by generating internal comparisons.
One could test whether performance holds when the initial model used to create rankings is deliberately under-trained or noisy.

Load-bearing premise

Relative ranking relations among candidate predictions can be reliably constructed from model outputs alone without additional human annotations and that these constructed rankings provide a sufficiently accurate preference signal for the reward model.

What would settle it

Compare results when the reward model receives the model-derived rankings versus randomly shuffled rankings on the same tasks; if accuracy and calibration gains vanish under shuffled rankings, the contribution of the constructed preferences is isolated.

Figures

Figures reproduced from arXiv: 2605.17458 by Jiayin Wang, Tianxiang Xu, Xiaoyan Zhu, Xin Lai.

**Figure 2.** Figure 2: Overall framework of ClaHF. (a) SFT to provide high-quality initialization. (b) Automatic construction [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Performance heatmaps of ClaHF in terms of Acc, F1, and ECE over a continuous [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Text classification models are typically trained via supervised fine-tuning (SFT). However, SFT essentially performs behavior cloning from instance-wise labels and thus fails to adequately capture relative preference relations among samples, which limits the model's ability to shape decision boundaries and calibrate predictive confidence. In this paper, we propose ClaHF, a human feedback-inspired reinforcement learning (RL) framework for text classification that integrates preference modeling and RL optimization into the classification pipeline without requiring additional human annotations. Unlike prior work that relies solely on instance-wise supervision, ClaHF constructs multiple candidate predictions together with their relative ranking relations, and jointly models the Top-1 preference and the ordering among non-optimal candidates within a reward model (RM). This design converts conventional label supervision into preference signals that are directly applicable to policy optimization. We conduct systematic evaluations on eight classification tasks spanning three categories of scenarios. Results demonstrate that ClaHF consistently improves both classification performance and confidence calibration across diverse language models (LMs). The data and code are available at https://anonymous.4open.science/r/ClaHF.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ClaHF turns standard classification labels into synthetic preference pairs for RL optimization without new annotations, but the rankings come from model outputs and the abstract supplies no numbers to check if it actually helps.

read the letter

The main thing to know is that this paper takes the RLHF preference-modeling trick and tries to make it work for text classification. They generate several candidate predictions per input, build relative rankings among them from the model itself, and train a reward model that scores both the top choice and the ordering of the rest. That preference signal then drives policy optimization instead of plain supervised fine-tuning.

Referee Report

3 major / 2 minor

Summary. The paper proposes ClaHF, a human feedback-inspired reinforcement learning framework for text classification. It constructs multiple candidate predictions along with their relative ranking relations from model outputs without requiring additional human annotations, then jointly models the Top-1 preference and the ordering among non-optimal candidates inside a reward model. This converts standard label supervision into preference signals usable for policy optimization, with claimed consistent gains in both accuracy and calibration on eight tasks across three scenario categories.

Significance. If the empirical improvements hold and the synthetic rankings supply non-circular supervisory signal, the work would be significant for enabling scalable RL-style optimization on classification problems without extra labeling. The joint Top-1 plus ordering modeling and the release of data and code are positive features that could support further research on preference-based fine-tuning for discriminative tasks.

major comments (3)

[Abstract] Abstract: the central claim that ClaHF 'consistently improves both classification performance and confidence calibration' is unsupported by any numerical results, baseline comparisons, statistical tests, or details on how candidates and rankings are generated, rendering the empirical support for the main contribution unverifiable from the manuscript text.
[Method] Method section (construction of rankings): the relative ranking relations are stated to be built from model outputs alone; if these rankings are derived from the same logits, sampling, or scores used to produce the candidates, the resulting preference signal is likely correlated with the original model and may simply reinforce existing errors rather than provide independent information for the reward model and subsequent policy optimization.
[Experiments] Experiments: no description is given of the eight tasks, the exact procedure for generating and ranking candidates, the loss used to jointly train the Top-1 and ordering components of the reward model, or direct comparisons against strong SFT and alternative RL baselines, all of which are load-bearing for validating that the proposed design improves decision boundaries and calibration.

minor comments (2)

[Abstract] The anonymous code link should be replaced with a permanent repository before publication.
[Method] Notation for the reward model components (Top-1 preference versus ordering loss) should be defined explicitly with equations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications based on the content of the paper and indicating revisions where they will strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that ClaHF 'consistently improves both classification performance and confidence calibration' is unsupported by any numerical results, baseline comparisons, statistical tests, or details on how candidates and rankings are generated, rendering the empirical support for the main contribution unverifiable from the manuscript text.

Authors: We appreciate the referee noting the need for greater specificity in the abstract. The abstract is a concise summary; the full manuscript provides numerical results across eight tasks in the Experiments section, including accuracy and calibration metrics with comparisons to SFT and RL baselines, along with details on candidate generation via sampling and ranking construction from labels. To improve verifiability from the opening text, we will revise the abstract to include brief quantitative highlights of the observed gains. revision: yes
Referee: [Method] Method section (construction of rankings): the relative ranking relations are stated to be built from model outputs alone; if these rankings are derived from the same logits, sampling, or scores used to produce the candidates, the resulting preference signal is likely correlated with the original model and may simply reinforce existing errors rather than provide independent information for the reward model and subsequent policy optimization.

Authors: We take this concern about potential circularity seriously. Candidates are indeed sampled from the model, but the relative rankings are not derived purely from the same logits or scores; instead, the ground-truth label is used to designate the Top-1 preferred candidate (the one matching the label) and to order non-optimal candidates according to their similarity or overlap with the reference label. This converts the existing label supervision into a non-circular preference signal for training the reward model. We will add a clarifying subsection and illustrative example in the revised Method section to make this distinction explicit. revision: yes
Referee: [Experiments] Experiments: no description is given of the eight tasks, the exact procedure for generating and ranking candidates, the loss used to jointly train the Top-1 and ordering components of the reward model, or direct comparisons against strong SFT and alternative RL baselines, all of which are load-bearing for validating that the proposed design improves decision boundaries and calibration.

Authors: We acknowledge that these elements could be presented more explicitly. The manuscript describes the eight tasks (spanning sentiment, topic, and other classification scenarios) in Section 4.1, details candidate generation through temperature and top-p sampling plus ranking via label alignment in Section 3, specifies the joint reward model loss (preference loss for Top-1 combined with pairwise ranking loss) in the corresponding equations, and reports direct comparisons to SFT, DPO, and PPO baselines with both accuracy and calibration results in the experimental tables. To address the comment fully, we will expand these descriptions with pseudocode and additional baseline discussion in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper describes ClaHF as constructing candidate predictions and relative rankings from model outputs to form preference signals for a reward model and subsequent policy optimization, converting instance-wise labels into usable RL signals. No equations, derivations, or self-citation chains are exhibited that reduce the claimed performance gains or preference modeling to a quantity fitted directly from the same inputs by construction. The central mechanism is presented as an independent modeling choice evaluated empirically across eight tasks, with no load-bearing step that renames a fitted parameter as a prediction or imports uniqueness via self-citation. This matches the expectation of a self-contained framework against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract provides no explicit free parameters, mathematical axioms, or newly postulated physical entities; the central addition is the procedural construction of preference signals from candidate predictions.

axioms (1)

domain assumption Relative ranking relations among non-optimal candidate predictions supply useful training signal for the reward model
Invoked when the paper states that the RM jointly models Top-1 preference and ordering among non-optimal candidates

pith-pipeline@v0.9.0 · 5718 in / 1238 out tokens · 70997 ms · 2026-05-20T15:06:42.817308+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ClaHF constructs multiple candidate predictions together with their relative ranking relations... jointly models the Top-1 preference and the ordering among non-optimal candidates within a reward model (RM)
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lrm = α Ltop1 + (1-α) Lpairwise

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 14 internal anchors

[1]

ACM Transactions on Intelligent Systems and Technology (TIST) , volume=

A survey on text classification: From traditional to deep learning , author=. ACM Transactions on Intelligent Systems and Technology (TIST) , volume=. 2022 , publisher=

work page 2022
[2]

InAdvances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14–17, 2020, Proceedings, Part II 42, pages 359–366

Text classification via large language models , author=. arXiv preprint arXiv:2305.08377 , year=

work page arXiv
[3]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Instance-selection-inspired undersampling strategies for bias reduction in small and large language models for binary text classification , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[4]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

ProtoLens: Advancing Prototype Learning for Fine-Grained Interpretability in Text Classification , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[5]

Advances in neural information processing systems , volume=

Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=

work page
[6]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[7]

RLHF Workflow: From Reward Modeling to Online RLHF

Rlhf workflow: From reward modeling to online rlhf , author=. arXiv preprint arXiv:2405.07863 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

arXiv preprint arXiv:2505.23349 , year=

Towards Reward Fairness in RLHF: From a Resource Allocation Perspective , author=. arXiv preprint arXiv:2505.23349 , year=

work page arXiv
[9]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Roberta: A robustly optimized bert pretraining approach , author=. arXiv preprint arXiv:1907.11692 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1907
[11]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Proceedings of the IEEE international conference on computer vision , pages=

Focal loss for dense object detection , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page
[13]

Advances in neural information processing systems , volume=

When does label smoothing help? , author=. Advances in neural information processing systems , volume=

work page
[14]

Advances in neural information processing systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=

work page
[15]

IEEE Transactions on Neural Networks and Learning Systems , volume=

Prioritized experience-based reinforcement learning with human guidance for autonomous driving , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=. 2022 , publisher=

work page 2022
[16]

Conference on robot learning , pages=

Skill preferences: Learning to extract and execute robotic skills from human feedback , author=. Conference on robot learning , pages=. 2022 , organization=

work page 2022
[17]

Advances in Neural Information Processing Systems , volume=

Imagereward: Learning and evaluating human preferences for text-to-image generation , author=. Advances in Neural Information Processing Systems , volume=

work page
[18]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Hive: Harnessing human feedback for instructional visual editing , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[19]

IEEE Transactions on Instrumentation and Measurement , year=

Human-Guided Data Augmentation via Diffusion Model for Surface Defect Recognition Under Limited Data , author=. IEEE Transactions on Instrumentation and Measurement , year=

work page
[20]

Fine-Tuning Language Models from Human Preferences

Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909
[21]

2025 IEEE/ACM International Workshop on Deep Learning for Testing and Testing for Deep Learning (DeepTest) , pages=

Reinforcement learning from automatic feedback for high-quality unit test generation , author=. 2025 IEEE/ACM International Workshop on Deep Learning for Testing and Testing for Deep Learning (DeepTest) , pages=. 2025 , organization=

work page 2025
[22]

Preference learning , pages=

Preference learning and ranking by pairwise comparison , author=. Preference learning , pages=. 2010 , publisher=

work page 2010
[23]

2021 IEEE International Conference on Big Data (Big Data) , pages=

Rank over class: The untapped potential of ranking in natural language processing , author=. 2021 IEEE International Conference on Big Data (Big Data) , pages=. 2021 , organization=

work page 2021
[24]

International Conference on Machine Learning , pages=

Prefer to classify: Improving text classifiers via auxiliary preference learning , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[25]

Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

Recursive deep models for semantic compositionality over a sentiment treebank , author=. Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

work page 2013
[26]

arXiv preprint arXiv:2505.02666 , year=

A survey on progress in llm alignment from the perspective of reward design , author=. arXiv preprint arXiv:2505.02666 , year=

work page arXiv
[27]

Meta-radiology , volume=

Summary of chatgpt-related research and perspective towards the future of large language models , author=. Meta-radiology , volume=. 2023 , publisher=

work page 2023
[28]

Transactions of the Association for Computational Linguistics , volume=

Neural network acceptability judgments , author=. Transactions of the Association for Computational Linguistics , volume=. 2019 , publisher=

work page 2019
[29]

Proceedings of the third international workshop on paraphrasing (IWP2005) , year=

Automatically constructing a corpus of sentential paraphrases , author=. Proceedings of the third international workshop on paraphrasing (IWP2005) , year=

work page
[30]

Advances in neural information processing systems , volume=

Character-level convolutional networks for text classification , author=. Advances in neural information processing systems , volume=

work page
[31]

Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

CARER: Contextualized affect representations for emotion recognition , author=. Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

work page 2018
[32]

Advances in neural information processing systems , volume=

Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks , author=. Advances in neural information processing systems , volume=

work page
[33]

2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER) , pages=

Detecting code clones with graph neural network and flow-augmented abstract syntax tree , author=. 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER) , pages=. 2020 , organization=

work page 2020
[34]

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

work page 2019
[35]

OPT: Open Pre-trained Transformer Language Models

Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Journal of machine learning research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=

work page
[37]

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Codebert: A pre-trained model for program-ming and natural languages , author=. arXiv preprint arXiv:2002.08155 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2002
[38]

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation , author=. arXiv preprint arXiv:2109.00859 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

Codet5+: Open code large language models for code understanding and generation , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

work page 2023
[40]

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

Codegen: An open large language model for code with multi-turn program synthesis , author=. arXiv preprint arXiv:2203.13474 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations , pages=

Transformers: State-of-the-art natural language processing , author=. Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations , pages=

work page 2020
[42]

Ieee Access , volume=

The Matthews correlation coefficient (MCC) is more informative than Cohen’s Kappa and Brier score in binary classification assessment , author=. Ieee Access , volume=. 2021 , publisher=

work page 2021
[43]

International conference on machine learning , pages=

On calibration of modern neural networks , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017
[44]

Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP , pages=

GLUE: A multi-task benchmark and analysis platform for natural language understanding , author=. Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP , pages=

work page 2018
[45]

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Codexglue: A machine learning benchmark dataset for code understanding and generation , author=. arXiv preprint arXiv:2102.04664 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Sft memorizes, rl generalizes: A comparative study of foundation model post-training , author=. arXiv preprint arXiv:2501.17161 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from Token and Parameter Levels , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025
[49]

ACM Transactions on Knowledge Discovery from Data (TKDD) , volume=

Mitigating class-boundary label uncertainty to reduce both model bias and variance , author=. ACM Transactions on Knowledge Discovery from Data (TKDD) , volume=. 2021 , publisher=

work page 2021
[50]

ACM Computing Surveys , issn =

Jiang, Ruili and Chen, Kehai and Bai, Xuefeng and He, Zhixuan and Li, Juntao and Yang, Muyun and Zhao, Tiejun and Nie, Liqiang and Zhang, Min , title =. ACM Computing Surveys , issn =. 2025 , issue_date =

work page 2025
[51]

arXiv preprint arXiv:2301.09820 , year=

A stability analysis of fine-tuning a pre-trained model , author=. arXiv preprint arXiv:2301.09820 , year=

work page arXiv
[52]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Open problems and fundamental limitations of reinforcement learning from human feedback , author=. arXiv preprint arXiv:2307.15217 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

arXiv preprint arXiv:2406.18346 , year=

AI alignment through reinforcement learning from human feedback? Contradictions and limitations , author=. arXiv preprint arXiv:2406.18346 , year=

work page arXiv
[54]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[56]

Publications Manual , year = "1983", publisher =

work page 1983
[57]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[58]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page
[59]

Dan Gusfield , title =. 1997

work page 1997
[60]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015
[61]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page

[1] [1]

ACM Transactions on Intelligent Systems and Technology (TIST) , volume=

A survey on text classification: From traditional to deep learning , author=. ACM Transactions on Intelligent Systems and Technology (TIST) , volume=. 2022 , publisher=

work page 2022

[2] [2]

InAdvances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14–17, 2020, Proceedings, Part II 42, pages 359–366

Text classification via large language models , author=. arXiv preprint arXiv:2305.08377 , year=

work page arXiv

[3] [3]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Instance-selection-inspired undersampling strategies for bias reduction in small and large language models for binary text classification , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[4] [4]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

ProtoLens: Advancing Prototype Learning for Fine-Grained Interpretability in Text Classification , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[5] [5]

Advances in neural information processing systems , volume=

Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=

work page

[6] [6]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page

[7] [7]

RLHF Workflow: From Reward Modeling to Online RLHF

Rlhf workflow: From reward modeling to online rlhf , author=. arXiv preprint arXiv:2405.07863 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

arXiv preprint arXiv:2505.23349 , year=

Towards Reward Fairness in RLHF: From a Resource Allocation Perspective , author=. arXiv preprint arXiv:2505.23349 , year=

work page arXiv

[9] [9]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Roberta: A robustly optimized bert pretraining approach , author=. arXiv preprint arXiv:1907.11692 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1907

[11] [11]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Proceedings of the IEEE international conference on computer vision , pages=

Focal loss for dense object detection , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page

[13] [13]

Advances in neural information processing systems , volume=

When does label smoothing help? , author=. Advances in neural information processing systems , volume=

work page

[14] [14]

Advances in neural information processing systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=

work page

[15] [15]

IEEE Transactions on Neural Networks and Learning Systems , volume=

Prioritized experience-based reinforcement learning with human guidance for autonomous driving , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=. 2022 , publisher=

work page 2022

[16] [16]

Conference on robot learning , pages=

Skill preferences: Learning to extract and execute robotic skills from human feedback , author=. Conference on robot learning , pages=. 2022 , organization=

work page 2022

[17] [17]

Advances in Neural Information Processing Systems , volume=

Imagereward: Learning and evaluating human preferences for text-to-image generation , author=. Advances in Neural Information Processing Systems , volume=

work page

[18] [18]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Hive: Harnessing human feedback for instructional visual editing , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[19] [19]

IEEE Transactions on Instrumentation and Measurement , year=

Human-Guided Data Augmentation via Diffusion Model for Surface Defect Recognition Under Limited Data , author=. IEEE Transactions on Instrumentation and Measurement , year=

work page

[20] [20]

Fine-Tuning Language Models from Human Preferences

Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909

[21] [21]

2025 IEEE/ACM International Workshop on Deep Learning for Testing and Testing for Deep Learning (DeepTest) , pages=

Reinforcement learning from automatic feedback for high-quality unit test generation , author=. 2025 IEEE/ACM International Workshop on Deep Learning for Testing and Testing for Deep Learning (DeepTest) , pages=. 2025 , organization=

work page 2025

[22] [22]

Preference learning , pages=

Preference learning and ranking by pairwise comparison , author=. Preference learning , pages=. 2010 , publisher=

work page 2010

[23] [23]

2021 IEEE International Conference on Big Data (Big Data) , pages=

Rank over class: The untapped potential of ranking in natural language processing , author=. 2021 IEEE International Conference on Big Data (Big Data) , pages=. 2021 , organization=

work page 2021

[24] [24]

International Conference on Machine Learning , pages=

Prefer to classify: Improving text classifiers via auxiliary preference learning , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023

[25] [25]

Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

Recursive deep models for semantic compositionality over a sentiment treebank , author=. Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

work page 2013

[26] [26]

arXiv preprint arXiv:2505.02666 , year=

A survey on progress in llm alignment from the perspective of reward design , author=. arXiv preprint arXiv:2505.02666 , year=

work page arXiv

[27] [27]

Meta-radiology , volume=

Summary of chatgpt-related research and perspective towards the future of large language models , author=. Meta-radiology , volume=. 2023 , publisher=

work page 2023

[28] [28]

Transactions of the Association for Computational Linguistics , volume=

Neural network acceptability judgments , author=. Transactions of the Association for Computational Linguistics , volume=. 2019 , publisher=

work page 2019

[29] [29]

Proceedings of the third international workshop on paraphrasing (IWP2005) , year=

Automatically constructing a corpus of sentential paraphrases , author=. Proceedings of the third international workshop on paraphrasing (IWP2005) , year=

work page

[30] [30]

Advances in neural information processing systems , volume=

Character-level convolutional networks for text classification , author=. Advances in neural information processing systems , volume=

work page

[31] [31]

Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

CARER: Contextualized affect representations for emotion recognition , author=. Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

work page 2018

[32] [32]

Advances in neural information processing systems , volume=

Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks , author=. Advances in neural information processing systems , volume=

work page

[33] [33]

2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER) , pages=

Detecting code clones with graph neural network and flow-augmented abstract syntax tree , author=. 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER) , pages=. 2020 , organization=

work page 2020

[34] [34]

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

work page 2019

[35] [35]

OPT: Open Pre-trained Transformer Language Models

Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

Journal of machine learning research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=

work page

[37] [37]

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Codebert: A pre-trained model for program-ming and natural languages , author=. arXiv preprint arXiv:2002.08155 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2002

[38] [38]

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation , author=. arXiv preprint arXiv:2109.00859 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

Codet5+: Open code large language models for code understanding and generation , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

work page 2023

[40] [40]

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

Codegen: An open large language model for code with multi-turn program synthesis , author=. arXiv preprint arXiv:2203.13474 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations , pages=

Transformers: State-of-the-art natural language processing , author=. Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations , pages=

work page 2020

[42] [42]

Ieee Access , volume=

The Matthews correlation coefficient (MCC) is more informative than Cohen’s Kappa and Brier score in binary classification assessment , author=. Ieee Access , volume=. 2021 , publisher=

work page 2021

[43] [43]

International conference on machine learning , pages=

On calibration of modern neural networks , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017

[44] [44]

Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP , pages=

GLUE: A multi-task benchmark and analysis platform for natural language understanding , author=. Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP , pages=

work page 2018

[45] [45]

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Codexglue: A machine learning benchmark dataset for code understanding and generation , author=. arXiv preprint arXiv:2102.04664 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Sft memorizes, rl generalizes: A comparative study of foundation model post-training , author=. arXiv preprint arXiv:2501.17161 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from Token and Parameter Levels , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025

[49] [49]

ACM Transactions on Knowledge Discovery from Data (TKDD) , volume=

Mitigating class-boundary label uncertainty to reduce both model bias and variance , author=. ACM Transactions on Knowledge Discovery from Data (TKDD) , volume=. 2021 , publisher=

work page 2021

[50] [50]

ACM Computing Surveys , issn =

Jiang, Ruili and Chen, Kehai and Bai, Xuefeng and He, Zhixuan and Li, Juntao and Yang, Muyun and Zhao, Tiejun and Nie, Liqiang and Zhang, Min , title =. ACM Computing Surveys , issn =. 2025 , issue_date =

work page 2025

[51] [51]

arXiv preprint arXiv:2301.09820 , year=

A stability analysis of fine-tuning a pre-trained model , author=. arXiv preprint arXiv:2301.09820 , year=

work page arXiv

[52] [52]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Open problems and fundamental limitations of reinforcement learning from human feedback , author=. arXiv preprint arXiv:2307.15217 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[53] [53]

arXiv preprint arXiv:2406.18346 , year=

AI alignment through reinforcement learning from human feedback? Contradictions and limitations , author=. arXiv preprint arXiv:2406.18346 , year=

work page arXiv

[54] [54]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[55] [55]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972

[56] [56]

Publications Manual , year = "1983", publisher =

work page 1983

[57] [57]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[58] [58]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page

[59] [59]

Dan Gusfield , title =. 1997

work page 1997

[60] [60]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015

[61] [61]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page