arxiv: 2604.17501 · v1 · submitted 2026-04-19 · 💻 cs.CL

Recognition: unknown

CoAct: Co-Active LLM Preference Learning with Human-AI Synergy

Ruiyao Xu , Mihir Parmar , Tiankai Yang , Zhengyu Hu , Yue Zhao , Kaize Ding

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:22 UTC · model grok-4.3

classification 💻 cs.CL

keywords preference learningLLM alignmentself-rewardingactive learninghuman-AI collaborationreasoning benchmarksself-consistencyinstruction generation

0 comments

The pith

CoAct improves LLM reasoning by using self-consistency to blend reliable AI self-labels with targeted human feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CoAct as a way to create high-quality preference data for aligning LLMs without relying entirely on expensive human labels or risky AI-only labels. It claims the method works by checking how consistently the model agrees with its own outputs to decide which data can be trusted for self-rewarding and which needs human review. Human input is then used not just to correct labels but to steer the model toward generating new instructions that stay within its current abilities. This hybrid loop is shown to deliver clear gains on math and instruction benchmarks compared with baselines that use only one approach. A sympathetic reader would care because it points to a practical way to scale alignment data collection as models grow.

Core claim

CoAct synergistically combines self-rewarding and active learning through strategic human-AI collaboration. It leverages self-consistency to identify both reliable self-labeled data and samples that require oracle verification. Additionally, oracle feedback guides the model to generate new instructions within its solvable capability.

What carries the argument

The CoAct framework that uses self-consistency checks to separate trustworthy self-labeled preference data from samples needing human oracle verification, then applies the oracle input to generate fresh solvable instructions.

If this is right

CoAct delivers average gains of +13.25 percent on GSM8K, +8.19 percent on MATH, and +13.16 percent on WebInstruct while beating all compared baselines.
Most preference data can be handled by the model itself, with human effort reserved for uncertain cases only.
Oracle feedback can expand the set of instructions the model can reliably solve without capability mismatch.
The same consistency signal serves dual purposes: validating self-labels and deciding when to request human input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could reduce total human annotation cost enough to make preference tuning feasible for smaller research groups or specialized domains.
If consistency remains a stable signal as models improve, the fraction of cases needing humans might shrink over successive training rounds.
The generated instructions could be reused as a growing pool of solvable examples for future self-rewarding stages.

Load-bearing premise

Self-consistency can be trusted to correctly separate good AI-generated labels from those that truly need human correction, and human feedback can steer instruction generation without creating bias or exceeding the model's real capability.

What would settle it

Applying CoAct to a new reasoning benchmark and finding no gain over a strong self-rewarding baseline, or discovering that the newly generated instructions produce preference pairs that humans rate as lower quality than the original data.

Figures

Figures reproduced from arXiv: 2604.17501 by Kaize Ding, Mihir Parmar, Ruiyao Xu, Tiankai Yang, Yue Zhao, Zhengyu Hu.

**Figure 1.** Figure 1: (a) Self-rewarding uses AI self-labeled data to construct preference pairs; (b) Active preference learning uses human annotation to ensure data quality; (c) Our framework COACT combines both approaches through human-AI collaboration. Das et al., 2025) tackle either data scarcity or data quality in isolation, each making distinct trade-offs. To address the challenge of data scarcity in preference learning, … view at source ↗

**Figure 2.** Figure 2: Overview of the COACT framework. COACT combines three key components: ➀ self-consistency-based preference construction, ➁ strategic oracle annotation selection, and ➂ oracle-guided instruction augmentation to generate new training data within the model’s capability. y + = arg maxy∈yx C(y) and the least consistent response as the rejected response y − = arg miny∈yx C(y). The preference pairs are then const… view at source ↗

**Figure 3.** Figure 3: Left: Average majority vote share across iterations, showing the percentage of samples where the most [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Sensitivity analysis and ablation study. Left: Incorrect rate for different high-consistency selection [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Out-of-domain generalization on GPQA and MMLU-Pro. 5.4 Ablation and Sensitivity Analysis In this section, we conduct sensitivity analysis of key hyperparameters and evaluate the effectiveness of the key modules in COACT. Effectiveness of k-NN Selection. To validate the effectiveness of k-NN distance for detecting selfconsistent errors in high-consistency samples, we compare different selection strategies … view at source ↗

read the original abstract

Learning from preference-based feedback has become an effective approach for aligning LLMs across diverse tasks. However, high-quality human-annotated preference data remains expensive and scarce. Existing methods address this challenge through either self-rewarding, which scales by using purely AI-generated labels but risks unreliability, or active learning, which ensures quality through oracle annotation but cannot fully leverage unlabeled data. In this paper, we present CoAct, a novel framework that synergistically combines self-rewarding and active learning through strategic human-AI collaboration. CoAct leverages self-consistency to identify both reliable self-labeled data and samples that require oracle verification. Additionally, oracle feedback guides the model to generate new instructions within its solvable capability. Evaluated on three reasoning benchmarks across two model families, CoAct achieves average improvements of +13.25% on GSM8K, +8.19% on MATH, and +13.16% on WebInstruct, consistently outperforming all baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoAct's self-consistency gate between self-rewarding and oracle input plus oracle-guided generation is the main novelty, but the abstract supplies no evidence the filter actually works.

read the letter

The paper's main contribution is a hybrid setup called CoAct that uses self-consistency scores to decide when to trust the model's own preference labels and when to route samples to an oracle for verification, then feeds the oracle responses back to generate new instructions the model can handle. It reports average gains of roughly 13% on GSM8K, 8% on MATH, and 13% on WebInstruct over baselines, across two model families on reasoning tasks. This specific combination of consistency-driven selection and oracle-guided generation is not just a rehash of pure self-rewarding or standard active learning, and it directly targets the cost of high-quality preference data for alignment. The framing is clear about the weaknesses of the two pure approaches it contrasts against. That part is useful and worth noting. The soft spots are exactly where the stress-test note flags them. The abstract gives no numbers on how well self-consistency actually predicts label correctness, no precision or recall for the filter against ground truth, and no ablation that removes the oracle step to show its contribution. Without those, the gains could come from extra oracle calls or from shifts in the generated instructions rather than from the claimed synergy. All results stay within three reasoning benchmarks, so broader claims about preference learning efficiency remain untested. This is the sort of paper that would interest researchers working on scalable LLM alignment and data-efficient preference optimization. If the full manuscript includes the missing implementation details, data splits, statistical tests, and ablations on the consistency filter, it deserves a serious referee. Otherwise the central empirical claim is still too thin to evaluate.

Referee Report

2 major / 2 minor

Summary. The paper introduces CoAct, a framework for LLM preference learning that synergistically combines self-rewarding and active learning through human-AI collaboration. It uses self-consistency to identify reliable self-labeled data and samples requiring oracle verification, while leveraging oracle feedback to generate new solvable instructions. Evaluated on GSM8K, MATH, and WebInstruct across two model families, it reports average gains of +13.25%, +8.19%, and +13.16% respectively, outperforming baselines.

Significance. If the gains prove robust and attributable to the proposed filtering and generation mechanisms rather than ancillary factors, CoAct could meaningfully reduce reliance on expensive human annotations while addressing reliability issues in pure self-rewarding methods. The approach targets a practical bottleneck in scalable LLM alignment.

major comments (2)

[Experiments section (results and ablations)] The central empirical claims (+13.25% GSM8K etc.) rest on the premise that self-consistency reliably separates trustworthy self-labels from those needing oracle input and that oracle-guided instruction generation avoids bias or mismatch. However, no precision/recall of the consistency filter against ground truth, no ablation removing the oracle step, and no analysis of distribution shift in generated instructions are provided. This leaves the attribution of gains to the CoAct synergy unverified.
[§4 (Evaluation) and abstract] The abstract and results report consistent outperformance but supply no information on statistical tests, run-to-run variance, exact data splits, or full baseline implementations. Without these, it is impossible to assess whether the reported margins are reliable or could arise from implementation differences.

minor comments (2)

[Abstract] Clarify the two model families and their sizes in the abstract or introduction for immediate context.
[§4] Ensure all baselines are described with implementation details (e.g., prompting strategies, reward models) in the experimental setup.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where additional empirical validation would strengthen the paper. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Experiments section (results and ablations)] The central empirical claims (+13.25% GSM8K etc.) rest on the premise that self-consistency reliably separates trustworthy self-labels from those needing oracle input and that oracle-guided instruction generation avoids bias or mismatch. However, no precision/recall of the consistency filter against ground truth, no ablation removing the oracle step, and no analysis of distribution shift in generated instructions are provided. This leaves the attribution of gains to the CoAct synergy unverified.

Authors: We agree these analyses would better substantiate the claims. In the revised manuscript we will add precision/recall metrics for the self-consistency filter against ground-truth solutions on GSM8K and MATH. We will also include an ablation that disables the oracle verification step while keeping all other components fixed, to isolate its contribution. Finally, we will quantify distribution shift by comparing instruction difficulty (via solution length and required reasoning steps), topic coverage, and lexical overlap between oracle-generated instructions and the original datasets. These results will be reported in an expanded Experiments section. revision: yes
Referee: [§4 (Evaluation) and abstract] The abstract and results report consistent outperformance but supply no information on statistical tests, run-to-run variance, exact data splits, or full baseline implementations. Without these, it is impossible to assess whether the reported margins are reliable or could arise from implementation differences.

Authors: We will revise §4 to report (i) statistical significance via paired t-tests or bootstrap resampling over multiple seeds, (ii) run-to-run standard deviations for all main results, and (iii) exact train/validation/test splits together with preprocessing steps. We will also expand the baseline descriptions with complete hyperparameter settings, prompt templates, and decoding configurations to enable reproduction. While space constraints limit changes to the abstract, the key reliability details will be summarized in the results tables and appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework with external benchmarks

full rationale

The paper describes an empirical framework (CoAct) for combining self-rewarding and active learning via self-consistency filtering and oracle-guided instruction generation. No equations, derivations, or first-principles predictions are present in the abstract or described structure. Reported gains (+13.25% GSM8K etc.) are measured on independent external benchmarks (GSM8K, MATH, WebInstruct) rather than quantities defined by the method itself. No self-citation chains, fitted inputs renamed as predictions, or ansatzes appear in the provided text. The approach is self-contained as a practical human-AI synergy method evaluated against baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the framework is described at the level of high-level strategy without mathematical or architectural details.

pith-pipeline@v0.9.0 · 5479 in / 1096 out tokens · 51271 ms · 2026-05-10T05:22:56.521389+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 7 canonical work pages · 2 internal anchors

[1]

A General Language Assistant as a Laboratory for Alignment

A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861. Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bi- lal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. 2024. A general theoret- ical paradigm to understand learning from human preferences. InAISTATS. Parikshit Bansal and Amit Sharma. 2023. La...

work page internal anchor Pith review arXiv 2024
[2]

Divide and denoise: Empowering simple mod- els for robust semi-supervised node classification against label noise. InKDD. Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. 2024. Shifting attention to relevance: To- wards the predictive uncertainty quantification of free- form large language models. ...

work page arXiv 2024
[3]

Correlated proxies: A new definition and im- proved mitigation for reward hacking. InICLR. Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and 1 others. 2024. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback. In ICML. Yuxua...

work page arXiv 2024
[4]

Activedpo: Active direct preference optimiza- tion for sample-efficient alignment. InICLR. Alisa Liu, Swabha Swayamdipta, Noah A. Smith, and Yejin Choi. 2022. W ANLI: Worker and AI collabora- tion for natural language inference dataset creation. InFindings of the Association for Computational Linguistics: EMNLP 2022, pages 6826–6847, Abu Dhabi, United Ara...

work page arXiv 2022
[5]

arXiv preprint arXiv:2404.03715 , year=

Direct preference optimization: Your language model is secretly a reward model. InNeurIPS. David Rein, Betty Li Hou, Asa Cooper Stickland, Jack- son Petty, Richard Yuanzhe Pang, Julien Dirani, Ju- lian Michael, and Samuel R. Bowman. 2024. GPQA: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling. Corby Rosset, Ching-An Ch...

work page arXiv 2024
[6]

Out-of-distribution detection with deep nearest neighbors. InICML. Hexiang Tan, Fei Sun, Sha Liu, Du Su, Qi Cao, Xin Chen, Jingang Wang, Xunliang Cai, Yuanzhuo Wang, Huawei Shen, and Xueqi Cheng. 2025. Too consis- tent to detect: A study of self-consistent errors in LLMs. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Proces...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

arXiv preprint arXiv:2506.17219 , year=

Openood: Benchmarking generalized out-of- distribution detection. InNeurIPS. Tiankai Yang, Yi Nian, Li Li, Ruiyao Xu, Yuangang Li, Jiaqi Li, Zhuo Xiao, Xiyang Hu, Ryan A. Rossi, Kaize Ding, Xia Hu, and Yue Zhao. 2025b. AD- LLM: Benchmarking large language models for anomaly detection. InFindings of the Association for Computational Linguistics: ACL 2025, ...

work page arXiv 2025
[8]

Entropy Filtering:Rank all instructions in Bt by predictive entropy and select the top K samples with highest entropy, whereK > M
[9]

Select the top M samples with lowest certainty

Preference Selection:For the filtered K sam- ples, generate response pairs and compute preference certainty scores. Select the top M samples with lowest certainty. This two-stage design is based on the hypothesis that high-entropy instructions are more likely to yield uncertain preference predictions. We set K= 2Mfollowing Muldrew et al. (2024). C Impleme...

2024
[10]

"" # ============================================ # MATH Response Generation # ============================================ math_response_prompt: str =

The ratio follows by direct computa- tion. E More Experiments E.1 Out-of-Domain Generalization We evaluate the generalization capability of COACTby testing models trained on in-domain datasets (GSM8K, MATH, WebInstruct) on challeng- ing out-of-domain benchmarks. We assess perfor- mance on three diverse tasks:AIME(Veeraboina, 2023),GPQA(Rein et al., 2024) ...

2023
[11]

Check if Response 1’s final answer is correct
[12]

Check if Response 2’s final answer is correct
[13]

response1_correct

Determine preference based on correctness: - If only one response is correct, prefer the correct one - If both responses are correct, prefer the one with better reasoning/explanation - If both responses are incorrect, prefer the one with better reasoning/explanation Output your evaluation as a JSON object with the following structure: {{ "response1_correc...