arxiv: 2402.11411 · v1 · submitted 2024-02-18 · 💻 cs.LG · cs.CL· cs.CV

Recognition: 2 theorem links

· Lean Theorem

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

Yiyang Zhou , Chenhang Cui , Rafael Rafailov , Chelsea Finn , Huaxiu Yao

Authors on Pith no claims yet

Pith reviewed 2026-05-17 10:54 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CV

keywords vision language modelshallucinationspreference optimizationdirect preference optimizationmultimodal alignmentinstruction tuningreinforcement learning from human feedback

0 comments

The pith

Preference fine-tuning on automatically generated hallucinated responses aligns vision and language modalities in large models while cutting hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision large language models frequently generate answers that do not match the input image. The paper frames this hallucination problem as imperfect alignment between separately trained vision and language components. It introduces POVID, which builds preference data by treating accurate ground-truth answers as preferred responses and creating dispreferred ones through two automated steps: prompting GPT-4V to insert plausible hallucinations or distorting the image to elicit errors from the model. These pairs feed into Direct Preference Optimization to adjust the model. Experiments across benchmarks demonstrate reduced hallucinations together with gains on standard tasks that surpass prior alignment methods.

Core claim

The paper claims that hallucinations in instruction-following vision large language models stem from incomplete modality alignment during joint training and can be addressed by preference optimization on automatically constructed data. Preferred responses come directly from ground-truth instructions, while dispreferred responses are produced in two stages—GPT-4V hallucination injection into correct answers and image distortion that triggers the model's own hallucination behavior—then optimized via Direct Preference Optimization without any human data collection.

What carries the argument

POVID, the two-stage automated generation of dispreferred responses (GPT-4V hallucination injection plus image distortion) that supplies preference pairs for Direct Preference Optimization.

If this is right

Hallucination rates fall on multiple evaluation benchmarks.
Overall accuracy rises on standard vision-language tasks beyond previous alignment techniques.
The method scales without requiring human-generated preference data or perfect expert supervision.
Both generation strategies can be combined within a single Direct Preference Optimization run.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same automated preference-pair construction could be tested on other multimodal errors such as spatial reasoning failures or object-counting mistakes.
Preference data generated this way might transfer to align components in non-vision multimodal systems like audio-language models.
Running the process iteratively—using the tuned model to generate new dispreferred pairs—could produce further gains without external models.

Load-bearing premise

The two-stage automated generation of dispreferred responses produces high-quality preference pairs that accurately reflect and correct the model's hallucination behavior when used in Direct Preference Optimization.

What would settle it

Apply the full POVID pipeline to a held-out vision-language model and measure hallucination rates on a benchmark such as POPE or CHAIR; if rates show no reduction or benchmark scores fail to rise relative to the untuned baseline, the central claim would be falsified.

read the original abstract

Instruction-following Vision Large Language Models (VLLMs) have achieved significant progress recently on a variety of tasks. These approaches merge strong pre-trained vision models and large language models (LLMs). Since these components are trained separately, the learned representations need to be aligned with joint training on additional image-language pairs. This procedure is not perfect and can cause the model to hallucinate - provide answers that do not accurately reflect the image, even when the core LLM is highly factual and the vision backbone has sufficiently complete representations. In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning. Specifically, we propose POVID to generate feedback data with AI models. We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data. First, we prompt GPT-4V to inject plausible hallucinations into the correct answer. Second, we distort the image to trigger the inherent hallucination behavior of the VLLM. This is an automated approach, which does not rely on human data generation or require a perfect expert, which makes it easily scalable. Finally, both of these generation strategies are integrated into an RLHF pipeline via Direct Preference Optimization. In experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches. Our data and code are available at https://github.com/YiyangZhou/POVID.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

POVID adapts DPO to VLLMs with a two-stage automated pipeline for preference data, which is a useful engineering step but leaves open whether the generated pairs truly match the target model's hallucination patterns.

read the letter

The core contribution here is taking the preference optimization approach from text LLMs and making it work for vision-language models to address hallucinations. They avoid human labeling by building preference pairs automatically: ground-truth answers are the preferred ones, while dispreferred ones come from either prompting GPT-4V to insert plausible errors or distorting the input image to elicit the VLLM's own mistakes, then running DPO on top.

Referee Report

2 major / 2 minor

Summary. The paper introduces POVID, a scalable automated pipeline for generating preference data to fine-tune Vision Large Language Models (VLLMs) via Direct Preference Optimization (DPO). Ground-truth answers serve as preferred responses; dispreferred responses are created in two stages—prompting GPT-4V to inject plausible hallucinations into correct answers, and distorting input images to elicit the VLLM’s inherent hallucinations—before integrating the pairs into an RLHF-style DPO training loop. Experiments are reported to show both reduced hallucinations and improved performance on standard benchmarks, outperforming prior approaches.

Significance. If the results are reproducible and the preference pairs prove high-quality, the work supplies a practical, human-annotation-free method for modality alignment that simultaneously mitigates hallucinations and lifts benchmark scores, which would be a useful contribution to reliable multimodal models.

major comments (2)

[Methods / Data Generation] Methods / Data Generation: The central claim that DPO on these pairs reduces hallucinations and improves performance rests on the assumption that the two-stage dispreferred responses faithfully capture the target VLLM’s hallucination distribution. Because GPT-4V is an external, stronger model, its injected hallucinations may not match the visual-grounding failures of the specific VLLM under training; likewise, image distortion can alter semantic content and render the original ground-truth invalid. The manuscript should include quantitative validation (e.g., overlap statistics between GPT-4V hallucinations and the VLLM’s own errors on a held-out set) or an ablation that replaces the generated pairs with human-annotated ones to test whether the automated data is load-bearing for the reported gains.
[Experiments] Experiments: The abstract asserts “outperforming prior approaches” across “broad benchmarks,” yet the provided description supplies no concrete metrics, baseline tables, ablation results, or statistical significance tests. Without these details it is impossible to assess whether the claimed improvements are robust or merely marginal; the manuscript must present full comparison tables (including exact benchmark names, scores, and variance) and control experiments that isolate the contribution of each data-generation stage.

minor comments (2)

[Abstract] Abstract: The phrase “broad benchmarks” is vague; naming the specific datasets (e.g., VQA, GQA, POPE) and reporting at least the headline hallucination and accuracy deltas would make the summary self-contained.
[Notation / Setup] Notation: The manuscript should clarify whether the same VLLM is used both for image-distortion triggering and for final evaluation, or whether a different model is employed; inconsistent usage could affect reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our work. We address each of the major comments below and outline the revisions we plan to make to strengthen the manuscript.

read point-by-point responses

Referee: [Methods / Data Generation] The central claim that DPO on these pairs reduces hallucinations and improves performance rests on the assumption that the two-stage dispreferred responses faithfully capture the target VLLM’s hallucination distribution. Because GPT-4V is an external, stronger model, its injected hallucinations may not match the visual-grounding failures of the specific VLLM under training; likewise, image distortion can alter semantic content and render the original ground-truth invalid. The manuscript should include quantitative validation (e.g., overlap statistics between GPT-4V hallucinations and the VLLM’s own errors on a held-out set) or an ablation that replaces the generated pairs with human-annotated ones to test whether the automated data is load-bearing for the reported gains.

Authors: We agree that validating the quality of the generated preference pairs is important for substantiating our claims. While our method uses GPT-4V to generate plausible hallucinations and image distortions to elicit model-specific errors, we recognize that these may not perfectly align with the target VLLM's distribution. In the revised manuscript, we will add quantitative analysis, such as comparing the types and frequencies of hallucinations generated by our pipeline versus those observed directly from the VLLM on a held-out set. We will also include an ablation study using a subset of human-annotated preference pairs to demonstrate that our automated approach achieves comparable gains, thereby showing that the data generation is effective and load-bearing. This will help confirm the robustness of our findings without relying on human annotation at scale. revision: partial
Referee: [Experiments] The abstract asserts “outperforming prior approaches” across “broad benchmarks,” yet the provided description supplies no concrete metrics, baseline tables, ablation results, or statistical significance tests. Without these details it is impossible to assess whether the claimed improvements are robust or merely marginal; the manuscript must present full comparison tables (including exact benchmark names, scores, and variance) and control experiments that isolate the contribution of each data-generation stage.

Authors: We apologize if the experimental details were not sufficiently clear in the initial version. The full manuscript does contain comprehensive results, including tables with specific benchmark names (such as POPE, HallusionBench, and others), performance scores, comparisons to baselines like LLaVA and other alignment methods, ablation studies on the two stages of dispreferred data generation, and multiple runs for variance. To address this comment directly, we will revise the paper to include more prominent tables in the main body, add statistical significance tests where appropriate, and provide additional control experiments that isolate the contribution of the GPT-4V injection stage versus the image distortion stage. These revisions will make the improvements and their robustness more transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation or claims.

full rationale

The paper presents an empirical method (POVID) for automated preference data generation via external GPT-4V hallucination injection and image distortion, followed by standard DPO fine-tuning on VLLMs. No mathematical derivations, equations, or parameter-fitting steps are described that reduce by construction to self-defined inputs or predictions. Central claims rest on experimental benchmark results rather than internal self-reference. Any author overlap with the DPO method is a standard citation to prior published work and does not load-bear or force the present empirical outcomes, which remain independently falsifiable via external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that automatically generated preference pairs will produce generalizable alignment improvements via DPO without side effects on other capabilities.

axioms (1)

domain assumption Automatically generated preference pairs from GPT-4V and image distortion will effectively reduce hallucinations when optimized with DPO.
This is the core unproven premise linking the data generation step to the claimed performance gains.

pith-pipeline@v0.9.0 · 5574 in / 1282 out tokens · 108054 ms · 2026-05-17T10:54:31.659670+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Visual Preference Optimization with Rubric Rewards
cs.CV 2026-04 unverdicted novelty 7.0

rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.
Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

A new RL method called MoCA with Perception Verification rewards perceptual fidelity independently to improve both seeing and thinking in VLMs.
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
cs.LG 2026-05 conditional novelty 6.0

Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
cs.LG 2026-05 unverdicted novelty 6.0

Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination
cs.MM 2026-05 unverdicted novelty 6.0

LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models
cs.AI 2026-05 unverdicted novelty 6.0

MoR lets clients train local reward models on private preferences and uses a learned Mixture-of-Rewards with GRPO on the server to align a shared base VLM without exchanging parameters, architectures, or raw data.
Online Self-Calibration Against Hallucination in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal...
When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs
cs.CV 2026-04 unverdicted novelty 6.0

Hallucinations in LVLMs largely arise from textual priors in prompts, and can be reduced by fine-tuning with preference optimization on grounded vs. hallucinated response pairs.
R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs
cs.CV 2026-04 conditional novelty 6.0

R-CoV is a six-step region-aware chain-of-verification technique that elicits coordinate and description outputs from LVLMs themselves to detect and reduce object hallucinations without external models or retraining.
S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.
HTDC: Hesitation-Triggered Differential Calibration for Mitigating Hallucination in Large Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

HTDC mitigates hallucinations in LVLMs by triggering calibration only at hesitation-prone decoding steps via contrasts with visual-nullification and semantic-nullification probes.
Visual-RFT: Visual Reinforcement Fine-Tuning
cs.CV 2025-03 conditional novelty 6.0

Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
cs.CL 2024-11 conditional novelty 6.0

Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models
cs.LG 2026-05 unverdicted novelty 5.0

UE-DPO quantifies epistemic uncertainty from grounding failures to direct more learning pressure on hard visual tokens in preferred samples while easing penalties on dispreferred ones.
DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling
cs.AI 2026-04 unverdicted novelty 5.0

DT2IT-MRM proposes a debiased preference construction pipeline, T2I data reformulation, and iterative training to curate multimodal preference data, achieving SOTA on VL-RewardBench, Multimodal RewardBench, and MM-RLH...
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
cs.CV 2024-07 conditional novelty 5.0

InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
Seed1.5-VL Technical Report
cs.CV 2025-05 unverdicted novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

Reference graph

Works this paper leans on

155 extracted references · 155 canonical work pages · cited by 17 Pith papers · 21 internal anchors

[18]

2023 , eprint=

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. 2023 , eprint=

work page 2023
[20]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000
[21]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980
[22]

M. J. Kearns , title =

work page
[23]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983
[24]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000
[25]

Suppressed for Anonymity , author=

work page
[26]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981
[27]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959
[28]

2018 , eprint=

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , author=. 2018 , eprint=

work page 2018
[29]

Generalized Results for the Existence and Consistency of the

Heejong Bong and Alessandro Rinaldo , year=. Generalized Results for the Existence and Consistency of the. International Conference on Machine Learning , note =

work page
[30]

2017 , eprint=

Proximal Policy Optimization Algorithms , author=. 2017 , eprint=

work page 2017
[31]

R. L. Plackett , title =. Journal of the Royal Statistical Society. Series C (Applied Statistics). 1975. doi:https://doi.org/10.2307/2346567

work page doi:10.2307/2346567 1975
[32]

R. D. Luce , title =. Courier Corporation. 2012

work page 2012
[33]

Terry , title =

Ralph Allan Bradley and Milton E. Terry , title =. Biometrika. 1952. doi:https://doi.org/10.2307/2334029

work page doi:10.2307/2334029 1952
[34]

2020 , eprint=

Fine-Tuning Language Models from Human Preferences , author=. 2020 , eprint=

work page 2020
[35]

2022 , eprint=

Learning to summarize from human feedback , author=. 2022 , eprint=

work page 2022
[36]

Training language models to follow instructions with human feedback , url =

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...

work page
[37]

2022 , eprint=

LaMDA: Language Models for Dialog Applications , author=. 2022 , eprint=

work page 2022
[38]

2023 , eprint=

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author=. 2023 , eprint=

work page 2023
[39]

Wang, Ben and Komatsuzaki, Aran , title =

work page
[40]

and Daly, Raymond E

Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher , title =. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , month =. 2011 , address =

work page 2011
[41]

ArXiv , year=

Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: A Preliminary Empirical Study , author=. ArXiv , year=

work page
[42]

TL ; DR : Mining R eddit to Learn Automatic Summarization

V. TL ; DR : Mining R eddit to Learn Automatic Summarization. Proceedings of the Workshop on New Frontiers in Summarization. 2017. doi:10.18653/v1/W17-4508

work page doi:10.18653/v1/w17-4508 2017
[43]

2022 , eprint=

Constitutional AI: Harmlessness from AI Feedback , author=. 2022 , eprint=

work page 2022
[44]

2021 , isbn =

Narayanan, Deepak and Shoeybi, Mohammad and Casper, Jared and LeGresley, Patrick and Patwary, Mostofa and Korthikanti, Vijay and Vainbrand, Dmitri and Kashinkunti, Prethvi and Bernauer, Julie and Catanzaro, Bryan and Phanishayee, Amar and Zaharia, Matei , title =. 2021 , isbn =. doi:10.1145/3458817.3476209 , booktitle =

work page doi:10.1145/3458817.3476209 2021
[45]

Robotics Research: Volume 1

Kupcsik, Andras and Hsu, David and Lee, Wee Sun , year =. Learning Dynamic Robot-to-Human Object Handover from Human Feedback , bookTitle="Robotics Research: Volume 1", publisher="Springer International Publishing", isbn =

work page
[46]

Active preference-based learning of reward functions , year=

Sadigh, Dorsa and Dragan, Anca D and Sastry, Shankar and Seshia, Sanjit A , booktitle=. Active preference-based learning of reward functions , year=

work page
[47]

2019 , note=

Language Models are Unsupervised Multitask Learners , author=. 2019 , note=

work page 2019
[48]

Abstractive Text Summarization using Sequence-to-sequence RNN s and Beyond

Nallapati, Ramesh and Zhou, Bowen and dos Santos, Cicero and Gul c ehre, C a g lar and Xiang, Bing. Abstractive Text Summarization using Sequence-to-sequence RNN s and Beyond. Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. 2016. doi:10.18653/v1/K16-1028

work page doi:10.18653/v1/k16-1028 2016
[49]

Aligning Language Models with Preferences through F-Divergence Minimization , year =

Go, Dongyoung and Korbak, Tomasz and Kruszewski, Germ\'. Aligning Language Models with Preferences through F-Divergence Minimization , year =. Proceedings of the 40th International Conference on Machine Learning , articleno =

work page
[50]

On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting , url =

Korbak, Tomasz and Elsahar, Hady and Kruszewski, Germ\'. On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting , url =. Advances in Neural Information Processing Systems , editor =

work page
[51]

International Conference on Learning Representations , year=

Multitask Prompted Training Enables Zero-Shot Task Generalization , author=. International Conference on Learning Representations , year=

work page
[52]

2022 , eprint=

Scaling Instruction-Finetuned Language Models , author=. 2022 , eprint=

work page 2022
[53]

The Eleventh International Conference on Learning Representations , year=

Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization , author=. The Eleventh International Conference on Learning Representations , year=

work page
[54]

Cross-Task Generalization via Natural Language Crowdsourcing Instructions

Mishra, Swaroop and Khashabi, Daniel and Baral, Chitta and Hajishirzi, Hannaneh. Cross-Task Generalization via Natural Language Crowdsourcing Instructions. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.244

work page doi:10.18653/v1/2022.acl-long.244 2022
[55]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

work page
[56]

Proceedings of The 26th International Conference on Artificial Intelligence and Statistics , pages =

Dueling RL: Reinforcement Learning with Trajectory Preferences , author =. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics , pages =. 2023 , editor =

work page 2023
[57]

2020 , url=

Way Off-Policy Batch Deep Reinforcement Learning of Human Preferences in Dialog , author=. 2020 , url=

work page 2020
[58]

Wu, Yuxiang and Hu, Baotian , title =. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence , articleno =. 2018 , isbn =

work page 2018
[59]

Leandro von Werra and Jonathan Tow and reciprocated and Shahbuland Matiana and Alex Havrilla and cat-state and Louis Castricato and Alan and Duy V. Phung and Ayush Thakur and Alexey Bukhtiyarov and aaronrmm and Fabrizio Milo and Daniel and Daniel King and Dong Shin and Ethan Kim and Justin Wei and Manuel Romero and Nicky Pochinkov and Omar Sanseviero and ...

work page doi:10.5281/zenodo.7790115
[60]

CoRR , year=

Sequence Level Training with Recurrent Neural Networks , author=. CoRR , year=

work page
[61]

, title =

Williams, Ronald J. , title =. Mach. Learn. , month =. 1992 , issue_date =. doi:10.1007/BF00992696 , abstract =

work page doi:10.1007/bf00992696 1992
[62]

Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning

Kreutzer, Julia and Uyheng, Joshua and Riezler, Stefan. Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1165

work page doi:10.18653/v1/p18-1165 2018
[63]

International Conference on Learning Representations , year=

A Deep Reinforced Model for Abstractive Summarization , author=. International Conference on Learning Representations , year=

work page
[64]

Deep Reinforcement Learning from Human Preferences , url =

Christiano, Paul F and Leike, Jan and Brown, Tom and Martic, Miljan and Legg, Shane and Amodei, Dario , booktitle =. Deep Reinforcement Learning from Human Preferences , url =

work page
[65]

Yan, Xinyi and Luo, Chengxi and Clarke, Charles L. A. and Craswell, Nick and Voorhees, Ellen M. and Castells, Pablo , title =. 2022 , isbn =. doi:10.1145/3477495.3531991 , booktitle =

work page doi:10.1145/3477495.3531991 2022
[66]

Learning Trajectory Preferences for Manipulators via Iterative Improvement , url =

Jain, Ashesh and Wojcik, Brian and Joachims, Thorsten and Saxena, Ashutosh , booktitle =. Learning Trajectory Preferences for Manipulators via Iterative Improvement , url =

work page
[67]

Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm , journal =

R. Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm , journal =. 2014 , month = jul, publisher =. doi:10.1007/s10994-014-5458-8 , url =

work page doi:10.1007/s10994-014-5458-8 2014
[68]

The K-armed dueling bandits problem , journal =

Yisong Yue and Josef Broder and Robert Kleinberg and Thorsten Joachims , keywords =. The K-armed dueling bandits problem , journal =. 2012 , note =. doi:https://doi.org/10.1016/j.jcss.2011.12.028 , url =

work page doi:10.1016/j.jcss.2011.12.028 2012
[69]

Proceedings of The 28th Conference on Learning Theory , pages =

Contextual Dueling Bandits , author =. Proceedings of The 28th Conference on Learning Theory , pages =. 2015 , editor =

work page 2015
[70]

PaLM: Scaling Language Modeling with Pathways

Palm: Scaling language modeling with pathways , author=. arXiv preprint arXiv:2204.02311 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[71]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[73]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sébastien Bubeck and Varun Chandrasekaran and Ronen Eldan and Johannes Gehrke and Eric Horvitz and Ece Kamar and Peter Lee and Yin Tat Lee and Yuanzhi Li and Scott Lundberg and Harsha Nori and Hamid Palangi and Marco Tulio Ribeiro and Yi Zhang , year=. Sparks of Artificial General Intelligence: Early experiments with. 2303.12712 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[74]

Proceedings of the 24th international conference on Machine learning , pages=

Reinforcement learning by reward-weighted regression for operational space control , author=. Proceedings of the 24th international conference on Machine learning , pages=

work page
[75]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Advantage-weighted regression: Simple and scalable off-policy reinforcement learning , author=. arXiv preprint arXiv:1910.00177 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1910
[76]

arXiv preprint arXiv:1908.04319 , year=

Neural text generation with unlikelihood training , author=. arXiv preprint arXiv:1908.04319 , year=

work page arXiv 1908
[78]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[79]

2023 , jouranl=

Scaling Laws for Reward Model Overoptimization , author=. 2023 , jouranl=

work page 2023
[80]

2019 , eprint=

Categorizing Variants of Goodhart's Law , author=. 2019 , eprint=

work page 2019
[81]

2023 , eprint=

Uncertainty-Penalized Reinforcement Learning from Human Feedback with Diverse Reward LoRA Ensembles , author=. 2023 , eprint=

work page 2023
[82]

2023 , eprint=

Reward Model Ensembles Help Mitigate Overoptimization , author=. 2023 , eprint=

work page 2023
[83]

2023 , eprint=

Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking , author=. 2023 , eprint=

work page 2023
[84]

2024 , eprint=

AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback , author=. 2024 , eprint=

work page 2024
[85]

2024 , eprint=

Mixtral of Experts , author=. 2024 , eprint=

work page 2024
[86]

2023 , eprint=

The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback , author=. 2023 , eprint=

work page 2023
[87]

2023 , eprint=

A Long Way to Go: Investigating Length Correlations in RLHF , author=. 2023 , eprint=

work page 2023
[88]

2023 , eprint=

Who Answers It Better? An In-Depth Analysis of ChatGPT and Stack Overflow Answers to Software Engineering Questions , author=. 2023 , eprint=

work page 2023
[89]

2022 , url=

Introducing ChatGPT , author=. 2022 , url=

work page 2022
[90]

2023 , journal =

Verbosity Bias in Preference Labeling by Large Language Models , author=. 2023 , journal =

work page 2023
[91]

2023 , journal =

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , journal =

work page 2023
[93]

The method of paired comparisons , author=

Rank analysis of incomplete block designs: I. The method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=

work page 1952
[94]

2023 , eprint=

Gemini: A Family of Highly Capable Multimodal Models , author=. 2023 , eprint=

work page 2023
[95]

The 36th Conference on Neural Information Processing Systems (NeurIPS) , year=

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , author=. The 36th Conference on Neural Information Processing Systems (NeurIPS) , year=

work page
[96]

arXiv preprint arXiv:2301.10166 , year=

Leveraging vision-language models for granular market change prediction , author=. arXiv preprint arXiv:2301.10166 , year=

work page arXiv
[97]

Advances in Neural Information Processing Systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[99]

International conference on machine learning , pages=

Scaling up visual and vision-language representation learning with noisy text supervision , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[100]

2023 , eprint=

Vision-Language Instruction Tuning: A Review and Analysis , author=. 2023 , eprint=

work page 2023
[101]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[102]

arXiv preprint arXiv:2209.15517 , year=

Medical image understanding with pretrained vision language models: A comprehensive study , author=. arXiv preprint arXiv:2209.15517 , year=

work page arXiv

Showing first 80 references.