pith. machine review for the scientific record. sign in

arxiv: 2402.11411 · v1 · submitted 2024-02-18 · 💻 cs.LG · cs.CL· cs.CV

Recognition: 2 theorem links

· Lean Theorem

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

Authors on Pith no claims yet

Pith reviewed 2026-05-17 10:54 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CV
keywords vision language modelshallucinationspreference optimizationdirect preference optimizationmultimodal alignmentinstruction tuningreinforcement learning from human feedback
0
0 comments X

The pith

Preference fine-tuning on automatically generated hallucinated responses aligns vision and language modalities in large models while cutting hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision large language models frequently generate answers that do not match the input image. The paper frames this hallucination problem as imperfect alignment between separately trained vision and language components. It introduces POVID, which builds preference data by treating accurate ground-truth answers as preferred responses and creating dispreferred ones through two automated steps: prompting GPT-4V to insert plausible hallucinations or distorting the image to elicit errors from the model. These pairs feed into Direct Preference Optimization to adjust the model. Experiments across benchmarks demonstrate reduced hallucinations together with gains on standard tasks that surpass prior alignment methods.

Core claim

The paper claims that hallucinations in instruction-following vision large language models stem from incomplete modality alignment during joint training and can be addressed by preference optimization on automatically constructed data. Preferred responses come directly from ground-truth instructions, while dispreferred responses are produced in two stages—GPT-4V hallucination injection into correct answers and image distortion that triggers the model's own hallucination behavior—then optimized via Direct Preference Optimization without any human data collection.

What carries the argument

POVID, the two-stage automated generation of dispreferred responses (GPT-4V hallucination injection plus image distortion) that supplies preference pairs for Direct Preference Optimization.

If this is right

  • Hallucination rates fall on multiple evaluation benchmarks.
  • Overall accuracy rises on standard vision-language tasks beyond previous alignment techniques.
  • The method scales without requiring human-generated preference data or perfect expert supervision.
  • Both generation strategies can be combined within a single Direct Preference Optimization run.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same automated preference-pair construction could be tested on other multimodal errors such as spatial reasoning failures or object-counting mistakes.
  • Preference data generated this way might transfer to align components in non-vision multimodal systems like audio-language models.
  • Running the process iteratively—using the tuned model to generate new dispreferred pairs—could produce further gains without external models.

Load-bearing premise

The two-stage automated generation of dispreferred responses produces high-quality preference pairs that accurately reflect and correct the model's hallucination behavior when used in Direct Preference Optimization.

What would settle it

Apply the full POVID pipeline to a held-out vision-language model and measure hallucination rates on a benchmark such as POPE or CHAIR; if rates show no reduction or benchmark scores fail to rise relative to the untuned baseline, the central claim would be falsified.

read the original abstract

Instruction-following Vision Large Language Models (VLLMs) have achieved significant progress recently on a variety of tasks. These approaches merge strong pre-trained vision models and large language models (LLMs). Since these components are trained separately, the learned representations need to be aligned with joint training on additional image-language pairs. This procedure is not perfect and can cause the model to hallucinate - provide answers that do not accurately reflect the image, even when the core LLM is highly factual and the vision backbone has sufficiently complete representations. In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning. Specifically, we propose POVID to generate feedback data with AI models. We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data. First, we prompt GPT-4V to inject plausible hallucinations into the correct answer. Second, we distort the image to trigger the inherent hallucination behavior of the VLLM. This is an automated approach, which does not rely on human data generation or require a perfect expert, which makes it easily scalable. Finally, both of these generation strategies are integrated into an RLHF pipeline via Direct Preference Optimization. In experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches. Our data and code are available at https://github.com/YiyangZhou/POVID.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces POVID, a scalable automated pipeline for generating preference data to fine-tune Vision Large Language Models (VLLMs) via Direct Preference Optimization (DPO). Ground-truth answers serve as preferred responses; dispreferred responses are created in two stages—prompting GPT-4V to inject plausible hallucinations into correct answers, and distorting input images to elicit the VLLM’s inherent hallucinations—before integrating the pairs into an RLHF-style DPO training loop. Experiments are reported to show both reduced hallucinations and improved performance on standard benchmarks, outperforming prior approaches.

Significance. If the results are reproducible and the preference pairs prove high-quality, the work supplies a practical, human-annotation-free method for modality alignment that simultaneously mitigates hallucinations and lifts benchmark scores, which would be a useful contribution to reliable multimodal models.

major comments (2)
  1. [Methods / Data Generation] Methods / Data Generation: The central claim that DPO on these pairs reduces hallucinations and improves performance rests on the assumption that the two-stage dispreferred responses faithfully capture the target VLLM’s hallucination distribution. Because GPT-4V is an external, stronger model, its injected hallucinations may not match the visual-grounding failures of the specific VLLM under training; likewise, image distortion can alter semantic content and render the original ground-truth invalid. The manuscript should include quantitative validation (e.g., overlap statistics between GPT-4V hallucinations and the VLLM’s own errors on a held-out set) or an ablation that replaces the generated pairs with human-annotated ones to test whether the automated data is load-bearing for the reported gains.
  2. [Experiments] Experiments: The abstract asserts “outperforming prior approaches” across “broad benchmarks,” yet the provided description supplies no concrete metrics, baseline tables, ablation results, or statistical significance tests. Without these details it is impossible to assess whether the claimed improvements are robust or merely marginal; the manuscript must present full comparison tables (including exact benchmark names, scores, and variance) and control experiments that isolate the contribution of each data-generation stage.
minor comments (2)
  1. [Abstract] Abstract: The phrase “broad benchmarks” is vague; naming the specific datasets (e.g., VQA, GQA, POPE) and reporting at least the headline hallucination and accuracy deltas would make the summary self-contained.
  2. [Notation / Setup] Notation: The manuscript should clarify whether the same VLLM is used both for image-distortion triggering and for final evaluation, or whether a different model is employed; inconsistent usage could affect reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our work. We address each of the major comments below and outline the revisions we plan to make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Methods / Data Generation] The central claim that DPO on these pairs reduces hallucinations and improves performance rests on the assumption that the two-stage dispreferred responses faithfully capture the target VLLM’s hallucination distribution. Because GPT-4V is an external, stronger model, its injected hallucinations may not match the visual-grounding failures of the specific VLLM under training; likewise, image distortion can alter semantic content and render the original ground-truth invalid. The manuscript should include quantitative validation (e.g., overlap statistics between GPT-4V hallucinations and the VLLM’s own errors on a held-out set) or an ablation that replaces the generated pairs with human-annotated ones to test whether the automated data is load-bearing for the reported gains.

    Authors: We agree that validating the quality of the generated preference pairs is important for substantiating our claims. While our method uses GPT-4V to generate plausible hallucinations and image distortions to elicit model-specific errors, we recognize that these may not perfectly align with the target VLLM's distribution. In the revised manuscript, we will add quantitative analysis, such as comparing the types and frequencies of hallucinations generated by our pipeline versus those observed directly from the VLLM on a held-out set. We will also include an ablation study using a subset of human-annotated preference pairs to demonstrate that our automated approach achieves comparable gains, thereby showing that the data generation is effective and load-bearing. This will help confirm the robustness of our findings without relying on human annotation at scale. revision: partial

  2. Referee: [Experiments] The abstract asserts “outperforming prior approaches” across “broad benchmarks,” yet the provided description supplies no concrete metrics, baseline tables, ablation results, or statistical significance tests. Without these details it is impossible to assess whether the claimed improvements are robust or merely marginal; the manuscript must present full comparison tables (including exact benchmark names, scores, and variance) and control experiments that isolate the contribution of each data-generation stage.

    Authors: We apologize if the experimental details were not sufficiently clear in the initial version. The full manuscript does contain comprehensive results, including tables with specific benchmark names (such as POPE, HallusionBench, and others), performance scores, comparisons to baselines like LLaVA and other alignment methods, ablation studies on the two stages of dispreferred data generation, and multiple runs for variance. To address this comment directly, we will revise the paper to include more prominent tables in the main body, add statistical significance tests where appropriate, and provide additional control experiments that isolate the contribution of the GPT-4V injection stage versus the image distortion stage. These revisions will make the improvements and their robustness more transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation or claims.

full rationale

The paper presents an empirical method (POVID) for automated preference data generation via external GPT-4V hallucination injection and image distortion, followed by standard DPO fine-tuning on VLLMs. No mathematical derivations, equations, or parameter-fitting steps are described that reduce by construction to self-defined inputs or predictions. Central claims rest on experimental benchmark results rather than internal self-reference. Any author overlap with the DPO method is a standard citation to prior published work and does not load-bear or force the present empirical outcomes, which remain independently falsifiable via external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that automatically generated preference pairs will produce generalizable alignment improvements via DPO without side effects on other capabilities.

axioms (1)
  • domain assumption Automatically generated preference pairs from GPT-4V and image distortion will effectively reduce hallucinations when optimized with DPO.
    This is the core unproven premise linking the data generation step to the claimed performance gains.

pith-pipeline@v0.9.0 · 5574 in / 1282 out tokens · 108054 ms · 2026-05-17T10:54:31.659670+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Visual Preference Optimization with Rubric Rewards

    cs.CV 2026-04 unverdicted novelty 7.0

    rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.

  2. Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    A new RL method called MoCA with Perception Verification rewards perceptual fidelity independently to improve both seeing and thinking in VLMs.

  3. 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

    cs.LG 2026-05 conditional novelty 6.0

    Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.

  4. 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

    cs.LG 2026-05 unverdicted novelty 6.0

    Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.

  5. Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination

    cs.MM 2026-05 unverdicted novelty 6.0

    LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.

  6. Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models

    cs.AI 2026-05 unverdicted novelty 6.0

    MoR lets clients train local reward models on private preferences and uses a learned Mixture-of-Rewards with GRPO on the server to align a shared base VLM without exchanging parameters, architectures, or raw data.

  7. Online Self-Calibration Against Hallucination in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal...

  8. When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    Hallucinations in LVLMs largely arise from textual priors in prompts, and can be reduced by fine-tuning with preference optimization on grounded vs. hallucinated response pairs.

  9. R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs

    cs.CV 2026-04 conditional novelty 6.0

    R-CoV is a six-step region-aware chain-of-verification technique that elicits coordinate and description outputs from LVLMs themselves to detect and reduce object hallucinations without external models or retraining.

  10. S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.

  11. HTDC: Hesitation-Triggered Differential Calibration for Mitigating Hallucination in Large Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    HTDC mitigates hallucinations in LVLMs by triggering calibration only at hesitation-prone decoding steps via contrasts with visual-nullification and semantic-nullification probes.

  12. Visual-RFT: Visual Reinforcement Fine-Tuning

    cs.CV 2025-03 conditional novelty 6.0

    Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.

  13. Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

    cs.CL 2024-11 conditional novelty 6.0

    Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.

  14. Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models

    cs.LG 2026-05 unverdicted novelty 5.0

    UE-DPO quantifies epistemic uncertainty from grounding failures to direct more learning pressure on hard visual tokens in preferred samples while easing penalties on dispreferred ones.

  15. DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling

    cs.AI 2026-04 unverdicted novelty 5.0

    DT2IT-MRM proposes a debiased preference construction pipeline, T2I data reformulation, and iterative training to curate multimodal preference data, achieving SOTA on VL-RewardBench, Multimodal RewardBench, and MM-RLH...

  16. InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

    cs.CV 2024-07 conditional novelty 5.0

    InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.

  17. Hallucination of Multimodal Large Language Models: A Survey

    cs.CV 2024-04 accept novelty 5.0

    The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

  18. Seed1.5-VL Technical Report

    cs.CV 2025-05 unverdicted novelty 4.0

    Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

Reference graph

Works this paper leans on

155 extracted references · 155 canonical work pages · cited by 17 Pith papers · 21 internal anchors

  1. [18]

    2023 , eprint=

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. 2023 , eprint=

  2. [20]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  3. [21]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  4. [22]

    M. J. Kearns , title =

  5. [23]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  6. [24]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  7. [25]

    Suppressed for Anonymity , author=

  8. [26]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  9. [27]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

  10. [28]

    2018 , eprint=

    Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , author=. 2018 , eprint=

  11. [29]

    Generalized Results for the Existence and Consistency of the

    Heejong Bong and Alessandro Rinaldo , year=. Generalized Results for the Existence and Consistency of the. International Conference on Machine Learning , note =

  12. [30]

    2017 , eprint=

    Proximal Policy Optimization Algorithms , author=. 2017 , eprint=

  13. [31]

    R. L. Plackett , title =. Journal of the Royal Statistical Society. Series C (Applied Statistics). 1975. doi:https://doi.org/10.2307/2346567

  14. [32]

    R. D. Luce , title =. Courier Corporation. 2012

  15. [33]

    Terry , title =

    Ralph Allan Bradley and Milton E. Terry , title =. Biometrika. 1952. doi:https://doi.org/10.2307/2334029

  16. [34]

    2020 , eprint=

    Fine-Tuning Language Models from Human Preferences , author=. 2020 , eprint=

  17. [35]

    2022 , eprint=

    Learning to summarize from human feedback , author=. 2022 , eprint=

  18. [36]

    Training language models to follow instructions with human feedback , url =

    Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...

  19. [37]

    2022 , eprint=

    LaMDA: Language Models for Dialog Applications , author=. 2022 , eprint=

  20. [38]

    2023 , eprint=

    Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author=. 2023 , eprint=

  21. [39]

    Wang, Ben and Komatsuzaki, Aran , title =

  22. [40]

    and Daly, Raymond E

    Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher , title =. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , month =. 2011 , address =

  23. [41]

    ArXiv , year=

    Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: A Preliminary Empirical Study , author=. ArXiv , year=

  24. [42]

    TL ; DR : Mining R eddit to Learn Automatic Summarization

    V. TL ; DR : Mining R eddit to Learn Automatic Summarization. Proceedings of the Workshop on New Frontiers in Summarization. 2017. doi:10.18653/v1/W17-4508

  25. [43]

    2022 , eprint=

    Constitutional AI: Harmlessness from AI Feedback , author=. 2022 , eprint=

  26. [44]

    2021 , isbn =

    Narayanan, Deepak and Shoeybi, Mohammad and Casper, Jared and LeGresley, Patrick and Patwary, Mostofa and Korthikanti, Vijay and Vainbrand, Dmitri and Kashinkunti, Prethvi and Bernauer, Julie and Catanzaro, Bryan and Phanishayee, Amar and Zaharia, Matei , title =. 2021 , isbn =. doi:10.1145/3458817.3476209 , booktitle =

  27. [45]

    Robotics Research: Volume 1

    Kupcsik, Andras and Hsu, David and Lee, Wee Sun , year =. Learning Dynamic Robot-to-Human Object Handover from Human Feedback , bookTitle="Robotics Research: Volume 1", publisher="Springer International Publishing", isbn =

  28. [46]

    Active preference-based learning of reward functions , year=

    Sadigh, Dorsa and Dragan, Anca D and Sastry, Shankar and Seshia, Sanjit A , booktitle=. Active preference-based learning of reward functions , year=

  29. [47]

    2019 , note=

    Language Models are Unsupervised Multitask Learners , author=. 2019 , note=

  30. [48]

    Abstractive Text Summarization using Sequence-to-sequence RNN s and Beyond

    Nallapati, Ramesh and Zhou, Bowen and dos Santos, Cicero and Gul c ehre, C a g lar and Xiang, Bing. Abstractive Text Summarization using Sequence-to-sequence RNN s and Beyond. Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. 2016. doi:10.18653/v1/K16-1028

  31. [49]

    Aligning Language Models with Preferences through F-Divergence Minimization , year =

    Go, Dongyoung and Korbak, Tomasz and Kruszewski, Germ\'. Aligning Language Models with Preferences through F-Divergence Minimization , year =. Proceedings of the 40th International Conference on Machine Learning , articleno =

  32. [50]

    On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting , url =

    Korbak, Tomasz and Elsahar, Hady and Kruszewski, Germ\'. On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting , url =. Advances in Neural Information Processing Systems , editor =

  33. [51]

    International Conference on Learning Representations , year=

    Multitask Prompted Training Enables Zero-Shot Task Generalization , author=. International Conference on Learning Representations , year=

  34. [52]

    2022 , eprint=

    Scaling Instruction-Finetuned Language Models , author=. 2022 , eprint=

  35. [53]

    The Eleventh International Conference on Learning Representations , year=

    Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization , author=. The Eleventh International Conference on Learning Representations , year=

  36. [54]

    Cross-Task Generalization via Natural Language Crowdsourcing Instructions

    Mishra, Swaroop and Khashabi, Daniel and Baral, Chitta and Hajishirzi, Hannaneh. Cross-Task Generalization via Natural Language Crowdsourcing Instructions. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.244

  37. [55]

    Language Models are Few-Shot Learners , url =

    Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

  38. [56]

    Proceedings of The 26th International Conference on Artificial Intelligence and Statistics , pages =

    Dueling RL: Reinforcement Learning with Trajectory Preferences , author =. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics , pages =. 2023 , editor =

  39. [57]

    2020 , url=

    Way Off-Policy Batch Deep Reinforcement Learning of Human Preferences in Dialog , author=. 2020 , url=

  40. [58]

    Wu, Yuxiang and Hu, Baotian , title =. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence , articleno =. 2018 , isbn =

  41. [59]

    Leandro von Werra and Jonathan Tow and reciprocated and Shahbuland Matiana and Alex Havrilla and cat-state and Louis Castricato and Alan and Duy V. Phung and Ayush Thakur and Alexey Bukhtiyarov and aaronrmm and Fabrizio Milo and Daniel and Daniel King and Dong Shin and Ethan Kim and Justin Wei and Manuel Romero and Nicky Pochinkov and Omar Sanseviero and ...

  42. [60]

    CoRR , year=

    Sequence Level Training with Recurrent Neural Networks , author=. CoRR , year=

  43. [61]

    , title =

    Williams, Ronald J. , title =. Mach. Learn. , month =. 1992 , issue_date =. doi:10.1007/BF00992696 , abstract =

  44. [62]

    Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning

    Kreutzer, Julia and Uyheng, Joshua and Riezler, Stefan. Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1165

  45. [63]

    International Conference on Learning Representations , year=

    A Deep Reinforced Model for Abstractive Summarization , author=. International Conference on Learning Representations , year=

  46. [64]

    Deep Reinforcement Learning from Human Preferences , url =

    Christiano, Paul F and Leike, Jan and Brown, Tom and Martic, Miljan and Legg, Shane and Amodei, Dario , booktitle =. Deep Reinforcement Learning from Human Preferences , url =

  47. [65]

    Yan, Xinyi and Luo, Chengxi and Clarke, Charles L. A. and Craswell, Nick and Voorhees, Ellen M. and Castells, Pablo , title =. 2022 , isbn =. doi:10.1145/3477495.3531991 , booktitle =

  48. [66]

    Learning Trajectory Preferences for Manipulators via Iterative Improvement , url =

    Jain, Ashesh and Wojcik, Brian and Joachims, Thorsten and Saxena, Ashutosh , booktitle =. Learning Trajectory Preferences for Manipulators via Iterative Improvement , url =

  49. [67]

    Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm , journal =

    R. Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm , journal =. 2014 , month = jul, publisher =. doi:10.1007/s10994-014-5458-8 , url =

  50. [68]

    The K-armed dueling bandits problem , journal =

    Yisong Yue and Josef Broder and Robert Kleinberg and Thorsten Joachims , keywords =. The K-armed dueling bandits problem , journal =. 2012 , note =. doi:https://doi.org/10.1016/j.jcss.2011.12.028 , url =

  51. [69]

    Proceedings of The 28th Conference on Learning Theory , pages =

    Contextual Dueling Bandits , author =. Proceedings of The 28th Conference on Learning Theory , pages =. 2015 , editor =

  52. [70]

    PaLM: Scaling Language Modeling with Pathways

    Palm: Scaling language modeling with pathways , author=. arXiv preprint arXiv:2204.02311 , year=

  53. [71]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  54. [73]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Sébastien Bubeck and Varun Chandrasekaran and Ronen Eldan and Johannes Gehrke and Eric Horvitz and Ece Kamar and Peter Lee and Yin Tat Lee and Yuanzhi Li and Scott Lundberg and Harsha Nori and Hamid Palangi and Marco Tulio Ribeiro and Yi Zhang , year=. Sparks of Artificial General Intelligence: Early experiments with. 2303.12712 , archivePrefix=

  55. [74]

    Proceedings of the 24th international conference on Machine learning , pages=

    Reinforcement learning by reward-weighted regression for operational space control , author=. Proceedings of the 24th international conference on Machine learning , pages=

  56. [75]

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

    Advantage-weighted regression: Simple and scalable off-policy reinforcement learning , author=. arXiv preprint arXiv:1910.00177 , year=

  57. [76]

    arXiv preprint arXiv:1908.04319 , year=

    Neural text generation with unlikelihood training , author=. arXiv preprint arXiv:1908.04319 , year=

  58. [78]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  59. [79]

    2023 , jouranl=

    Scaling Laws for Reward Model Overoptimization , author=. 2023 , jouranl=

  60. [80]

    2019 , eprint=

    Categorizing Variants of Goodhart's Law , author=. 2019 , eprint=

  61. [81]

    2023 , eprint=

    Uncertainty-Penalized Reinforcement Learning from Human Feedback with Diverse Reward LoRA Ensembles , author=. 2023 , eprint=

  62. [82]

    2023 , eprint=

    Reward Model Ensembles Help Mitigate Overoptimization , author=. 2023 , eprint=

  63. [83]

    2023 , eprint=

    Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking , author=. 2023 , eprint=

  64. [84]

    2024 , eprint=

    AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback , author=. 2024 , eprint=

  65. [85]

    2024 , eprint=

    Mixtral of Experts , author=. 2024 , eprint=

  66. [86]

    2023 , eprint=

    The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback , author=. 2023 , eprint=

  67. [87]

    2023 , eprint=

    A Long Way to Go: Investigating Length Correlations in RLHF , author=. 2023 , eprint=

  68. [88]

    2023 , eprint=

    Who Answers It Better? An In-Depth Analysis of ChatGPT and Stack Overflow Answers to Software Engineering Questions , author=. 2023 , eprint=

  69. [89]

    2022 , url=

    Introducing ChatGPT , author=. 2022 , url=

  70. [90]

    2023 , journal =

    Verbosity Bias in Preference Labeling by Large Language Models , author=. 2023 , journal =

  71. [91]

    2023 , journal =

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , journal =

  72. [93]

    The method of paired comparisons , author=

    Rank analysis of incomplete block designs: I. The method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=

  73. [94]

    2023 , eprint=

    Gemini: A Family of Highly Capable Multimodal Models , author=. 2023 , eprint=

  74. [95]

    The 36th Conference on Neural Information Processing Systems (NeurIPS) , year=

    Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , author=. The 36th Conference on Neural Information Processing Systems (NeurIPS) , year=

  75. [96]

    arXiv preprint arXiv:2301.10166 , year=

    Leveraging vision-language models for granular market change prediction , author=. arXiv preprint arXiv:2301.10166 , year=

  76. [97]

    Advances in Neural Information Processing Systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in Neural Information Processing Systems , volume=

  77. [99]

    International conference on machine learning , pages=

    Scaling up visual and vision-language representation learning with noisy text supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  78. [100]

    2023 , eprint=

    Vision-Language Instruction Tuning: A Review and Analysis , author=. 2023 , eprint=

  79. [101]

    Hashimoto , title =

    Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

  80. [102]

    arXiv preprint arXiv:2209.15517 , year=

    Medical image understanding with pretrained vision language models: A comprehensive study , author=. arXiv preprint arXiv:2209.15517 , year=

Showing first 80 references.