Recognition: 2 theorem links
· Lean TheoremAligning Modalities in Vision Large Language Models via Preference Fine-tuning
Pith reviewed 2026-05-17 10:54 UTC · model grok-4.3
The pith
Preference fine-tuning on automatically generated hallucinated responses aligns vision and language modalities in large models while cutting hallucinations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that hallucinations in instruction-following vision large language models stem from incomplete modality alignment during joint training and can be addressed by preference optimization on automatically constructed data. Preferred responses come directly from ground-truth instructions, while dispreferred responses are produced in two stages—GPT-4V hallucination injection into correct answers and image distortion that triggers the model's own hallucination behavior—then optimized via Direct Preference Optimization without any human data collection.
What carries the argument
POVID, the two-stage automated generation of dispreferred responses (GPT-4V hallucination injection plus image distortion) that supplies preference pairs for Direct Preference Optimization.
If this is right
- Hallucination rates fall on multiple evaluation benchmarks.
- Overall accuracy rises on standard vision-language tasks beyond previous alignment techniques.
- The method scales without requiring human-generated preference data or perfect expert supervision.
- Both generation strategies can be combined within a single Direct Preference Optimization run.
Where Pith is reading between the lines
- The same automated preference-pair construction could be tested on other multimodal errors such as spatial reasoning failures or object-counting mistakes.
- Preference data generated this way might transfer to align components in non-vision multimodal systems like audio-language models.
- Running the process iteratively—using the tuned model to generate new dispreferred pairs—could produce further gains without external models.
Load-bearing premise
The two-stage automated generation of dispreferred responses produces high-quality preference pairs that accurately reflect and correct the model's hallucination behavior when used in Direct Preference Optimization.
What would settle it
Apply the full POVID pipeline to a held-out vision-language model and measure hallucination rates on a benchmark such as POPE or CHAIR; if rates show no reduction or benchmark scores fail to rise relative to the untuned baseline, the central claim would be falsified.
read the original abstract
Instruction-following Vision Large Language Models (VLLMs) have achieved significant progress recently on a variety of tasks. These approaches merge strong pre-trained vision models and large language models (LLMs). Since these components are trained separately, the learned representations need to be aligned with joint training on additional image-language pairs. This procedure is not perfect and can cause the model to hallucinate - provide answers that do not accurately reflect the image, even when the core LLM is highly factual and the vision backbone has sufficiently complete representations. In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning. Specifically, we propose POVID to generate feedback data with AI models. We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data. First, we prompt GPT-4V to inject plausible hallucinations into the correct answer. Second, we distort the image to trigger the inherent hallucination behavior of the VLLM. This is an automated approach, which does not rely on human data generation or require a perfect expert, which makes it easily scalable. Finally, both of these generation strategies are integrated into an RLHF pipeline via Direct Preference Optimization. In experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches. Our data and code are available at https://github.com/YiyangZhou/POVID.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces POVID, a scalable automated pipeline for generating preference data to fine-tune Vision Large Language Models (VLLMs) via Direct Preference Optimization (DPO). Ground-truth answers serve as preferred responses; dispreferred responses are created in two stages—prompting GPT-4V to inject plausible hallucinations into correct answers, and distorting input images to elicit the VLLM’s inherent hallucinations—before integrating the pairs into an RLHF-style DPO training loop. Experiments are reported to show both reduced hallucinations and improved performance on standard benchmarks, outperforming prior approaches.
Significance. If the results are reproducible and the preference pairs prove high-quality, the work supplies a practical, human-annotation-free method for modality alignment that simultaneously mitigates hallucinations and lifts benchmark scores, which would be a useful contribution to reliable multimodal models.
major comments (2)
- [Methods / Data Generation] Methods / Data Generation: The central claim that DPO on these pairs reduces hallucinations and improves performance rests on the assumption that the two-stage dispreferred responses faithfully capture the target VLLM’s hallucination distribution. Because GPT-4V is an external, stronger model, its injected hallucinations may not match the visual-grounding failures of the specific VLLM under training; likewise, image distortion can alter semantic content and render the original ground-truth invalid. The manuscript should include quantitative validation (e.g., overlap statistics between GPT-4V hallucinations and the VLLM’s own errors on a held-out set) or an ablation that replaces the generated pairs with human-annotated ones to test whether the automated data is load-bearing for the reported gains.
- [Experiments] Experiments: The abstract asserts “outperforming prior approaches” across “broad benchmarks,” yet the provided description supplies no concrete metrics, baseline tables, ablation results, or statistical significance tests. Without these details it is impossible to assess whether the claimed improvements are robust or merely marginal; the manuscript must present full comparison tables (including exact benchmark names, scores, and variance) and control experiments that isolate the contribution of each data-generation stage.
minor comments (2)
- [Abstract] Abstract: The phrase “broad benchmarks” is vague; naming the specific datasets (e.g., VQA, GQA, POPE) and reporting at least the headline hallucination and accuracy deltas would make the summary self-contained.
- [Notation / Setup] Notation: The manuscript should clarify whether the same VLLM is used both for image-distortion triggering and for final evaluation, or whether a different model is employed; inconsistent usage could affect reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments on our work. We address each of the major comments below and outline the revisions we plan to make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Methods / Data Generation] The central claim that DPO on these pairs reduces hallucinations and improves performance rests on the assumption that the two-stage dispreferred responses faithfully capture the target VLLM’s hallucination distribution. Because GPT-4V is an external, stronger model, its injected hallucinations may not match the visual-grounding failures of the specific VLLM under training; likewise, image distortion can alter semantic content and render the original ground-truth invalid. The manuscript should include quantitative validation (e.g., overlap statistics between GPT-4V hallucinations and the VLLM’s own errors on a held-out set) or an ablation that replaces the generated pairs with human-annotated ones to test whether the automated data is load-bearing for the reported gains.
Authors: We agree that validating the quality of the generated preference pairs is important for substantiating our claims. While our method uses GPT-4V to generate plausible hallucinations and image distortions to elicit model-specific errors, we recognize that these may not perfectly align with the target VLLM's distribution. In the revised manuscript, we will add quantitative analysis, such as comparing the types and frequencies of hallucinations generated by our pipeline versus those observed directly from the VLLM on a held-out set. We will also include an ablation study using a subset of human-annotated preference pairs to demonstrate that our automated approach achieves comparable gains, thereby showing that the data generation is effective and load-bearing. This will help confirm the robustness of our findings without relying on human annotation at scale. revision: partial
-
Referee: [Experiments] The abstract asserts “outperforming prior approaches” across “broad benchmarks,” yet the provided description supplies no concrete metrics, baseline tables, ablation results, or statistical significance tests. Without these details it is impossible to assess whether the claimed improvements are robust or merely marginal; the manuscript must present full comparison tables (including exact benchmark names, scores, and variance) and control experiments that isolate the contribution of each data-generation stage.
Authors: We apologize if the experimental details were not sufficiently clear in the initial version. The full manuscript does contain comprehensive results, including tables with specific benchmark names (such as POPE, HallusionBench, and others), performance scores, comparisons to baselines like LLaVA and other alignment methods, ablation studies on the two stages of dispreferred data generation, and multiple runs for variance. To address this comment directly, we will revise the paper to include more prominent tables in the main body, add statistical significance tests where appropriate, and provide additional control experiments that isolate the contribution of the GPT-4V injection stage versus the image distortion stage. These revisions will make the improvements and their robustness more transparent. revision: yes
Circularity Check
No significant circularity detected in derivation or claims.
full rationale
The paper presents an empirical method (POVID) for automated preference data generation via external GPT-4V hallucination injection and image distortion, followed by standard DPO fine-tuning on VLLMs. No mathematical derivations, equations, or parameter-fitting steps are described that reduce by construction to self-defined inputs or predictions. Central claims rest on experimental benchmark results rather than internal self-reference. Any author overlap with the DPO method is a standard citation to prior published work and does not load-bear or force the present empirical outcomes, which remain independently falsifiable via external evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Automatically generated preference pairs from GPT-4V and image distortion will effectively reduce hallucinations when optimized with DPO.
Forward citations
Cited by 18 Pith papers
-
Visual Preference Optimization with Rubric Rewards
rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.
-
Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
A new RL method called MoCA with Perception Verification rewards perceptual fidelity independently to improve both seeing and thinking in VLMs.
-
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
-
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.
-
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination
LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
-
Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models
MoR lets clients train local reward models on private preferences and uses a learned Mixture-of-Rewards with GRPO on the server to align a shared base VLM without exchanging parameters, architectures, or raw data.
-
Online Self-Calibration Against Hallucination in Vision-Language Models
OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal...
-
When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs
Hallucinations in LVLMs largely arise from textual priors in prompts, and can be reduced by fine-tuning with preference optimization on grounded vs. hallucinated response pairs.
-
R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs
R-CoV is a six-step region-aware chain-of-verification technique that elicits coordinate and description outputs from LVLMs themselves to detect and reduce object hallucinations without external models or retraining.
-
S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models
S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.
-
HTDC: Hesitation-Triggered Differential Calibration for Mitigating Hallucination in Large Vision-Language Models
HTDC mitigates hallucinations in LVLMs by triggering calibration only at hesitation-prone decoding steps via contrasts with visual-nullification and semantic-nullification probes.
-
Visual-RFT: Visual Reinforcement Fine-Tuning
Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.
-
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
-
Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models
UE-DPO quantifies epistemic uncertainty from grounding failures to direct more learning pressure on hard visual tokens in preferred samples while easing penalties on dispreferred ones.
-
DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling
DT2IT-MRM proposes a debiased preference construction pipeline, T2I data reformulation, and iterative training to curate multimodal preference data, achieving SOTA on VL-RewardBench, Multimodal RewardBench, and MM-RLH...
-
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
Reference graph
Works this paper leans on
-
[18]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. 2023 , eprint=
work page 2023
-
[20]
P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =
work page 2000
-
[21]
T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980
work page 1980
-
[22]
M. J. Kearns , title =
-
[23]
Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983
work page 1983
-
[24]
R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000
work page 2000
-
[25]
Suppressed for Anonymity , author=
-
[26]
A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981
work page 1981
-
[27]
A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959
work page 1959
-
[28]
Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , author=. 2018 , eprint=
work page 2018
-
[29]
Generalized Results for the Existence and Consistency of the
Heejong Bong and Alessandro Rinaldo , year=. Generalized Results for the Existence and Consistency of the. International Conference on Machine Learning , note =
- [30]
-
[31]
R. L. Plackett , title =. Journal of the Royal Statistical Society. Series C (Applied Statistics). 1975. doi:https://doi.org/10.2307/2346567
-
[32]
R. D. Luce , title =. Courier Corporation. 2012
work page 2012
-
[33]
Ralph Allan Bradley and Milton E. Terry , title =. Biometrika. 1952. doi:https://doi.org/10.2307/2334029
-
[34]
Fine-Tuning Language Models from Human Preferences , author=. 2020 , eprint=
work page 2020
- [35]
-
[36]
Training language models to follow instructions with human feedback , url =
Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...
-
[37]
LaMDA: Language Models for Dialog Applications , author=. 2022 , eprint=
work page 2022
-
[38]
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author=. 2023 , eprint=
work page 2023
-
[39]
Wang, Ben and Komatsuzaki, Aran , title =
-
[40]
Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher , title =. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , month =. 2011 , address =
work page 2011
-
[41]
Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: A Preliminary Empirical Study , author=. ArXiv , year=
-
[42]
TL ; DR : Mining R eddit to Learn Automatic Summarization
V. TL ; DR : Mining R eddit to Learn Automatic Summarization. Proceedings of the Workshop on New Frontiers in Summarization. 2017. doi:10.18653/v1/W17-4508
-
[43]
Constitutional AI: Harmlessness from AI Feedback , author=. 2022 , eprint=
work page 2022
-
[44]
Narayanan, Deepak and Shoeybi, Mohammad and Casper, Jared and LeGresley, Patrick and Patwary, Mostofa and Korthikanti, Vijay and Vainbrand, Dmitri and Kashinkunti, Prethvi and Bernauer, Julie and Catanzaro, Bryan and Phanishayee, Amar and Zaharia, Matei , title =. 2021 , isbn =. doi:10.1145/3458817.3476209 , booktitle =
-
[45]
Kupcsik, Andras and Hsu, David and Lee, Wee Sun , year =. Learning Dynamic Robot-to-Human Object Handover from Human Feedback , bookTitle="Robotics Research: Volume 1", publisher="Springer International Publishing", isbn =
-
[46]
Active preference-based learning of reward functions , year=
Sadigh, Dorsa and Dragan, Anca D and Sastry, Shankar and Seshia, Sanjit A , booktitle=. Active preference-based learning of reward functions , year=
-
[47]
Language Models are Unsupervised Multitask Learners , author=. 2019 , note=
work page 2019
-
[48]
Abstractive Text Summarization using Sequence-to-sequence RNN s and Beyond
Nallapati, Ramesh and Zhou, Bowen and dos Santos, Cicero and Gul c ehre, C a g lar and Xiang, Bing. Abstractive Text Summarization using Sequence-to-sequence RNN s and Beyond. Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. 2016. doi:10.18653/v1/K16-1028
-
[49]
Aligning Language Models with Preferences through F-Divergence Minimization , year =
Go, Dongyoung and Korbak, Tomasz and Kruszewski, Germ\'. Aligning Language Models with Preferences through F-Divergence Minimization , year =. Proceedings of the 40th International Conference on Machine Learning , articleno =
-
[50]
Korbak, Tomasz and Elsahar, Hady and Kruszewski, Germ\'. On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting , url =. Advances in Neural Information Processing Systems , editor =
-
[51]
International Conference on Learning Representations , year=
Multitask Prompted Training Enables Zero-Shot Task Generalization , author=. International Conference on Learning Representations , year=
-
[52]
Scaling Instruction-Finetuned Language Models , author=. 2022 , eprint=
work page 2022
-
[53]
The Eleventh International Conference on Learning Representations , year=
Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization , author=. The Eleventh International Conference on Learning Representations , year=
-
[54]
Cross-Task Generalization via Natural Language Crowdsourcing Instructions
Mishra, Swaroop and Khashabi, Daniel and Baral, Chitta and Hajishirzi, Hannaneh. Cross-Task Generalization via Natural Language Crowdsourcing Instructions. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.244
-
[55]
Language Models are Few-Shot Learners , url =
Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...
-
[56]
Proceedings of The 26th International Conference on Artificial Intelligence and Statistics , pages =
Dueling RL: Reinforcement Learning with Trajectory Preferences , author =. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics , pages =. 2023 , editor =
work page 2023
-
[57]
Way Off-Policy Batch Deep Reinforcement Learning of Human Preferences in Dialog , author=. 2020 , url=
work page 2020
-
[58]
Wu, Yuxiang and Hu, Baotian , title =. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence , articleno =. 2018 , isbn =
work page 2018
-
[59]
Leandro von Werra and Jonathan Tow and reciprocated and Shahbuland Matiana and Alex Havrilla and cat-state and Louis Castricato and Alan and Duy V. Phung and Ayush Thakur and Alexey Bukhtiyarov and aaronrmm and Fabrizio Milo and Daniel and Daniel King and Dong Shin and Ethan Kim and Justin Wei and Manuel Romero and Nicky Pochinkov and Omar Sanseviero and ...
-
[60]
Sequence Level Training with Recurrent Neural Networks , author=. CoRR , year=
-
[61]
Williams, Ronald J. , title =. Mach. Learn. , month =. 1992 , issue_date =. doi:10.1007/BF00992696 , abstract =
-
[62]
Kreutzer, Julia and Uyheng, Joshua and Riezler, Stefan. Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1165
-
[63]
International Conference on Learning Representations , year=
A Deep Reinforced Model for Abstractive Summarization , author=. International Conference on Learning Representations , year=
-
[64]
Deep Reinforcement Learning from Human Preferences , url =
Christiano, Paul F and Leike, Jan and Brown, Tom and Martic, Miljan and Legg, Shane and Amodei, Dario , booktitle =. Deep Reinforcement Learning from Human Preferences , url =
-
[65]
Yan, Xinyi and Luo, Chengxi and Clarke, Charles L. A. and Craswell, Nick and Voorhees, Ellen M. and Castells, Pablo , title =. 2022 , isbn =. doi:10.1145/3477495.3531991 , booktitle =
-
[66]
Learning Trajectory Preferences for Manipulators via Iterative Improvement , url =
Jain, Ashesh and Wojcik, Brian and Joachims, Thorsten and Saxena, Ashutosh , booktitle =. Learning Trajectory Preferences for Manipulators via Iterative Improvement , url =
-
[67]
R. Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm , journal =. 2014 , month = jul, publisher =. doi:10.1007/s10994-014-5458-8 , url =
-
[68]
The K-armed dueling bandits problem , journal =
Yisong Yue and Josef Broder and Robert Kleinberg and Thorsten Joachims , keywords =. The K-armed dueling bandits problem , journal =. 2012 , note =. doi:https://doi.org/10.1016/j.jcss.2011.12.028 , url =
-
[69]
Proceedings of The 28th Conference on Learning Theory , pages =
Contextual Dueling Bandits , author =. Proceedings of The 28th Conference on Learning Theory , pages =. 2015 , editor =
work page 2015
-
[70]
PaLM: Scaling Language Modeling with Pathways
Palm: Scaling language modeling with pathways , author=. arXiv preprint arXiv:2204.02311 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[71]
Advances in neural information processing systems , volume=
Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
-
[73]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck and Varun Chandrasekaran and Ronen Eldan and Johannes Gehrke and Eric Horvitz and Ece Kamar and Peter Lee and Yin Tat Lee and Yuanzhi Li and Scott Lundberg and Harsha Nori and Hamid Palangi and Marco Tulio Ribeiro and Yi Zhang , year=. Sparks of Artificial General Intelligence: Early experiments with. 2303.12712 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv
-
[74]
Proceedings of the 24th international conference on Machine learning , pages=
Reinforcement learning by reward-weighted regression for operational space control , author=. Proceedings of the 24th international conference on Machine learning , pages=
-
[75]
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
Advantage-weighted regression: Simple and scalable off-policy reinforcement learning , author=. arXiv preprint arXiv:1910.00177 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[76]
arXiv preprint arXiv:1908.04319 , year=
Neural text generation with unlikelihood training , author=. arXiv preprint arXiv:1908.04319 , year=
-
[78]
Thirty-seventh Conference on Neural Information Processing Systems , year=
Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
-
[79]
Scaling Laws for Reward Model Overoptimization , author=. 2023 , jouranl=
work page 2023
- [80]
-
[81]
Uncertainty-Penalized Reinforcement Learning from Human Feedback with Diverse Reward LoRA Ensembles , author=. 2023 , eprint=
work page 2023
-
[82]
Reward Model Ensembles Help Mitigate Overoptimization , author=. 2023 , eprint=
work page 2023
-
[83]
Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking , author=. 2023 , eprint=
work page 2023
-
[84]
AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback , author=. 2024 , eprint=
work page 2024
- [85]
-
[86]
The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback , author=. 2023 , eprint=
work page 2023
-
[87]
A Long Way to Go: Investigating Length Correlations in RLHF , author=. 2023 , eprint=
work page 2023
-
[88]
Who Answers It Better? An In-Depth Analysis of ChatGPT and Stack Overflow Answers to Software Engineering Questions , author=. 2023 , eprint=
work page 2023
- [89]
-
[90]
Verbosity Bias in Preference Labeling by Large Language Models , author=. 2023 , journal =
work page 2023
-
[91]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , journal =
work page 2023
-
[93]
The method of paired comparisons , author=
Rank analysis of incomplete block designs: I. The method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=
work page 1952
-
[94]
Gemini: A Family of Highly Capable Multimodal Models , author=. 2023 , eprint=
work page 2023
-
[95]
The 36th Conference on Neural Information Processing Systems (NeurIPS) , year=
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , author=. The 36th Conference on Neural Information Processing Systems (NeurIPS) , year=
-
[96]
arXiv preprint arXiv:2301.10166 , year=
Leveraging vision-language models for granular market change prediction , author=. arXiv preprint arXiv:2301.10166 , year=
-
[97]
Advances in Neural Information Processing Systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in Neural Information Processing Systems , volume=
-
[99]
International conference on machine learning , pages=
Scaling up visual and vision-language representation learning with noisy text supervision , author=. International conference on machine learning , pages=. 2021 , organization=
work page 2021
-
[100]
Vision-Language Instruction Tuning: A Review and Analysis , author=. 2023 , eprint=
work page 2023
-
[101]
Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[102]
arXiv preprint arXiv:2209.15517 , year=
Medical image understanding with pretrained vision language models: A comprehensive study , author=. arXiv preprint arXiv:2209.15517 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.