Recognition: 2 theorem links
· Lean TheoremAligning Large Multimodal Models with Factually Augmented RLHF
Pith reviewed 2026-05-15 17:51 UTC · model grok-4.3
The pith
Factually augmented RLHF aligns large multimodal models to cut hallucinations and reach 94 percent of GPT-4 performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We adapt Reinforcement Learning from Human Feedback to vision-language alignment by asking human annotators to compare two responses and select the more hallucinated one, then train the model to maximize simulated human rewards. We introduce Factually Augmented RLHF, which supplies the reward model with additional factual information such as image captions and ground-truth multi-choice options to reduce reward hacking. Training data is further strengthened by mixing GPT-4-generated vision instructions with previously available human-written image-text pairs. Evaluated on the new MMHAL-BENCH benchmark that penalizes hallucinations, the resulting model, the first LMM trained with RLHF, reaches
What carries the argument
Factually Augmented RLHF, the algorithm that augments the reward model with image captions and ground-truth options to prevent reward hacking during multimodal alignment training.
If this is right
- The model reaches 94 percent of the performance level of text-only GPT-4 on the LLaVA-Bench dataset.
- It delivers a 60 percent improvement on the MMHAL-BENCH benchmark relative to prior baselines.
- Mixing GPT-4-generated data with human-written image-text pairs improves general multimodal capabilities.
- The full code, model weights, and training data are released for public use.
Where Pith is reading between the lines
- The same factual-augmentation step could be tested on alignment methods other than RLHF to check whether reward hacking is reduced more broadly.
- If the method scales, it may improve reliability of multimodal systems in applications such as visual question answering where factual grounding is critical.
- Hybrid training that combines synthetic GPT-4 outputs with human image-text pairs may offer a practical route for future multimodal alignment work.
Load-bearing premise
That adding image captions and ground-truth options to the reward model reliably prevents reward hacking without introducing new biases or reducing generalization on open-ended questions.
What would settle it
A controlled experiment in which the model continues to hallucinate at baseline rates on questions whose answers are not directly supplied by the added captions or options, or where performance falls on open-ended benchmarks that lack the factual augmentations.
read the original abstract
Large Multimodal Models (LMM) are built across modalities and the misalignment between two modalities can result in "hallucination", generating textual outputs that are not grounded by the multimodal information in context. To address the multimodal misalignment issue, we adapt the Reinforcement Learning from Human Feedback (RLHF) from the text domain to the task of vision-language alignment, where human annotators are asked to compare two responses and pinpoint the more hallucinated one, and the vision-language model is trained to maximize the simulated human rewards. We propose a new alignment algorithm called Factually Augmented RLHF that augments the reward model with additional factual information such as image captions and ground-truth multi-choice options, which alleviates the reward hacking phenomenon in RLHF and further improves the performance. We also enhance the GPT-4-generated training data (for vision instruction tuning) with previously available human-written image-text pairs to improve the general capabilities of our model. To evaluate the proposed approach in real-world scenarios, we develop a new evaluation benchmark MMHAL-BENCH with a special focus on penalizing hallucinations. As the first LMM trained with RLHF, our approach achieves remarkable improvement on the LLaVA-Bench dataset with the 94% performance level of the text-only GPT-4 (while previous best methods can only achieve the 87% level), and an improvement by 60% on MMHAL-BENCH over other baselines. We opensource our code, model, data at https://llava-rlhf.github.io.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Factually Augmented RLHF (FA-RLHF) to align large multimodal models (LMMs) and mitigate hallucinations arising from modality misalignment. Human annotators compare pairs of responses to identify the more hallucinated one, training a reward model that is augmented with factual information such as image captions and ground-truth multi-choice options. The approach also enhances GPT-4-generated vision instruction tuning data with human-written image-text pairs. A new benchmark MMHAL-BENCH is proposed focusing on hallucination penalties. The authors report that their model reaches 94% of text-only GPT-4 performance on LLaVA-Bench (improving from previous 87%) and achieves a 60% improvement on MMHAL-BENCH over baselines. Code, model, and data are open-sourced.
Significance. If the empirical gains hold under scrutiny, this work represents a significant advance as the first application of RLHF to LMM alignment. The introduction of factual augmentation to address reward hacking is a novel contribution, and the new MMHAL-BENCH benchmark could become a standard for evaluating hallucination in multimodal models. Open-sourcing facilitates reproducibility and further research in multimodal alignment.
major comments (2)
- [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): The headline performance claims (94% of GPT-4 on LLaVA-Bench and 60% improvement on MMHAL-BENCH) are presented without statistical details such as standard deviations, number of evaluation runs, or ablation tables isolating the effect of factual augmentation versus data enhancement. This makes it difficult to rule out that gains stem from hyperparameter tuning or data selection rather than the proposed method.
- [§3 (Method, Factually Augmented RLHF)] §3 (Method, Factually Augmented RLHF): The assumption that augmenting the reward model with image captions and ground-truth options reliably prevents reward hacking without introducing biases is central but under-supported. Since these factual anchors are unavailable for open-ended prompts at inference time, the reward model may overfit to closed-ended cues, potentially limiting generalization. Experiments comparing performance on purely open-ended vs. multiple-choice questions would strengthen this claim.
minor comments (2)
- [§2 (Related Work)] §2 (Related Work): The discussion of prior RLHF applications could include more recent multimodal alignment works for completeness.
- [Figure 1] Figure 1: The diagram illustrating the FA-RLHF pipeline would benefit from clearer labeling of the factual augmentation step.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to incorporate additional statistical reporting, ablation studies, and generalization experiments where feasible.
read point-by-point responses
-
Referee: [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): The headline performance claims (94% of GPT-4 on LLaVA-Bench and 60% improvement on MMHAL-BENCH) are presented without statistical details such as standard deviations, number of evaluation runs, or ablation tables isolating the effect of factual augmentation versus data enhancement. This makes it difficult to rule out that gains stem from hyperparameter tuning or data selection rather than the proposed method.
Authors: We agree that additional statistical rigor would strengthen the presentation. In the revised manuscript, we report standard deviations computed over 5 independent evaluation runs for the key metrics on both LLaVA-Bench and MMHAL-BENCH. We have also added a dedicated ablation table in §4.3 that isolates the contribution of factual augmentation in the reward model from the GPT-4 data enhancement step. These results show that factual augmentation accounts for a substantial portion of the observed gains beyond data selection effects alone. revision: yes
-
Referee: [§3 (Method, Factually Augmented RLHF)] §3 (Method, Factually Augmented RLHF): The assumption that augmenting the reward model with image captions and ground-truth options reliably prevents reward hacking without introducing biases is central but under-supported. Since these factual anchors are unavailable for open-ended prompts at inference time, the reward model may overfit to closed-ended cues, potentially limiting generalization. Experiments comparing performance on purely open-ended vs. multiple-choice questions would strengthen this claim.
Authors: We clarify that factual anchors (captions and ground-truth options) are used exclusively during reward model training to provide stronger supervision and reduce reward hacking; they are not available to the reward model or policy at inference. To address generalization concerns, the revised §4 now includes a breakdown of results on purely open-ended question subsets versus multiple-choice subsets of MMHAL-BENCH. Performance gains from FA-RLHF remain consistent in the open-ended setting, indicating that the learned reward model does not rely on closed-ended cues at test time. revision: yes
Circularity Check
No circularity in empirical benchmark results
full rationale
The paper reports empirical performance gains on LLaVA-Bench (94% of GPT-4) and MMHAL-BENCH (60% improvement) from Factually Augmented RLHF. These are direct benchmark scores on evaluation sets, with no equations, fitted parameters, or derivations presented that reduce the claimed improvements to quantities defined or fitted on the same data. The factual augmentation is described as an input to the reward model, but the results remain independent external measurements. No self-citation chains or uniqueness theorems are invoked to support the central claims.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human preference judgments over paired responses can be modeled by a scalar reward function
Forward citations
Cited by 19 Pith papers
-
CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs
CCTVBench exposes a large gap between standard QA accuracy and contrastive consistency in traffic video reasoning for multimodal LLMs and introduces C-TCD to narrow that gap.
-
OxyEcomBench: Benchmarking Multimodal Foundation Models across E-Commerce Ecosystems
OxyEcomBench is a unified multimodal benchmark covering 6 capability areas and 29 tasks with authentic e-commerce data to measure how well foundation models handle real platform, merchant, and customer challenges.
-
Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models
Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregr...
-
Unified Reward Model for Multimodal Understanding and Generation
UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
-
Dual-Pathway Circuits of Object Hallucination in Vision-Language Models
Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.
-
CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering
CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with ...
-
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.
-
When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs
Hallucinations in LVLMs largely arise from textual priors in prompts, and can be reduced by fine-tuning with preference optimization on grounded vs. hallucinated response pairs.
-
S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models
S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.
-
Mitigating Multimodal Hallucination via Phase-wise Self-reward
PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.
-
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
Visual-RFT: Visual Reinforcement Fine-Tuning
Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
Generating Place-Based Compromises Between Two Points of View
Empathic similarity feedback in prompts generates more acceptable compromises than chain-of-thought, and margin-based training on the resulting data lets smaller models produce them without ongoing empathy estimation.
-
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
-
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
An 8B MLLM reaches state-of-the-art efficiency and performance under 30B by combining a unified 3D resampler, joint document-text training, and hybrid RL for reasoning modes.
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
A Survey on Hallucination in Large Vision-Language Models
This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.
Reference graph
Works this paper leans on
-
[1]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems . Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Pa...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open- source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 , 2022a. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901,
work page 1901
-
[6]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023a. 12 Preprint Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. PaLI: A joi...
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Pali-x: On scaling up a multilingual vision and language model
Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Car- los Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023b. Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zh...
-
[8]
PaLM: Scaling Language Modeling with Pathways
URL https: //vicuna.lmsys.org. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Scaling Instruction-Finetuned Language Models
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pel- lat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping...
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
PaLM-E: An Embodied Multimodal Language Model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multi- modal language model. arXiv preprint arXiv:2303.03378,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Alpacafarm: A simulation framework for methods that learn from human feedback
Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387,
-
[13]
LoRA: Low-Rank Adaptation of Large Language Models
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 6904–6913, 2017a. Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Language Models (Mostly) Know What They Know
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language mod- els (mostly) know what they know. arXiv preprint arXiv:2207.05221,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Sha- hab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. In- ternational Journal of Computer Vision, 128(7):1956–1981,
work page 1956
-
[16]
Hugo Laurenc ¸on, Lucile Saulnier, L´eo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M Rush, Douwe Kiela, et al. Obelisc: An open web-scale filtered dataset of interleaved image-text documents.arXiv preprint arXiv:2306.16527,
-
[17]
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tuning. 2023a. Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023b. Junnan Li, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Teaching models to express their uncertainty in words
Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334,
-
[19]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer,
work page 2014
-
[20]
MMBench: Is Your Multi-modal Model an All-around Player?
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. 2023a. Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023b. Shayne Longpre, Le Hou, Tu Vu, Albert Web...
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
An empirical study of scaling instruct-tuned large multimodal models
Yadong Lu, Chunyuan Li, Haotian Liu, Jianwei Yang, Jianfeng Gao, and Yelong Shen. An empirical study of scaling instruct-tuned large multimodal models. arXiv preprint arXiv:2309.09958,
-
[22]
Understanding blind people’s experiences with computer-generated captions of social media images
Haley MacLeod, Cynthia L Bennett, Meredith Ringel Morris, and Edward Cutrell. Understanding blind people’s experiences with computer-generated captions of social media images. In pro- ceedings of the 2017 CHI conference on human factors in computing systems , pp. 5988–5999,
work page 2017
-
[23]
Crosslingual gen- eralization through multitask finetuning
Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. Crosslingual gen- eralization through multitask finetuning. arXiv preprint arXiv:2211.01786,
-
[24]
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
14 Preprint Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Object Hallucination in Image Captioning
Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili´c, Daniel Hesslow, Roman Castagn´e, Alexandra Sasha Luccioni, Franc ¸ois Yvon, Matthias Gall ´e, et al. Bloom: A 176b- parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
High-Dimensional Continuous Control Using Generalized Advantage Estimation
URL https://www.youtube.com/watch?v=hhiLw5Q_UFg&ab_channel= BerkeleyEECS. Berkeley EECS. John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Replug: Retrieval-augmented black-box language models
Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettle- moyer, and Wen-tau Yih. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652,
-
[30]
Self-alignment with principle-following reward models
Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. Self-alignment with principle-following reward models. personal com- munication, 2023a. Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. Principle-driven self-alignment of language model...
-
[31]
LLaMA: Open and Efficient Foundation Language Models
15 Preprint Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics , 2:67–78, 2014a. Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual den...
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Detecting hallucinated content in conditional neural sequence generation
Chunting Zhou, Graham Neubig, Jiatao Gu, Mona Diab, Paco Guzman, Luke Zettlemoyer, and Marjan Ghazvininejad. Detecting hallucinated content in conditional neural sequence generation. arXiv preprint arXiv:2011.02593,
-
[35]
Lima: Less is more for alignment.arXiv preprint arXiv:2305.11206,
Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment.arXiv preprint arXiv:2305.11206,
-
[36]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Fine-Tuning Language Models from Human Preferences
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593,
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[38]
Question: What color is the fire hydrant cap in the picture? Ground Truth: The color of the fire hydrant cap in the image is yellow. LLaV A: The fire hydrant cap in the picture is red. Table 11: An example question where LLaV A hallucinates the object attribute. Question: Is the jam on the bread made of strawberry? Ground Truth: In fact, this photo only s...
work page 2023
-
[39]
For generalized advantage estimation (GAE; Schulman et al
Our training spanned 4 complete rounds on our held- out RL data, equaling around 500 PPO steps. For generalized advantage estimation (GAE; Schulman et al. (2015)), both λ and γ were set at
work page 2015
-
[40]
We opted for a constant KL regularizer coefficient of 0.1. For symbolic rewards, the length penalty is set as the number of response tokens divided by the maximum response length (set to896) times the length penalty coefficient. We set the length penalty coefficient to −10.0 for general questions, −40.0 for detailed description questions in LLaV A data, a...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.