arxiv: 2309.14525 · v1 · submitted 2023-09-25 · 💻 cs.CV · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Aligning Large Multimodal Models with Factually Augmented RLHF

Zhiqing Sun , Sheng Shen , Shengcao Cao , Haotian Liu , Chunyuan Li , Yikang Shen , Chuang Gan , Liang-Yan Gui

show 4 more authors

Yu-Xiong Wang Yiming Yang Kurt Keutzer Trevor Darrell

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:51 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords large multimodal modelsRLHFhallucinationvision-language alignmentreward hackingMMHAL-BENCHLLaVA-Benchfactually augmented RLHF

0 comments

The pith

Factually augmented RLHF aligns large multimodal models to cut hallucinations and reach 94 percent of GPT-4 performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large multimodal models often produce textual outputs not grounded in their visual inputs because of misalignment between the two modalities. The paper adapts RLHF to this setting by collecting human comparisons that identify the more hallucinated response in each pair, then training the model to maximize the resulting reward signal. To stop the model from exploiting gaps in that signal, the reward model is augmented with image captions and ground-truth options. The same approach also mixes GPT-4 synthetic data with existing human image-text pairs. On a new hallucination-focused benchmark and on LLaVA-Bench, the resulting model shows clear gains over prior methods.

Core claim

We adapt Reinforcement Learning from Human Feedback to vision-language alignment by asking human annotators to compare two responses and select the more hallucinated one, then train the model to maximize simulated human rewards. We introduce Factually Augmented RLHF, which supplies the reward model with additional factual information such as image captions and ground-truth multi-choice options to reduce reward hacking. Training data is further strengthened by mixing GPT-4-generated vision instructions with previously available human-written image-text pairs. Evaluated on the new MMHAL-BENCH benchmark that penalizes hallucinations, the resulting model, the first LMM trained with RLHF, reaches

What carries the argument

Factually Augmented RLHF, the algorithm that augments the reward model with image captions and ground-truth options to prevent reward hacking during multimodal alignment training.

If this is right

The model reaches 94 percent of the performance level of text-only GPT-4 on the LLaVA-Bench dataset.
It delivers a 60 percent improvement on the MMHAL-BENCH benchmark relative to prior baselines.
Mixing GPT-4-generated data with human-written image-text pairs improves general multimodal capabilities.
The full code, model weights, and training data are released for public use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same factual-augmentation step could be tested on alignment methods other than RLHF to check whether reward hacking is reduced more broadly.
If the method scales, it may improve reliability of multimodal systems in applications such as visual question answering where factual grounding is critical.
Hybrid training that combines synthetic GPT-4 outputs with human image-text pairs may offer a practical route for future multimodal alignment work.

Load-bearing premise

That adding image captions and ground-truth options to the reward model reliably prevents reward hacking without introducing new biases or reducing generalization on open-ended questions.

What would settle it

A controlled experiment in which the model continues to hallucinate at baseline rates on questions whose answers are not directly supplied by the added captions or options, or where performance falls on open-ended benchmarks that lack the factual augmentations.

read the original abstract

Large Multimodal Models (LMM) are built across modalities and the misalignment between two modalities can result in "hallucination", generating textual outputs that are not grounded by the multimodal information in context. To address the multimodal misalignment issue, we adapt the Reinforcement Learning from Human Feedback (RLHF) from the text domain to the task of vision-language alignment, where human annotators are asked to compare two responses and pinpoint the more hallucinated one, and the vision-language model is trained to maximize the simulated human rewards. We propose a new alignment algorithm called Factually Augmented RLHF that augments the reward model with additional factual information such as image captions and ground-truth multi-choice options, which alleviates the reward hacking phenomenon in RLHF and further improves the performance. We also enhance the GPT-4-generated training data (for vision instruction tuning) with previously available human-written image-text pairs to improve the general capabilities of our model. To evaluate the proposed approach in real-world scenarios, we develop a new evaluation benchmark MMHAL-BENCH with a special focus on penalizing hallucinations. As the first LMM trained with RLHF, our approach achieves remarkable improvement on the LLaVA-Bench dataset with the 94% performance level of the text-only GPT-4 (while previous best methods can only achieve the 87% level), and an improvement by 60% on MMHAL-BENCH over other baselines. We opensource our code, model, data at https://llava-rlhf.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Factually Augmented RLHF gives measurable benchmark gains on LMM hallucinations by adding captions and options to the reward model, but the generalization case still needs checking.

read the letter

The one or two things to know about this paper are that Factually Augmented RLHF improves performance on hallucination benchmarks for large multimodal models by adding factual information to the reward model, and that they've released a new benchmark called MMHAL-BENCH for this purpose. What is actually new is the augmentation of the reward model with image captions and ground-truth options to prevent reward hacking in the RLHF process for vision-language alignment. They also enhance the training data with human-written pairs. The results show clear improvements over baselines, reaching 94% of GPT-4 on LLaVA-Bench and 60% better on the new benchmark. Opening up the code, model, and data is a good step. The paper does well in focusing on a real deployment problem and providing an empirical method to address multimodal misalignment. Soft spots include the absence of detailed ablations, statistical analysis, or error breakdowns, which makes it hard to fully trust the source of the gains. The weakest assumption is that the factual augmentation reliably avoids new biases and works for open-ended questions; the stress-test concern about favoring closed-ended data has merit if the reward model overfits to the provided anchors. The full paper would need to include tests on generalization to confirm. This paper is for researchers in computer vision and NLP working on large multimodal models and alignment techniques. It offers value to those looking for ways to make these models more reliable in practice. It deserves a serious referee because the idea is concrete and the results are promising on important metrics. Recommendation: send to peer review for further evaluation.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Factually Augmented RLHF (FA-RLHF) to align large multimodal models (LMMs) and mitigate hallucinations arising from modality misalignment. Human annotators compare pairs of responses to identify the more hallucinated one, training a reward model that is augmented with factual information such as image captions and ground-truth multi-choice options. The approach also enhances GPT-4-generated vision instruction tuning data with human-written image-text pairs. A new benchmark MMHAL-BENCH is proposed focusing on hallucination penalties. The authors report that their model reaches 94% of text-only GPT-4 performance on LLaVA-Bench (improving from previous 87%) and achieves a 60% improvement on MMHAL-BENCH over baselines. Code, model, and data are open-sourced.

Significance. If the empirical gains hold under scrutiny, this work represents a significant advance as the first application of RLHF to LMM alignment. The introduction of factual augmentation to address reward hacking is a novel contribution, and the new MMHAL-BENCH benchmark could become a standard for evaluating hallucination in multimodal models. Open-sourcing facilitates reproducibility and further research in multimodal alignment.

major comments (2)

[Abstract and §4 (Experiments)] Abstract and §4 (Experiments): The headline performance claims (94% of GPT-4 on LLaVA-Bench and 60% improvement on MMHAL-BENCH) are presented without statistical details such as standard deviations, number of evaluation runs, or ablation tables isolating the effect of factual augmentation versus data enhancement. This makes it difficult to rule out that gains stem from hyperparameter tuning or data selection rather than the proposed method.
[§3 (Method, Factually Augmented RLHF)] §3 (Method, Factually Augmented RLHF): The assumption that augmenting the reward model with image captions and ground-truth options reliably prevents reward hacking without introducing biases is central but under-supported. Since these factual anchors are unavailable for open-ended prompts at inference time, the reward model may overfit to closed-ended cues, potentially limiting generalization. Experiments comparing performance on purely open-ended vs. multiple-choice questions would strengthen this claim.

minor comments (2)

[§2 (Related Work)] §2 (Related Work): The discussion of prior RLHF applications could include more recent multimodal alignment works for completeness.
[Figure 1] Figure 1: The diagram illustrating the FA-RLHF pipeline would benefit from clearer labeling of the factual augmentation step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to incorporate additional statistical reporting, ablation studies, and generalization experiments where feasible.

read point-by-point responses

Referee: [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): The headline performance claims (94% of GPT-4 on LLaVA-Bench and 60% improvement on MMHAL-BENCH) are presented without statistical details such as standard deviations, number of evaluation runs, or ablation tables isolating the effect of factual augmentation versus data enhancement. This makes it difficult to rule out that gains stem from hyperparameter tuning or data selection rather than the proposed method.

Authors: We agree that additional statistical rigor would strengthen the presentation. In the revised manuscript, we report standard deviations computed over 5 independent evaluation runs for the key metrics on both LLaVA-Bench and MMHAL-BENCH. We have also added a dedicated ablation table in §4.3 that isolates the contribution of factual augmentation in the reward model from the GPT-4 data enhancement step. These results show that factual augmentation accounts for a substantial portion of the observed gains beyond data selection effects alone. revision: yes
Referee: [§3 (Method, Factually Augmented RLHF)] §3 (Method, Factually Augmented RLHF): The assumption that augmenting the reward model with image captions and ground-truth options reliably prevents reward hacking without introducing biases is central but under-supported. Since these factual anchors are unavailable for open-ended prompts at inference time, the reward model may overfit to closed-ended cues, potentially limiting generalization. Experiments comparing performance on purely open-ended vs. multiple-choice questions would strengthen this claim.

Authors: We clarify that factual anchors (captions and ground-truth options) are used exclusively during reward model training to provide stronger supervision and reduce reward hacking; they are not available to the reward model or policy at inference. To address generalization concerns, the revised §4 now includes a breakdown of results on purely open-ended question subsets versus multiple-choice subsets of MMHAL-BENCH. Performance gains from FA-RLHF remain consistent in the open-ended setting, indicating that the learned reward model does not rely on closed-ended cues at test time. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical benchmark results

full rationale

The paper reports empirical performance gains on LLaVA-Bench (94% of GPT-4) and MMHAL-BENCH (60% improvement) from Factually Augmented RLHF. These are direct benchmark scores on evaluation sets, with no equations, fitted parameters, or derivations presented that reduce the claimed improvements to quantities defined or fitted on the same data. The factual augmentation is described as an input to the reward model, but the results remain independent external measurements. No self-citation chains or uniqueness theorems are invoked to support the central claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach inherits standard RLHF assumptions and adds factual anchors; no new physical entities or ungrounded constants are introduced.

axioms (1)

domain assumption Human preference judgments over paired responses can be modeled by a scalar reward function
Core premise of RLHF transferred to multimodal setting

pith-pipeline@v0.9.0 · 5612 in / 1182 out tokens · 43253 ms · 2026-05-15T17:51:53.974531+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs
cs.CV 2026-04 unverdicted novelty 8.0

CCTVBench exposes a large gap between standard QA accuracy and contrastive consistency in traffic video reasoning for multimodal LLMs and introduces C-TCD to narrow that gap.
OxyEcomBench: Benchmarking Multimodal Foundation Models across E-Commerce Ecosystems
cs.DB 2026-05 conditional novelty 7.0

OxyEcomBench is a unified multimodal benchmark covering 6 capability areas and 29 tasks with authentic e-commerce data to measure how well foundation models handle real platform, merchant, and customer challenges.
Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models
cs.CV 2026-04 conditional novelty 7.0

Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregr...
Unified Reward Model for Multimodal Understanding and Generation
cs.CV 2025-03 unverdicted novelty 7.0

UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
Dual-Pathway Circuits of Object Hallucination in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.
CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering
cs.CV 2026-05 unverdicted novelty 6.0

CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with ...
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
cs.CL 2026-04 unverdicted novelty 6.0

MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.
When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs
cs.CV 2026-04 unverdicted novelty 6.0

Hallucinations in LVLMs largely arise from textual priors in prompts, and can be reduced by fine-tuning with preference optimization on grounded vs. hallucinated response pairs.
S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.
Mitigating Multimodal Hallucination via Phase-wise Self-reward
cs.CV 2026-04 unverdicted novelty 6.0

PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
cs.LG 2026-04 unverdicted novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Visual-RFT: Visual Reinforcement Fine-Tuning
cs.CV 2025-03 conditional novelty 6.0

Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Generating Place-Based Compromises Between Two Points of View
cs.CL 2026-04 unverdicted novelty 5.0

Empathic similarity feedback in prompts generates more acceptable compromises than chain-of-thought, and margin-based training on the resulting data lets smaller models produce them without ongoing empathy estimation.
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
cs.AI 2026-04 unverdicted novelty 5.0

Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
cs.LG 2025-09 unverdicted novelty 5.0

An 8B MLLM reaches state-of-the-art efficiency and performance under 30B by combining a unified 3D resampler, joint document-text training, and hybrid RL for reasoning modes.
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
A Survey on Hallucination in Large Vision-Language Models
cs.CV 2024-02 unverdicted novelty 3.0

This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 19 Pith papers · 23 internal anchors

[1]

PaLM 2 Technical Report

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems . Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Pa...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open- source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 , 2022a. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy ...

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901,

work page 1901
[6]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023a. 12 Preprint Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. PaLI: A joi...

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Pali-x: On scaling up a multilingual vision and language model

Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Car- los Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023b. Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zh...

work page arXiv
[8]

PaLM: Scaling Language Modeling with Pathways

URL https: //vicuna.lmsys.org. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pel- lat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping...

work page internal anchor Pith review Pith/arXiv arXiv
[10]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multi- modal language model. arXiv preprint arXiv:2303.03378,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Alpacafarm: A simulation framework for methods that learn from human feedback

Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387,

work page arXiv
[13]

LoRA: Low-Rank Adaptation of Large Language Models

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 6904–6913, 2017a. Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. ...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language mod- els (mostly) know what they know. arXiv preprint arXiv:2207.05221,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Sha- hab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. In- ternational Journal of Computer Vision, 128(7):1956–1981,

work page 1956
[16]

Obelisc: An open web-scale filtered dataset of interleaved image-text documents.arXiv preprint arXiv:2306.16527,

Hugo Laurenc ¸on, Lucile Saulnier, L´eo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M Rush, Douwe Kiela, et al. Obelisc: An open web-scale filtered dataset of interleaved image-text documents.arXiv preprint arXiv:2306.16527,

work page arXiv
[17]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tuning. 2023a. Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023b. Junnan Li, ...

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Teaching models to express their uncertainty in words

Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334,

work page arXiv
[19]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer,

work page 2014
[20]

MMBench: Is Your Multi-modal Model an All-around Player?

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. 2023a. Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023b. Shayne Longpre, Le Hou, Tu Vu, Albert Web...

work page internal anchor Pith review Pith/arXiv arXiv
[21]

An empirical study of scaling instruct-tuned large multimodal models

Yadong Lu, Chunyuan Li, Haotian Liu, Jianwei Yang, Jianfeng Gao, and Yelong Shen. An empirical study of scaling instruct-tuned large multimodal models. arXiv preprint arXiv:2309.09958,

work page arXiv
[22]

Understanding blind people’s experiences with computer-generated captions of social media images

Haley MacLeod, Cynthia L Bennett, Meredith Ringel Morris, and Edward Cutrell. Understanding blind people’s experiences with computer-generated captions of social media images. In pro- ceedings of the 2017 CHI conference on human factors in computing systems , pp. 5988–5999,

work page 2017
[23]

Crosslingual gen- eralization through multitask finetuning

Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. Crosslingual gen- eralization through multitask finetuning. arXiv preprint arXiv:2211.01786,

work page arXiv
[24]

Orca: Progressive Learning from Complex Explanation Traces of GPT-4

14 Preprint Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Object Hallucination in Image Captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili´c, Daniel Hesslow, Roman Castagn´e, Alexandra Sasha Luccioni, Franc ¸ois Yvon, Matthias Gall ´e, et al. Bloom: A 176b- parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

URL https://www.youtube.com/watch?v=hhiLw5Q_UFg&ab_channel= BerkeleyEECS. Berkeley EECS. John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Replug: Retrieval-augmented black-box language models

Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettle- moyer, and Wen-tau Yih. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652,

work page arXiv
[30]

Self-alignment with principle-following reward models

Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. Self-alignment with principle-following reward models. personal com- munication, 2023a. Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. Principle-driven self-alignment of language model...

work page arXiv
[31]

LLaMA: Open and Efficient Foundation Language Models

15 Preprint Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine ...

work page internal anchor Pith review Pith/arXiv arXiv
[32]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics , 2:67–78, 2014a. Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual den...

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Detecting hallucinated content in conditional neural sequence generation

Chunting Zhou, Graham Neubig, Jiatao Gu, Mona Diab, Paco Guzman, Luke Zettlemoyer, and Marjan Ghazvininejad. Detecting hallucinated content in conditional neural sequence generation. arXiv preprint arXiv:2011.02593,

work page arXiv 2011
[35]

Lima: Less is more for alignment.arXiv preprint arXiv:2305.11206,

Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment.arXiv preprint arXiv:2305.11206,

work page arXiv
[36]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Fine-Tuning Language Models from Human Preferences

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[38]

Proverbial Philosophy

Question: What color is the fire hydrant cap in the picture? Ground Truth: The color of the fire hydrant cap in the image is yellow. LLaV A: The fire hydrant cap in the picture is red. Table 11: An example question where LLaV A hallucinates the object attribute. Question: Is the jam on the bread made of strawberry? Ground Truth: In fact, this photo only s...

work page 2023
[39]

For generalized advantage estimation (GAE; Schulman et al

Our training spanned 4 complete rounds on our held- out RL data, equaling around 500 PPO steps. For generalized advantage estimation (GAE; Schulman et al. (2015)), both λ and γ were set at

work page 2015
[40]

hallucination

We opted for a constant KL regularizer coefficient of 0.1. For symbolic rewards, the length penalty is set as the number of response tokens divided by the maximum response length (set to896) times the length penalty coefficient. We set the length penalty coefficient to −10.0 for general questions, −40.0 for detailed description questions in LLaV A data, a...

work page 2023