pith. sign in
Pith Number

pith:6YETJUN5

pith:2023:6YETJUN5ELIXCCUFJKB6DORLLZ
not attested not anchored not stored refs resolved

Aligning Large Multimodal Models with Factually Augmented RLHF

Chuang Gan, Chunyuan Li, Haotian Liu, Kurt Keutzer, Liang-Yan Gui, Shengcao Cao, Sheng Shen, Trevor Darrell, Yikang Shen, Yiming Yang, Yu-Xiong Wang, Zhiqing Sun

Factually augmented RLHF aligns large multimodal models to cut hallucinations and reach 94 percent of GPT-4 performance.

arxiv:2309.14525 v1 · 2023-09-25 · cs.CV · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{6YETJUN5ELIXCCUFJKB6DORLLZ}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

As the first LMM trained with RLHF, our approach achieves remarkable improvement on the LLaVA-Bench dataset with the 94% performance level of the text-only GPT-4 (while previous best methods can only achieve the 87% level), and an improvement by 60% on MMHAL-BENCH over other baselines.

C2weakest assumption

That augmenting the reward model with image captions and ground-truth options reliably prevents reward hacking without introducing new biases or reducing generalization on open-ended questions.

C3one line summary

Factually Augmented RLHF aligns large multimodal models to reduce hallucinations, reaching 94% of GPT-4 on LLaVA-Bench and 60% improvement on the new MMHAL-BENCH.

References

40 extracted · 40 resolved · 27 Pith anchors

[1] PaLM 2 Technical Report · arXiv:2305.10403
[2] OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models · arXiv:2308.01390
[3] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond · arXiv:2308.12966
[4] Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback · arXiv:2204.05862
[5] Language models are few-shot learners 1901

Formal links

2 machine-checked theorem links

Cited by

39 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:50.660329Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

f60934d1bd22d1710a854a83e1ba2b5e6737dc1ce190bd9a4bcce587715752dd

Aliases

arxiv: 2309.14525 · arxiv_version: 2309.14525v1 · doi: 10.48550/arxiv.2309.14525 · pith_short_12: 6YETJUN5ELIX · pith_short_16: 6YETJUN5ELIXCCUF · pith_short_8: 6YETJUN5
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/6YETJUN5ELIXCCUFJKB6DORLLZ \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: f60934d1bd22d1710a854a83e1ba2b5e6737dc1ce190bd9a4bcce587715752dd
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "4a31ccd234c5dd37a44709d78049d1b291502e92a679cc0c02c73eb12bf35fdf",
    "cross_cats_sorted": [
      "cs.CL"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2023-09-25T20:59:33Z",
    "title_canon_sha256": "b87d6710b9a70b3477c1ded6d7d8d8fa6c0ab18f6bcb0fae4771d23b5540c209"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2309.14525",
    "kind": "arxiv",
    "version": 1
  }
}