{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:WYVVK7UFB4B2WG5HOU57Q6FZLE","short_pith_number":"pith:WYVVK7UF","schema_version":"1.0","canonical_sha256":"b62b557e850f03ab1ba7753bf878b959185b9fecd2660f4f01f58fff3d4ad2e3","source":{"kind":"arxiv","id":"2306.14565","version":4},"attestation_state":"computed","paper":{"title":"Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Finetuning on a dataset with both positive and negative visual instructions reduces hallucinations in large multi-modal models.","cross_cats":["cs.AI","cs.CE","cs.CL","cs.MM"],"primary_cat":"cs.CV","authors_text":"Fuxiao Liu, Jianfeng Wang, Kevin Lin, Lijuan Wang, Linjie Li, Yaser Yacoob","submitted_at":"2023-06-26T10:26:33Z","abstract_excerpt":"Despite the promising progress in multi-modal tasks, current large multi-modal models (LMMs) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset comprises 400k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we desig"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2306.14565","kind":"arxiv","version":4},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2023-06-26T10:26:33Z","cross_cats_sorted":["cs.AI","cs.CE","cs.CL","cs.MM"],"title_canon_sha256":"9b26ffcce0d957b6538a8e449e36f9f06417f87d058c00bbf90192962d56c9d3","abstract_canon_sha256":"f37358e2dda34cc97dd49e498d272bd67d03fb70082bf90c64217715d5ba88bd"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:39:22.305666Z","signature_b64":"jTvXmaSKlprHj/Md7iBxdj/zDW7WaY8yD97AljeL1Uvu1bJjFJ/LDHTigM2us4xtVtl7VhFww8IicZ0zq5vTAw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"b62b557e850f03ab1ba7753bf878b959185b9fecd2660f4f01f58fff3d4ad2e3","last_reissued_at":"2026-05-17T23:39:22.305051Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:39:22.305051Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Finetuning on a dataset with both positive and negative visual instructions reduces hallucinations in large multi-modal models.","cross_cats":["cs.AI","cs.CE","cs.CL","cs.MM"],"primary_cat":"cs.CV","authors_text":"Fuxiao Liu, Jianfeng Wang, Kevin Lin, Lijuan Wang, Linjie Li, Yaser Yacoob","submitted_at":"2023-06-26T10:26:33Z","abstract_excerpt":"Despite the promising progress in multi-modal tasks, current large multi-modal models (LMMs) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset comprises 400k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we desig"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"We successfully mitigate hallucination by finetuning MiniGPT4 and mPLUG-Owl on LRV-Instruction while improving performance on several public datasets compared to state-of-the-art methods.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That GPT-4-generated negative instructions at the three semantic levels accurately capture the hallucination behaviors that matter in real deployments and that the GAVIE GPT-4 judge produces scores that align with human judgment.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Finetuning on a dataset with both positive and negative visual instructions reduces hallucinations in large multi-modal models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"41f4efdb9d353bd8479751c9063691f736ecdc7adee6d75c302ca1626a984ecc"},"source":{"id":"2306.14565","kind":"arxiv","version":4},"verdict":{"id":"eb4d07a3-e44a-4685-af10-5ffa287c0944","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T17:28:22.683615Z","strongest_claim":"We successfully mitigate hallucination by finetuning MiniGPT4 and mPLUG-Owl on LRV-Instruction while improving performance on several public datasets compared to state-of-the-art methods.","one_line_summary":"A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That GPT-4-generated negative instructions at the three semantic levels accurately capture the hallucination behaviors that matter in real deployments and that the GAVIE GPT-4 judge produces scores that align with human judgment.","pith_extraction_headline":"Finetuning on a dataset with both positive and negative visual instructions reduces hallucinations in large multi-modal models."},"references":{"count":34,"sample":[{"doi":"","year":2016,"title":"Spice: Semantic propositional image caption evaluation","work_id":"e943296a-40fe-41cb-b64a-72f7e0cbe393","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.5281/zenodo.7733589","year":null,"title":"Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al","work_id":"6f218053-cca5-4fde-92aa-730589931f0c","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1901,"title":"Language models are few-shot learners","work_id":"f93ff324-f230-4a46-97b9-6b103c35585d","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning","work_id":"fb62cd1b-3991-40be-a987-3cfa5772b5b5","ref_index":4,"cited_arxiv_id":"2310.09478","is_internal_anchor":true},{"doi":"","year":null,"title":"InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning","work_id":"f3aac728-ded0-4e55-aa9e-4a1635d4313d","ref_index":5,"cited_arxiv_id":"2305.06500","is_internal_anchor":true}],"resolved_work":34,"snapshot_sha256":"c42e47344267994655ac77e8c18dc1902a38152bd4ef92047ae4c1111450f21d","internal_anchors":15},"formal_canon":{"evidence_count":2,"snapshot_sha256":"4d3785b462baeb498b3d15f0625e31fb795e86846395a492813f6b290151033c"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2306.14565","created_at":"2026-05-17T23:39:22.305140+00:00"},{"alias_kind":"arxiv_version","alias_value":"2306.14565v4","created_at":"2026-05-17T23:39:22.305140+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2306.14565","created_at":"2026-05-17T23:39:22.305140+00:00"},{"alias_kind":"pith_short_12","alias_value":"WYVVK7UFB4B2","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"WYVVK7UFB4B2WG5H","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"WYVVK7UF","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":38,"internal_anchor_count":38,"sample":[{"citing_arxiv_id":"2507.12455","citing_title":"Mitigating Object Hallucinations via Sentence-Level Early Intervention","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2308.12067","citing_title":"MM-LIMA: Less Is More for Alignment in Multi-Modal Datasets","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2412.04468","citing_title":"NVILA: Efficient Frontier Visual Language Models","ref_index":131,"is_internal_anchor":true},{"citing_arxiv_id":"2505.13255","citing_title":"Policy Contrastive Decoding for Robotic Foundation Models","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01733","citing_title":"GEASS: Gated Evidence-Adaptive Selective Caption Trust for Vision-Language Models","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2408.04840","citing_title":"mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2507.21584","citing_title":"TARS: MinMax Token-Adaptive Preference Strategy for Hallucination Reduction in MLLMs","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2311.04257","citing_title":"mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2310.00754","citing_title":"Analyzing and Mitigating Object Hallucination in Large Vision-Language Models","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2309.15112","citing_title":"InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2402.11411","citing_title":"Aligning Modalities in Vision Large Language Models via Preference Fine-tuning","ref_index":164,"is_internal_anchor":true},{"citing_arxiv_id":"2310.14566","citing_title":"HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2512.06581","citing_title":"MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2512.20182","citing_title":"FaithLens: Detecting and Explaining Faithfulness Hallucination","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2401.15947","citing_title":"MoE-LLaVA: Mixture of Experts for Large Vision-Language Models","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2309.17421","citing_title":"The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)","ref_index":78,"is_internal_anchor":true},{"citing_arxiv_id":"2311.16502","citing_title":"MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2305.03726","citing_title":"Otter: A Multi-Modal Model with In-Context Instruction Tuning","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2312.14238","citing_title":"InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks","ref_index":91,"is_internal_anchor":true},{"citing_arxiv_id":"2402.00253","citing_title":"A Survey on Hallucination in Large Vision-Language Models","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2311.05232","citing_title":"A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions","ref_index":192,"is_internal_anchor":true},{"citing_arxiv_id":"2404.16821","citing_title":"How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites","ref_index":60,"is_internal_anchor":true},{"citing_arxiv_id":"2604.26614","citing_title":"State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2604.26419","citing_title":"Delineating Knowledge Boundaries for Honest Large Vision-Language Models","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08145","citing_title":"Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models","ref_index":21,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/WYVVK7UFB4B2WG5HOU57Q6FZLE","json":"https://pith.science/pith/WYVVK7UFB4B2WG5HOU57Q6FZLE.json","graph_json":"https://pith.science/api/pith-number/WYVVK7UFB4B2WG5HOU57Q6FZLE/graph.json","events_json":"https://pith.science/api/pith-number/WYVVK7UFB4B2WG5HOU57Q6FZLE/events.json","paper":"https://pith.science/paper/WYVVK7UF"},"agent_actions":{"view_html":"https://pith.science/pith/WYVVK7UFB4B2WG5HOU57Q6FZLE","download_json":"https://pith.science/pith/WYVVK7UFB4B2WG5HOU57Q6FZLE.json","view_paper":"https://pith.science/paper/WYVVK7UF","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2306.14565&json=true","fetch_graph":"https://pith.science/api/pith-number/WYVVK7UFB4B2WG5HOU57Q6FZLE/graph.json","fetch_events":"https://pith.science/api/pith-number/WYVVK7UFB4B2WG5HOU57Q6FZLE/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/WYVVK7UFB4B2WG5HOU57Q6FZLE/action/timestamp_anchor","attest_storage":"https://pith.science/pith/WYVVK7UFB4B2WG5HOU57Q6FZLE/action/storage_attestation","attest_author":"https://pith.science/pith/WYVVK7UFB4B2WG5HOU57Q6FZLE/action/author_attestation","sign_citation":"https://pith.science/pith/WYVVK7UFB4B2WG5HOU57Q6FZLE/action/citation_signature","submit_replication":"https://pith.science/pith/WYVVK7UFB4B2WG5HOU57Q6FZLE/action/replication_record"}},"created_at":"2026-05-17T23:39:22.305140+00:00","updated_at":"2026-05-17T23:39:22.305140+00:00"}