{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:C6RUBASQYQQFQ4DIJKLTWVQRZM","short_pith_number":"pith:C6RUBASQ","schema_version":"1.0","canonical_sha256":"17a3408250c4205870684a973b5611cb1d7ab91dd3d6445d7076680005a5045b","source":{"kind":"arxiv","id":"2304.15010","version":1},"attestation_state":"computed","paper":{"title":"LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"LLaMA-Adapter V2 turns LLaMA into an open-ended visual instruction follower by adding only 14 million parameters.","cross_cats":["cs.AI","cs.CL","cs.LG","cs.MM"],"primary_cat":"cs.CV","authors_text":"Aojun Zhou, Conghui He, Hongsheng Li, Jiaming Han, Pan Lu, Peng Gao, Renrui Zhang, Shijie Geng, Wei Zhang, Xiangyu Yue, Yu Qiao, Ziyi Lin","submitted_at":"2023-04-28T17:59:25Z","abstract_excerpt":"How to efficiently transform large language models (LLMs) into instruction followers is recently a popular research direction, while training LLM for multi-modal reasoning remains less explored. Although the recent LLaMA-Adapter demonstrates the potential to handle visual inputs with LLMs, it still cannot generalize well to open-ended visual instructions and lags behind GPT-4. In this paper, we present LLaMA-Adapter V2, a parameter-efficient visual instruction model. Specifically, we first augment LLaMA-Adapter by unlocking more learnable parameters (e.g., norm, bias and scale), which distribu"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2304.15010","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2023-04-28T17:59:25Z","cross_cats_sorted":["cs.AI","cs.CL","cs.LG","cs.MM"],"title_canon_sha256":"c3cc9569425a0dcc2b6287f8b1cc9dda45a8c9740ce9805e43e60a104fda39b6","abstract_canon_sha256":"44f870f9d7255528a47e73343d36a3f6ffeee3ff8d7d32938a2bb5d2da548e9a"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:52.998450Z","signature_b64":"GfYysLXCjHQSnxbcZ/y5rdanhVl5tHtgYWmKtD2SGiQk9rBN4LDHK3GaoIyeWEjXB30vQKJprzHL+rSQuxvoAw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"17a3408250c4205870684a973b5611cb1d7ab91dd3d6445d7076680005a5045b","last_reissued_at":"2026-05-17T23:38:52.997953Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:52.997953Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"LLaMA-Adapter V2 turns LLaMA into an open-ended visual instruction follower by adding only 14 million parameters.","cross_cats":["cs.AI","cs.CL","cs.LG","cs.MM"],"primary_cat":"cs.CV","authors_text":"Aojun Zhou, Conghui He, Hongsheng Li, Jiaming Han, Pan Lu, Peng Gao, Renrui Zhang, Shijie Geng, Wei Zhang, Xiangyu Yue, Yu Qiao, Ziyi Lin","submitted_at":"2023-04-28T17:59:25Z","abstract_excerpt":"How to efficiently transform large language models (LLMs) into instruction followers is recently a popular research direction, while training LLM for multi-modal reasoning remains less explored. Although the recent LLaMA-Adapter demonstrates the potential to handle visual inputs with LLMs, it still cannot generalize well to open-ended visual instructions and lags behind GPT-4. In this paper, we present LLaMA-Adapter V2, a parameter-efficient visual instruction model. Specifically, we first augment LLaMA-Adapter by unlocking more learnable parameters (e.g., norm, bias and scale), which distribu"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Compared to the original LLaMA-Adapter, our LLaMA-Adapter V2 can perform open-ended multi-modal instructions by merely introducing 14M parameters over LLaMA.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the early-fusion placement and the disjoint-parameter joint training will continue to prevent task interference and maintain generalization when the instruction data distribution shifts or when larger base models are used.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"LLaMA-Adapter V2 achieves open-ended visual instruction following in LLMs by unlocking more parameters, early fusion of visual tokens, and joint training on disjoint parameter groups with only 14M added parameters.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"LLaMA-Adapter V2 turns LLaMA into an open-ended visual instruction follower by adding only 14 million parameters.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"c9d348bc3257a2da07569ba9ae3472017e71220e651ebc57c1ec811a410510be"},"source":{"id":"2304.15010","kind":"arxiv","version":1},"verdict":{"id":"4ed12321-98b1-4490-beda-aebc663e4e58","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T08:36:50.707814Z","strongest_claim":"Compared to the original LLaMA-Adapter, our LLaMA-Adapter V2 can perform open-ended multi-modal instructions by merely introducing 14M parameters over LLaMA.","one_line_summary":"LLaMA-Adapter V2 achieves open-ended visual instruction following in LLMs by unlocking more parameters, early fusion of visual tokens, and joint training on disjoint parameter groups with only 14M added parameters.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the early-fusion placement and the disjoint-parameter joint training will continue to prevent task interference and maintain generalization when the instruction data distribution shifts or when larger base models are used.","pith_extraction_headline":"LLaMA-Adapter V2 turns LLaMA into an open-ended visual instruction follower by adding only 14 million parameters."},"references":{"count":79,"sample":[{"doi":"","year":null,"title":"https://sharegpt.com/","work_id":"42263fdc-d42c-4b2a-8562-4f2dc47ecf6c","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Flamingo: a visual language model for few-shot learning","work_id":"31e3af5c-9fec-43d9-b533-5bb70172dd15","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2018,"title":"Bottom-up and top-down attention for image captioning and visual question answering","work_id":"40ff759e-80ef-4b6f-86f5-11ea4321f5c8","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1901,"title":"Lan- guage models are few-shot learners","work_id":"5b23bebc-10b7-4150-9a97-e3f37825079e","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts","work_id":"c9a39a05-4f9a-45e3-92e2-f310468325af","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":79,"snapshot_sha256":"dd7cf17851379fca4c7b1bc3da5a359dac627288662817514af94575b0695681","internal_anchors":19},"formal_canon":{"evidence_count":2,"snapshot_sha256":"6b6edd9601be9795336ad80bbde7cdb959d9fbaefb4514280dc3923e30cfe686"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2304.15010","created_at":"2026-05-17T23:38:52.998032+00:00"},{"alias_kind":"arxiv_version","alias_value":"2304.15010v1","created_at":"2026-05-17T23:38:52.998032+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2304.15010","created_at":"2026-05-17T23:38:52.998032+00:00"},{"alias_kind":"pith_short_12","alias_value":"C6RUBASQYQQF","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"C6RUBASQYQQFQ4DI","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"C6RUBASQ","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":46,"internal_anchor_count":46,"sample":[{"citing_arxiv_id":"2308.12067","citing_title":"MM-LIMA: Less Is More for Alignment in Multi-Modal Datasets","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2408.12935","citing_title":"AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions","ref_index":230,"is_internal_anchor":true},{"citing_arxiv_id":"2501.05067","citing_title":"LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2503.13821","citing_title":"Stitch-a-Demo: Video Demonstrations from Multistep Descriptions","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2503.16549","citing_title":"MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21059","citing_title":"Multimodal LLMs under Pairwise Modalities","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17341","citing_title":"Single-Sample Black-Box Membership Inference Attack against Vision-Language Models via Cross-modal Semantic Alignment","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2308.08089","citing_title":"DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory","ref_index":254,"is_internal_anchor":true},{"citing_arxiv_id":"2307.06435","citing_title":"A Comprehensive Overview of Large Language Models","ref_index":156,"is_internal_anchor":true},{"citing_arxiv_id":"2503.17352","citing_title":"OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2509.19602","citing_title":"Parameter-Efficient Multi-Task Learning via Progressive Task-Specific Adaptation","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2311.04257","citing_title":"mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2510.27359","citing_title":"GD-FPS: Growth-Driven Feedforward Parameter Selection for Efficient Fine-Tuning","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2311.12871","citing_title":"An Embodied Generalist Agent in 3D World","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2309.15112","citing_title":"InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2305.17926","citing_title":"Large Language Models are not Fair Evaluators","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2401.16420","citing_title":"InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2512.02764","citing_title":"PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2403.14624","citing_title":"MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2406.16860","citing_title":"Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2512.10554","citing_title":"Grounding Everything in Tokens for Multimodal Large Language Models","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2407.12580","citing_title":"E5-V: Universal Embeddings with Multimodal Large Language Models","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2407.01284","citing_title":"We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2512.17435","citing_title":"ImagineNav++: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2512.19219","citing_title":"Selective LoRA for Visual Tokens and Attention Heads","ref_index":7,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/C6RUBASQYQQFQ4DIJKLTWVQRZM","json":"https://pith.science/pith/C6RUBASQYQQFQ4DIJKLTWVQRZM.json","graph_json":"https://pith.science/api/pith-number/C6RUBASQYQQFQ4DIJKLTWVQRZM/graph.json","events_json":"https://pith.science/api/pith-number/C6RUBASQYQQFQ4DIJKLTWVQRZM/events.json","paper":"https://pith.science/paper/C6RUBASQ"},"agent_actions":{"view_html":"https://pith.science/pith/C6RUBASQYQQFQ4DIJKLTWVQRZM","download_json":"https://pith.science/pith/C6RUBASQYQQFQ4DIJKLTWVQRZM.json","view_paper":"https://pith.science/paper/C6RUBASQ","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2304.15010&json=true","fetch_graph":"https://pith.science/api/pith-number/C6RUBASQYQQFQ4DIJKLTWVQRZM/graph.json","fetch_events":"https://pith.science/api/pith-number/C6RUBASQYQQFQ4DIJKLTWVQRZM/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/C6RUBASQYQQFQ4DIJKLTWVQRZM/action/timestamp_anchor","attest_storage":"https://pith.science/pith/C6RUBASQYQQFQ4DIJKLTWVQRZM/action/storage_attestation","attest_author":"https://pith.science/pith/C6RUBASQYQQFQ4DIJKLTWVQRZM/action/author_attestation","sign_citation":"https://pith.science/pith/C6RUBASQYQQFQ4DIJKLTWVQRZM/action/citation_signature","submit_replication":"https://pith.science/pith/C6RUBASQYQQFQ4DIJKLTWVQRZM/action/replication_record"}},"created_at":"2026-05-17T23:38:52.998032+00:00","updated_at":"2026-05-17T23:38:52.998032+00:00"}