{"work":{"id":"a7e3a737-e007-42bc-be89-c4d34c5ee071","openalex_id":null,"doi":null,"arxiv_id":"2304.10592","raw_key":null,"title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models","authors":null,"authors_text":"Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny","year":2023,"venue":"cs.CV","abstract":"The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. However, the technical details behind GPT-4 continue to remain undisclosed. We believe that the enhanced multi-modal generation capabilities of GPT-4 stem from the utilization of sophisticated large language models (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced LLM, Vicuna, using one projection layer. Our work, for the first time, uncovers that properly aligning the visual features with an advanced large language model can possess numerous advanced multi-modal abilities demonstrated by GPT-4, such as detailed image description generation and website creation from hand-drawn drafts. Furthermore, we also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images, teaching users how to cook based on food photos, and so on. In our experiment, we found that the model trained on short image caption pairs could produce unnatural language outputs (e.g., repetition and fragmentation). To address this problem, we curate a detailed image description dataset in the second stage to finetune the model, which consequently improves the model's generation reliability and overall usability. Our code, pre-trained model, and collected dataset are available at https://minigpt-4.github.io/.","external_url":"https://arxiv.org/abs/2304.10592","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T09:05:35.807435+00:00","pith_arxiv_id":"2304.10592","created_at":"2026-05-09T05:50:25.657121+00:00","updated_at":"2026-06-05T21:23:00.469572+00:00","title_quality_ok":true,"display_title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models","render_title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models"},"hub":{"state":{"work_id":"a7e3a737-e007-42bc-be89-c4d34c5ee071","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":196,"external_cited_by_count":null,"distinct_field_count":12,"first_pith_cited_at":"2023-03-28T17:59:12+00:00","last_pith_cited_at":"2026-05-22T15:46:10+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-06T02:49:54.627653+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":46},{"context_role":"baseline","n":3},{"context_role":"method","n":3},{"context_role":"dataset","n":1}],"polarity_counts":[{"context_polarity":"background","n":45},{"context_polarity":"baseline","n":3},{"context_polarity":"use_method","n":3},{"context_polarity":"support","n":1},{"context_polarity":"use_dataset","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models","claims":[{"claim_text":"The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. However, the technical details behind GPT-4 continue to remain undisclosed. We believe that the enhanced multi-modal generation capabilities of GPT-4 stem from the utilization of sophisticated large language models (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced LLM, Vicuna, using","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"1 Multimodal Large Language Models Multimodal Large Language Models (MLLMs) serve as the cognitive backbone for object-centric tasks, evolving from complex modular alignment to streamlined end-to-end integration. Early foundational works explored diverse bridging mechanisms:Flamingo[ 3] pioneeredGated Cross-Attention, whileBLIP-2[ 81], InstructBLIP[ 35], andMiniGPT-4[ 242] introduced theQ-Former. More recently, the field has largely converged on coupling vision encoders with LLMs via lightweight","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Flamingo [2], by connecting pretrained language models and vision encoders akin to CLIP, initially showcases promising few-shot multimodal understanding capabilities. The increasing availability and progress of LLMs have popularized the fusion of pretrained vision encoders with LLMs, forming a common approach to train extensive vision-language models (VLMs). The BLIP series [49, 48], MiniGPT4 [109], and LLaV A [57] exhibit encouraging results by linking vision encoders with LLMs and training on ","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"detection [147], privacy protection [148], [149], federated learning [150], and social biases mitigation [151]. In this section, we pay more focus on three representative downstream tasks: visual instruction following, continual learning, and context window extension. 1) Visual Instruct Following: Several studies, including VL-BART [152], MiniGPT-4 [153], and LLaV A [154], have successfully extended the capabilities of LLMs, initially de- signed for pure text, to comprehend and generate response","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16793-16803, 2022. [62] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. [63] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language unde","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"Languagebind: Extending video- language pretraining to n-modality by language-based se- mantic alignment. arXiv preprint arXiv:2310.01852, 2023. 10 [187] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 1, 3, 11 [188] Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Wei- jie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiao- gang Wang, et al. Ghost ","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"InstructBLIP: Towards general-purpose vision-language models with instruction tuning, 2023. [59] Shangpin Peng, Weinong Wang, Zhuotao Tian, Senqiao Yang, Xing Wu, Haotian Xu, Chengquan Zhang, Takashi Isobe, Baotian Hu, and Min Zhang. Uni-DPO: A unified paradigm for dynamic preference optimization of LLMs.arXiv preprint arXiv:2506.10054, 2025. [60] OpenAI. GPT-4V(ision) system card, 2023. [61] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing vision-languag","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (11 contexts).","role_counts":[{"n":11,"context_role":"background"}]},"error":null,"updated_at":"2026-05-15T14:57:51.305258+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"ebed79d3-5e72-4235-8a6f-2093314337cb","orcid":null,"display_name":"Deyao Zhu"},{"id":"cd014512-c569-4d7d-8bc5-7636747577d7","orcid":null,"display_name":"Jun Chen"},{"id":"50762c33-5933-47e0-8ad0-f347cc2ea7a2","orcid":null,"display_name":"Xiaoqian Shen"},{"id":"ac38562a-352e-451a-bcaf-39fba254be2c","orcid":null,"display_name":"Xiang Li"},{"id":"75d658b4-b853-4900-afa9-22da0df16f87","orcid":null,"display_name":"Mohamed Elhoseiny"}]},"error":null,"updated_at":"2026-05-15T14:57:51.674757+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T06:47:32.452545+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":30},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":26},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":21},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":20},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":20},{"title":"Qwen Technical Report","work_id":"bb1fd52f-6b2f-437c-9516-37bdf6eb9be8","shared_citers":20},{"title":"Visual Instruction Tuning","work_id":"68be622d-a6dc-4a13-82de-e3054a3dc509","shared_citers":20},{"title":"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models","work_id":"63d03f4d-15f4-4583-8286-913c19f02294","shared_citers":17},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":16},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":15},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":15},{"title":"MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models","work_id":"806d2e73-71b3-4d56-87e0-39d571cc15d6","shared_citers":15},{"title":"mplug-owl: Modularization empowers large language models with multimodality","work_id":"74a7deb6-48be-4132-9d35-882cc5870ebd","shared_citers":15},{"title":"Evaluating Object Hallucination in Large Vision-Language Models","work_id":"66d8ac3e-c134-4995-b528-550afa17586f","shared_citers":14},{"title":"InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning","work_id":"f3aac728-ded0-4e55-aa9e-4a1635d4313d","shared_citers":13},{"title":"Kosmos-2: Grounding Multimodal Large Language Models to the World","work_id":"46e7f9e9-24c6-49af-b7d5-96159fa6f443","shared_citers":13},{"title":"Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context","work_id":"80e3e977-f1bb-4c83-8d0c-1ab0a0c5c3f1","shared_citers":12},{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","work_id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","shared_citers":12},{"title":"Aligning large multi-modal model with robust instruction tuning","work_id":"ba8e8164-e47f-42d6-83ad-696cb57ee79a","shared_citers":11},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":11},{"title":"MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts","work_id":"e22c3789-9e71-4242-b6ea-3e60e06e2b66","shared_citers":11},{"title":"Otter: A multi-modal model with in-context instruction tuning","work_id":"33cb3a7a-6091-48db-a246-802bbb055f43","shared_citers":11},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":11},{"title":"Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic","work_id":"44525076-312a-4259-b79c-134cd7eeb297","shared_citers":11}],"time_series":[{"n":13,"year":2023},{"n":13,"year":2024},{"n":1,"year":2025},{"n":59,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T06:47:40.743468+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T06:47:28.390329+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models","claims":[{"claim_text":"The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. However, the technical details behind GPT-4 continue to remain undisclosed. We believe that the enhanced multi-modal generation capabilities of GPT-4 stem from the utilization of sophisticated large language models (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced LLM, Vicuna, using","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"1 Multimodal Large Language Models Multimodal Large Language Models (MLLMs) serve as the cognitive backbone for object-centric tasks, evolving from complex modular alignment to streamlined end-to-end integration. Early foundational works explored diverse bridging mechanisms:Flamingo[ 3] pioneeredGated Cross-Attention, whileBLIP-2[ 81], InstructBLIP[ 35], andMiniGPT-4[ 242] introduced theQ-Former. More recently, the field has largely converged on coupling vision encoders with LLMs via lightweight","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Flamingo [2], by connecting pretrained language models and vision encoders akin to CLIP, initially showcases promising few-shot multimodal understanding capabilities. The increasing availability and progress of LLMs have popularized the fusion of pretrained vision encoders with LLMs, forming a common approach to train extensive vision-language models (VLMs). The BLIP series [49, 48], MiniGPT4 [109], and LLaV A [57] exhibit encouraging results by linking vision encoders with LLMs and training on ","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"detection [147], privacy protection [148], [149], federated learning [150], and social biases mitigation [151]. In this section, we pay more focus on three representative downstream tasks: visual instruction following, continual learning, and context window extension. 1) Visual Instruct Following: Several studies, including VL-BART [152], MiniGPT-4 [153], and LLaV A [154], have successfully extended the capabilities of LLMs, initially de- signed for pure text, to comprehend and generate response","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16793-16803, 2022. [62] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. [63] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language unde","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"Languagebind: Extending video- language pretraining to n-modality by language-based se- mantic alignment. arXiv preprint arXiv:2310.01852, 2023. 10 [187] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 1, 3, 11 [188] Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Wei- jie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiao- gang Wang, et al. Ghost ","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"InstructBLIP: Towards general-purpose vision-language models with instruction tuning, 2023. [59] Shangpin Peng, Weinong Wang, Zhuotao Tian, Senqiao Yang, Xing Wu, Haotian Xu, Chengquan Zhang, Takashi Isobe, Baotian Hu, and Min Zhang. Uni-DPO: A unified paradigm for dynamic preference optimization of LLMs.arXiv preprint arXiv:2506.10054, 2025. [60] OpenAI. GPT-4V(ision) system card, 2023. [61] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing vision-languag","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (11 contexts).","role_counts":[{"n":11,"context_role":"background"}]},"error":null,"updated_at":"2026-05-15T14:57:51.678569+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models","claims":[{"claim_text":"The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. However, the technical details behind GPT-4 continue to remain undisclosed. We believe that the enhanced multi-modal generation capabilities of GPT-4 stem from the utilization of sophisticated large language models (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced LLM, Vicuna, using","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T06:47:40.748625+00:00"}},"summary":{"title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models","claims":[{"claim_text":"The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. However, the technical details behind GPT-4 continue to remain undisclosed. We believe that the enhanced multi-modal generation capabilities of GPT-4 stem from the utilization of sophisticated large language models (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced LLM, Vicuna, using","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":30},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":26},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":21},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":20},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":20},{"title":"Qwen Technical Report","work_id":"bb1fd52f-6b2f-437c-9516-37bdf6eb9be8","shared_citers":20},{"title":"Visual Instruction Tuning","work_id":"68be622d-a6dc-4a13-82de-e3054a3dc509","shared_citers":20},{"title":"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models","work_id":"63d03f4d-15f4-4583-8286-913c19f02294","shared_citers":17},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":16},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":15},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":15},{"title":"MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models","work_id":"806d2e73-71b3-4d56-87e0-39d571cc15d6","shared_citers":15},{"title":"mplug-owl: Modularization empowers large language models with multimodality","work_id":"74a7deb6-48be-4132-9d35-882cc5870ebd","shared_citers":15},{"title":"Evaluating Object Hallucination in Large Vision-Language Models","work_id":"66d8ac3e-c134-4995-b528-550afa17586f","shared_citers":14},{"title":"InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning","work_id":"f3aac728-ded0-4e55-aa9e-4a1635d4313d","shared_citers":13},{"title":"Kosmos-2: Grounding Multimodal Large Language Models to the World","work_id":"46e7f9e9-24c6-49af-b7d5-96159fa6f443","shared_citers":13},{"title":"Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context","work_id":"80e3e977-f1bb-4c83-8d0c-1ab0a0c5c3f1","shared_citers":12},{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","work_id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","shared_citers":12},{"title":"Aligning large multi-modal model with robust instruction tuning","work_id":"ba8e8164-e47f-42d6-83ad-696cb57ee79a","shared_citers":11},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":11},{"title":"MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts","work_id":"e22c3789-9e71-4242-b6ea-3e60e06e2b66","shared_citers":11},{"title":"Otter: A multi-modal model with in-context instruction tuning","work_id":"33cb3a7a-6091-48db-a246-802bbb055f43","shared_citers":11},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":11},{"title":"Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic","work_id":"44525076-312a-4259-b79c-134cd7eeb297","shared_citers":11}],"time_series":[{"n":13,"year":2023},{"n":13,"year":2024},{"n":1,"year":2025},{"n":59,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"ebed79d3-5e72-4235-8a6f-2093314337cb","orcid":null,"display_name":"Deyao Zhu","source":"manual","import_confidence":0.72},{"id":"cd014512-c569-4d7d-8bc5-7636747577d7","orcid":null,"display_name":"Jun Chen","source":"manual","import_confidence":0.72},{"id":"75d658b4-b853-4900-afa9-22da0df16f87","orcid":null,"display_name":"Mohamed Elhoseiny","source":"manual","import_confidence":0.72},{"id":"ac38562a-352e-451a-bcaf-39fba254be2c","orcid":null,"display_name":"Xiang Li","source":"manual","import_confidence":0.72},{"id":"50762c33-5933-47e0-8ad0-f347cc2ea7a2","orcid":null,"display_name":"Xiaoqian Shen","source":"manual","import_confidence":0.72}]}}