{"total":24,"items":[{"citing_arxiv_id":"2606.02576","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction Tuning","primary_cat":"cs.CV","submitted_at":"2026-06-01T17:59:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ProtoAda uses format-aware prototypes for better task routing and geometry-aware consolidation to reduce interference in multimodal continual instruction tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02502","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CRAM: Centroid-Routing and Adaptive MoE for Multimodal Continual Instruction Tuning","primary_cat":"cs.CL","submitted_at":"2026-06-01T17:11:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CRAM uses adaptive MoE with centroid routing and orthogonality constraints to enable parameter-efficient multimodal continual instruction tuning while mitigating forgetting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14938","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-14T15:13:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Octopus introduces history-free gradient orthogonalization in a two-stage finetuning framework to achieve state-of-the-art continual learning results for multimodal LLMs on the UCIT benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10765","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning","primary_cat":"cs.CV","submitted_at":"2026-05-11T15:59:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DRAPE generates query-image conditioned prompts on the fly for multimodal continual instruction tuning and reports SOTA results on MCIT benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"37 54.43 67.92 66.53 65.11 67.48 - w/o Cross-Modal Attn. 69.39 59.44 94.14 61.11 47.77 67.87 66.55 64.22 66.31 -1.17 w/o Null-Space Proj. 69.94 55.78 83.6261.4849.83 65.4367.43 65.3864.86 -2.62 Grounding (RefCOCO) [19, 30], VQAv2 [9], and OCR-VQA [31]. The second is UCIT [10], which contains six sequential tasks: ArxivQA [ 21], CLEVR-Math [22], IconQA [28], ImageNet-R [13], VizWiz-caption [12], and Flickr30k [32]. Together, these two benchmarks let us evaluate our method in both a widely used MCIT setting and a cleaner setting with reduced data-overlap concerns. Comparison Methods.We compare DRAPEwith classic prompt-based continual learning ap- proaches, including CODA-Prompt [36], DualPrompt [41], and L2P [54], as well as recent MCIT"},{"citing_arxiv_id":"2604.16930","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CoGR-MoE: Concept-Guided Expert Routing with Consistent Selection and Flexible Reasoning for Visual Question Answering","primary_cat":"cs.CV","submitted_at":"2026-04-18T09:28:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CoGR-MoE improves VQA by using concept-guided expert routing with option feature reweighting and contrastive learning to achieve consistent yet flexible reasoning across answer options.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14016","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MAny: Merge Anything for Multimodal Continual Instruction Tuning","primary_cat":"cs.LG","submitted_at":"2026-04-15T15:57:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MAny addresses dual-forgetting in multimodal continual instruction tuning via CPM and LPM merging strategies, delivering up to 8.57% accuracy gains on UCIT benchmarks without additional training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.17726","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM","primary_cat":"cs.CV","submitted_at":"2025-05-23T10:43:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Slot-MLLM introduces a slot-attention-based object-centric visual tokenizer with Q-Former encoder, diffusion decoder, and residual vector quantization for improved local visual comprehension and generation in multimodal LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.10479","ref_index":83,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","primary_cat":"cs.CV","submitted_at":"2025-04-14T17:59:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"pretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165, 2021. 6 [82] Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507-2521, 2022. 6 23 [83] Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021. 6 [84] Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embedding alignment for multimodal large language model."},{"citing_arxiv_id":"2504.09925","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding","primary_cat":"cs.CV","submitted_at":"2025-04-14T06:33:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.07536","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL","primary_cat":"cs.CL","submitted_at":"2025-03-10T17:04:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A two-stage RL framework first boosts text reasoning in 3B LMMs then adapts it to multimodal inputs, producing modest benchmark gains of 4.5-4.8%.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"XiXuan, Yifan Xu, Shudan Zhang, Hanyu Lai, Jiadai Sun, Xinyue Yang, et al. Visualagentbench: Towards large multi- modal models as visual agents. In ICLR, 2025. 3 [43] Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-GPS: Inter- pretable geometry problem solving with formal language and symbolic reasoning. In ACL, 2021. 13 [44] Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. IconQA: A new benchmark for abstract diagram under- standing and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021. 13 [45] Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and A."},{"citing_arxiv_id":"2502.02871","ref_index":116,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning","primary_cat":"cs.CL","submitted_at":"2025-02-05T04:05:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.10302","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding","primary_cat":"cs.CV","submitted_at":"2024-12-13T17:37:48+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B activated parameters.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"• Response: <|ref|><description><|/ref|><|det|>[[x1, y1, x2, y2]]<|/det|> In this format, <|grounding|>, <|ref|>, <|/ref|>, <|det|>, <|/det|> are special to- kens. The <object> placeholder represents phrases like \"an object within the red bounding box\" while <description> is the model's description of the detected object (e.g., \"cat\"). Grounded conversation. We construct our grounded conversation data using [62, 72] to further enhance the model's capabilities established during the pretraining phase. Text-Only datasets. To maintain the language ability of the model, we also use text-only instruction-tuning datasets [4, 6, 18, 19, 68, 70, 84, 91, 98] during the SFT stage. 4. Training Methodology 4.1. Training Pipelines DeepSeek-VL2 is trained through a three-stage pipeline: (1) an initial stage where we train the"},{"citing_arxiv_id":"2412.05271","ref_index":167,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","primary_cat":"cs.CV","submitted_at":"2024-12-06T18:57:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Laion-ZH [203], Laion-EN [203], Laion-COCO [204], LLaV AR [305], InternVL-SA-1B-Caption [113],Captioning MMInstruct [155], GRIT-Caption [194], ShareGPT4V [29], LVIS-Instruct-4V [244], ShareCaptioner [29], OmniCorpus [133], ShareGPT4o [35] GQA [98], OKVQA [178], A-OKVQA [205], Visual7W [317], VisText [226], VSR [147], TallyQA [2],General QA Objects365-YorN [208], IconQA [167], Stanford40 [273], VisDial [51], VQAv2 [74], Hateful-Memes [111] MA VIS [300], GeomVerse [107], MetaMath-Rendered [281], MapQA [23], GeoQA+ [20], Geometry3K [164],Mathematics UniGeo [26], GEOS [206], CLEVR-Math [144] ChartQA [181], PlotQA [187], FigureQA [105], LRV-Instruction [148], ArxivQA [132], MMC-Inst [149], TabMWP [166], DVQA [104], UniChart [182], SimChart9K [263], Chart2Text [191], FinTabNet [312],Chart"},{"citing_arxiv_id":"2411.10442","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization","primary_cat":"cs.CL","submitted_at":"2024-11-15T18:59:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"2 per preference pair, compared to 992.7 to- kens for the divide-and-conquer approach used in RLAIF- V . Thus, the cost of our pipeline is only 57.5% of that of RLAIF-V . Additionally, a comparison with other recent data pipelines [25, 71, 120] is also presented in Section 5.2.2. 3 Task Dataset General VQA VQAv2 [30], GQA [35], OKVQA [64], IconQA [60] Science AI2D [40], ScienceQA [61], M3CoT [16] Chart ChartQA [65], DVQA [38], MapQA [13] Mathematics GeoQA+ [12], CLEVR-Math [52], Geometry3K [59], GEOS [85], GeomVerse [39], Geo170K [28] OCR OCRVQA [69], InfoVQA [67], TextVQA [86], STVQA [8], SROIE [34] Document DocVQA [66] Table 1. Datasets used to build our preference dataset. 3.2. Multimodal Preference Dataset"},{"citing_arxiv_id":"2410.13848","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation","primary_cat":"cs.CV","submitted_at":"2024-10-17T17:58:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.04840","ref_index":237,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models","primary_cat":"cs.CV","submitted_at":"2024-08-09T03:25:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.01800","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MiniCPM-V: A GPT-4V Level MLLM on Your Phone","primary_cat":"cs.CV","submitted_at":"2024-08-03T15:02:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Part-1&2 data are concatenated sequentially in the SFT phase. Part-1 focuses on bolstering basic recognition capabilities, while part-2 aims to enhance advanced capabilities in generating detailed responses and following human instructions. Category Sources Size Part-1 Short Caption Flickr-30K [81], COCO [59] 560K VQA FM-IQA [34], VGQA [51], IconQA [69], GQA [44], VQAv2 [6]1.4MCLEVR [46], VizWiz [38], Visual7W [122], COCO-QA [84] Knowledge OKVQA [72], A-OKVQA [87], KVQA [88], ScienceQA [70]60K Grounding RefCOCO [109] 570K Reasoning COMVINT [32], VCR [114], NLVR [94], LRV [60] 135K Math GeoQA [19], SMART-101 [24] 125K OCR DocVQA [74], TextVQA [91], OCR-VQA [77], ST-VQA [12], VisualMRC [96], DVQA [47]"},{"citing_arxiv_id":"2407.03320","ref_index":98,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output","primary_cat":"cs.CV","submitted_at":"2024-07-03T17:59:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Vision Capability Enhancement WanJuan [46], Flicker[160], MMC-Inst[82], RCTW-17[130], CTW[165], LSVT[137], ReCTs[175], ArT[28] Table 1. Datasets used for Pre-Training. The data are collected from diverse sources for the three objectives. Task Dataset Caption ShareGPT4V [17], COCO [21], Nocaps [1] General QA VQAv2 [4], GQA [53], OK-VQA [105] VD [32], RD [16], VSR [81], ALLaV A-QA [15] Multi-Turn QA MMDU [92] Science QA AI2D [61], SQA [98], TQA [62], IconQA [97] Chart QA DVQA [58], ChartQA [106], ChartQA-AUG [106] Math QA MathQA [161], Geometry3K [96], TabMWP [99], CLEVR-MATH [80], Super [75] World Knowledge QA A-OKVQA [127], KVQA [128], ViQuAE [65] OCR QA TextVQA [133], OCR-VQA [109], ST-VQA [11] HD-OCR QA InfoVQA[108], DocVQA [107], TabFact [20], WTQ [117], DeepForm [139], Visual MRC [140]"},{"citing_arxiv_id":"2403.05525","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeepSeek-VL: Towards Real-World Vision-Language Understanding","primary_cat":"cs.AI","submitted_at":"2024-03-08T18:46:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder, and pretraining that preserves language capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.03766","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MobileVLM V2: Faster and Stronger Baseline for Vision Language Model","primary_cat":"cs.CV","submitted_at":"2024-02-06T07:16:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"MobileVLM V2 shows that 1.7B and 3B parameter vision-language models can reach or exceed the performance of 3B and 7B+ models on common VLM benchmarks via targeted design and data improvements.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2312.14238","ref_index":100,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks","primary_cat":"cs.CV","submitted_at":"2023-12-21T18:59:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[98] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 11, 12, 13 [99] Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram under- standing and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021. 5, 17 [100] Yadong Lu, Chunyuan Li, Haotian Liu, Jianwei Yang, Jian- feng Gao, and Yelong Shen. An empirical study of scal- ing instruct-tuned large multimodal models. arXiv preprint arXiv:2309.09958, 2023. 3 [101] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classi- fication of aircraft. arXiv preprint arXiv:1306."},{"citing_arxiv_id":"2310.09478","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning","primary_cat":"cs.CV","submitted_at":"2023-10-14T03:22:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MiniGPT-v2 adds unique task identifiers to a large language model so one system can perform image description, visual question answering, and visual grounding after three-stage training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.15112","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition","primary_cat":"cs.CV","submitted_at":"2023-09-26T17:58:20+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"InternLM-XComposer generates articles with seamlessly integrated images and achieves state-of-the-art results on vision-language benchmarks including MME, MMBench, and Seed-Bench.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"utilizing the Chinese CLIP [92]. The volume of text data is counted in terms of the number of tokens. The In-house Concept data is collected from public websites, including over 11 million vision-language concepts from public websites. Task Dataset Multi-task training Caption COCO [12], SUB [12], TextCaps [79] VQA VQAv2 [2], GQA [34], OK-VQA [57], IConQA [56] Text-VQA [80], SQA [55], VSR [47], OCR-VQA [58], VIGC [88] IQG VQAv2 [2], OK-VQA [57], A-OKVQA [76] Conversation Visual Dialog [19], LLaV A-150k [50] Instruction tuning Composiiton In-house data (Refer to Sec.3.3) Conversation LLaV A-150k [50], Alpaca-en&zh [83] ShareGPT-en&zh [15], Oasst-en&zh [38], LRV [48] Table 2. Datasets used for Supervised Fine-Tuning."},{"citing_arxiv_id":"2308.12067","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MM-LIMA: Less Is More for Alignment in Multi-Modal Datasets","primary_cat":"cs.LG","submitted_at":"2023-08-23T11:27:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MM-LIMA uses proposed quality metrics and a trainable selector to pick 200 high-quality multimodal instruction examples and outperforms MiniGPT-4 on evaluations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}