{"work":{"id":"63d03f4d-15f4-4583-8286-913c19f02294","openalex_id":null,"doi":null,"arxiv_id":"2301.12597","raw_key":null,"title":"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models","authors":null,"authors_text":"Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi","year":2023,"venue":"cs.CV","abstract":"The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.","external_url":"https://arxiv.org/abs/2301.12597","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-06-29T13:43:28.713190+00:00","pith_arxiv_id":"2301.12597","created_at":"2026-05-10T08:12:26.530877+00:00","updated_at":"2026-06-29T13:43:28.713190+00:00","title_quality_ok":true,"display_title":"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models","render_title":"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"},"hub":{"state":{"work_id":"63d03f4d-15f4-4583-8286-913c19f02294","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":115,"external_cited_by_count":null,"distinct_field_count":11,"first_pith_cited_at":"2022-08-31T06:18:10+00:00","last_pith_cited_at":"2026-06-17T00:42:00+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-30T02:39:24.912766+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":31},{"context_role":"method","n":6},{"context_role":"baseline","n":2},{"context_role":"dataset","n":1}],"polarity_counts":[{"context_polarity":"background","n":30},{"context_polarity":"use_method","n":6},{"context_polarity":"unclear","n":2},{"context_polarity":"baseline","n":1},{"context_polarity":"use_dataset","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models","claims":[{"claim_text":"The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative ","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"VisualEncoder GMHRA QFormer Linear LargeLanguageModel ###Human:Q ###Assistant:A VideoEmbedding Q K/V 10M 15MStage1:Alignment reasoning Stage2:InstructionTuning conversation detaileddescription concisedescription 7K 4K 2K 2K 3K InternVideo N Architecture O Data (b) Data Figure 2: Architecture and training paradigm of VideoChat-Embed. It is built on BLIP-2 [18] and StableVicuna [10]. The training contains two-stage alignment and instruction tuning. 3.2.1 Architecture In this paper, we instantiate ","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding. arXiv preprint arXiv:2501.01926 (2025). [101] Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, and Lingpeng Kong. 2023. Silkie: Preference distillation for large visual language models. arXiv preprint arXiv:2312.10665 (2023). [102] Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and ","claim_type":"background","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"as the important application of MLLMs in visual reasoning. Pre-training: This stream of MLLMs intends to support differ- ent modalities using unified end-to-end models. For instance, Flamingo [269] applies gated cross-attention to fuse vision and language modalities, which are collected from pre-trained and frozen visual encoder and LLM, respectively. Moreover, BLIP- 2 [270] proposes a two-stage strategy to pre-train a Querying Transformer (Q-Former) for the alignment between vision and language","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In International conference on machine learning, pages 12888-12900. PMLR, 2022. [97] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. [98] J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encod","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"the two LLMs are predominantly pre-trained on English corpus, they are limited in multi-language support, such as Chinese. In contrast, Qwen [58] is a bilingual LLM that supports Chinese and English well. It should be noted that scaling up the parameter size of LLMs also brings additional gains, similar to the case of increasing input resolution. Specifically, Liu et al . [50], [61] find that simply scaling up LLM from 7B to 13B brings comprehensive improvement on various benchmarks. Fur- thermo","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"els rely heavily on pose accuracy and handcrafted behavioral priors. In contrast, we introduce a new paradigm where pre- trained MLLMs are fine-tuned to capture subtle behavioral dynamics and produce zero-shot dominance predictions from raw interaction videos. Multimodal Large Language Models for Scientific Rea- soning.Multimodal large language models (MLLMs), such as BLIP-2 [25], OpenFlamingo [26], and InternVL [27], have shown remarkable success in visual reasoning, captioning, and video under","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (28 contexts).","role_counts":[{"n":28,"context_role":"background"},{"n":5,"context_role":"method"},{"n":2,"context_role":"baseline"},{"n":1,"context_role":"dataset"}]},"error":null,"updated_at":"2026-05-23T17:34:41.380539+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"60f21ccf-1677-407b-9064-53afffb645cb","orcid":null,"display_name":"Junnan Li"},{"id":"1c8b5fd0-ee78-4913-a1d8-c86401ba55e2","orcid":null,"display_name":"Dongxu Li"},{"id":"6afdd6a1-b779-4959-a4c5-b79b653cc623","orcid":null,"display_name":"Silvio Savarese"},{"id":"ee428ba2-be34-4c84-8190-5ff6a7d12044","orcid":null,"display_name":"Steven Hoi"}]},"error":null,"updated_at":"2026-05-23T17:34:41.374283+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T18:06:36.286726+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models","work_id":"a7e3a737-e007-42bc-be89-c4d34c5ee071","shared_citers":18},{"title":"Visual Instruction Tuning","work_id":"68be622d-a6dc-4a13-82de-e3054a3dc509","shared_citers":18},{"title":"InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning","work_id":"f3aac728-ded0-4e55-aa9e-4a1635d4313d","shared_citers":12},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":12},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":11},{"title":"Language is not all you need: Aligning perception with language models","work_id":"2a1e0563-79f5-4521-8293-b8b1aebf7cee","shared_citers":11},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":11},{"title":"Flamingo: a Visual Language Model for Few-Shot Learning","work_id":"a110f764-38dc-41b2-a802-53744ecea1fc","shared_citers":10},{"title":"PaLM-E: An Embodied Multimodal Language Model","work_id":"5b99811a-1d93-47e2-9d59-f4045a0b74a2","shared_citers":10},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":9},{"title":"mplug-owl: Modularization empowers large lan- guage models with multimodality","work_id":"74a7deb6-48be-4132-9d35-882cc5870ebd","shared_citers":9},{"title":"Evaluating Object Hallucination in Large Vision-Language Models","work_id":"66d8ac3e-c134-4995-b528-550afa17586f","shared_citers":8},{"title":"MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models","work_id":"806d2e73-71b3-4d56-87e0-39d571cc15d6","shared_citers":8},{"title":"Otter: A multi-modal model with in-context instruction tuning","work_id":"33cb3a7a-6091-48db-a246-802bbb055f43","shared_citers":8},{"title":"Learning Transferable Visual Models From Natural Language Supervision","work_id":"6de86bb5-27bd-4d5c-8b89-967ebfc52659","shared_citers":7},{"title":"Microsoft COCO Captions: Data Collection and Evaluation Server","work_id":"b3d6fb46-4169-4a28-8f7e-2ca6774211da","shared_citers":7},{"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","shared_citers":7},{"title":"Qwen Technical Report","work_id":"bb1fd52f-6b2f-437c-9516-37bdf6eb9be8","shared_citers":7},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":6},{"title":"Improved Baselines with Visual Instruction Tuning","work_id":"5baeaa33-5986-44a3-85a4-fcabd6fc1e8d","shared_citers":6},{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":6},{"title":"Multimodal-gpt: A vision and language model for dialogue with humans","work_id":"e5fb1f2e-4ed2-454f-87a3-9e9c40f8fa31","shared_citers":6},{"title":"Scaling Instruction-Finetuned Language Models","work_id":"8405abb1-7558-4fdf-af24-f4c52fa77a06","shared_citers":6},{"title":"Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic","work_id":"44525076-312a-4259-b79c-134cd7eeb297","shared_citers":6}],"time_series":[{"n":19,"year":2023},{"n":6,"year":2024},{"n":2,"year":2025},{"n":17,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T18:06:32.564110+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T18:06:28.042619+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models","claims":[{"claim_text":"The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative ","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"VisualEncoder GMHRA QFormer Linear LargeLanguageModel ###Human:Q ###Assistant:A VideoEmbedding Q K/V 10M 15MStage1:Alignment reasoning Stage2:InstructionTuning conversation detaileddescription concisedescription 7K 4K 2K 2K 3K InternVideo N Architecture O Data (b) Data Figure 2: Architecture and training paradigm of VideoChat-Embed. It is built on BLIP-2 [18] and StableVicuna [10]. The training contains two-stage alignment and instruction tuning. 3.2.1 Architecture In this paper, we instantiate ","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding. arXiv preprint arXiv:2501.01926 (2025). [101] Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, and Lingpeng Kong. 2023. Silkie: Preference distillation for large visual language models. arXiv preprint arXiv:2312.10665 (2023). [102] Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and ","claim_type":"background","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"as the important application of MLLMs in visual reasoning. Pre-training: This stream of MLLMs intends to support differ- ent modalities using unified end-to-end models. For instance, Flamingo [269] applies gated cross-attention to fuse vision and language modalities, which are collected from pre-trained and frozen visual encoder and LLM, respectively. Moreover, BLIP- 2 [270] proposes a two-stage strategy to pre-train a Querying Transformer (Q-Former) for the alignment between vision and language","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In International conference on machine learning, pages 12888-12900. PMLR, 2022. [97] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. [98] J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encod","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"the two LLMs are predominantly pre-trained on English corpus, they are limited in multi-language support, such as Chinese. In contrast, Qwen [58] is a bilingual LLM that supports Chinese and English well. It should be noted that scaling up the parameter size of LLMs also brings additional gains, similar to the case of increasing input resolution. Specifically, Liu et al . [50], [61] find that simply scaling up LLM from 7B to 13B brings comprehensive improvement on various benchmarks. Fur- thermo","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"els rely heavily on pose accuracy and handcrafted behavioral priors. In contrast, we introduce a new paradigm where pre- trained MLLMs are fine-tuned to capture subtle behavioral dynamics and produce zero-shot dominance predictions from raw interaction videos. Multimodal Large Language Models for Scientific Rea- soning.Multimodal large language models (MLLMs), such as BLIP-2 [25], OpenFlamingo [26], and InternVL [27], have shown remarkable success in visual reasoning, captioning, and video under","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (28 contexts).","role_counts":[{"n":28,"context_role":"background"},{"n":5,"context_role":"method"},{"n":2,"context_role":"baseline"},{"n":1,"context_role":"dataset"}]},"error":null,"updated_at":"2026-05-23T17:34:41.385771+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models","claims":[{"claim_text":"The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T18:06:32.570822+00:00"}},"summary":{"title":"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models","claims":[{"claim_text":"The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models","work_id":"a7e3a737-e007-42bc-be89-c4d34c5ee071","shared_citers":18},{"title":"Visual Instruction Tuning","work_id":"68be622d-a6dc-4a13-82de-e3054a3dc509","shared_citers":18},{"title":"InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning","work_id":"f3aac728-ded0-4e55-aa9e-4a1635d4313d","shared_citers":12},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":12},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":11},{"title":"Language is not all you need: Aligning perception with language models","work_id":"2a1e0563-79f5-4521-8293-b8b1aebf7cee","shared_citers":11},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":11},{"title":"Flamingo: a Visual Language Model for Few-Shot Learning","work_id":"a110f764-38dc-41b2-a802-53744ecea1fc","shared_citers":10},{"title":"PaLM-E: An Embodied Multimodal Language Model","work_id":"5b99811a-1d93-47e2-9d59-f4045a0b74a2","shared_citers":10},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":9},{"title":"mplug-owl: Modularization empowers large lan- guage models with multimodality","work_id":"74a7deb6-48be-4132-9d35-882cc5870ebd","shared_citers":9},{"title":"Evaluating Object Hallucination in Large Vision-Language Models","work_id":"66d8ac3e-c134-4995-b528-550afa17586f","shared_citers":8},{"title":"MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models","work_id":"806d2e73-71b3-4d56-87e0-39d571cc15d6","shared_citers":8},{"title":"Otter: A multi-modal model with in-context instruction tuning","work_id":"33cb3a7a-6091-48db-a246-802bbb055f43","shared_citers":8},{"title":"Learning Transferable Visual Models From Natural Language Supervision","work_id":"6de86bb5-27bd-4d5c-8b89-967ebfc52659","shared_citers":7},{"title":"Microsoft COCO Captions: Data Collection and Evaluation Server","work_id":"b3d6fb46-4169-4a28-8f7e-2ca6774211da","shared_citers":7},{"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","shared_citers":7},{"title":"Qwen Technical Report","work_id":"bb1fd52f-6b2f-437c-9516-37bdf6eb9be8","shared_citers":7},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":6},{"title":"Improved Baselines with Visual Instruction Tuning","work_id":"5baeaa33-5986-44a3-85a4-fcabd6fc1e8d","shared_citers":6},{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":6},{"title":"Multimodal-gpt: A vision and language model for dialogue with humans","work_id":"e5fb1f2e-4ed2-454f-87a3-9e9c40f8fa31","shared_citers":6},{"title":"Scaling Instruction-Finetuned Language Models","work_id":"8405abb1-7558-4fdf-af24-f4c52fa77a06","shared_citers":6},{"title":"Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic","work_id":"44525076-312a-4259-b79c-134cd7eeb297","shared_citers":6}],"time_series":[{"n":19,"year":2023},{"n":6,"year":2024},{"n":2,"year":2025},{"n":17,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"1c8b5fd0-ee78-4913-a1d8-c86401ba55e2","orcid":null,"display_name":"Dongxu Li","source":"manual","import_confidence":0.72},{"id":"60f21ccf-1677-407b-9064-53afffb645cb","orcid":null,"display_name":"Junnan Li","source":"manual","import_confidence":0.72},{"id":"6afdd6a1-b779-4959-a4c5-b79b653cc623","orcid":null,"display_name":"Silvio Savarese","source":"manual","import_confidence":0.72},{"id":"ee428ba2-be34-4c84-8190-5ff6a7d12044","orcid":null,"display_name":"Steven Hoi","source":"manual","import_confidence":0.72}]}}