{"work":{"id":"6de86bb5-27bd-4d5c-8b89-967ebfc52659","openalex_id":null,"doi":null,"arxiv_id":"2103.00020","raw_key":null,"title":"Learning Transferable Visual Models From Natural Language Supervision","authors":null,"authors_text":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal","year":2021,"venue":"cs.CV","abstract":"State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.","external_url":"https://arxiv.org/abs/2103.00020","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T06:10:23.706045+00:00","pith_arxiv_id":"2103.00020","created_at":"2026-05-08T18:28:58.224515+00:00","updated_at":"2026-05-25T06:10:23.706045+00:00","title_quality_ok":true,"display_title":"Learning Transferable Visual Models From Natural Language Supervision","render_title":"Learning Transferable Visual Models From Natural Language Supervision"},"hub":{"state":{"work_id":"6de86bb5-27bd-4d5c-8b89-967ebfc52659","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":171,"external_cited_by_count":null,"distinct_field_count":16,"first_pith_cited_at":"2021-05-11T17:50:24+00:00","last_pith_cited_at":"2026-05-22T17:49:59+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-04T12:57:36.004690+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":36},{"context_role":"method","n":8},{"context_role":"baseline","n":4},{"context_role":"other","n":1}],"polarity_counts":[{"context_polarity":"background","n":34},{"context_polarity":"use_method","n":8},{"context_polarity":"baseline","n":4},{"context_polarity":"unclear","n":2},{"context_polarity":"support","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Learning Transferable Visual Models From Natural Language Supervision","claims":[{"claim_text":"State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (i","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"We can then use the exact same sampling procedure as used for regular DDIM, but with the modiﬁed noise predictions ˆϵθ(xt) instead of ϵθ(xt). Algorithm 2 summaries the corresponding sampling algorithm. 4.3 Scaling Classiﬁer Gradients To apply classiﬁer guidance to a large scale generative task, we train classiﬁcation models on ImageNet. Our classiﬁer architecture is simply the downsampling trunk of the UNet model with an attention pool [49] at the 8x8 layer to produce the ﬁnal output. We train t","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"these annotations, FLARE further introduces adual-regimeevaluation protocol that benchmarks every model under both a caption-based regime (detailed captions) and a query-based regime (user-style queries) on the same gallery, isolating the impact of query formulation on model assessment. Benchmarking 15 representative contrastive and LLM-based retrievers [22, 27, 8, 29, 17, 11, 31, 21, 9, 32, 13, 35, 28, 24] reveals two phenomena that demonstrate the value of FLARE. First, switching from captions","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"16429. [52] J. Hou, B. Graham, M. Nießner, and S. Xie. Exploring data-efficient 3d scene understanding with contrastive scene contexts, 2021. URLhttps://arxiv.org/abs/2012.09165. [53] X. Wu, X. Wen, X. Liu, and H. Zhao. Masked scene contrast: A scalable framework for unsu- pervised 3d representation learning, 2023. URLhttps://arxiv.org/abs/2303.14191. [54] X. Wu, L. Jiang, P.-S. Wang, Z. Liu, X. Liu, Y . Qiao, W. Ouyang, T. He, and H. Zhao. Point transformer v3: Simpler, faster, stronger, 2024. ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Adaptive testing and debugging of nlp models. In Annual Meeting of the Association for Computational Linguistics (ACL), 2022. https: //aclanthology.org/2022.acl-long.230/. [83] Fereshteh Sadeghi, C Lawrence Zitnick, and Ali Farhadi. Visalogy: Answering visual analogy questions. In Advances in Neural Information Processing Systems (NeurIPS), 2015. [84] Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chafﬁn, Arnaud Stiegler, Teven Le Scao, Arun Ra","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"the EDM-framework [51] and significantly shift the noise schedule towards higher noise values, which we find to be essential for high-resolution finetuning. See Section 4 for a detailed discussion of the latter. Data Curation Pretraining on large-scale datasets [80] is an essential ingredient for powerful models in several tasks such as discriminative text-image [66, 105] and lan- guage [27, 63, 67] modeling. By leveraging efficient language-image representations such as CLIP [47, 66, 105], data","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"is to develop a general-purpose assistant that can effectively follow multi-modal vision-and-language instructions, aligned with human intent to complete various real-world tasks in the wild [4, 27, 26]. To this end, the community has witnessed an emergent interest in developing language-augmented foundation vision models [ 27, 16], with strong capabilities in open-world visual understanding such as classification [ 40, 21, 57, 54, 39], detection [ 29, 62, 33], segmentation [ 25, 63, 58] and cap","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Learning Transferable Visual Models From Natural Language Supervision because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (5 contexts).","role_counts":[{"n":5,"context_role":"background"},{"n":3,"context_role":"method"},{"n":1,"context_role":"baseline"}]},"error":null,"updated_at":"2026-05-15T15:18:02.632431+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"9783c850-24f9-4444-91ee-29b7660c744b","orcid":null,"display_name":"Alec Radford"},{"id":"04ee47f7-43ba-4624-9776-dd299d8d3026","orcid":null,"display_name":"Jong Wook Kim"},{"id":"d98936eb-a9ed-42a8-a792-b3fe74a0446c","orcid":null,"display_name":"Chris Hallacy"},{"id":"b2b074ec-f0aa-41f0-97ca-ac5ce457b2f6","orcid":null,"display_name":"Aditya Ramesh"},{"id":"f70d2447-4305-4f98-83d2-580f38c8b2db","orcid":null,"display_name":"Gabriel Goh"},{"id":"a38b4d86-1e7d-4c3f-9d8f-f17a82ad89ef","orcid":null,"display_name":"Sandhini Agarwal"}]},"error":null,"updated_at":"2026-05-15T15:18:04.041674+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T06:27:24.652213+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":11},{"title":"High-Resolution Image Synthesis with Latent Diffusion Models","work_id":"f0270d36-2952-47fb-84c1-95e3ec341126","shared_citers":11},{"title":"Flamingo: a Visual Language Model for Few-Shot Learning","work_id":"a110f764-38dc-41b2-a802-53744ecea1fc","shared_citers":10},{"title":"Microsoft COCO: Common Objects in Context","work_id":"384118d4-9bc9-444c-abfa-3125ee3ca314","shared_citers":10},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":9},{"title":"Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding","work_id":"af16442b-a46f-469d-8818-c37b53a504c7","shared_citers":9},{"title":"Denoising Diffusion Probabilistic Models","work_id":"dc023f4e-7c79-471c-b713-deeb559ba010","shared_citers":8},{"title":"Hierarchical Text-Conditional Image Generation with CLIP Latents","work_id":"0c6a768b-70b8-4242-bb0e-459f1008c9fc","shared_citers":8},{"title":"Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig","work_id":"d28390f3-8b21-4b2a-a523-473a16c2e43a","shared_citers":8},{"title":"Attention Is All You Need","work_id":"baafb5a2-5272-43bc-932f-09fa9ffe5316","shared_citers":7},{"title":"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models","work_id":"63d03f4d-15f4-4583-8286-913c19f02294","shared_citers":7},{"title":"Classifier-Free Diffusion Guidance","work_id":"acf2c588-c088-4a6c-938e-150ad7c666d7","shared_citers":7},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":7},{"title":"Denoising Diffusion Implicit Models","work_id":"8fa2128b-d18c-405c-ac92-0e669cf89ac0","shared_citers":7},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":7},{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":7},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":7},{"title":"Visual Instruction Tuning","work_id":"68be622d-a6dc-4a13-82de-e3054a3dc509","shared_citers":7},{"title":"Flow Matching for Generative Modeling","work_id":"6edb71c4-5d64-40af-a394-9757ea051a36","shared_citers":6},{"title":"GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models","work_id":"34430d19-7919-48ce-88a5-17b3bfe2192e","shared_citers":6},{"title":"Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection","work_id":"3757dc8f-79d5-4beb-a03b-eb4c9a33427d","shared_citers":6},{"title":"Language Models are Few-Shot Learners","work_id":"214732c0-2edd-44a0-af9e-28184a2b8279","shared_citers":6},{"title":"Scaling Autoregressive Models for Content-Rich Text-to-Image Generation","work_id":"0a105815-ff2e-43ce-8566-966cdcae1af4","shared_citers":6},{"title":"Score-Based Generative Modeling through Stochastic Differential Equations","work_id":"d9110e53-a5d4-4794-a4c5-a575e91c31ad","shared_citers":6}],"time_series":[{"n":3,"year":2021},{"n":6,"year":2022},{"n":6,"year":2023},{"n":2,"year":2024},{"n":74,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T06:27:04.195412+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T06:27:20.562269+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Learning Transferable Visual Models From Natural Language Supervision","claims":[{"claim_text":"State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (i","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"We can then use the exact same sampling procedure as used for regular DDIM, but with the modiﬁed noise predictions ˆϵθ(xt) instead of ϵθ(xt). Algorithm 2 summaries the corresponding sampling algorithm. 4.3 Scaling Classiﬁer Gradients To apply classiﬁer guidance to a large scale generative task, we train classiﬁcation models on ImageNet. Our classiﬁer architecture is simply the downsampling trunk of the UNet model with an attention pool [49] at the 8x8 layer to produce the ﬁnal output. We train t","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"these annotations, FLARE further introduces adual-regimeevaluation protocol that benchmarks every model under both a caption-based regime (detailed captions) and a query-based regime (user-style queries) on the same gallery, isolating the impact of query formulation on model assessment. Benchmarking 15 representative contrastive and LLM-based retrievers [22, 27, 8, 29, 17, 11, 31, 21, 9, 32, 13, 35, 28, 24] reveals two phenomena that demonstrate the value of FLARE. First, switching from captions","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"16429. [52] J. Hou, B. Graham, M. Nießner, and S. Xie. Exploring data-efficient 3d scene understanding with contrastive scene contexts, 2021. URLhttps://arxiv.org/abs/2012.09165. [53] X. Wu, X. Wen, X. Liu, and H. Zhao. Masked scene contrast: A scalable framework for unsu- pervised 3d representation learning, 2023. URLhttps://arxiv.org/abs/2303.14191. [54] X. Wu, L. Jiang, P.-S. Wang, Z. Liu, X. Liu, Y . Qiao, W. Ouyang, T. He, and H. Zhao. Point transformer v3: Simpler, faster, stronger, 2024. ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Adaptive testing and debugging of nlp models. In Annual Meeting of the Association for Computational Linguistics (ACL), 2022. https: //aclanthology.org/2022.acl-long.230/. [83] Fereshteh Sadeghi, C Lawrence Zitnick, and Ali Farhadi. Visalogy: Answering visual analogy questions. In Advances in Neural Information Processing Systems (NeurIPS), 2015. [84] Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chafﬁn, Arnaud Stiegler, Teven Le Scao, Arun Ra","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"the EDM-framework [51] and significantly shift the noise schedule towards higher noise values, which we find to be essential for high-resolution finetuning. See Section 4 for a detailed discussion of the latter. Data Curation Pretraining on large-scale datasets [80] is an essential ingredient for powerful models in several tasks such as discriminative text-image [66, 105] and lan- guage [27, 63, 67] modeling. By leveraging efficient language-image representations such as CLIP [47, 66, 105], data","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"is to develop a general-purpose assistant that can effectively follow multi-modal vision-and-language instructions, aligned with human intent to complete various real-world tasks in the wild [4, 27, 26]. To this end, the community has witnessed an emergent interest in developing language-augmented foundation vision models [ 27, 16], with strong capabilities in open-world visual understanding such as classification [ 40, 21, 57, 54, 39], detection [ 29, 62, 33], segmentation [ 25, 63, 58] and cap","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Learning Transferable Visual Models From Natural Language Supervision because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (5 contexts).","role_counts":[{"n":5,"context_role":"background"},{"n":3,"context_role":"method"},{"n":1,"context_role":"baseline"}]},"error":null,"updated_at":"2026-05-15T15:18:04.046469+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Learning Transferable Visual Models From Natural Language Supervision","claims":[{"claim_text":"State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (i","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Learning Transferable Visual Models From Natural Language Supervision because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T06:27:20.567056+00:00"}},"summary":{"title":"Learning Transferable Visual Models From Natural Language Supervision","claims":[{"claim_text":"State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (i","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Learning Transferable Visual Models From Natural Language Supervision because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":11},{"title":"High-Resolution Image Synthesis with Latent Diffusion Models","work_id":"f0270d36-2952-47fb-84c1-95e3ec341126","shared_citers":11},{"title":"Flamingo: a Visual Language Model for Few-Shot Learning","work_id":"a110f764-38dc-41b2-a802-53744ecea1fc","shared_citers":10},{"title":"Microsoft COCO: Common Objects in Context","work_id":"384118d4-9bc9-444c-abfa-3125ee3ca314","shared_citers":10},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":9},{"title":"Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding","work_id":"af16442b-a46f-469d-8818-c37b53a504c7","shared_citers":9},{"title":"Denoising Diffusion Probabilistic Models","work_id":"dc023f4e-7c79-471c-b713-deeb559ba010","shared_citers":8},{"title":"Hierarchical Text-Conditional Image Generation with CLIP Latents","work_id":"0c6a768b-70b8-4242-bb0e-459f1008c9fc","shared_citers":8},{"title":"Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig","work_id":"d28390f3-8b21-4b2a-a523-473a16c2e43a","shared_citers":8},{"title":"Attention Is All You Need","work_id":"baafb5a2-5272-43bc-932f-09fa9ffe5316","shared_citers":7},{"title":"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models","work_id":"63d03f4d-15f4-4583-8286-913c19f02294","shared_citers":7},{"title":"Classifier-Free Diffusion Guidance","work_id":"acf2c588-c088-4a6c-938e-150ad7c666d7","shared_citers":7},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":7},{"title":"Denoising Diffusion Implicit Models","work_id":"8fa2128b-d18c-405c-ac92-0e669cf89ac0","shared_citers":7},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":7},{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":7},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":7},{"title":"Visual Instruction Tuning","work_id":"68be622d-a6dc-4a13-82de-e3054a3dc509","shared_citers":7},{"title":"Flow Matching for Generative Modeling","work_id":"6edb71c4-5d64-40af-a394-9757ea051a36","shared_citers":6},{"title":"GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models","work_id":"34430d19-7919-48ce-88a5-17b3bfe2192e","shared_citers":6},{"title":"Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection","work_id":"3757dc8f-79d5-4beb-a03b-eb4c9a33427d","shared_citers":6},{"title":"Language Models are Few-Shot Learners","work_id":"214732c0-2edd-44a0-af9e-28184a2b8279","shared_citers":6},{"title":"Scaling Autoregressive Models for Content-Rich Text-to-Image Generation","work_id":"0a105815-ff2e-43ce-8566-966cdcae1af4","shared_citers":6},{"title":"Score-Based Generative Modeling through Stochastic Differential Equations","work_id":"d9110e53-a5d4-4794-a4c5-a575e91c31ad","shared_citers":6}],"time_series":[{"n":3,"year":2021},{"n":6,"year":2022},{"n":6,"year":2023},{"n":2,"year":2024},{"n":74,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"b2b074ec-f0aa-41f0-97ca-ac5ce457b2f6","orcid":null,"display_name":"Aditya Ramesh","source":"manual","import_confidence":0.72},{"id":"9783c850-24f9-4444-91ee-29b7660c744b","orcid":null,"display_name":"Alec Radford","source":"manual","import_confidence":0.72},{"id":"d98936eb-a9ed-42a8-a792-b3fe74a0446c","orcid":null,"display_name":"Chris Hallacy","source":"manual","import_confidence":0.72},{"id":"f70d2447-4305-4f98-83d2-580f38c8b2db","orcid":null,"display_name":"Gabriel Goh","source":"manual","import_confidence":0.72},{"id":"04ee47f7-43ba-4624-9776-dd299d8d3026","orcid":null,"display_name":"Jong Wook Kim","source":"manual","import_confidence":0.72},{"id":"a38b4d86-1e7d-4c3f-9d8f-f17a82ad89ef","orcid":null,"display_name":"Sandhini Agarwal","source":"manual","import_confidence":0.72}]}}