{"work":{"id":"df6f48b3-5792-47c7-9614-cb856ea31ad9","openalex_id":null,"doi":null,"arxiv_id":"2407.07726","raw_key":null,"title":"PaliGemma: A versatile 3B VLM for transfer","authors":null,"authors_text":"Lucas Beyer, Andreas Steiner, Andr\\'e Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz","year":2024,"venue":"cs.CV","abstract":"PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.","external_url":"https://arxiv.org/abs/2407.07726","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-06-29T13:43:28.792665+00:00","pith_arxiv_id":"2407.07726","created_at":"2026-05-09T06:05:35.123941+00:00","updated_at":"2026-06-29T13:43:28.792665+00:00","title_quality_ok":false,"display_title":"PaliGemma: A versatile 3B VLM for transfer","render_title":"PaliGemma: A versatile 3B VLM for transfer"},"hub":{"state":{"work_id":"df6f48b3-5792-47c7-9614-cb856ea31ad9","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":102,"external_cited_by_count":null,"distinct_field_count":8,"first_pith_cited_at":"2024-08-03T15:02:21+00:00","last_pith_cited_at":"2026-06-22T10:42:36+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-29T14:38:56.588102+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":19},{"context_role":"method","n":6},{"context_role":"baseline","n":5},{"context_role":"dataset","n":2}],"polarity_counts":[{"context_polarity":"background","n":19},{"context_polarity":"use_method","n":6},{"context_polarity":"baseline","n":4},{"context_polarity":"use_dataset","n":2},{"context_polarity":"unclear","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"PaliGemma: A versatile 3B VLM for transfer","claims":[{"claim_text":"PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"on ReasonVOS and Molmo2-Track, which require complex reasoning and occlusion handling skills. 2) Gemini 2.5 Pro is the strongest API model, but it still struggles to generate accurate object tracks. Model AI2D test [65] ChartQA test [102] DocVQA test [104] InfoQA test [105] TextVQA val [130] VQA v2.0 val [47] RWQA [158] MMMU val [179] MathVista testmini [96] CountBench [13] PixMoCount test [29] MuirBench [142] MMIU [106] Blink val [41] Img QA avg. MultiImg QA avg. Average API call only GPT-5 [11","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Connect-the-Dots contains 100 images spanning three subsets: 21 randomly generated dot patterns, 30 connect-the-dots puzzles derived from sil- houette SVGs, and 49 worksheet-style images collected from online sources. Models must locate each dot and connect them in order (Secs. B.1 and D.3 and Fig. 2). 2.CountingObjects contains746imagesdrawnfromCountBench[5,34], and Pixmo-Count [14]. We include object counts from 0 to 10 and filter out unsuitable Pixmo-Count examples. Models must count the targ","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"tokens to preserve the decomposition capability learned previously. The model is then fine-tuned to generate executable action chunks conditioned on the 4D representation and the chunk-level action prompt, supervised by the flow-matching loss in Sec. 3.3. 5 Experiments 5.1 Experiment Setup Implementation Details.Our model adopts the pretrained weights of PaliGemma [3] from 𝜋0.5 [13] as the VLM backbone, DINOv2 [27] from VGGT [32] as the geometry encoder, and Gemma-300M [29] equipped with a struc","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"narrow but consequential: in 3-7B VLMs, reliability is read more reliably off hidden-state geometry, layer-wise margin formation, and sparse late-layer circuits than off attention-map sharpness. 1 Introduction Vision-language models can answer richly compositional questions about images, yet routinely producefluentmistakes: confident, well-formed answers that are not supported by the pixels they purport to describe [3, 18, 27]. For deployment in settings where errors carry cost (scientific image","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"uate the effectiveness of our 3D-specific distillation approach. First, we report results for theteacher model (LLaV A-3D- 7B)[4], which provides an upper-bound performance. Second, we compare against lightweight2D vision-language models of similar scale to demonstrate the benefit of 3D-specific distillation (Table I), including LLaV A-v1.5-7B (without 3D training), MobileVLM-2B [34], and PaliGemma-3B [35]. Fi- nally, we includeablation variantsof our model to analyze the contribution of individ","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"green block\" while evaluation reverses the concept order to \"stack green block on blue block\". Al- though the required stacking skill is unchanged, standard SFT repeats the training behavior, whereas DeLockfollows the new instruction and succeeds. To examine the representation-level cause, we vi- sualize vision-language cross-attention in the PaliGemma [61] backbone using instruction tokens as queries and image patches as keys (Figure 4(a)). Standard SFT shows a collapsed attention pattern, cont","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks PaliGemma: A versatile 3B VLM for transfer because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (19 contexts).","role_counts":[{"n":19,"context_role":"background"},{"n":6,"context_role":"method"},{"n":5,"context_role":"baseline"},{"n":2,"context_role":"dataset"}]},"error":null,"updated_at":"2026-06-29T13:39:03.239660+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"1bfa3006-3f8d-4961-84b1-36b6aa3f09d4","orcid":null,"display_name":"Lucas Beyer"},{"id":"a9d14cf6-0c1c-4bb7-bbba-39be634a3fa5","orcid":null,"display_name":"Andreas Steiner"},{"id":"19bc41f1-d7a2-4d7f-8d1e-28f5e85d3715","orcid":null,"display_name":"Andr\\'e Susano Pinto"},{"id":"360699f9-8488-4024-9d3d-bd37260c7616","orcid":null,"display_name":"Alexander Kolesnikov"},{"id":"466ac76c-2740-42d7-a7f4-523b8528b0e0","orcid":null,"display_name":"Xiao Wang"},{"id":"8d7c92de-8af7-47f1-bcde-3b822c6e07cc","orcid":null,"display_name":"Daniel Salz"}]},"error":null,"updated_at":"2026-06-29T13:39:02.692460+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T17:26:46.364080+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","work_id":"f790abdc-a796-482f-a40d-f8ee035ecfc2","shared_citers":16},{"title":"$\\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization","work_id":"d1ad7304-d09a-49bc-809e-846439f6aff9","shared_citers":16},{"title":"GR00T N1: An Open Foundation Model for Generalist Humanoid Robots","work_id":"e2db69c7-ee8a-4cb7-a761-7b8de1dfcf97","shared_citers":13},{"title":"OpenVLA: An Open-Source Vision-Language-Action Model","work_id":"3e7e65c5-5aed-4fe9-8414-2092bcb31cc7","shared_citers":12},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":11},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":10},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":10},{"title":"RT-1: Robotics Transformer for Real-World Control at Scale","work_id":"e11bda85-8531-46bc-a07f-d0ade3643ab1","shared_citers":10},{"title":"DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset","work_id":"13253de2-3d89-415c-8c2f-3adb25d4c337","shared_citers":9},{"title":"Flow Matching for Generative Modeling","work_id":"6edb71c4-5d64-40af-a394-9757ea051a36","shared_citers":9},{"title":"SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features","work_id":"50eec732-2d41-432f-9dcf-ac7fff235ea5","shared_citers":9},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":8},{"title":"FAST: Efficient Action Tokenization for Vision-Language-Action Models","work_id":"83a8f966-6cfa-4f21-81f3-87440aae238f","shared_citers":7},{"title":"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware","work_id":"6fe159e0-fa73-481a-88d4-4719c15140be","shared_citers":7},{"title":"Gemini Robotics: Bringing AI into the Physical World","work_id":"f7c5ce10-8364-4fbe-964f-2802b81c3a98","shared_citers":6},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":6},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":6},{"title":"X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model","work_id":"13faca8d-e96d-4e6c-a441-9f2683d11934","shared_citers":6},{"title":"$\\pi^{*}_{0.6}$: a VLA That Learns From Experience","work_id":"7c1b3355-694a-44c6-880f-631e897e1713","shared_citers":5},{"title":"AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems","work_id":"f797e9ec-510f-43a7-8a0c-18009ce332e5","shared_citers":5},{"title":"CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation","work_id":"4b158d3e-3dff-4412-85cd-baa879465a5e","shared_citers":5},{"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","shared_citers":5},{"title":"Evaluating Real-World Robot Manipulation Policies in Simulation","work_id":"7f4ca6cb-1b94-454c-9623-b52441b74b61","shared_citers":5},{"title":"Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success","work_id":"04f46bb3-4346-47e8-bf09-c75d91f96e87","shared_citers":5}],"time_series":[{"n":3,"year":2024},{"n":4,"year":2025},{"n":38,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T17:26:28.924020+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T17:26:33.289482+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"PaliGemma: A versatile 3B VLM for transfer","claims":[{"claim_text":"PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"on ReasonVOS and Molmo2-Track, which require complex reasoning and occlusion handling skills. 2) Gemini 2.5 Pro is the strongest API model, but it still struggles to generate accurate object tracks. Model AI2D test [65] ChartQA test [102] DocVQA test [104] InfoQA test [105] TextVQA val [130] VQA v2.0 val [47] RWQA [158] MMMU val [179] MathVista testmini [96] CountBench [13] PixMoCount test [29] MuirBench [142] MMIU [106] Blink val [41] Img QA avg. MultiImg QA avg. Average API call only GPT-5 [11","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Connect-the-Dots contains 100 images spanning three subsets: 21 randomly generated dot patterns, 30 connect-the-dots puzzles derived from sil- houette SVGs, and 49 worksheet-style images collected from online sources. Models must locate each dot and connect them in order (Secs. B.1 and D.3 and Fig. 2). 2.CountingObjects contains746imagesdrawnfromCountBench[5,34], and Pixmo-Count [14]. We include object counts from 0 to 10 and filter out unsuitable Pixmo-Count examples. Models must count the targ","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"tokens to preserve the decomposition capability learned previously. The model is then fine-tuned to generate executable action chunks conditioned on the 4D representation and the chunk-level action prompt, supervised by the flow-matching loss in Sec. 3.3. 5 Experiments 5.1 Experiment Setup Implementation Details.Our model adopts the pretrained weights of PaliGemma [3] from 𝜋0.5 [13] as the VLM backbone, DINOv2 [27] from VGGT [32] as the geometry encoder, and Gemma-300M [29] equipped with a struc","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"narrow but consequential: in 3-7B VLMs, reliability is read more reliably off hidden-state geometry, layer-wise margin formation, and sparse late-layer circuits than off attention-map sharpness. 1 Introduction Vision-language models can answer richly compositional questions about images, yet routinely producefluentmistakes: confident, well-formed answers that are not supported by the pixels they purport to describe [3, 18, 27]. For deployment in settings where errors carry cost (scientific image","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"uate the effectiveness of our 3D-specific distillation approach. First, we report results for theteacher model (LLaV A-3D- 7B)[4], which provides an upper-bound performance. Second, we compare against lightweight2D vision-language models of similar scale to demonstrate the benefit of 3D-specific distillation (Table I), including LLaV A-v1.5-7B (without 3D training), MobileVLM-2B [34], and PaliGemma-3B [35]. Fi- nally, we includeablation variantsof our model to analyze the contribution of individ","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"green block\" while evaluation reverses the concept order to \"stack green block on blue block\". Al- though the required stacking skill is unchanged, standard SFT repeats the training behavior, whereas DeLockfollows the new instruction and succeeds. To examine the representation-level cause, we vi- sualize vision-language cross-attention in the PaliGemma [61] backbone using instruction tokens as queries and image patches as keys (Figure 4(a)). Standard SFT shows a collapsed attention pattern, cont","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks PaliGemma: A versatile 3B VLM for transfer because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (19 contexts).","role_counts":[{"n":19,"context_role":"background"},{"n":6,"context_role":"method"},{"n":5,"context_role":"baseline"},{"n":2,"context_role":"dataset"}]},"error":null,"updated_at":"2026-06-29T13:39:02.113142+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"PaliGemma: A versatile 3B VLM for transfer","claims":[{"claim_text":"PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks PaliGemma: A versatile 3B VLM for transfer because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T17:26:20.712900+00:00"}},"summary":{"title":"PaliGemma: A versatile 3B VLM for transfer","claims":[{"claim_text":"PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks PaliGemma: A versatile 3B VLM for transfer because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","work_id":"f790abdc-a796-482f-a40d-f8ee035ecfc2","shared_citers":16},{"title":"$\\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization","work_id":"d1ad7304-d09a-49bc-809e-846439f6aff9","shared_citers":16},{"title":"GR00T N1: An Open Foundation Model for Generalist Humanoid Robots","work_id":"e2db69c7-ee8a-4cb7-a761-7b8de1dfcf97","shared_citers":13},{"title":"OpenVLA: An Open-Source Vision-Language-Action Model","work_id":"3e7e65c5-5aed-4fe9-8414-2092bcb31cc7","shared_citers":12},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":11},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":10},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":10},{"title":"RT-1: Robotics Transformer for Real-World Control at Scale","work_id":"e11bda85-8531-46bc-a07f-d0ade3643ab1","shared_citers":10},{"title":"DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset","work_id":"13253de2-3d89-415c-8c2f-3adb25d4c337","shared_citers":9},{"title":"Flow Matching for Generative Modeling","work_id":"6edb71c4-5d64-40af-a394-9757ea051a36","shared_citers":9},{"title":"SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features","work_id":"50eec732-2d41-432f-9dcf-ac7fff235ea5","shared_citers":9},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":8},{"title":"FAST: Efficient Action Tokenization for Vision-Language-Action Models","work_id":"83a8f966-6cfa-4f21-81f3-87440aae238f","shared_citers":7},{"title":"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware","work_id":"6fe159e0-fa73-481a-88d4-4719c15140be","shared_citers":7},{"title":"Gemini Robotics: Bringing AI into the Physical World","work_id":"f7c5ce10-8364-4fbe-964f-2802b81c3a98","shared_citers":6},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":6},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":6},{"title":"X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model","work_id":"13faca8d-e96d-4e6c-a441-9f2683d11934","shared_citers":6},{"title":"$\\pi^{*}_{0.6}$: a VLA That Learns From Experience","work_id":"7c1b3355-694a-44c6-880f-631e897e1713","shared_citers":5},{"title":"AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems","work_id":"f797e9ec-510f-43a7-8a0c-18009ce332e5","shared_citers":5},{"title":"CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation","work_id":"4b158d3e-3dff-4412-85cd-baa879465a5e","shared_citers":5},{"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","shared_citers":5},{"title":"Evaluating Real-World Robot Manipulation Policies in Simulation","work_id":"7f4ca6cb-1b94-454c-9623-b52441b74b61","shared_citers":5},{"title":"Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success","work_id":"04f46bb3-4346-47e8-bf09-c75d91f96e87","shared_citers":5}],"time_series":[{"n":3,"year":2024},{"n":4,"year":2025},{"n":38,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"360699f9-8488-4024-9d3d-bd37260c7616","orcid":null,"display_name":"Alexander Kolesnikov","source":"manual","import_confidence":0.72},{"id":"a9d14cf6-0c1c-4bb7-bbba-39be634a3fa5","orcid":null,"display_name":"Andreas Steiner","source":"manual","import_confidence":0.72},{"id":"19bc41f1-d7a2-4d7f-8d1e-28f5e85d3715","orcid":null,"display_name":"Andr\\'e Susano Pinto","source":"manual","import_confidence":0.72},{"id":"8d7c92de-8af7-47f1-bcde-3b822c6e07cc","orcid":null,"display_name":"Daniel Salz","source":"manual","import_confidence":0.72},{"id":"1bfa3006-3f8d-4961-84b1-36b6aa3f09d4","orcid":null,"display_name":"Lucas Beyer","source":"manual","import_confidence":0.72},{"id":"466ac76c-2740-42d7-a7f4-523b8528b0e0","orcid":null,"display_name":"Xiao Wang","source":"manual","import_confidence":0.72}]}}