{"work":{"id":"cc02a01e-7218-47dc-8e66-3333e7e4adec","openalex_id":null,"doi":null,"arxiv_id":"2211.09110","raw_key":null,"title":"Holistic Evaluation of Language Models","authors":null,"authors_text":"Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D","year":2022,"venue":"cs.CL","abstract":"Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios when possible (87.5% of the time). This ensures metrics beyond accuracy don't fall to the wayside, and that trade-offs are clearly exposed. We also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze specific aspects (e.g. reasoning, disinformation). Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, 21 of which were not previously used in mainstream LM evaluation. Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: now all 30 models have been densely benchmarked on the same core scenarios and metrics under standardized conditions. Our evaluation surfaces 25 top-level findings. For full transparency, we release all raw model prompts and completions publicly for further analysis, as well as a general modular toolkit. We intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.","external_url":"https://arxiv.org/abs/2211.09110","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T05:25:23.632622+00:00","pith_arxiv_id":"2211.09110","created_at":"2026-05-08T18:44:01.759503+00:00","updated_at":"2026-05-25T05:25:23.632622+00:00","title_quality_ok":true,"display_title":"Holistic Evaluation of Language Models","render_title":"Holistic Evaluation of Language Models"},"hub":{"state":{"work_id":"cc02a01e-7218-47dc-8e66-3333e7e4adec","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":105,"external_cited_by_count":null,"distinct_field_count":16,"first_pith_cited_at":"2022-11-09T18:48:09+00:00","last_pith_cited_at":"2026-05-21T22:03:25+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-04T09:07:27.444418+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":22},{"context_role":"dataset","n":2}],"polarity_counts":[{"context_polarity":"background","n":21},{"context_polarity":"use_dataset","n":2},{"context_polarity":"unclear","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Holistic Evaluation of Language Models","claims":[{"claim_text":"Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness).","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, et al. 2021. Few-shot learning with multilingual language models. arXiv preprint arXiv:2112.10668. [61] Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li, Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani, Junxian He, Zhisong Zhang, Xuezhe Ma, et al. 2019. Choosing transfer languages for cross-lingual learning. arXiv preprint arXiv:1905.12688. [62] Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph,","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"We formulate the eviction policy with greedy H2 as a variant of dynamic submodular maximization. The analysis shows that it results in a similar generative process as the one using the H2 eviction policy. We perform extensive experiments on OPT, LLaMA, and GPT-NeoX on a single NVIDIA A 100 (80GB) GPU to evaluate H2O across a range of tasks from lm-eval-harness [ 15] and HELM [16]. We implement H2O on top of FlexGen that can easily adapt different cache eviction techniques to produce a system wit","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"acl-long.229 [41] Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation Using GPT-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2511-2522. https://doi.org/10.18653/v1/2023.emnlp-main.153 [42] Yuhan Luo, Xinning Gui, Xianghua Ding, Xi Zheng, Rie Helene Hernandez, Zhuoyang Li, and Qiurong Song. 2025. Reflecting Upon The Unintende","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"instruction following and conversational abilities [ 31, 2, 30, 8, 52, 48, 14]. Once aligned with humans, these chat models are strongly preferred by human users over the original, unaligned models on which they are built. However, the heightened user preference does not always correspond to improved scores on traditional LLM benchmarks - benchmarks like MMLU [19] and HELM [24] cannot effectively tell the difference between these aligned models and the base models. This phenomenon suggests that ","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"In recent years, the development of various benchmarks has significantly enhanced the evaluation of Large Language Models (LLMs). For instance, GLUE [37] and its successor SuperGLUE [38], have played a pivotal role in advancing language understanding tasks, setting the stage for more specialized evaluations. Other recent benchmarks, including MMLU [ 18], HELM [22], BigBench [32], Hel- laSwag [45], and the AI2 Reasoning Challenge (ARC) [12], have broadened the scope by assessing capabilities acro","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Counterfactual fairness.Advances in neural information processing systems, 30, 2017. [29] W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Sto- ica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611-626, 2023. [30] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumar, et al. Holistic evaluati","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Holistic Evaluation of Language Models because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (22 contexts).","role_counts":[{"n":22,"context_role":"background"},{"n":2,"context_role":"dataset"}]},"error":null,"updated_at":"2026-05-22T23:44:04.778772+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"1151f7b5-4762-4a80-b79b-147a68616c43","orcid":null,"display_name":"Percy Liang"},{"id":"e10a60aa-4833-4871-b90d-03d4ba0e0a5b","orcid":null,"display_name":"Rishi Bommasani"},{"id":"04b52f6a-436d-44cb-b04d-637a53e02f51","orcid":null,"display_name":"Tony Lee"},{"id":"33a0ad28-2bf8-458a-ac4c-6ffb97e032c2","orcid":null,"display_name":"Dimitris Tsipras"},{"id":"d726813d-a21a-4cc3-b904-9f10b0802b1d","orcid":null,"display_name":"Dilara Soylu"},{"id":"df375447-789e-42c9-bf39-844bb6794689","orcid":null,"display_name":"Michihiro Yasunaga"}]},"error":null,"updated_at":"2026-05-22T23:44:06.017697+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T11:49:52.365084+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models","work_id":"bb63abb3-0d50-4362-b97c-b5e725b03b39","shared_citers":13},{"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","shared_citers":11},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":11},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":10},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":10},{"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","shared_citers":9},{"title":"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena","work_id":"d0c30cd7-81e1-4159-a87f-f6adca77ff08","shared_citers":8},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":8},{"title":"Scaling Instruction-Finetuned Language Models","work_id":"8405abb1-7558-4fdf-af24-f4c52fa77a06","shared_citers":8},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":8},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":7},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":7},{"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","shared_citers":7},{"title":"Scaling Language Models: Methods, Analysis & Insights from Training Gopher","work_id":"47ce8be9-e500-407d-af41-ac2d132215eb","shared_citers":7},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":7},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":7},{"title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","work_id":"a1f2574b-a899-4713-be60-c87ba332656c","shared_citers":7},{"title":"Emergent Abilities of Large Language Models","work_id":"6ea3375b-837c-4640-a175-be7525aa3c6d","shared_citers":6},{"title":"LaMDA: Language Models for Dialog Applications","work_id":"1b66d0a5-f6ae-4332-8025-c662dc64b238","shared_citers":6},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":6},{"title":"Program Synthesis with Large Language Models","work_id":"fd241a05-03b9-4de2-9588-9d77ce176125","shared_citers":6},{"title":"SWE-bench: Can Language Models Resolve Real-World GitHub Issues?","work_id":"d0effe15-a689-441a-8e3f-ea35f1c4e4b1","shared_citers":6},{"title":"Toolformer: Language Models Can Teach Themselves to Use Tools","work_id":"9bce40c8-cfd7-4983-80e0-c3bd4402322a","shared_citers":6},{"title":"Universal and Transferable Adversarial Attacks on Aligned Language Models","work_id":"3322fa86-1768-4677-8425-dd326b45e078","shared_citers":6}],"time_series":[{"n":1,"year":2022},{"n":6,"year":2023},{"n":1,"year":2024},{"n":1,"year":2025},{"n":49,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T11:49:50.335216+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T11:49:54.767691+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Holistic Evaluation of Language Models","claims":[{"claim_text":"Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness).","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, et al. 2021. Few-shot learning with multilingual language models. arXiv preprint arXiv:2112.10668. [61] Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li, Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani, Junxian He, Zhisong Zhang, Xuezhe Ma, et al. 2019. Choosing transfer languages for cross-lingual learning. arXiv preprint arXiv:1905.12688. [62] Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph,","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"We formulate the eviction policy with greedy H2 as a variant of dynamic submodular maximization. The analysis shows that it results in a similar generative process as the one using the H2 eviction policy. We perform extensive experiments on OPT, LLaMA, and GPT-NeoX on a single NVIDIA A 100 (80GB) GPU to evaluate H2O across a range of tasks from lm-eval-harness [ 15] and HELM [16]. We implement H2O on top of FlexGen that can easily adapt different cache eviction techniques to produce a system wit","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"acl-long.229 [41] Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation Using GPT-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2511-2522. https://doi.org/10.18653/v1/2023.emnlp-main.153 [42] Yuhan Luo, Xinning Gui, Xianghua Ding, Xi Zheng, Rie Helene Hernandez, Zhuoyang Li, and Qiurong Song. 2025. Reflecting Upon The Unintende","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"instruction following and conversational abilities [ 31, 2, 30, 8, 52, 48, 14]. Once aligned with humans, these chat models are strongly preferred by human users over the original, unaligned models on which they are built. However, the heightened user preference does not always correspond to improved scores on traditional LLM benchmarks - benchmarks like MMLU [19] and HELM [24] cannot effectively tell the difference between these aligned models and the base models. This phenomenon suggests that ","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"In recent years, the development of various benchmarks has significantly enhanced the evaluation of Large Language Models (LLMs). For instance, GLUE [37] and its successor SuperGLUE [38], have played a pivotal role in advancing language understanding tasks, setting the stage for more specialized evaluations. Other recent benchmarks, including MMLU [ 18], HELM [22], BigBench [32], Hel- laSwag [45], and the AI2 Reasoning Challenge (ARC) [12], have broadened the scope by assessing capabilities acro","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Counterfactual fairness.Advances in neural information processing systems, 30, 2017. [29] W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Sto- ica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611-626, 2023. [30] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumar, et al. Holistic evaluati","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Holistic Evaluation of Language Models because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (22 contexts).","role_counts":[{"n":22,"context_role":"background"},{"n":2,"context_role":"dataset"}]},"error":null,"updated_at":"2026-05-22T23:44:04.783281+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Holistic Evaluation of Language Models","claims":[{"claim_text":"Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness).","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Holistic Evaluation of Language Models because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T11:49:56.762901+00:00"}},"summary":{"title":"Holistic Evaluation of Language Models","claims":[{"claim_text":"Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness).","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Holistic Evaluation of Language Models because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models","work_id":"bb63abb3-0d50-4362-b97c-b5e725b03b39","shared_citers":13},{"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","shared_citers":11},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":11},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":10},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":10},{"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","shared_citers":9},{"title":"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena","work_id":"d0c30cd7-81e1-4159-a87f-f6adca77ff08","shared_citers":8},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":8},{"title":"Scaling Instruction-Finetuned Language Models","work_id":"8405abb1-7558-4fdf-af24-f4c52fa77a06","shared_citers":8},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":8},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":7},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":7},{"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","shared_citers":7},{"title":"Scaling Language Models: Methods, Analysis & Insights from Training Gopher","work_id":"47ce8be9-e500-407d-af41-ac2d132215eb","shared_citers":7},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":7},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":7},{"title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","work_id":"a1f2574b-a899-4713-be60-c87ba332656c","shared_citers":7},{"title":"Emergent Abilities of Large Language Models","work_id":"6ea3375b-837c-4640-a175-be7525aa3c6d","shared_citers":6},{"title":"LaMDA: Language Models for Dialog Applications","work_id":"1b66d0a5-f6ae-4332-8025-c662dc64b238","shared_citers":6},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":6},{"title":"Program Synthesis with Large Language Models","work_id":"fd241a05-03b9-4de2-9588-9d77ce176125","shared_citers":6},{"title":"SWE-bench: Can Language Models Resolve Real-World GitHub Issues?","work_id":"d0effe15-a689-441a-8e3f-ea35f1c4e4b1","shared_citers":6},{"title":"Toolformer: Language Models Can Teach Themselves to Use Tools","work_id":"9bce40c8-cfd7-4983-80e0-c3bd4402322a","shared_citers":6},{"title":"Universal and Transferable Adversarial Attacks on Aligned Language Models","work_id":"3322fa86-1768-4677-8425-dd326b45e078","shared_citers":6}],"time_series":[{"n":1,"year":2022},{"n":6,"year":2023},{"n":1,"year":2024},{"n":1,"year":2025},{"n":49,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"d726813d-a21a-4cc3-b904-9f10b0802b1d","orcid":null,"display_name":"Dilara Soylu","source":"manual","import_confidence":0.72},{"id":"33a0ad28-2bf8-458a-ac4c-6ffb97e032c2","orcid":null,"display_name":"Dimitris Tsipras","source":"manual","import_confidence":0.72},{"id":"df375447-789e-42c9-bf39-844bb6794689","orcid":null,"display_name":"Michihiro Yasunaga","source":"manual","import_confidence":0.72},{"id":"1151f7b5-4762-4a80-b79b-147a68616c43","orcid":null,"display_name":"Percy Liang","source":"manual","import_confidence":0.72},{"id":"e10a60aa-4833-4871-b90d-03d4ba0e0a5b","orcid":null,"display_name":"Rishi Bommasani","source":"manual","import_confidence":0.72},{"id":"04b52f6a-436d-44cb-b04d-637a53e02f51","orcid":null,"display_name":"Tony Lee","source":"manual","import_confidence":0.72}]}}