Building Trust in the Skies: A Knowledge-Grounded LLM-based Framework for Aviation Safety

Alisa Tiselska; Anirudh Iyengar; Dumindu Samaraweera; Hong Liu

arxiv: 2604.13101 · v1 · submitted 2026-04-10 · 💻 cs.SE · cs.AI

Building Trust in the Skies: A Knowledge-Grounded LLM-based Framework for Aviation Safety

Anirudh Iyengar , Alisa Tiselska , Dumindu Samaraweera , Hong Liu This is my paper

Pith reviewed 2026-05-10 17:27 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords aviation safetyknowledge graphslarge language modelsretrieval-augmented generationhallucination mitigationtrustworthy AIsafety analytics

0 comments

The pith

LLMs grounded by an aviation safety knowledge graph deliver more accurate and traceable responses than standalone models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an end-to-end system that first uses LLMs to build and update an Aviation Safety Knowledge Graph from multimodal sources, then applies that graph inside a retrieval-augmented generation pipeline to ground and explain the model's outputs. A sympathetic reader would care because aviation safety decisions demand verifiable facts and carry catastrophic risk when models invent details. The implemented framework is shown to handle complex queries while cutting hallucinations and adding traceability links back to source data. This combination addresses the core weakness of pure LLM use in regulated, high-stakes domains.

Core claim

The dual-phase pipeline automates construction of an Aviation Safety Knowledge Graph from text, images, and other records using LLMs, then retrieves from that graph to validate, explain, and constrain every generated answer, producing measurable gains in accuracy and traceability over LLM-only baselines.

What carries the argument

The Aviation Safety Knowledge Graph (ASKG), automatically extracted and refreshed by LLMs, then queried inside the RAG layer to anchor and explain every response.

Load-bearing premise

Automated LLM extraction from multimodal aviation sources yields an accurate and complete knowledge graph free of systematic gaps that would break downstream reliability.

What would settle it

A side-by-side test on a held-out set of real aviation safety queries where the framework still generates hallucinations or unverifiable answers at rates comparable to a plain LLM.

Figures

Figures reproduced from arXiv: 2604.13101 by Alisa Tiselska, Anirudh Iyengar, Dumindu Samaraweera, Hong Liu.

**Figure 2.** Figure 2: A sub-knowledge graph generated by the pipeline from a sample set [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Architecture of the aviation safety query system. The red line denotes the data-flow path, while the green line indicates the return path of the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Natural language to Cypher translation. The GraphRAG interface [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 4.** Figure 4: The dashboard demonstrates the semantic mapping of a natural [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

The integration of Large Language Models (LLMs) into aviation safety decision-making represents a significant technological advancement, yet their standalone application poses critical risks due to inherent limitations such as factual inaccuracies, hallucination, and lack of verifiability. These challenges undermine the reliability required for safety-critical environments where errors can have catastrophic consequences. To address these challenges, this paper proposes a novel, end-to-end framework that synergistically combines LLMs and Knowledge Graphs (KGs) to enhance the trustworthiness of safety analytics. The framework introduces a dual-phase pipeline: it first employs LLMs to automate the construction and dynamic updating of an Aviation Safety Knowledge Graph (ASKG) from multimodal sources. It then leverages this curated KG within a Retrieval-Augmented Generation (RAG) architecture to ground, validate, and explain LLM-generated responses. The implemented system demonstrates improved accuracy and traceability over LLM-only approaches, effectively supporting complex querying and mitigating hallucination. Results confirm the framework's capability to deliver context-aware, verifiable safety insights, addressing the stringent reliability requirements of the aviation industry. Future work will focus on enhancing relationship extraction and integrating hybrid retrieval mechanisms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This aviation safety paper proposes an LLM-KG-RAG framework but provides no quantitative evidence for its effectiveness.

read the letter

The punchline on this paper is that it describes a dual-phase framework for using LLMs and knowledge graphs in aviation safety analysis, but the claims of better accuracy and hallucination reduction lack any supporting measurements or comparisons. What the work actually does is lay out a pipeline where LLMs extract and update an Aviation Safety Knowledge Graph from sources like reports and documents, followed by a RAG component that uses the graph to ground responses. This is a direct response to the well-known issues with LLMs in critical domains, where factual errors can be dangerous. The paper does a good job explaining the motivation and sketching how the system could provide traceable, context-aware insights for complex safety queries. The architecture seems thoughtfully put together for the domain, with attention to dynamic updating of the graph and validation of outputs. It builds on prior ideas of combining LLMs with structured knowledge but applies them to aviation specifics. The main soft spot is the absence of empirical support. The abstract mentions that the implemented system demonstrates improvements, yet the text provides no numbers on accuracy, no error analysis for the knowledge graph construction, and no baseline tests against plain LLM use. This matters because the reliability of the entire approach hinges on whether the automated extraction from multimodal sources produces a complete and accurate graph. If that step has gaps, such as missing relations in incident data, the downstream benefits won't materialize. It's not a minor omission; it's the foundation. This kind of paper is for readers working on trustworthy AI systems in regulated sectors like aviation, healthcare, or transportation. Someone exploring RAG techniques for factual grounding might pick up useful structural ideas from the description. It deserves a serious referee. The problem it targets is important, and the proposed solution is coherent, but it needs the evaluation data to be publishable.

Referee Report

2 major / 1 minor

Summary. The paper proposes a dual-phase framework for aviation safety analytics that first uses LLMs to automate construction and dynamic updating of an Aviation Safety Knowledge Graph (ASKG) from multimodal sources, then integrates the resulting KG into a RAG pipeline to ground, validate, and explain LLM outputs, claiming improved accuracy, traceability, and hallucination mitigation relative to LLM-only baselines.

Significance. If the performance claims were substantiated, the work would address a timely need for verifiable AI in safety-critical domains by combining automated KG construction with retrieval grounding. The approach could serve as a template for reducing hallucinations in high-stakes querying, but the complete absence of quantitative validation limits its immediate contribution.

major comments (2)

Abstract: the assertion that 'the implemented system demonstrates improved accuracy and traceability over LLM-only approaches' is unsupported by any quantitative metrics, evaluation protocol, baseline comparisons, or error analysis, leaving the central empirical claim unsubstantiated.
ASKG construction pipeline: the automated LLM-based extraction and dynamic updating from multimodal sources is described at a high level with no reported precision/recall/F1 scores, coverage analysis of safety-critical relations, or expert gold-standard validation, which directly undermines the reliability of all downstream RAG grounding and hallucination-mitigation claims.

minor comments (1)

Abstract: the acronym ASKG is introduced without an immediate parenthetical expansion on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments. We agree that the current manuscript lacks sufficient quantitative validation to support the performance claims, and we will revise it accordingly to include empirical evaluations.

read point-by-point responses

Referee: Abstract: the assertion that 'the implemented system demonstrates improved accuracy and traceability over LLM-only approaches' is unsupported by any quantitative metrics, evaluation protocol, baseline comparisons, or error analysis, leaving the central empirical claim unsubstantiated.

Authors: We acknowledge this limitation in the current version. The claims in the abstract were based on preliminary internal testing, but we agree they require formal substantiation. In the revised manuscript, we will add a new 'Evaluation' section that details the experimental setup, including quantitative metrics such as accuracy improvements, traceability scores, baseline comparisons against standard LLM approaches, and error analysis. This will directly support the assertions made. revision: yes
Referee: ASKG construction pipeline: the automated LLM-based extraction and dynamic updating from multimodal sources is described at a high level with no reported precision/recall/F1 scores, coverage analysis of safety-critical relations, or expert gold-standard validation, which directly undermines the reliability of all downstream RAG grounding and hallucination-mitigation claims.

Authors: We concur that the ASKG construction details are presented at a conceptual level without quantitative backing. To rectify this, we will enhance the 'ASKG Construction' section with specifics on the extraction methodology, including the use of LLMs for entity and relation extraction from multimodal data. We will report precision, recall, and F1 scores based on a validation set, provide coverage analysis for safety-critical relations, and describe the expert validation process used to create the gold standard. For dynamic updating, we will include metrics on the accuracy of incremental updates. revision: yes

Circularity Check

0 steps flagged

No circularity: framework is descriptive with independent implementation claims

full rationale

The paper describes an end-to-end LLM+KG pipeline for aviation safety without equations, fitted parameters, or any derivation that reduces outputs to inputs by construction. Claims of improved accuracy and hallucination mitigation rest on the implemented RAG system rather than self-referential definitions or self-citation chains. No load-bearing steps match the enumerated circularity patterns; the KG construction step is presented as an engineering choice whose accuracy is asserted via results, not presupposed.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework depends on two unproven domain assumptions about LLM extraction quality and RAG effectiveness, plus one newly postulated entity (the ASKG) whose independent validation is not provided.

axioms (2)

domain assumption LLMs can reliably extract and structure knowledge from multimodal aviation safety sources into a usable graph
Invoked in the first phase of the dual-phase pipeline described in the abstract.
domain assumption Retrieval-augmented generation over the constructed KG will consistently reduce hallucinations and increase traceability for safety queries
Central premise of the second phase and the claimed improvement over LLM-only baselines.

invented entities (1)

Aviation Safety Knowledge Graph (ASKG) no independent evidence
purpose: Structured knowledge base to ground and validate LLM outputs in aviation safety
Newly introduced entity constructed by the LLM in phase one; no external validation or independent evidence supplied.

pith-pipeline@v0.9.0 · 5508 in / 1292 out tokens · 50390 ms · 2026-05-10T17:27:48.646965+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

[1]

Roadmap for artificial intelligence safety assurance (version i)

Federal Aviation Administration. Roadmap for artificial intelligence safety assurance (version i). Technical report, U.S. Department of Transportation, Washington, DC, 2024. Accessed: 2025-03

work page 2024
[2]

Trung T. Pham. Faa roadmap on artificial intelligence safety: Presenta- tion to redac-nas. Presentation, Research, Engineering, and Development Advisory Committee (REDAC-NAS), Federal Aviation Administration,

work page
[3]

Presented September 4, 2024

work page 2024
[4]

Automated analysis of aviation safety reports using transformer models.Journal of Aerospace Informa- tion Systems, 19(6):401–415, 2022

Li Zhang, Hao Wang, and Xu Chen. Automated analysis of aviation safety reports using transformer models.Journal of Aerospace Informa- tion Systems, 19(6):401–415, 2022

work page 2022
[5]

Regulatory compliance querying with large language models in aviation

Wei Chen and Rajesh Gupta. Regulatory compliance querying with large language models in aviation. InProceedings of the IEEE International Conference on Artificial Intelligence in Safety Systems, pages 112–119, 2023

work page 2023
[6]

Miller and Ying Zhao

James K. Miller and Ying Zhao. Hallucination in domain-specific large language models: An empirical study in aviation safety.Safety Science, 172:106382, 2024

work page 2024
[7]

An ontology for aviation safety management systems

Adrian Stroe, Patrick Klein, and Matthias Bauer. An ontology for aviation safety management systems. InEuropean Conference on Knowledge Management, pages 820–828, 2021

work page 2021
[8]

Causal chain analysis of aviation incidents using knowledge graphs.Reliability Engineering & System Safety, 230:108924, 2023

Yiming Wang, Hiroshi Tanaka, and Robert Smith. Causal chain analysis of aviation incidents using knowledge graphs.Reliability Engineering & System Safety, 230:108924, 2023

work page 2023
[9]

Automated regulatory compliance check- ing for aviation operations using knowledge graphs

Soo-Jin Kim and Min-Su Park. Automated regulatory compliance check- ing for aviation operations using knowledge graphs. InInternational Conference on Knowledge Engineering and Knowledge Management, pages 45–60. Springer, 2022

work page 2022
[10]

Retrieval- augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

work page 2020
[11]

Graph rag: Un- leashing the power of knowledge graphs with large language models

Darren Edge, Rishi Trivedi, and Mehrdad Mozafari. Graph rag: Un- leashing the power of knowledge graphs with large language models. arXiv preprint arXiv:2401.15841, 2024

work page arXiv 2024
[12]

Pan, Simon Razniewski, Jan-Christoph Kalo, et al

Jeff Z. Pan, Simon Razniewski, Jan-Christoph Kalo, et al. Zero-shot information extraction for knowledge graph construction using large language models. InProceedings of the International Semantic Web Conference, pages 89–106, 2023

work page 2023
[13]

From natural language to graph query: A framework using llms for knowledge graph interaction

Jiashu Sun, Ming Sun, and Jiawei Zhang. From natural language to graph query: A framework using llms for knowledge graph interaction. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 1568–1580, 2023

work page 2023
[14]

Preliminary Accident and Incident Re- port

Federal Aviation Administration. Preliminary Accident and Incident Re- port. Technical Report 100:93, FAA, 2025. Aviation Safety Information Analysis and Sharing

work page 2025
[15]

Langchain, 2022

Harrison Chase. Langchain, 2022. Python framework for developing applications powered by language models

work page 2022
[16]

spacy: Industrial-strength natural language processing in python

Matthew Honnibal and Ines Montani. spacy: Industrial-strength natural language processing in python. InProceedings of the Conference on Python for Scientific Computing, 2017

work page 2017
[17]

Sentence-bert: Sentence embeddings using siamese bert-networks.Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 3982–3992, 2019

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 3982–3992, 2019

work page 2019
[18]

Faiss: A library for efficient similarity search and clustering of dense vectors, 2017

Jeff Johnson, Matthijs Douze, and Herv ´e J ´egou. Faiss: A library for efficient similarity search and clustering of dense vectors, 2017. Meta AI Research

work page 2017
[19]

Langgraph: Building stateful, multi- agent applications with llms, 2024

Harrison Chase and LangChain AI. Langgraph: Building stateful, multi- agent applications with llms, 2024

work page 2024
[20]

Carol query system: Aviation accident database and synopses, 2024

National Transportation Safety Board. Carol query system: Aviation accident database and synopses, 2024. Accessed: 2025-12-27

work page 2024
[21]

Aviation accident statistics and monthly summaries

National Transportation Safety Board. Aviation accident statistics and monthly summaries. Technical report, NTSB Office of Aviation Safety, Washington, D.C., 2024

work page 2024
[22]

Neo4j graph database, 2023

Neo4j, Inc. Neo4j graph database, 2023. Native graph database platform

work page 2023
[23]

Gruber.Toward principles for the design of ontologies used for knowledge sharing, volume 43

Thomas R. Gruber.Toward principles for the design of ontologies used for knowledge sharing, volume 43. Academic Press, 1993

work page 1993
[24]

Cypher: An evolving query language for property graphs.Proceedings of the 2018 International Conference on Management of Data, pages 1433–1445, 2018

Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andr ´es Taylor. Cypher: An evolving query language for property graphs.Proceedings of the 2018 International Conference on Management of Data, pages 1433–1445, 2018

work page 2018
[25]

GPT-3.5 Turbo Model Specifications and API Documentation,

OpenAI. GPT-3.5 Turbo Model Specifications and API Documentation,

work page
[26]

Model documentation and API reference

work page
[27]

Redis: Lightweight key/value store that goes beyond ordinary caching.Linux Journal, 2009(185):1, 2009

Salvatore Sanfilippo. Redis: Lightweight key/value store that goes beyond ordinary caching.Linux Journal, 2009(185):1, 2009. Original Redis introduction by its creator

work page 2009
[28]

Josiah L. Carlson. Redis in action. InProceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 953–

work page 2013
[29]

The Llama 3 Herd of Models

AI@Meta. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. Meta’s Llama 3 model series, 8B to 405B parameters

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Roadmap for artificial intelligence safety assurance (version i)

Federal Aviation Administration. Roadmap for artificial intelligence safety assurance (version i). Technical report, U.S. Department of Transportation, Washington, DC, 2024. Accessed: 2025-03

work page 2024

[2] [2]

Trung T. Pham. Faa roadmap on artificial intelligence safety: Presenta- tion to redac-nas. Presentation, Research, Engineering, and Development Advisory Committee (REDAC-NAS), Federal Aviation Administration,

work page

[3] [3]

Presented September 4, 2024

work page 2024

[4] [4]

Automated analysis of aviation safety reports using transformer models.Journal of Aerospace Informa- tion Systems, 19(6):401–415, 2022

Li Zhang, Hao Wang, and Xu Chen. Automated analysis of aviation safety reports using transformer models.Journal of Aerospace Informa- tion Systems, 19(6):401–415, 2022

work page 2022

[5] [5]

Regulatory compliance querying with large language models in aviation

Wei Chen and Rajesh Gupta. Regulatory compliance querying with large language models in aviation. InProceedings of the IEEE International Conference on Artificial Intelligence in Safety Systems, pages 112–119, 2023

work page 2023

[6] [6]

Miller and Ying Zhao

James K. Miller and Ying Zhao. Hallucination in domain-specific large language models: An empirical study in aviation safety.Safety Science, 172:106382, 2024

work page 2024

[7] [7]

An ontology for aviation safety management systems

Adrian Stroe, Patrick Klein, and Matthias Bauer. An ontology for aviation safety management systems. InEuropean Conference on Knowledge Management, pages 820–828, 2021

work page 2021

[8] [8]

Causal chain analysis of aviation incidents using knowledge graphs.Reliability Engineering & System Safety, 230:108924, 2023

Yiming Wang, Hiroshi Tanaka, and Robert Smith. Causal chain analysis of aviation incidents using knowledge graphs.Reliability Engineering & System Safety, 230:108924, 2023

work page 2023

[9] [9]

Automated regulatory compliance check- ing for aviation operations using knowledge graphs

Soo-Jin Kim and Min-Su Park. Automated regulatory compliance check- ing for aviation operations using knowledge graphs. InInternational Conference on Knowledge Engineering and Knowledge Management, pages 45–60. Springer, 2022

work page 2022

[10] [10]

Retrieval- augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

work page 2020

[11] [11]

Graph rag: Un- leashing the power of knowledge graphs with large language models

Darren Edge, Rishi Trivedi, and Mehrdad Mozafari. Graph rag: Un- leashing the power of knowledge graphs with large language models. arXiv preprint arXiv:2401.15841, 2024

work page arXiv 2024

[12] [12]

Pan, Simon Razniewski, Jan-Christoph Kalo, et al

Jeff Z. Pan, Simon Razniewski, Jan-Christoph Kalo, et al. Zero-shot information extraction for knowledge graph construction using large language models. InProceedings of the International Semantic Web Conference, pages 89–106, 2023

work page 2023

[13] [13]

From natural language to graph query: A framework using llms for knowledge graph interaction

Jiashu Sun, Ming Sun, and Jiawei Zhang. From natural language to graph query: A framework using llms for knowledge graph interaction. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 1568–1580, 2023

work page 2023

[14] [14]

Preliminary Accident and Incident Re- port

Federal Aviation Administration. Preliminary Accident and Incident Re- port. Technical Report 100:93, FAA, 2025. Aviation Safety Information Analysis and Sharing

work page 2025

[15] [15]

Langchain, 2022

Harrison Chase. Langchain, 2022. Python framework for developing applications powered by language models

work page 2022

[16] [16]

spacy: Industrial-strength natural language processing in python

Matthew Honnibal and Ines Montani. spacy: Industrial-strength natural language processing in python. InProceedings of the Conference on Python for Scientific Computing, 2017

work page 2017

[17] [17]

Sentence-bert: Sentence embeddings using siamese bert-networks.Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 3982–3992, 2019

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 3982–3992, 2019

work page 2019

[18] [18]

Faiss: A library for efficient similarity search and clustering of dense vectors, 2017

Jeff Johnson, Matthijs Douze, and Herv ´e J ´egou. Faiss: A library for efficient similarity search and clustering of dense vectors, 2017. Meta AI Research

work page 2017

[19] [19]

Langgraph: Building stateful, multi- agent applications with llms, 2024

Harrison Chase and LangChain AI. Langgraph: Building stateful, multi- agent applications with llms, 2024

work page 2024

[20] [20]

Carol query system: Aviation accident database and synopses, 2024

National Transportation Safety Board. Carol query system: Aviation accident database and synopses, 2024. Accessed: 2025-12-27

work page 2024

[21] [21]

Aviation accident statistics and monthly summaries

National Transportation Safety Board. Aviation accident statistics and monthly summaries. Technical report, NTSB Office of Aviation Safety, Washington, D.C., 2024

work page 2024

[22] [22]

Neo4j graph database, 2023

Neo4j, Inc. Neo4j graph database, 2023. Native graph database platform

work page 2023

[23] [23]

Gruber.Toward principles for the design of ontologies used for knowledge sharing, volume 43

Thomas R. Gruber.Toward principles for the design of ontologies used for knowledge sharing, volume 43. Academic Press, 1993

work page 1993

[24] [24]

Cypher: An evolving query language for property graphs.Proceedings of the 2018 International Conference on Management of Data, pages 1433–1445, 2018

Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andr ´es Taylor. Cypher: An evolving query language for property graphs.Proceedings of the 2018 International Conference on Management of Data, pages 1433–1445, 2018

work page 2018

[25] [25]

GPT-3.5 Turbo Model Specifications and API Documentation,

OpenAI. GPT-3.5 Turbo Model Specifications and API Documentation,

work page

[26] [26]

Model documentation and API reference

work page

[27] [27]

Redis: Lightweight key/value store that goes beyond ordinary caching.Linux Journal, 2009(185):1, 2009

Salvatore Sanfilippo. Redis: Lightweight key/value store that goes beyond ordinary caching.Linux Journal, 2009(185):1, 2009. Original Redis introduction by its creator

work page 2009

[28] [28]

Josiah L. Carlson. Redis in action. InProceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 953–

work page 2013

[29] [29]

The Llama 3 Herd of Models

AI@Meta. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. Meta’s Llama 3 model series, 8B to 405B parameters

work page internal anchor Pith review Pith/arXiv arXiv 2024