pith. machine review for the scientific record. sign in

arxiv: 2605.08385 · v1 · submitted 2026-05-08 · 💻 cs.CR

Recognition: 2 theorem links

· Lean Theorem

Quantifiable Uncertainty: A Stochastic Consensus Multi-Agent RAG Framework for Robust Malware Detection

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:56 UTC · model grok-4.3

classification 💻 cs.CR
keywords malware detectionRAG frameworkuncertainty estimationmulti-agent systemsensemble methodsepistemic uncertaintystochastic consensusreject option
0
0 comments X

The pith

A stochastic multi-agent RAG system uses ensemble disagreement scores to reject ambiguous malware samples and reach 98.4 percent detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MAGMA as a retrieval-augmented generation framework that splits malware analysis into semantic code retrieval and probabilistic verification steps. It employs dual-stream embeddings on assembly and pseudo-code to focus on decision-critical functions while ignoring dead code. A stochastic consistency ensemble runs multiple non-deterministic agent evaluations on the retrieved set, from which it derives an Evidence Conflict Score as the Shannon entropy of the prediction distribution. Elevated ECS values are shown to act as a proxy for structural ambiguity, supporting a reject-option policy that avoids forced classifications on uncertain inputs. This yields the reported 98.4 percent detection rate by addressing epistemic uncertainty that standard deep-learning classifiers cannot express.

Core claim

The central claim is that the Evidence Conflict Score derived from a stochastic consistency ensemble over retrieval-augmented reasoning agents serves as an effective proxy for structural ambiguity in malware binaries, enabling a principled reject-option policy that improves detection reliability beyond existing monolithic classifiers to 98.4 percent.

What carries the argument

The Evidence Conflict Score (ECS), defined as the Shannon entropy of the ensemble's predictive distribution, which quantifies disagreement among multiple independent agent evaluations to identify ambiguous cases.

If this is right

  • Detectors can defer classification on high-ECS samples instead of risking misclassification under evasion attacks.
  • Dual-stream retrieval isolates decision-critical functions, reducing noise from irrelevant code sections.
  • Quantifiable uncertainty allows the system to express epistemic limits that monolithic classifiers hide.
  • The reject-option policy improves overall accuracy by routing uncertain cases to secondary analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ECS-based deferral could be tested in other adversarial domains such as network intrusion detection where input ambiguity is common.
  • Production deployments might integrate human review loops triggered specifically by high-ECS outputs to reduce analyst workload on clear cases.
  • Future extensions could explore whether ECS correlates with actual evasion success rates in controlled red-team experiments.

Load-bearing premise

That the Evidence Conflict Score from the stochastic ensemble accurately captures structural ambiguity in malware rather than other sources of agent disagreement.

What would settle it

A test set of malware variants engineered with known levels of structural ambiguity (such as varying dead-code insertion) whose ECS values are measured to check whether higher ambiguity reliably produces higher ECS and triggers the reject policy.

read the original abstract

While contemporary deep learning malware detectors define a dominant defense paradigm, their sophistication also exposes them to novel structural evasion attacks, a limitation we attribute to their inherent inability to express epistemic uncertainty. To address this challenge, we present MAGMA, a Retrieval-Augmented Generation (RAG) framework that decouples malware analysis into semantic code retrieval and probabilistic verification. In contrast to monolithic classifiers, MAGMA employs a dual-stream embedding scheme over assembly and pseudo-code representations to isolate Decision-Critical Functions (DCFs) from the noise of dead code. We further introduce a Stochastic Consistency Ensemble, in which multiple instances of the same reasoning agent independently evaluate the retrieval set under non-deterministic sampling. From this ensemble, we derive two complementary metrics: Function Evidence Strength (FES), a weighted aggregation of retrieval confidence, and the Evidence Conflict Score (ECS), defined as the Shannon entropy of the ensemble's predictive distribution. We show that elevated ECS values serve as an effective proxy for structural ambiguity, enabling the system to implement a principled ``reject-option'' policy. Extensive evaluation demonstrates that MAGMA achieves a 98.4% detection rate, substantially exceeding existing solutions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes MAGMA, a Retrieval-Augmented Generation (RAG) framework for malware detection that decouples analysis into semantic code retrieval and probabilistic verification using a dual-stream embedding scheme for assembly and pseudo-code to isolate Decision-Critical Functions (DCFs). It introduces a Stochastic Consistency Ensemble where multiple agents evaluate the retrieval set, deriving Function Evidence Strength (FES) as weighted retrieval confidence and Evidence Conflict Score (ECS) as Shannon entropy of the predictive distribution to enable a reject-option for high ambiguity. The paper claims that this approach achieves a 98.4% detection rate, substantially outperforming existing solutions.

Significance. If the empirical results hold under rigorous validation, the work could significantly impact the field of malware detection by providing a framework that quantifies uncertainty to handle structural evasion attacks, moving beyond monolithic deep learning classifiers. The use of ensemble-based metrics like FES and ECS offers a principled way to implement reject options, which is valuable for high-stakes security applications.

major comments (2)
  1. [Abstract] Abstract: The assertion that 'MAGMA achieves a 98.4% detection rate, substantially exceeding existing solutions' is presented without any reference to the datasets employed, baseline methods, evaluation methodology, cross-validation strategy, or ablation studies. This is load-bearing for the central effectiveness claim, as no supporting evidence is supplied to allow assessment of the result.
  2. [Abstract] Abstract: The statement that 'elevated ECS values serve as an effective proxy for structural ambiguity' is not accompanied by any validation against ground-truth measures of ambiguity, correlation analysis, or experiments demonstrating the reject-option policy's effect on detection performance. Without external grounding, the proxy interpretation rests solely on the internal definition of ECS as Shannon entropy over the ensemble distribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and for recognizing the potential significance of our work in advancing uncertainty-aware malware detection. We address each major comment below and have revised the manuscript to strengthen the presentation of our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that 'MAGMA achieves a 98.4% detection rate, substantially exceeding existing solutions' is presented without any reference to the datasets employed, baseline methods, evaluation methodology, cross-validation strategy, or ablation studies. This is load-bearing for the central effectiveness claim, as no supporting evidence is supplied to allow assessment of the result.

    Authors: We agree that the abstract, as a concise summary, would benefit from additional context to ground the central claim. The full manuscript provides detailed descriptions of the datasets, baseline comparisons, evaluation methodology, cross-validation strategy, and ablation studies in the Experiments and Evaluation sections. To address this concern directly, we have revised the abstract to include a brief reference to the evaluation setup and key performance context while respecting length constraints. revision: yes

  2. Referee: [Abstract] Abstract: The statement that 'elevated ECS values serve as an effective proxy for structural ambiguity' is not accompanied by any validation against ground-truth measures of ambiguity, correlation analysis, or experiments demonstrating the reject-option policy's effect on detection performance. Without external grounding, the proxy interpretation rests solely on the internal definition of ECS as Shannon entropy over the ensemble distribution.

    Authors: We thank the referee for this observation. The manuscript presents experimental results demonstrating that elevated ECS aligns with cases of structural evasion and that the reject-option policy improves overall detection metrics. To provide stronger external validation as suggested, we have added explicit correlation analysis between ECS and ground-truth measures of structural ambiguity (derived from controlled function modifications) along with quantitative results showing the reject-option's impact on precision and recall. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper's core architecture defines FES directly as a weighted aggregation of retrieval confidence scores and ECS as Shannon entropy over the ensemble's predictive distribution; these are explicit computational definitions from the stochastic consistency ensemble outputs rather than derived predictions or first-principles results that reduce to the inputs by construction. The claim that elevated ECS serves as a proxy for structural ambiguity is framed as an empirical observation validated through evaluation, not a mathematical equivalence or self-referential loop. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior work are present in the abstract or high-level description, and the dual-stream embedding and reject-option policy remain independent components without reducing to fitted parameters renamed as predictions. The 98.4% detection rate is an empirical result, not a circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 5 invented entities

The central claims rest on several newly introduced concepts and derived metrics whose independent grounding is not evidenced in the abstract; the framework adds these entities without parameter-free derivations or external benchmarks.

axioms (2)
  • standard math Shannon entropy of an ensemble's predictive distribution quantifies conflict or ambiguity
    Directly invoked to define the Evidence Conflict Score (ECS)
  • domain assumption Dual-stream embeddings on assembly and pseudo-code representations isolate Decision-Critical Functions from dead code noise
    Core premise of the retrieval component
invented entities (5)
  • MAGMA no independent evidence
    purpose: Overall RAG framework for uncertainty-aware malware detection
    Newly proposed system name and architecture
  • Decision-Critical Functions (DCFs) no independent evidence
    purpose: Focus analysis on key functions while ignoring dead code
    Introduced as output of the dual-stream embedding scheme
  • Stochastic Consistency Ensemble no independent evidence
    purpose: Generate multiple independent evaluations under non-deterministic sampling
    Core mechanism for deriving uncertainty metrics
  • Function Evidence Strength (FES) no independent evidence
    purpose: Weighted aggregation of retrieval confidence
    Derived metric from the framework
  • Evidence Conflict Score (ECS) no independent evidence
    purpose: Measure structural ambiguity via entropy to enable reject-option
    Derived metric enabling the key policy

pith-pipeline@v0.9.0 · 5500 in / 1692 out tokens · 62638 ms · 2026-05-12T00:56:33.616765+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

  1. [1]

    In: Proceedings of the 17th Annual Computer Security Applications Conference (ACSAC), pp

    Anderson, R.: Why information security is hard-an economic perspective. In: Proceedings of the 17th Annual Computer Security Applications Conference (ACSAC), pp. 358–365. IEEE, ??? (2001) 25

  2. [2]

    Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, 3078–3092 (2023) https://doi.org/10

    Li, S.: Packgenome: Automatically generating robust yara rules for accurate mal- ware packer detection. Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, 3078–3092 (2023) https://doi.org/10. 1145/3576915.3616625

  3. [3]

    In: Proceedings of the 1st Reversing and Offensive-oriented Trends Symposium (ROTS), pp

    Bulazel, A., Yener, B.: A survey on automated dynamic malware analysis evasion and counter-evasion: Pc, mobile, and web. In: Proceedings of the 1st Reversing and Offensive-oriented Trends Symposium (ROTS), pp. 1–21. ACM, ??? (2017)

  4. [4]

    Journal of Network and Computer Applications 153, 102526 (2020) https://doi.org/10

    Gibert, D., Mateu, C., Planes, J.: The rise of machine learning for detection and classification of malware: Research developments, trends and challenges. Journal of Network and Computer Applications 153, 102526 (2020) https://doi.org/10. 1016/j.jnca.2019.102526

  5. [5]

    In: 31st USENIX Security Symposium (USENIX Security 22), pp

    Arp, D., Quiring, E., Pendlebury, F., Warnecke, A., Pierazzi, F., Wressnegger, C., Cavallaro, L., Rieck, K.: Dos and don’ts of machine learning in computer security. In: 31st USENIX Security Symposium (USENIX Security 22), pp. 3971–3988. USENIX Association, ??? (2022)

  6. [6]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Hein, M., Andriushchenko, M., Bitterwolf, J.: Why relu networks yield high- confidence predictions far away from the training data and how to mitigate the problem. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 41–50 (2019)

  7. [7]

    In: International Conference on Machine Learning (ICML), pp

    Gal, Y., Ghahramani, Z.: Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In: International Conference on Machine Learning (ICML), pp. 1050–1059. PMLR, ??? (2016)

  8. [8]

    IEEE Transactions on Information Forensics and Security (TIFS) 19, 1142–1155 (2024)

    He, Y., Kang, X., Yan, Q., Li, E.: Resnext+: Attention mechanisms based on resnext for malware detection and classification. IEEE Transactions on Information Forensics and Security (TIFS) 19, 1142–1155 (2024)

  9. [9]

    Malware Detection by Eating a Whole EXE,

    Raff, E., Barker, J., Sylvester, J., Brandon, R., Catanzaro, B., Nicholas, C.: Malware detection by eating a whole exe. In: Workshops at the Thirty- Second AAAI Conference on Artificial Intelligence (2018). Presented at the AAAI Workshop on Artificial Intelligence for Cyber Security (AICS). https://arxiv.org/abs/1710.09435

  10. [10]

    In: 2020 IEEE Symposium on Security and Privacy (SP), pp

    Pierazzi, F., Pendlebury, F., Cortellino, J., Cavallaro, L.: Intriguing properties of adversarial ml attacks in the problem space. In: 2020 IEEE Symposium on Security and Privacy (SP), pp. 1332–1349. IEEE, ??? (2020)

  11. [11]

    https://ghidra-sre.org/

    National Security Agency: Ghidra Software Reverse Engineering Framework. https://ghidra-sre.org/

  12. [12]

    Harang, R., Rudd, E.M.: SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection (2020) 26

  13. [13]

    https://virusshare.com/

    VirusShare.com. https://virusshare.com/. Accessed: 2025-12-01 (2025)

  14. [14]

    Technical Report TR 2007-48, Purdue University (2007)

    Idika, N., Mathur, A.P.: A survey of malware detection techniques. Technical Report TR 2007-48, Purdue University (2007)

  15. [15]

    Digital Investigation 18, 33–45 (2016) https://doi.org/10.1016/j.diin.2016.04.013

    Karbab, E.B., Debbabi, M., Mouheb, D.: Fingerprinting android packaging: Generating dnas for malware detection. Digital Investigation 18, 33–45 (2016) https://doi.org/10.1016/j.diin.2016.04.013

  16. [16]

    Auror: Defending against poisoning attacks in collaborative deep learning systems,

    Karbab, E.B., Debbabi, M., Derhab, A., Mouheb, D.: Cypider: building community-based cyber-defense infrastructure for android malware detec- tion. In: Proceedings of the 32nd Annual Conference on Computer Security Applications. ACSAC ’16, pp. 348–362. Association for Computing Machin- ery, New York, NY, USA (2016). https://doi.org/10.1145/2991079.2991124 ...

  17. [17]

    Applying Graph Analysis for Unsupervised Fast Malware Fingerprinting

    Karbab, E.B., Debbabi, M.: Applying Graph Analysis for Unsupervised Fast Malware Fingerprinting (2025). https://arxiv.org/abs/2510.12811

  18. [18]

    In: 11th International Symposium on Recent Advances in Intrusion Detection (RAID)

    Rieck, K., Holz, T., Willems, C., Düssel, P., Laskov, P.: Learning and classifying new intrusion attacks with unknown payloads. In: 11th International Symposium on Recent Advances in Intrusion Detection (RAID). Springer, ??? (2008)

  19. [19]

    In: Proceedings of the 9th ACM Conference on Computer and Communications Security (CCS), pp

    Wagner, D., Soto, P.: Mimicry attacks on host-based intrusion detection systems. In: Proceedings of the 9th ACM Conference on Computer and Communications Security (CCS), pp. 255–264. ACM, ??? (2002)

  20. [20]

    Digital Investiga- tion 24, 48–59 (2018) https://doi.org/10.1016/j.diin.2018.01.007

    Karbab, E.B., Debbabi, M., Derhab, A., Mouheb, D.: Maldozer: Automatic framework for android malware detection using deep learning. Digital Investiga- tion 24, 48–59 (2018) https://doi.org/10.1016/j.diin.2018.01.007

  21. [21]

    In: Bilge, L., Cavallaro, L., Pellegrino, G., Neves, N

    Karbab, E.B., Debbabi, M.: Petadroid: Adaptive android malware detection using deep learning. In: Bilge, L., Cavallaro, L., Pellegrino, G., Neves, N. (eds.) Detection of Intrusions and Malware, and Vulnerability Assessment, pp. 319–340. Springer, Cham (2021)

  22. [22]

    CoRR abs/1712.08996 (2017) 1712.08996

    Karbab, E.B., Debbabi, M., Derhab, A., Mouheb, D.: Android malware detection using deep learning on API method sequences. CoRR abs/1712.08996 (2017) 1712.08996

  23. [23]

    Karbab, E.B., Debbabi, M., Derhab, A., Mouheb, D.: Android Malware Detection Using Machine Learning: Data-driven Fingerprinting and Threat Intelligence vol

  24. [24]

    Springer, ??? (2021)

  25. [25]

    In: Sako, K., Schneider, S., Ryan, P.Y.A

    Alrabaee, S., Karbab, E.B., Wang, L., Debbabi, M.: Bineye: Towards efficient binary authorship characterization using deepălearning. In: Sako, K., Schneider, S., Ryan, P.Y.A. (eds.) Computer Security – ESORICS 2019, pp. 47–67. Springer, 27 Cham (2019)

  26. [26]

    Expert Sys- tems with Applications 225, 120017 (2023) https://doi.org/10.1016/j.eswa.2023

    Karbab, E.B., Debbabi, M., Derhab, A.: Swiftr: Cross-platform ransomware fingerprinting using hierarchical neural networks on hybrid features. Expert Sys- tems with Applications 225, 120017 (2023) https://doi.org/10.1016/j.eswa.2023. 120017

  27. [27]

    IEEE Trans- actions on Information Forensics and Security 16, 3469–3478 (2021)

    Demetrio, L., Biggio, B., Lagorio, G., Roli, F., Armando, A.: Functionality- preserving black-box optimization of adversarial windows malware. IEEE Trans- actions on Information Forensics and Security 16, 3469–3478 (2021)

  28. [28]

    In: 32nd USENIX Security Symposium (USENIX Security 23), pp

    Mukherjee, K., Wiedemeier, J., Wang, T., Wei, J., Chen, F., Kim, M., Kantar- cioglu, M., Jee, K.: Evading provenance-based ml detectors with adversarial system actions. In: 32nd USENIX Security Symposium (USENIX Security 23), pp. 1199–1216. USENIX Association, ??? (2023)

  29. [29]

    In: 28th USENIX Security Symposium (USENIX Security 19), pp

    Pendlebury, F., Pierazzi, F., Jordaney, R., Kinder, J., Cavallaro, L.: Tesseract: Eliminating experimental bias in malware classification across space and time. In: 28th USENIX Security Symposium (USENIX Security 19), pp. 729–746. USENIX Association, ??? (2019)

  30. [30]

    In: 30th USENIX Security Symposium (USENIX Security 21), pp

    Yang, L., Guo, W., Hao, Q., Ciptadi, A., Ahmadzadeh, A., Xing, X., Wang, G.: Cade: Detecting and explaining concept drift samples for security applications. In: 30th USENIX Security Symposium (USENIX Security 21), pp. 2327–2344. USENIX Association, ??? (2021)

  31. [31]

    In: 2022 IEEE Symposium on Security and Privacy (SP), pp

    Barbero, F., Pendlebury, F., Pierazzi, F., Cavallaro, L.: Transcending transcend: Revisiting malware classification in the presence of concept drift. In: 2022 IEEE Symposium on Security and Privacy (SP), pp. 805–823. IEEE, ??? (2022)

  32. [32]

    In: Advances in Neural Information Processing Systems (NeurIPS), vol

    Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 30 (2017)

  33. [33]

    In: International Conference on Machine Learning (ICML), pp

    Gal, Y., Ghahramani, Z.: Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In: International Conference on Machine Learning (ICML), pp. 1050–1059 (2016)

  34. [34]

    In: 2022 IEEE Symposium on Security and Privacy (SP), pp

    Pearce, H., Ahmad, B., Tan, B., Dolan-Gavitt, B., Karri, R.: Asleep at the key- board? assessing the security of github copilot’s code contributions. In: 2022 IEEE Symposium on Security and Privacy (SP), pp. 754–768. IEEE, ??? (2022)

  35. [35]

    AsmRAG: LLM-Driven Malware Detection by Retrieving Functionally Similar Assembly Code

    Karbab, E.B.: AsmRAG: LLM-Driven Malware Detection by Retrieving Func- tionally Similar Assembly Code (2026). https://arxiv.org/abs/2604.23196

  36. [36]

    Bender, E.M., Gebru, T., McMillan-Major, A., Shmitchell, S.: On the dangers of stochastic parrots: Can language models be too big? In: Proceedings of the 2021 28 ACM Conference on Fairness, Accountability, and Transparency, pp. 610–623. ACM, ??? (2021)

  37. [37]

    In: Advances in Neural Information Processing Systems (NeurIPS), vol

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 24824–24837 (2022)

  38. [38]

    In: International Conference on Learning Representations (ICLR) (2023)

    Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. In: International Conference on Learning Representations (ICLR) (2023)

  39. [39]

    Alvarez, V.: YARA: The Pattern Matching Swiss Knife for Malware Researchers. (2023). https://virustotal.github.io/yara/

  40. [40]

    EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models

    Anderson, H.S., Roth, P.: Ember: An open dataset for training static pe malware machine learning models. arXiv preprint arXiv:1804.04637 (2018)

  41. [41]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recogni- tion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)

  42. [42]

    Computers & Security 92, 101740 (2020)

    Zhou, S., Zou, F., Wang, T.: Automating the analysis of variant malware using graph neural networks. Computers & Security 92, 101740 (2020)

  43. [43]

    In: Detection of Intrusions and Malware, and Vulnerability Assessment (DIMV A), pp

    Huang, W., Stokes, J.W.: Mtnet: A multi-task neural network for dynamic mal- ware classification. In: Detection of Intrusions and Malware, and Vulnerability Assessment (DIMV A), pp. 399–418. Springer, ??? (2016)

  44. [44]

    arXiv preprint arXiv:2304.01852 (2023) 29

    Liu, Y., Han, T., Ma, S., Zhang, J., Yang, Y., Tian, J., He, H., et al.: Summary of chatgpt/gpt-4 research and perspective towards the future of large language models. arXiv preprint arXiv:2304.01852 (2023) 29