pith. sign in

arxiv: 2501.00106 · v2 · submitted 2024-12-30 · 💻 cs.SE · cs.AI

LicenseGPT: A Fine-tuned Foundation Model for Publicly Available Dataset License Compliance

Pith reviewed 2026-05-23 05:58 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords dataset license compliancefine-tuned foundation modellegal AIlicense interpretationsoftware intellectual propertyprediction agreementuser study
0
0 comments X

The pith

LicenseGPT, a model fine-tuned on 500 expert-annotated licenses, achieves 64.3% prediction agreement on dataset license compliance and reduces analysis time by over 94%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LicenseGPT to help interpret ambiguous dataset licenses for commercial AI development. Existing legal foundation models reach only 43.75% agreement with expert judgments. By fine-tuning on a curated set of 500 licenses labeled by legal experts, LicenseGPT reaches 64.3% agreement and is faster. User studies with IP lawyers show it speeds up work dramatically while maintaining accuracy. Lawyers see it as a useful aid that still requires human review for hard cases.

Core claim

LicenseGPT is created by fine-tuning a foundation model on 500 licenses annotated by legal experts. It improves Prediction Agreement from the best legal FM's 43.75% to 64.30%. In A/B tests and user studies, it reduces the time for software IP lawyers to analyze each license from 108 seconds to 6 seconds without loss of accuracy.

What carries the argument

LicenseGPT, a fine-tuned foundation model trained on expert-annotated dataset licenses to predict compliance.

If this is right

  • LicenseGPT outperforms both specialized legal models and general-purpose models on the task.
  • Analysis time drops by 94.44% per license in controlled tests.
  • Lawyers perceive the tool as valuable but still require oversight for complex cases.
  • The model provides a publicly available resource for practitioners.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar fine-tuning approaches could apply to other areas of legal document analysis beyond licenses.
  • Expanding the annotated dataset beyond 500 examples might further improve performance on rare license types.
  • Integration into automated pipelines could change how datasets are selected for AI training.

Load-bearing premise

The 500 expert-annotated licenses represent the full range of ambiguities found in real-world dataset licenses, and the prediction agreement metric reflects actual legal usefulness.

What would settle it

Evaluating LicenseGPT on a fresh collection of 100 dataset licenses not included in the original 500 annotations, measured against new expert judgments.

Figures

Figures reproduced from arXiv: 2501.00106 by Ahmed E. Hassan, Dan Li, Gopi Krishnan Rajbahadur, Jianshan Lin, Jingwen Tan, Xiangfu Song, Zibin Zheng, Zi Li.

Figure 1
Figure 1. Figure 1: Overview of our study design the role of a software IP lawyer, ensuring its responses are legally sound. Defining roles enhances task relevance in spe￾cialized domains [46]. Task Definition: We clearly define the task for the model, instructing it to assess whether a dataset can be used commercially, thus maintaining a focused objective [31]. Focus Specification: We direct the model to concentrate on the l… view at source ↗
Figure 2
Figure 2. Figure 2: Heapmap of PA on studied system and user prompts [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
read the original abstract

Dataset license compliance is a critical yet complex aspect of developing commercial AI products, particularly with the increasing use of publicly available datasets. Ambiguities in dataset licenses pose significant legal risks, making it challenging even for software IP lawyers to accurately interpret rights and obligations. In this paper, we introduce LicenseGPT, a fine-tuned foundation model (FM) specifically designed for dataset license compliance analysis. We first evaluate existing legal FMs (i.e., FMs specialized in understanding and processing legal texts) and find that the best-performing model achieves a Prediction Agreement (PA) of only 43.75%. LicenseGPT, fine-tuned on a curated dataset of 500 licenses annotated by legal experts, significantly improves PA to 64.30%, outperforming both legal and general-purpose FMs. Through an A/B test and user study with software IP lawyers, we demonstrate that LicenseGPT reduces analysis time by 94.44%, from 108 seconds to 6 seconds per license, without compromising accuracy. Software IP lawyers perceive LicenseGPT as a valuable supplementary tool that enhances efficiency while acknowledging the need for human oversight in complex cases. Our work underscores the potential of specialized AI tools in legal practice and offers a publicly available resource for practitioners and researchers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces LicenseGPT, a fine-tuned foundation model for dataset license compliance analysis. It reports that the best existing legal FM achieves only 43.75% Prediction Agreement (PA) with expert annotations, while LicenseGPT, trained on 500 expert-annotated licenses, reaches 64.30% PA and outperforms both legal and general-purpose models. An A/B test and user study with software IP lawyers is claimed to show a 94.44% reduction in analysis time (108s to 6s per license) without accuracy loss, with lawyers viewing it as a useful supplementary tool requiring human oversight.

Significance. If the 500-license dataset is representative and the evaluation metrics valid, the result would be significant for reducing legal risks in commercial AI development using public datasets. The combination of quantitative PA gains with a practical user study on time savings provides applied value beyond pure model performance. However, the absence of methodological details on data curation and evaluation prevents confirming whether the gains reflect genuine legal utility or artifacts of the experimental setup.

major comments (2)
  1. [Abstract] Abstract: The central claim of PA improvement from 43.75% to 64.30% rests on a 'curated dataset of 500 licenses annotated by legal experts,' but the abstract (and available text) supplies no selection criteria, ambiguity-handling protocol, inter-annotator agreement statistics, or confirmation that the evaluation split is disjoint from fine-tuning data. This directly undermines assessment of whether PA captures real-world license ambiguities rather than annotation artifacts.
  2. [Abstract] Abstract (user study paragraph): The A/B test and user study demonstrating 94.44% time reduction is presented without details on participant recruitment, task construction, how 'accuracy' was independently verified, or controls for selection/expectation bias. These elements are load-bearing for the practical-utility claim and the assertion that accuracy is not compromised.
minor comments (1)
  1. [Abstract] The abstract refers to 'Prediction Agreement (PA)' without defining the metric or how it differs from standard accuracy/F1; a brief definition would improve clarity even if expanded in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on methodological transparency. We agree that the current manuscript lacks sufficient detail on data curation and user study design, which limits evaluation of the claims. We will revise the manuscript to address these points.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of PA improvement from 43.75% to 64.30% rests on a 'curated dataset of 500 licenses annotated by legal experts,' but the abstract (and available text) supplies no selection criteria, ambiguity-handling protocol, inter-annotator agreement statistics, or confirmation that the evaluation split is disjoint from fine-tuning data. This directly undermines assessment of whether PA captures real-world license ambiguities rather than annotation artifacts.

    Authors: We agree that these details are missing from the current version and are necessary for readers to assess whether the PA gains reflect genuine improvements. In the revised manuscript we will add a Methods subsection describing the license selection criteria, the ambiguity-handling protocol used by the legal experts, the inter-annotator agreement statistics obtained during annotation, and explicit confirmation that the evaluation split was held out from the fine-tuning data. revision: yes

  2. Referee: [Abstract] Abstract (user study paragraph): The A/B test and user study demonstrating 94.44% time reduction is presented without details on participant recruitment, task construction, how 'accuracy' was independently verified, or controls for selection/expectation bias. These elements are load-bearing for the practical-utility claim and the assertion that accuracy is not compromised.

    Authors: We agree that the user-study description is insufficiently detailed. The revised manuscript will expand the relevant section to cover participant recruitment criteria and process, the construction of the A/B test tasks, the independent verification procedure for accuracy, and the controls implemented to mitigate selection and expectation bias. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results rest on external expert annotations and baselines.

full rationale

The paper's core claims rest on fine-tuning a model on 500 expert-annotated licenses and measuring Prediction Agreement (PA) plus time reduction against those same annotations and external baselines. No equations, self-definitional metrics, fitted-input predictions, or load-bearing self-citations appear in the provided text. The reported gains (43.75% to 64.30% PA; 94.44% time reduction) are presented as direct empirical comparisons rather than quantities derived by construction from the inputs themselves. The derivation chain is therefore self-contained against independent annotations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Only abstract available, so free parameters and axioms cannot be exhaustively listed; main implicit assumption is that fine-tuning on a modest expert-annotated set transfers to real legal practice.

free parameters (1)
  • fine-tuning hyperparameters and data split
    Not specified in abstract but required for the reported performance.
axioms (1)
  • domain assumption Existing legal foundation models provide a suitable base that can be improved via fine-tuning on domain-specific annotated licenses
    Invoked by the choice to start from legal FMs and fine-tune rather than train from scratch.

pith-pipeline@v0.9.0 · 5775 in / 1405 out tokens · 58081 ms · 2026-05-23T05:58:12.445565+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · 6 internal anchors

  1. [1]

    [Online]

    Lexilaw. [Online]. Available: https://github.com/CSHaitao/LexiLaw

  2. [2]

    [Online]

    wisdominterrogatory. [Online]. Available: https://github.com/zhihaiLLM/ wisdomInterrogatory

  3. [3]

    Jacobsen v. katzer,

    “Jacobsen v. katzer,” pp. 1373–1381, 2008

  4. [4]

    Open data commons public domain dedication and license (pddl),

    “Open data commons public domain dedication and license (pddl),” 2018, open Data Commons License. [Online]. Available: https://opendatacommons.org/ licenses/pddl/

  5. [5]

    Amazon s3,

    “Amazon s3,” 2023, accessed: 2024-10-02. [Online]. Available: https://aws. amazon.com/s3/

  6. [6]

    Datahub,

    “Datahub,” 2023, accessed: 2024-10-02. [Online]. Available: https://datahub.io/

  7. [7]

    Figshare,

    “Figshare,” 2023, accessed: 2024-10-02. [Online]. Available: https://figshare.com/

  8. [8]

    [Online]

    “Github,” 2023, accessed: 2024-10-02. [Online]. Available: https://github.com/

  9. [9]

    [Online]

    “Gitlab,” 2023, accessed: 2024-10-02. [Online]. Available: https://gitlab.com/

  10. [10]

    Google cloud,

    “Google cloud,” 2023, accessed: 2024-10-02. [Online]. Available: https: //cloud.google.com/

  11. [11]

    Hugging face,

    “Hugging face,” 2023, accessed: 2024-10-02. [Online]. Available: https: //huggingface.co/

  12. [12]

    [Online]

    “Kaggle,” 2023, accessed: 2024-10-02. [Online]. Available: https://www.kaggle. com/

  13. [13]

    Microsoft azure,

    “Microsoft azure,” 2023, accessed: 2024-10-02. [Online]. Available: https: //azure.microsoft.com/

  14. [14]

    Opendataology,

    “Opendataology,” 2023, accessed: 2024-10-02. [Online]. Available: http://www. opendataology.com:30800/#/dataSetAll

  15. [15]

    SPDX 3.0 Dataset Profile,

    “SPDX 3.0 Dataset Profile,” 2023, accessed: 2024-10-11. [Online]. Available: https://spdx.github.io/spdx-spec/v3.0/model/Dataset/Dataset/

  16. [16]

    [Online]

    “Zenodo,” 2023, accessed: 2024-10-02. [Online]. Available: https://zenodo.org/

  17. [17]

    Github licensing guide,

    “Github licensing guide,” 2024, https://docs.github.com/en/repositories/ managing-your-repositorys-settings-and-features/customizing-your-repository/ licensing-a-repository

  18. [18]

    Open source initiative,

    “Open source initiative,” 2024, available at: https://opensource.org/licenses

  19. [19]

    SPDX AI - Areas of Interest,

    “SPDX AI - Areas of Interest,” 2024, accessed: 2024-10-11. [Online]. Available: https://spdx.dev/learn/areas-of-interest/ai/

  20. [20]

    Tldrlegal: Understand open source licenses,

    “Tldrlegal: Understand open source licenses,” 2024, available at: https://www. tldrlegal.com/

  21. [21]

    Qwen: Open-source pretrained large-scale language model,

    A. D. Academy, “Qwen: Open-source pretrained large-scale language model,” https: //modelscope.cn/models/damo, 2023, accessed: 2024-10-04

  22. [22]

    Llama-2: Open and efficient foundation language models,

    M. AI, “Llama-2: Open and efficient foundation language models,” https://ai.meta. com/llama, 2023, accessed: 2024-10-04

  23. [23]

    Zero-shot learning: What, how, and why it matters for nlp,

    N. AI, “Zero-shot learning: What, how, and why it matters for nlp,” 2023, accessed: 2024-10-05. [Online]. Available: https://neptune.ai/blog/zero-shot-learning

  24. [24]

    Software engineering for machine learning: A case study,

    S. Amershi, A. Begel, C. Bird, R. DeLine, H. Gall, E. Kamar, N. Nagappan, B. Nushi, and T. Zimmermann, “Software engineering for machine learning: A case study,” in2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) . IEEE, 2019, pp. 291–300

  25. [25]

    Factsheets: Increasing trust in ai services through supplier’s declarations of conformity,

    M. Arnold, R. K. Bellamy, M. Hind, S. Houde, S. Mehta, A. Mojsilovi ´c, R. Nair, K. N. Ramamurthy, A. Olteanu, D. Piorkowskiet al., “Factsheets: Increasing trust in ai services through supplier’s declarations of conformity,”IBM Journal of Research and Development, vol. 63, no. 4/5, pp. 6–1, 2019

  26. [26]

    Promptsource: An integrated development environment and repository for natural language prompts,

    S. H. Bach et al. , “Promptsource: An integrated development environment and repository for natural language prompts,” 2022

  27. [27]

    Towards Traceability in Data Ecosystems using a Bill of Materials Model

    I. Barclay, A. Preece, I. Taylor, and D. Verma, “Towards traceability in data ecosystems using a bill of materials model,” arXiv preprint arXiv:1904.04253 , 2019

  28. [28]

    Towards Standardization of Data Licenses: The Montreal Data License

    M. Benjamin, P. Gagnon, N. Rostamzadeh, C. Pal, Y . Bengio, and A. Shee, “Towards standardization of data licenses: The montreal data license,” arXiv preprint arXiv:1903.12262, 2019

  29. [29]

    Chatgpt-4 performance on legal benchmarks: Evalu- ating its applicability for specialized tasks,

    M. Bommarito and D. Katz, “Chatgpt-4 performance on legal benchmarks: Evalu- ating its applicability for specialized tasks,” Artificial Intelligence and Law , 2023. [Online]. Available: https://link.springer.com/article/10.1007/s10506-023-09356-y

  30. [30]

    Analyzing regulatory rules for privacy and security requirements,

    Breaux et al., “Analyzing regulatory rules for privacy and security requirements,” IEEE Transactions on Software Engineering , 2008

  31. [31]

    Language models are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakan- tan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems , vol. 33, pp. 1877–1901, 2020

  32. [32]

    Objectives and key results in software teams: Challenges, opportunities and impact on development,

    J. L. Butler, T. Zimmermann, and C. Bird, “Objectives and key results in software teams: Challenges, opportunities and impact on development,” in Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice, 2024, pp. 358–368

  33. [33]

    Cohen, Statistical Power Analysis for the Behavioral Sciences, 2nd ed

    J. Cohen, Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates, 1988

  34. [34]

    Creative commons attribution license (cc by),

    C. Commons, “Creative commons attribution license (cc by),” 2013, creative Commons License. [Online]. Available: https://creativecommons.org/licenses/by/4. 0/

  35. [35]

    Chatlaw: A multi-agent collaborative legal assistant with knowledge graph enhanced mixture-of-experts large language model, 2024

    J. Cui, Z. Li, Y . Yan, B. Chen, and L. Yuan, “Chatlaw: Open-source legal large language model with integrated external knowledge bases,” arXiv preprint arXiv:2306.16092, 2023

  36. [36]

    Efficient and effective text encoding for chinese llama and alpaca,

    Y . Cui, Z. Yang, and X. Yao, “Efficient and effective text encoding for chinese llama and alpaca,” arXiv preprint arXiv:2304.08177 , 2023

  37. [37]

    Glm: General language model pretraining with autoregressive blank infilling,

    Z. Du, Y . Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang, “Glm: General language model pretraining with autoregressive blank infilling,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 320–335

  38. [38]

    Multiple comparisons among means,

    O. J. Dunn, “Multiple comparisons among means,” Journal of the American statistical association, vol. 56, no. 293, pp. 52–64, 1961

  39. [39]

    What is zero-shot classification?

    H. Face, “What is zero-shot classification?” 2023, accessed: 2024-10-

  40. [40]

    Available: https://huggingface.co/docs/transformers/main/en/task_ summary#zero-shot-classification

    [Online]. Available: https://huggingface.co/docs/transformers/main/en/task_ summary#zero-shot-classification

  41. [41]

    Datasheets for datasets,

    T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. D. Iii, and K. Crawford, “Datasheets for datasets,” Communications of the ACM , vol. 64, no. 12, pp. 86–92, 2021

  42. [42]

    A method for open source license compliance of java applications,

    D. German and M. Di Penta, “A method for open source license compliance of java applications,” IEEE software, vol. 29, no. 3, pp. 58–63, 2012

  43. [43]

    License integration patterns: Addressing license mismatches in component-based development,

    D. M. German and A. E. Hassan, “License integration patterns: Addressing license mismatches in component-based development,” in 2009 IEEE 31st international conference on software engineering . IEEE, 2009, pp. 188–198

  44. [44]

    Large language models: The legal aspects of licensing for commercial purposes,

    GetInData, “Large language models: The legal aspects of licensing for commercial purposes,” 2023, accessed: 2024-10-02. [Online]. Available: https://getindata.com/ blog/large-language-models-legal-aspects-licensing-commercial-purposes/

  45. [45]

    Github copilot: Your ai pair programmer,

    GitHub, “Github copilot: Your ai pair programmer,” https://copilot.github.com, 2021, accessed: 2024-07-03

  46. [46]

    Infringement of copyright and moral rights and exceptions to infringement (continued),

    Government of Canada, “Infringement of copyright and moral rights and exceptions to infringement (continued),” 2021, [Last visited on 09-25-2024]. [Online]. Available: https://laws-lois.justice.gc.ca/eng/acts/c-42/page-9.html

  47. [47]

    Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models,

    N. Guha, Nyarko et al. , “Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models,” in Advances in Neural Information Processing Systems , A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36. Curran Associates, Inc., 2023, pp. 44 123– 44 279. [Online]. Available: https://proceedin...

  48. [48]

    Fossology: A license compli- ance tool,

    F. Hansen, B. Becker, C. Chamas, and P. Germain, “Fossology: A license compli- ance tool,” in IFIP International Conference on Open Source Systems . Springer, 2010, pp. 47–62

  49. [49]

    Key issues in writers’ case against openai explained,

    Harvard Gazette, “Key issues in writers’ case against openai explained,” Sep. 2023. [Online]. Available: https://news.harvard.edu/gazette/story/2023/09/ key-issues-in-writers-case-against-openai-explained/

  50. [50]

    Rethinking software engineering in the foundation model era: From task-driven ai copilots to goal-driven ai pair programmers,

    A. E. Hassan, G. A. Oliva, D. Lin, B. Chen, Z. Ming et al., “Rethinking software engineering in the foundation model era: From task-driven ai copilots to goal-driven ai pair programmers,” arXiv preprint arXiv:2404.10225 , 2024

  51. [51]

    Lawyer llama technical report,

    Q. Huang, M. Tao, Z. An, C. Zhang, C. Jiang, Z. Chen, Z. Wu, and Y . Feng, “Lawyer llama technical report,” arXiv preprint arXiv:2305.15062 , 2023

  52. [52]

    Arguing regulatory compliance of software requirements,

    S. Ingolfo, A. Siena, J. Mylopoulos, A. Susi, and A. Perini, “Arguing regulatory compliance of software requirements,” Data & Knowledge Engineering , vol. 87, pp. 279–296, 2013

  53. [53]

    Fossology: The open source license compliance tool,

    M. C. Jaeger, G. J. Herzwurm, and J. Böhm, “Fossology: The open source license compliance tool,”International Free and Open Source Software Law Review, vol. 1, no. 2, pp. 153–171, 2009

  54. [54]

    Automating the license compati- bility process in open source software with spdx,

    G. M. Kapitsaki, F. Kramer, and N. D. Tselikas, “Automating the license compati- bility process in open source software with spdx,” Journal of systems and software, vol. 131, pp. 386–401, 2017

  55. [55]

    Automating the extraction of rights and obligations for regulatory compliance,

    N. Kiyavitskaya, N. Zeni, T. D. Breaux, A. I. Ant ’on, J. R. Cordy, L. Mich, and J. Mylopoulos, “Automating the extraction of rights and obligations for regulatory compliance,” inProceedings of the 27th International Conference on Conceptual Modeling, Barcelona, Spain, October 20-24 , 2008

  56. [56]

    Enforcing the gpl and open source software licenses in the us after jacobsen v. katzer,

    B. M. Kuhn and K. M. Sandler, “Enforcing the gpl and open source software licenses in the us after jacobsen v. katzer,” Berkeley Technology Law Journal , vol. 27, pp. 231–274, 2012

  57. [57]

    Legal documents drafting with fine-tuned pre-trained large language model,

    C.-H. Lin and P.-J. Cheng, “Legal documents drafting with fine-tuned pre-trained large language model,” arXiv preprint arXiv:2406.04202 , 2024

  58. [58]

    Lawgpt:chinese legal model,

    M. LiuHongcheng, LiaoYusheng and WangYuhao, “Lawgpt:chinese legal model,”

  59. [59]

    Available: https://github.com/LiuHC0428/LAW_GPT

    [Online]. Available: https://github.com/LiuHC0428/LAW_GPT

  60. [60]

    The data provenance initiative: A large scale audit of dataset licensing & attribution in ai,

    S. Longpre, R. Mahari, A. Chen, N. Obeng-Marnu, D. Sileo, W. Brannon, N. Muennighoff, N. Khazam, J. Kabbara, K. Perisetla et al., “The data provenance initiative: A large scale audit of dataset licensing & attribution in ai,” arXiv preprint arXiv:2310.16787, 2023

  61. [61]

    Model cards for model reporting,

    M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru, “Model cards for model reporting,” in Proceedings of the conference on fairness, accountability, and transparency , 2019, pp. 220–229. 11

  62. [62]

    The rise of open source program office,

    H. Munir and C.-E. Mols, “The rise of open source program office,” IT Professional, vol. 23, no. 1, pp. 27–33, 2021

  63. [63]

    A guide to copyright,

    G. of Canada, “A guide to copyright,” 2021, [Last visited on 09-25-2024]. [Online]. Available: https://laws-lois.justice.gc.ca/eng/acts/c-42/page-9.html

  64. [64]

    More information on fair use,

    U. C. Office, “More information on fair use,” 2021, [Last visited on 09-25-2024]. [Online]. Available: https://www.copyright.gov/fair-use/more-info.html

  65. [65]

    OpenAI, “Gpt-4,” https://openai.com/gpt-4, 2023, accessed: 2024-10-04

  66. [66]

    OpenChain AI Study Group Monthly Workshop for North America and Europe: Full Recording,

    OpenChain Project, “OpenChain AI Study Group Monthly Workshop for North America and Europe: Full Recording,” https://openchainproject.org/ news/2024/04/09/openchain-ai-study-group-monthly-workshop-for-north-/ america-and-europe-2024-04-02-full-recording, 2024, last accessed: October 10, 2024

  67. [67]

    Openchain project,

    “Openchain project,” https://openchainproject.org/, OpenChain Project, 2024, ac- cessed: 2024-10-10

  68. [68]

    LicenseGPT,

    OpenDataology, “LicenseGPT,” https://github.com/OpenDataology/LicenseGPT, 2024, gitHub repository, Last accessed: 2024-10-11

  69. [69]

    A review of current trends, techniques, and challenges in large language models (llms),

    R. Patil and V . Gudivada, “A review of current trends, techniques, and challenges in large language models (llms),” Applied Sciences, vol. 14, no. 5, p. 2074, 2024

  70. [70]

    Mitigating dataset harms requires stewardship: Lessons from 1000 papers,

    K. Peng, A. Mathur, and A. Narayanan, “Mitigating dataset harms requires stewardship: Lessons from 1000 papers,” arXiv preprint arXiv:2108.02922 , 2021

  71. [71]

    Can i use this publicly available dataset to build commercial ai software? most likely not,

    G. K. Rajbahadur, E. Tuck, L. Zi, Z. Wei, D. Lin, B. Chen, Z. M. Jiang, and D. M. German, “Can i use this publicly available dataset to build commercial ai software? most likely not,” CoRR, abs/2111.02374, pp. 1–1, 2021

  72. [72]

    Self-reflective chain-of-thought reasoning in large language mod- els,

    T. Researcher, “Self-reflective chain-of-thought reasoning in large language mod- els,” 2023

  73. [73]

    W. S. G. . Rosati. (2017) Open source software: Risks, compliance, and best practices. [Online]. Available: https://www.wsgr.com/en/insights/ open-source-software-risks-compliance-and-best-practices.html

  74. [74]

    Simultaneous statistical inference,

    G. Rupert Jr et al., “Simultaneous statistical inference,” 2012

  75. [75]

    fuzi.mingcha,

    Z. Z. Shiguang Wu, Zhongkun Liu et al. , “fuzi.mingcha,” 2023. [Online]. Available: https://github.com/irlab-sdu/fuzi.mingcha

  76. [76]

    B. D. Software. (2023) Open source security and license compliance management. [Online]. Available: https://www.blackducksoftware.com

  77. [77]

    Lawgpt: Chinese-llama tuned with chinese legal knowledge,

    Z. Z. Song Pengxiao and cainiao, “Lawgpt: Chinese-llama tuned with chinese legal knowledge,” 2023. [Online]. Available: https://github.com/pengxiao-song/LaWGPT

  78. [78]

    Responsible ai licenses-a real alternative to generally applicable laws?

    K. Szpyt, “Responsible ai licenses-a real alternative to generally applicable laws?” Revista Ibérica do Direito , vol. 1, no. 2, pp. 178–186, 2020

  79. [79]

    The impact of automated parameter optimization for defect prediction models,

    C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Matsumoto, “The impact of automated parameter optimization for defect prediction models,”IEEE Transactions on Software Engineering , 2018

  80. [80]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al. , “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971 , 2023

Showing first 80 references.