pith. machine review for the scientific record. sign in

arxiv: 2605.09927 · v1 · submitted 2026-05-11 · ⚛️ physics.optics · physics.data-an

Recognition: 2 theorem links

· Lean Theorem

Information Extraction of Nested Complex Structure of Quantum Cascade Lasers via Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:37 UTC · model grok-4.3

classification ⚛️ physics.optics physics.data-an
keywords information extractionlarge language modelsquantum cascade lasersJSON schemanested structuresscientific literature miningdevice databasesoptical devices
0
0 comments X

The pith

A JSON schema guide lets large language models extract nested quantum cascade laser device structures far more reliably than plain prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a pipeline that uses a predefined JSON schema to direct LLMs in pulling out the detailed, layered structures of quantum cascade lasers from research papers. Traditional prompting often fails to maintain the hierarchical consistency needed for these devices, which involve many interdependent parameters. The schema-guided approach raises performance across 12 models, with the biggest gains for smaller and open-source ones, allowing them to match larger models' capabilities. This matters because it opens a route to automatically building comprehensive databases of device designs from the literature, speeding up innovation in optoelectronics without manual data entry or model retraining.

Core claim

The JSON-Schema Guided Information Extraction (JSG-IE) pipeline transforms the task of extracting nested complex structures from quantum cascade laser literature into a schema-constrained generation problem, delivering structural consistency and accuracy improvements of 5.7% on average and up to 24.1% for mid-tier models, reaching a peak F1 of 83.4% with reasoning-enabled models.

What carries the argument

The JSON-Schema Guided Information Extraction (JSG-IE) pipeline, which converts device architecture extraction into a schema-constrained generation task to enforce correct nesting of laser layer sequences and parameters.

If this is right

  • Automated construction of high-fidelity device databases from literature becomes feasible without fine-tuning.
  • Mid-tier and open-source LLMs achieve extraction fidelity previously limited to much larger models.
  • Data-driven optoelectronic design accelerates through reliable structured data mining.
  • The method applies to other scientific domains that require extraction of complex hierarchical data.
  • No model retraining is needed to obtain consistent nested output across multiple LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The schema approach could transfer to other fields like materials science or molecular biology where papers describe nested hierarchical objects.
  • Schemas will likely need periodic updates as new device architectures emerge, pointing to a need for versioned or adaptive schema systems.
  • Extracted parameter sets could feed directly into simulation tools for rapid device modeling and optimization.
  • Broader use of smaller models for literature curation would reduce reliance on proprietary large-scale AI services.

Load-bearing premise

That a single fixed JSON schema can faithfully represent the full range of nested device architectures described across different papers without loss of critical details or the need for paper-specific adjustments.

What would settle it

A published quantum cascade laser paper whose device structure cannot be fully captured by the fixed schema, producing omissions or requiring schema changes that drop the F1 score below conventional prompting levels.

read the original abstract

The rapid advancement of Large Language Models has transformed scientific research workflows, including enabling the automated extraction of data directly from published literature. Most existing efforts, however, focus on extracting simple labeled key-value entities, whereas many scientific applications require more complex, hierarchically structured data. A representative example is Quantum Cascade Lasers, whose device architectures are defined by tens of interdependent parameters organized in nested layer sequences. In this work we propose a \emph{JSON-Schema Guided Information Extraction Pipeline} (JSG-IE) that enables reliable extraction of deeply structured device data without model fine-tuning. By transforming extraction into a schema-constrained generation task, our approach significantly improves structural consistency and accuracy. Across 12 state-of-the-art LLMs, a properly designed JSON Schema improves performance by 5.7\% over conventional prompting, with the highest $F_1$ score up to 83.4\%, achieved by the reasoning-enabled Kimi-k2-thinking model. Importantly, this performance enhancement is most significant for mid-tier and open-source models, where $F_1$ gains reach as high as 24.1\%, effectively enabling these widely accessible models to achieve extraction fidelity previously restricted to much larger architectures. This framework provides a scalable path toward automated construction of high-fidelity device databases, accelerating data-driven optoelectronic design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a JSON-Schema Guided Information Extraction Pipeline (JSG-IE) that converts LLM-based extraction of Quantum Cascade Laser device architectures into a schema-constrained generation task. It reports that this yields an average 5.7% F1 improvement over conventional prompting across 12 LLMs, with a maximum F1 of 83.4% (Kimi-k2-thinking) and gains up to 24.1% for mid-tier and open-source models, enabling scalable construction of structured device databases without fine-tuning.

Significance. If the evaluation holds, the work provides a practical, no-fine-tuning route to extract deeply nested, interdependent parameters from optoelectronics literature. The differential benefit to smaller models is a notable strength that could broaden access to automated database building in the field.

major comments (3)
  1. [Evaluation / Results] The central performance claims rest on F1 scores computed against held-out literature, yet the manuscript supplies no information on dataset size, number of papers, annotation protocol, or inter-annotator agreement. Without these details the reported 5.7% average lift and 24.1% model-specific gains cannot be assessed for statistical reliability or selection bias.
  2. [Methods / Schema Design] The fixed JSON schema is presented as sufficient to capture all nested layer sequences, material stacks, and parameter interdependencies across the test papers. No coverage analysis, schema-validation procedure, or handling of out-of-schema structures (e.g., variable numbers of active/injector regions or doping profiles) is provided; if any test instance requires field dropping or coercion, the no-fine-tuning and scalability claims are compromised.
  3. [Results] No error analysis or per-structure breakdown (e.g., success rates on deeply nested versus flat parameters) is reported. This omission prevents identification of where schema guidance actually resolves structural inconsistencies versus where it merely masks failures.
minor comments (2)
  1. [Abstract / Experimental Setup] The abstract and main text refer to “twelve models” and “Kimi-k2-thinking” without an explicit table listing all models, their sizes, or open-source status; adding such a table would improve reproducibility.
  2. [Methods] Clarify whether the JSON schema was constructed once from a development subset or iteratively refined on the evaluation papers; the latter would introduce a subtle data-leakage risk.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for improving the clarity and completeness of our manuscript. We address each major comment below and describe the revisions we will implement.

read point-by-point responses
  1. Referee: [Evaluation / Results] The central performance claims rest on F1 scores computed against held-out literature, yet the manuscript supplies no information on dataset size, number of papers, annotation protocol, or inter-annotator agreement. Without these details the reported 5.7% average lift and 24.1% model-specific gains cannot be assessed for statistical reliability or selection bias.

    Authors: We agree that these methodological details are necessary to evaluate the reliability of the reported gains. The current manuscript does not include them. In the revised version we will add a dedicated subsection describing the dataset size and composition, the number of papers, the annotation protocol, and inter-annotator agreement. This addition will allow readers to assess statistical reliability and potential selection bias. revision: yes

  2. Referee: [Methods / Schema Design] The fixed JSON schema is presented as sufficient to capture all nested layer sequences, material stacks, and parameter interdependencies across the test papers. No coverage analysis, schema-validation procedure, or handling of out-of-schema structures (e.g., variable numbers of active/injector regions or doping profiles) is provided; if any test instance requires field dropping or coercion, the no-fine-tuning and scalability claims are compromised.

    Authors: The JSON schema was constructed to represent the nested and interdependent parameters typical of QCL architectures. We will add a coverage analysis section that includes the schema-validation procedure and explicitly states how out-of-schema structures are handled. In the evaluated test set no field dropping or coercion was required; any future out-of-schema cases can be accommodated by schema extension without fine-tuning. These clarifications will be incorporated in the revised manuscript. revision: yes

  3. Referee: [Results] No error analysis or per-structure breakdown (e.g., success rates on deeply nested versus flat parameters) is reported. This omission prevents identification of where schema guidance actually resolves structural inconsistencies versus where it merely masks failures.

    Authors: We concur that a detailed error analysis would better demonstrate the specific contributions of schema guidance. We will add an error-analysis subsection containing a per-structure breakdown (deeply nested layer sequences versus flat parameters) together with representative examples of resolved inconsistencies. This material will be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on held-out literature

full rationale

The paper reports an empirical study comparing LLM prompting strategies (conventional vs. JSON-schema-guided) on extraction of QCL device parameters from published papers. F1 scores and gains (5.7% average, up to 24.1% for mid-tier models) are measured against independent ground-truth annotations extracted from the test literature. The JSON schema is a fixed input to the pipeline, not derived from or fitted to the evaluation data; accuracy is not forced by construction. No self-citations, ansatzes, or renamings appear in the load-bearing claims. The derivation chain consists of standard held-out evaluation and is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that LLMs can reliably follow complex JSON schemas for generation and that the chosen schema captures all relevant device parameters without omission.

axioms (1)
  • domain assumption Large language models can be reliably constrained to produce valid, complete JSON matching a supplied schema for scientific nested data.
    Invoked when the pipeline is presented as schema-constrained generation without fine-tuning.

pith-pipeline@v0.9.0 · 5552 in / 1257 out tokens · 41040 ms · 2026-05-12T04:37:41.418356+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 2 internal anchors

  1. [1]

    Nature 621(7980), 672–675 (2023)

    Van Noorden, R., Perkel, J.M.: Ai and sci- ence: what 1,600 researchers think. Nature 621(7980), 672–675 (2023)

  2. [2]

    Nature Machine Intelligence 7(7), 991–998 (2025)

    Miret, S., Krishnan, N.A.: Enabling large lan- guage models for real-world materials discov- ery. Nature Machine Intelligence 7(7), 991–998 (2025)

  3. [3]

    Nature Computational Science 6(3), 301–315 (2026) 11

    Shao, E., Wang, Y., Qian, Y., Pan, Z., Liu, H., Wang, D.: Sciscigpt: advancing human–ai collaboration in the science of science. Nature Computational Science 6(3), 301–315 (2026) 11

  4. [4]

    Science 264(5158), 553–556 (1994)

    Faist, J., Capasso, F., Sivco, D.L., Sirtori, C., Hutchinson, A.L., Cho, A.Y.: Quantum cascade laser. Science 264(5158), 553–556 (1994)

  5. [5]

    science 295(5553), 301–305 (2002)

    Beck, M., Hofstetter, D., Aellen, T., Faist, J., Oesterle, U., Ilegems, M., Gini, E., Mel- chior, H.: Continuous wave operation of a mid-infrared semiconductor laser at room tem- perature. science 295(5553), 301–305 (2002)

  6. [6]

    Nature 415(6874), 883–887 (2002)

    Gmachl, C., Sivco, D.L., Colombelli, R., Capasso, F., Cho, A.Y.: Ultra-broadband semi- conductor laser. Nature 415(6874), 883–887 (2002)

  7. [7]

    Applied Physics Letters 98(18), 181102 (2011)

    Bai, Y., Bandyopadhyay, N., Tsao, S., Slivken, S., Razeghi, M.: Room temperature quantum cascade lasers with 27% wall plug efficiency. Applied Physics Letters 98(18), 181102 (2011)

  8. [8]

    Nature492(7428), 229–233 (2012)

    Hugi, A., Villares, G., Blaser, S., Liu, H.C., Faist, J.: Mid-infrared frequency comb based on a quantum cascade laser. Nature492(7428), 229–233 (2012)

  9. [9]

    Applied Physics Letters 67(21), 3057–3059 (1995)

    Faist, J., Capasso, F., Sirtori, C., Sivco, D.L., Hutchinson, A.L., Cho, A.Y.: Continuous wave operation of a vertical transition quantum cas- cade laser above t= 80 k. Applied Physics Letters 67(21), 3057–3059 (1995)

  10. [10]

    IEEE Photonics Technology Let- ters 9(3), 294–296 (2002)

    Sirtori, C., Faist, J., Capasso, F., Sivco, D.L., Hutchinson, A.L., Cho, A.Y.: Mid-infrared (8.5 µm) semiconductor lasers operating at room temperature. IEEE Photonics Technology Let- ters 9(3), 294–296 (2002)

  11. [11]

    Electronics Letters 35(21), 1848–1849 (1999)

    Page, H., Kruck, P., Barbieri, S., Sirtori, C., Stellmacher, M., Nagle, J.: High peak power (1.1 w)(al) gaas quantum cascade laser emit- ting at 9.7 µm. Electronics Letters 35(21), 1848–1849 (1999)

  12. [12]

    Applied Physics Letters 78(2), 147–149 (2001)

    Faist, J., Beck, M., Aellen, T., Gini, E.: Quantum-cascade lasers based on a bound-to- continuum transition. Applied Physics Letters 78(2), 147–149 (2001)

  13. [13]

    Applied Physics Letters 84(10), 1659–1661 (2004)

    Maulini, R., Beck, M., Faist, J., Gini, E.: Broadband tuning of external cavity bound- to-continuum quantum-cascade lasers. Applied Physics Letters 84(10), 1659–1661 (2004)

  14. [14]

    : Broadband distributed-feedback quantum cas- cade laser array operating from 8.0 to 9.8 µm

    Lee, B.G., Belkin, M.A., Audet, R., MacArthur, J., Diehl, L., Pflügl, C., Capasso, F., Fis- cher, A.M., Gmachl, C.F., Wang, X., et al. : Broadband distributed-feedback quantum cas- cade laser array operating from 8.0 to 9.8 µm. IEEE Photonics Technology Letters 21(13), 914–916 (2009)

  15. [15]

    Applied Physics Letters 78(4), 396–398 (2001)

    Hofstetter, D., Beck, M., Aellen, T., Faist, J.: High-temperature operation of ingaas/alinas quantum cascade lasers at λ≈ 9 µm. Applied Physics Letters 78(4), 396–398 (2001)

  16. [16]

    Applied Physics Letters 86(4), 041109 (2005)

    Blaser, S., Yarekha, D., Hvozdara, L., Bonetti, Y., Muller, A., Giovannini, M., Faist, J.: Room-temperature, continuous-wave, single- mode quantum-cascade lasers at λ≈ 5.4 µm. Applied Physics Letters 86(4), 041109 (2005)

  17. [17]

    IEEE Photonics Technology Letters 21(12), 814–816 (2009)

    Wittmann, A., Gresch, T., Blaser, S., Muller, A., Faist, J.: Distributed-feedback quantum- cascade lasers at 9 µm operating in continuous wave up to 423 k. IEEE Photonics Technology Letters 21(12), 814–816 (2009)

  18. [18]

    Nature Photonics4(2), 95–98 (2010)

    Liu, P.Q., Hoffman, A.J., Escarra, M.D., Franz, K.J., Khurgin, J.B., Dikmelik, Y., Wang, X., Fan, J.-Y., Gmachl, C.F.: Highly power-efficient quantum cascade lasers. Nature Photonics4(2), 95–98 (2010)

  19. [19]

    Physical Review Applied 13(3), 034025 (2020)

    Franckié, M., Faist, J.: Bayesian optimization of terahertz quantum cascade lasers. Physical Review Applied 13(3), 034025 (2020)

  20. [20]

    Applied Physics Letters 101(2) (2012)

    Bismuto, A., Terazzi, R., Hinkov, B., Beck, M., Faist, J.: Fully automatized quantum cascade laser design by genetic optimization. Applied Physics Letters 101(2) (2012)

  21. [21]

    In: 2022 IEEE Photonics Society Summer Topicals Meeting Series (SUM), pp

    Hernandez, A.C., Lyu, M., Gmachl, C.F.: Gen- erating quantum cascade laser datasets for applications in machine learning. In: 2022 IEEE Photonics Society Summer Topicals Meeting Series (SUM), pp. 1–2 (2022). IEEE

  22. [22]

    In: 2023 57th Annual Conference on Information Sciences and Systems (CISS), pp

    Hernandez, A.C., Gmachl, C.F.: Application of machine learning to quantum cascade laser design. In: 2023 57th Annual Conference on Information Sciences and Systems (CISS), pp. 1–6 (2023). IEEE

  23. [23]

    APL Machine Learning 2(3) (2024)

    Correa Hernandez, A., Gmachl, C.F.: A machine learning framework for quantum cas- cade laser design. APL Machine Learning 2(3) (2024)

  24. [24]

    In: 2023 IEEE Photonics Conference (IPC), pp

    Hu, Y., Suri, S., Kirch, J., Knipfer, B., Jacobs, S., Nair, S., Zhou, Z., Yu, Z., Botez, D., Mawst, L.: Active-region design of mid-infrared quan- tum cascade lasers via machine learning. In: 2023 IEEE Photonics Conference (IPC), pp. 1–2 (2023). IEEE

  25. [25]

    Applied Physics Letters 124(24) (2024)

    Hu, Y., Suri, S., Kirch, J., Knipfer, B., Jacobs, S., Nair, S., Zhou, Z., Yu, Z., Botez, D., 12 Mawst, L.: Large-scale data generation for quantum cascade laser active-region design with automated wavefunction identification. Applied Physics Letters 124(24) (2024)

  26. [26]

    Jour- nal of applied physics 97(8) (2005)

    Mirčetić, A., Indjin, D., Ikonić, Z., Harrison, P., Milanović, V., Kelsall, R.W.: Towards auto- mated design of quantum cascade lasers. Jour- nal of applied physics 97(8) (2005)

  27. [27]

    Chemical Science 16(25), 11548–11558 (2025)

    Wang, Q., Zhang, W., Chen, M., Li, X., Xiong, Z., Xiong, J., Fu, Z., Zheng, M.: Nmrextractor: leveraging large language models to construct an experimental nmr database from open- source scientific publications. Chemical Science 16(25), 11548–11558 (2025)

  28. [28]

    Chemical Society Reviews (2025)

    Schilling-Wilhelmi, M., Ríos-García, M., Shabih, S., Gil, M.V., Miret, S., Koch, C.T., Márquez, J.A., Jablonka, K.M.: From text to insight: large language models for chemical data extraction. Chemical Society Reviews (2025)

  29. [29]

    Nature communications 15(1), 1418 (2024)

    Dagdelen, J., Dunn, A., Lee, S., Walker, N., Rosen, A.S., Ceder, G., Persson, K.A., Jain, A.: Structured information extraction from scien- tific text with large language models. Nature communications 15(1), 1418 (2024)

  30. [30]

    MedRxiv (2024)

    Wiest, I.C., Wolf, F., Leßmann, M.-E., Treeck, M., Ferber, D., Zhu, J., Boehme, H., Bressem, K.K., Ulrich, H., Ebert, M.P., et al.: Llm-aix: An open source pipeline for information extrac- tion from unstructured medical text based on privacy preserving large language models. MedRxiv (2024)

  31. [31]

    JMIR cancer 11, 65984 (2025)

    Chen, D., Alnassar, S.A., Avison, K.E., Huang, R.S., Raman, S.: Large language model applica- tions for health information extraction in oncol- ogy: scoping review. JMIR cancer 11, 65984 (2025)

  32. [32]

    In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp

    Bhattacharyya, A., Tripathi, A., Das, U., Kar- makar, A., Pathak, A., Gupta, M.: Information extraction from visually rich documents using llm-based organization of documents into inde- pendent textual segments. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 17241–17256 (2025)

  33. [33]

    Polymer 336, 128875 (2025)

    Chen, Z.-Y., Li, T., Yang, Y., Huang, H.-D., Lin, H., Liu, H., Zhong, G.-J., Li, Z.-M.: Intel- ligent information extraction pipeline driven by large language model for building poly- mer processing database. Polymer 336, 128875 (2025)

  34. [34]

    arXiv preprint arXiv:2602.04602 (2026)

    Sundaram, A.K., Chakraborty, M., Devathi, S.M.K., Prusty, B.P., Batra, R.: Automated extraction of multicomponent alloy data using large language models for sustainable design. arXiv preprint arXiv:2602.04602 (2026)

  35. [35]

    Neurocomputing 618, 129171 (2025)

    Goyal, N., Singh, N.: Named entity recognition and relationship extraction for biomedical text: A comprehensive survey, recent advancements, and future research directions. Neurocomputing 618, 129171 (2025)

  36. [36]

    In: Proceedings of the International Conference “Dialogue, vol

    Sidorova, E., Ivanov, A., Ilina, D., Ovchin- nikova, K., Osmushkin, N., Sery, A.: An approach to information extraction from texts of a limited subject domain based on a chain of large language models. In: Proceedings of the International Conference “Dialogue, vol. 2025 (2025)

  37. [37]

    Applied Soft Computing, 113302 (2025)

    Zhang, X., Cai, S., Shen, X., Yang, H., Hu, W., Zhang, Y.: Efficient unified information extrac- tion model based on large language models. Applied Soft Computing, 113302 (2025)

  38. [38]

    In: Artificial Intelli- gence for Global Security: First IFIP WG 12.13 International Conference, AI4GS 2024, Paris, France, November 19, 2024, Proceedings, vol

    Soltani, S., Limouni, E.: Llm based data anno- tation and augmentation for ner and relation- ship extraction models. In: Artificial Intelli- gence for Global Security: First IFIP WG 12.13 International Conference, AI4GS 2024, Paris, France, November 19, 2024, Proceedings, vol. 743, p. 153 (2025). Springer Nature

  39. [39]

    In: International Conference on Theory and Practice of Digital Libraries, pp

    Ateia, S., Kruschwitz, U., Scholz, M., Koschmider, A., Almohaishi, M.: Llm-based information extraction to support scientific literature research and publication workflows. In: International Conference on Theory and Practice of Digital Libraries, pp. 90–99 (2025). Springer

  40. [40]

    The American Surgeon™, 00031348251390958 (2025)

    Sanli, A.N., Turan, B., Tekcan Sanli, D.E.: Advances in large language model performance: a comparative study of chatgpt-4 and chatgpt-5 on absite questions. The American Surgeon™, 00031348251390958 (2025)

  41. [41]

    Preprints (2025)

    Anderson, I.: Comparative Analysis Between Industrial Design Methodologies Versus the Scientific Method: AI: Claude 3.7 Sonnet. Preprints (2025)

  42. [42]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., et al.: Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556 (2025)

  43. [43]

    GitHub (2022)

    Mangrulkar, S., Paul, S., Sanh, V., Gugger, 13 S.: PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods. GitHub (2022)

  44. [44]

    Iclr 1(2), 3 (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.,et al.: Lora: Low-rank adaptation of large language models. Iclr 1(2), 3 (2022)

  45. [45]

    Discover Computing 28(1), 1–20 (2025)

    Song, Y., Lv, C., Zhu, K., Qiu, X.: Lora fine- tuning of llama3 large model for intelligent fishery field. Discover Computing 28(1), 1–20 (2025)

  46. [46]

    Mineru2.5: Adecoupledvision-languagemodelforefficienthigh-resolution document parsing.arXiv preprint arXiv:2509.22186, 2025

    Niu, J., Liu, Z., Gu, Z., Wang, B., Ouyang, L., Zhao, Z., Chu, T., He, T., Wu, F., Zhang, Q., Jin, Z., Liang, G., Zhang, R., Zhang, W., Qu, Y., Ren, Z., Sun, Y., Zheng, Y., Ma, D., Tang, Z., Niu, B., Miao, Z., Dong, H., Qian, S., Zhang, J., Chen, J., Wang, F., Zhao, X., Wei, L., Li, W., Wang, S., Xu, R., Cao, Y., Chen, L., Wu, Q., Gu, H., Lu, L., Wang, K....

  47. [47]

    https://arxiv.org/abs/2409

    Wang, B., Xu, C., Zhao, X., Ouyang, L., Wu, F., Zhao, Z., Xu, R., Liu, K., Qu, Y., Shang, F., Zhang, B., Wei, L., Sui, Z., Li, W., Shi, B., Qiao, Y., Lin, D., He, C.: MinerU: An Open- Source Solution for Precise Document Content Extraction (2024). https://arxiv.org/abs/2409. 18839

  48. [48]

    arXiv preprint arXiv:2407.13773 (2024)

    He, C., Li, W., Jin, Z., Xu, C., Wang, B., Lin, D.: Opendatalab: Empowering general artificial intelligence with open datasets. arXiv preprint arXiv:2407.13773 (2024)

  49. [49]

    Energy Storage Materials 80, 104390 (2025)

    Liu, Y., Liu, D., Yang, Z., Ge, X., Yao, W., Wu, J., Avdeev, M., Shi, S.: A knowledge acquisition automatizing framework from literature exem- plified by na+ activation energy prediction of nasicon solid-state electrolyte. Energy Storage Materials 80, 104390 (2025)

  50. [50]

    GitHub (2021)

    Lyu, M.: ErwinJr2: Software Design for Model- ing Quantum Cascade Lasers. GitHub (2021)

  51. [51]

    PhD thesis, Princeton Uni- versity, Princeton, NJ (May 2021)

    Lyu, M.: Software design for modeling quantum cascade lasers and long wave- length ( ∼16 µm) GaAs/AlGaAs quantum cascade lasers. PhD thesis, Princeton Uni- versity, Princeton, NJ (May 2021). Adviser: Claire Gmachl. http://arks.princeton.edu/ark: /88435/dsp01th83m246x

  52. [52]

    In: Pro- ceedings of the 25th International Conference on World Wide Web, pp

    Pezoa, F., Reutter, J.L., Suarez, F., Ugarte, M., Vrgoč, D.: Foundations of json schema. In: Pro- ceedings of the 25th International Conference on World Wide Web, pp. 263–273 (2016)

  53. [53]

    ACM Transactions on Information Systems (TOIS) 40(2), 1–40 (2021)

    Vuong, T., Andolina, S., Jacucci, G., Ruotsalo, T.: Does more context help? effects of context window and application source on retrieval per- formance. ACM Transactions on Information Systems (TOIS) 40(2), 1–40 (2021)

  54. [54]

    arXiv preprint arXiv:2410.15821 (2024)

    Hawkins, W., Mittelstadt, B., Russell, C.: The effect of fine-tuning on language model toxicity. arXiv preprint arXiv:2410.15821 (2024)

  55. [55]

    Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

    Huang, T., Hu, S., Ilhan, F., Tekin, S.F., Liu, L.: Harmful fine-tuning attacks and defenses for large language models: A survey. arXiv preprint arXiv:2409.18169 (2024)

  56. [56]

    arXiv preprint arXiv:2508.14031 (2025)

    Hahm, D., Min, T., Jin, W., Lee, K.: Unintended misalignment from agentic fine- tuning: Risks and mitigation. arXiv preprint arXiv:2508.14031 (2025)

  57. [57]

    arXiv preprint arXiv:2409.01586 (2024)

    Huang, T., Hu, S., Ilhan, F., Tekin, S.F., Liu, L.: Booster: Tackling harmful fine-tuning for large language models via attenuating harmful perturbation. arXiv preprint arXiv:2409.01586 (2024)

  58. [58]

    journal-article

    Meyen, S., Sigg, D.M., Luxburg, U.v., Franz, V.H.: Group decisions based on confidence weighted majority voting. Cognitive research: principles and implications 6(1), 18 (2021) 14 A JSON Schema Design A.1 JSON Template for QCLs Information Extraction Fig. 3 illustrates the designed JSON template for extracting Quantum Cascade Laser (QCLs) struc- tural inf...