pith. sign in

arxiv: 2604.09552 · v1 · pith:FZIS5C7Snew · submitted 2026-01-31 · 💻 cs.IR · cs.AI· cs.CL

MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval

Pith reviewed 2026-05-16 09:25 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL
keywords multimodal retrievalRAGengineering documentationColPaliquestion answeringDesignQA benchmarkLLM reasoning
0
0 comments X

The pith

A multimodal retrieval framework improves accuracy on engineering document questions by 41 percent relative to standard RAG.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents MCERF, a system that retrieves both text and visual elements from engineering rulebooks using ColPali and then applies one of several reasoning strategies depending on the query type. The approach avoids ingesting complete documents and instead uses targeted retrieval plus LLM reasoning. On the DesignQA benchmark it achieves a 41.1 percent relative accuracy gain over the best prior RAG methods. The result matters for any setting where standards contain dense tables, diagrams, and rules that text-only systems struggle to navigate.

Core claim

MCERF demonstrates that coupling ColPali multimodal retrieval with four hand-crafted reasoning modes and two routing strategies produces substantially more accurate answers to questions drawn from engineering documentation than baseline retrieval-augmented generation, delivering a 41.1% relative accuracy improvement on the DesignQA benchmark while using only partial document access.

What carries the argument

ColPali-based multimodal retriever combined with modular reasoning pipelines consisting of Hybrid Lookup, Vision-to-Text fusion, High-Reasoning LLM, and Self-Consistency modes, plus single-case and multi-agent routing.

If this is right

  • Question answering systems for engineering standards can achieve higher accuracy without ingesting entire rulebooks.
  • Vision-language retrieval enables direct use of figures and tables in reasoning chains.
  • Modular design supports future replacement of the underlying retriever or LLM.
  • Adaptive routing improves performance across different query complexities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar pipelines could be adapted for legal or medical documents that mix text with diagrams.
  • Further gains might come from training the routing agent on more diverse engineering corpora.
  • The framework offers a template for building domain-specific multimodal QA systems beyond the tested benchmark.

Load-bearing premise

That the ColPali retrieval and hand-designed reasoning modes will generalize beyond the DesignQA benchmark without benchmark-specific tuning.

What would settle it

A test on a fresh set of engineering rulebooks and questions where accuracy fails to exceed baseline RAG performance would falsify the general improvement claim.

read the original abstract

Engineering rulebooks and technical standards contain multimodal information like dense text, tables, and illustrations that are challenging for retrieval augmented generation (RAG) systems. Building upon the DesignQA framework [1], which relied on full-text ingestion and text-based retrieval, this work establishes a Multimodal ColPali Enhanced Retrieval and Reasoning Framework (MCERF), a system that couples a multimodal retriever with large language model reasoning for accurate and efficient question answering from engineering documents. The system employs the ColPali, which retrieves both textual and visual information, and multiple retrieval and reasoning strategies: (i) Hybrid Lookup mode for explicit rule mentions, (ii) Vision to Text fusion for figure and table guided queries, (iii) High Reasoning LLM mode for complex multi modal questions, and (iv) SelfConsistency decision to stabilize responses. The modular framework design provides a reusable template for future multimodal systems regardless of underlying model architecture. Furthermore, this work establishes and compares two routing approaches: a single case routing approach and a multi-agent system, both of which dynamically allocate queries to optimal pipelines. Evaluation on the DesignQA benchmark illustrates that this system improves average accuracy across all tasks with a relative gain of +41.1% from baseline RAG best results, which is a significant improvement in multimodal and reasoning-intensive tasks without complete rulebook ingestion. This shows how vision language retrieval, modular reasoning, and adaptive routing enable scalable document comprehension in engineering use cases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces MCERF, a multimodal retrieval-augmented generation framework for engineering documentation that pairs the ColPali retriever with four hand-designed reasoning modes (Hybrid Lookup, Vision-to-Text fusion, High Reasoning LLM, SelfConsistency) and two dynamic routing schemes (single-case and multi-agent). It reports a 41.1% relative accuracy gain over baseline RAG on the DesignQA benchmark while avoiding full rulebook ingestion.

Significance. If the accuracy lift proves robust under fixed, non-oracle routing and is supported by ablations and statistical validation, the modular design could offer a practical template for handling multimodal technical documents (text, tables, figures) where pure text RAG falls short.

major comments (3)
  1. [Evaluation on the DesignQA benchmark] Evaluation section: The abstract and results claim a +41.1% relative gain from 'baseline RAG best results' but supply no explicit baseline configuration, error bars, number of runs, statistical tests, or ablation isolating each mode and router; without these the central empirical claim cannot be verified as robust.
  2. [Routing approaches] Routing approaches: The description of single-case and multi-agent routing does not state whether mode assignment (to Hybrid Lookup, Vision-to-Text, etc.) is performed from query features alone or involves post-hoc selection after inspecting ground truth or test-set performance; oracle routing would make the reported gain an upper bound rather than evidence of a deployable fixed system.
  3. [Introduction and related work] Comparison to prior work: While the manuscript builds on the DesignQA framework [1], it does not report a head-to-head accuracy and efficiency comparison against the original full-text ingestion baseline on the same tasks, leaving unclear how much of the gain is attributable to ColPali plus routing versus simply avoiding complete ingestion.
minor comments (3)
  1. [Abstract] Abstract: the phrasing 'without complete rulebook ingestion' should be quantified (e.g., fraction of pages or tokens actually retrieved) to make the efficiency claim concrete.
  2. Notation: ensure 'ColPali' is introduced with a brief parenthetical description on first use rather than assuming reader familiarity.
  3. Figures: captions for any routing diagrams or accuracy tables should explicitly list the exact metric (e.g., exact-match accuracy) and the number of queries per task.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen the empirical claims and clarifications.

read point-by-point responses
  1. Referee: [Evaluation on the DesignQA benchmark] Evaluation section: The abstract and results claim a +41.1% relative gain from 'baseline RAG best results' but supply no explicit baseline configuration, error bars, number of runs, statistical tests, or ablation isolating each mode and router; without these the central empirical claim cannot be verified as robust.

    Authors: We agree that additional details are required to verify robustness. In the revised manuscript we will explicitly document the baseline RAG configuration (retriever, LLM, and prompting), report mean accuracy and standard deviation over five independent runs with error bars, include paired statistical significance tests, and provide ablations that isolate the contribution of each reasoning mode and routing scheme. These additions will directly support the reported +41.1% relative gain. revision: yes

  2. Referee: [Routing approaches] Routing approaches: The description of single-case and multi-agent routing does not state whether mode assignment (to Hybrid Lookup, Vision-to-Text, etc.) is performed from query features alone or involves post-hoc selection after inspecting ground truth or test-set performance; oracle routing would make the reported gain an upper bound rather than evidence of a deployable fixed system.

    Authors: Mode assignment in both routing schemes is performed exclusively from query features and content, without access to ground-truth answers or test-set performance. The single-case router employs a lightweight query classifier, while the multi-agent router uses agent deliberation on the query alone. We will add explicit statements and pseudocode in the revised manuscript to confirm the absence of oracle information and to demonstrate that the system is a fixed, deployable pipeline. revision: yes

  3. Referee: [Introduction and related work] Comparison to prior work: While the manuscript builds on the DesignQA framework [1], it does not report a head-to-head accuracy and efficiency comparison against the original full-text ingestion baseline on the same tasks, leaving unclear how much of the gain is attributable to ColPali plus routing versus simply avoiding complete ingestion.

    Authors: We will add a direct head-to-head comparison against the original DesignQA full-text ingestion baseline on the identical DesignQA tasks. The revised evaluation section will report both accuracy and efficiency metrics (retrieval latency, token consumption, and memory usage) to quantify the incremental benefit of the ColPali retriever and routing over full ingestion. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark gains rest on measured performance, not definitional reduction or self-citation chains

full rationale

The paper describes a modular system (ColPali retrieval plus four hand-designed modes and two routing schemes) and reports its measured accuracy on the external DesignQA benchmark, claiming a +41.1% relative gain over baseline RAG. No equations, fitted parameters, or predictions appear; the central result is an empirical comparison rather than a quantity derived by construction from the authors' inputs. The citation to DesignQA [1] supplies the benchmark dataset and prior baseline, not a load-bearing uniqueness theorem or ansatz that the present method reduces to. Hand-designed modes and routing are presented as engineering choices whose effectiveness is evaluated externally on held-out queries, with no indication that the reported lift is obtained by post-hoc oracle selection or by renaming a fitted quantity. The derivation chain is therefore self-contained as a system description plus benchmark measurement.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that ColPali provides effective joint text-image retrieval and that the four hand-specified reasoning modes are sufficient for engineering questions; no new entities or fitted parameters are introduced in the abstract.

axioms (2)
  • domain assumption ColPali multimodal retriever can jointly index and retrieve text, tables, and figures from engineering documents
    Invoked as the core retrieval component without further justification in the abstract.
  • domain assumption The DesignQA benchmark is representative of real engineering documentation tasks
    Used as the sole evaluation target.

pith-pipeline@v0.9.0 · 5590 in / 1372 out tokens · 21185 ms · 2026-05-16T09:25:47.462782+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 4 internal anchors

  1. [1]

    Designqa: A multimodal benchmark for evaluating large language models’ understanding of engineering documentation,

    Doris, A. C., Grandi, D., Tomich, R., Alam, M. F., Ataei, M., Cheong, H., and Ahmed, F., 2025, “Designqa: A multimodal benchmark for evaluating large language models’ understanding of engineering documentation,” Journal of Computing and Information Science in Engineering,25(2), p. 021009

  2. [2]

    Generative Models for Multimodal Docu- ment Understanding,

    Rombach, R. and Esser, P., 2023, “Generative Models for Multimodal Docu- ment Understanding,”Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

  3. [3]

    Layout-Aware Pre-training for Visually Rich Document Understanding,

    Zhang, W., Li, X., and Wang, H., 2022, “Layout-Aware Pre-training for Visually Rich Document Understanding,”Advances in Neural Information Processing Systems (NeurIPS)

  4. [4]

    ColPali: Efficient Document Retrieval with Vision Language Models

    Faysse, M., Sibille, H., Wu, T., Omrani, B., Viaud, G., Hudelot, C., and Colombo, P., 2024, “Colpali: Efficient document retrieval with vision language models,” arXiv preprint arXiv:2407.01449

  5. [5]

    A Comprehensive Review of Vision- Language Models,

    Yin, W., Fu, J., and Liu, Z., 2023, “A Comprehensive Review of Vision- Language Models,” arXiv preprint arXiv:2301.05052

  6. [6]

    Multimodal Rag-Driven Anomaly Detection and Classifica- tion in Laser Powder Bed Fusion Using Large Language Models,

    Naghavi Khanghah, K., Chen, Z., Romeo, L., Yang, Q., Malhotra, R., Imani, F., and Xu, H., 2025, “Multimodal Rag-Driven Anomaly Detection and Classifica- tion in Laser Powder Bed Fusion Using Large Language Models,”International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, Vol. 89220, American Society of...

  7. [7]

    Agent-based Systems for Complex Task Automation and Reasoning,

    Shen, Y., Chen, K., and Jiang, J., 2023, “Agent-based Systems for Complex Task Automation and Reasoning,”International Conference on Learning Rep- resentations (ICLR)

  8. [8]

    On the Limits of Retrieval-Augmented 18 Generation for Fact-intensive Tasks,

    Gao, T., Yao, W., and Chen, D., 2024, “On the Limits of Retrieval-Augmented 18 Generation for Fact-intensive Tasks,”Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)

  9. [9]

    Retrieval- Augmented Generation for Knowledge-Intensive NLP Tasks,

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Rocktäschel, T., Grefenstette, E., Kular, H. S., et al., 2020, “Retrieval- Augmented Generation for Knowledge-Intensive NLP Tasks,”Advances in Neu- ral Information Processing Systems (NeurIPS), Vol. 33, pp. 9459–9474

  10. [10]

    Chain-of-Thought Prompting Elicits Reasoning in Large Lan- guage Models,

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., and Zhou, D., 2022, “Chain-of-Thought Prompting Elicits Reasoning in Large Lan- guage Models,”Advances in Neural Information Processing Systems (NeurIPS)

  11. [11]

    GPT-4o System Card,

    OpenAI, 2024, “GPT-4o System Card,”

  12. [12]

    GPT-4V(ision) system card,

    OpenAI, 2023, “GPT-4V(ision) system card,”

  13. [13]

    Gemini API: Models - Gemini 1.0 Pro Vision,

    Google AI, 2024, “Gemini API: Models - Gemini 1.0 Pro Vision,” https://ai. google.dev/gemini-api/docs/models/gemini

  14. [14]

    Claude 3 Model Card,

    Anthropic, 2024, “Claude 3 Model Card,” https://www.anthropic.com/ claude-3-model-card

  15. [15]

    LLaVA-v1.5-13B,

    Liu, H., Li, C., Li, Y., and Lee, Y. J., 2023, “LLaVA-v1.5-13B,” Hugging Face, https://huggingface.co/liuhaotian/llava-v1.5-13b

  16. [16]

    Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval- Augmented Generation,

    Mahdi Abootorabi, M., Zobeiri, A., Dehghani, M., Mohammadkhani, M., Mo- hammadi, B., Ghahroodi, O., Soleymani Baghshah, M., and Asgari, E., 2025, “Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval- Augmented Generation,” arXiv e-prints, pp. arXiv–2502

  17. [17]

    Llm agent for fire dynamics simulations,

    Xu, L., Mohaddes, D., and Wang, Y., 2024, “Llm agent for fire dynamics simulations,” arXiv preprint arXiv:2412.17146

  18. [18]

    Retrieval augmentation reduces hallucination in conversation,

    Shuster, K., Poff, S., Chen, M., Kiela, D., and Weston, J., 2021, “Re- trieval augmentation reduces hallucination in conversation,” arXiv preprint arXiv:2104.07567

  19. [19]

    Zero-Shot Anomaly Detection in Laser Powder Bed Fusion Us- ing Multimodal Retrieval-Augmented Generation and Large Language Models,

    Khanghah, K. N., Chen, Z., Romeo, L., Yang, Q., Malhotra, R., Imani, F., and Xu, H., 2026, “Zero-Shot Anomaly Detection in Laser Powder Bed Fusion Us- ing Multimodal Retrieval-Augmented Generation and Large Language Models,” Journal of Mechanical Design,148(7), p. 072001

  20. [20]

    Large lan- guage models for extrapolative modeling of manufacturing processes,

    Naghavi Khanghah, K., Patel, A., Malhotra, R., and Xu, H., 2025, “Large lan- guage models for extrapolative modeling of manufacturing processes,” Journal of Intelligent Manufacturing, pp. 1–29

  21. [21]

    Robust multi model rag pipeline for documents containing text, table & images,

    Joshi, P., Gupta, A., Kumar, P., and Sisodia, M., 2024, “Robust multi model rag pipeline for documents containing text, table & images,”2024 3rd International Conference on Applied Artificial Intelligence and Computing (ICAAIC), IEEE, pp. 993–999

  22. [22]

    Learning transferable visual modelsfromnaturallanguagesupervision,

    Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al., 2021, “Learning transferable visual modelsfromnaturallanguagesupervision,”Internationalconferenceonmachine learning, PmLR, pp. 8748–8763

  23. [23]

    Contrastive localized language-image pre-training,

    Chen, H.-Y., Lai, Z., Zhang, H., Wang, X., Eichner, M., You, K., Cao, M., Zhang, B., Yang, Y., and Gan, Z., 2024, “Contrastive localized language-image pre-training,” arXiv preprint arXiv:2410.02746

  24. [24]

    Uniclip: Unified framework for contrastive language-image pre-training,

    Lee, J., Kim, J., Shon, H., Kim, B., Kim, S. H., Lee, H., and Kim, J., 2022, “Uniclip: Unified framework for contrastive language-image pre-training,” Ad- vances in Neural Information Processing Systems,35, pp. 1008–1019

  25. [25]

    Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation,

    Li, J., Li, D., Xiong, C., and Hoi, S., 2022, “Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation,” International conference on machine learning, PMLR, pp. 12888–12900

  26. [26]

    Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

    Li, J., Li, D., Savarese, S., and Hoi, S., 2023, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” International conference on machine learning, PMLR, pp. 19730–19742

  27. [27]

    MARVEL: unlocking the multi-modal capability of dense retrieval via visual module plugin,

    Zhou, T., Mei, S., Li, X., Liu, Z., Xiong, C., Liu, Z., Gu, Y., and Yu, G., 2024, “MARVEL: unlocking the multi-modal capability of dense retrieval via visual module plugin,”Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14608–14624

  28. [28]

    Uniir: Training and benchmarking universal multimodal information retrievers,

    Wei, C., Chen, Y., Chen, H., Hu, H., Zhang, G., Fu, J., Ritter, A., and Chen, W., 2024, “Uniir: Training and benchmarking universal multimodal information retrievers,”European Conference on Computer Vision, Springer, pp. 387–404

  29. [29]

    and Zaragoza, H., 2009,The probabilistic relevance framework: BM25 and beyond, Vol

    Robertson, S. and Zaragoza, H., 2009,The probabilistic relevance framework: BM25 and beyond, Vol. 4, Now Publishers Inc

  30. [30]

    M3- embedding: Multi-linguality, multi-functionality, multi-granularity text embed- dings through self-knowledge distillation,

    Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., and Liu, Z., 2024, “M3- embedding: Multi-linguality, multi-functionality, multi-granularity text embed- dings through self-knowledge distillation,”Findings of the Association for Com- putational Linguistics ACL 2024, pp. 2318–2335

  31. [31]

    Colbert: Efficient and effective passage search via contextualized late interaction over bert,

    Khattab, O. and Zaharia, M., 2020, “Colbert: Efficient and effective passage search via contextualized late interaction over bert,”Proceedings of the 43rd In- ternational ACM SIGIR conference on research and development in Information Retrieval, pp. 39–48

  32. [32]

    XL- HeadTags: Leveraging multimodal retrieval augmentation for the multilingual generation of news headlines and tags,

    Shohan, F. T., Nayeem, M. T., Islam, S., Akash, A. U., and Joty, S., 2024, “XL- HeadTags: Leveraging multimodal retrieval augmentation for the multilingual generation of news headlines and tags,” arXiv preprint arXiv:2406.03776

  33. [33]

    arXiv preprint arXiv:2407.12735 , year=

    Yan, Y. and Xie, W., 2024, “EchoSight: Advancing visual-language models with Wiki knowledge,” arXiv preprint arXiv:2407.12735

  34. [34]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al., 2024, “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,” arXiv preprint arXiv:2409.12191

  35. [35]

    M3docrag: Multi-modal retrieval is what you need for multi-page multi-document understanding,

    Cho, J., Mahata, D., Irsoy, O., He, Y., and Bansal, M., 2024, “M3docrag: Multi- modal retrieval is what you need for multi-page multi-document understanding,” arXiv preprint arXiv:2411.04952

  36. [36]

    N., Seymour, S

    Kossiakoff, A., Sweet, W. N., Seymour, S. J., and Biemer, S. M., 2011,Systems engineering principles and practice, Vol. 83, John Wiley & Sons

  37. [37]

    DesAgent: AMulti- Agent Mechanical Design Method Based on Collaborative Large and Small Models,

    Zhang, S., Li, X., Yuan, C., Feng, W., andJiang, Q., 2026, “DesAgent: AMulti- Agent Mechanical Design Method Based on Collaborative Large and Small Models,” Journal of Mechanical Design,148(5), p. 051706

  38. [38]

    AgenticLargeLanguageModelsforConcep- tual Systems Engineering and Design,

    Massoudi, S.andFuge, M., 2026, “AgenticLargeLanguageModelsforConcep- tual Systems Engineering and Design,” Journal of Mechanical Design,148(5), p. 051405

  39. [39]

    Fine-grained late- interactionmulti-modalretrievalforretrievalaugmentedvisualquestionanswer- ing,

    Lin, W., Chen, J., Mei, J., Coca, A., and Byrne, B., 2023, “Fine-grained late- interactionmulti-modalretrievalforretrievalaugmentedvisualquestionanswer- ing,”AdvancesinNeuralInformationProcessingSystems,36,pp.22820–22840

  40. [40]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Tschannen, M., Gritsenko, A., Wang, X., Naeem, M. F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., et al., 2025, “Siglip 2: Multilingual vision-language encoders with improved semantic understand- ing, localization, and dense features,” arXiv preprint arXiv:2502.14786

  41. [41]

    Gemma: Open Models Based on Gemini Research and Technology

    Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivière, M., Kale, M.S., Love, J., etal., 2024, “Gemma: Openmodels based on gemini research and technology,” arXiv preprint arXiv:2403.08295

  42. [42]

    Introducing GPT-5,

    OpenAI, 2025, “Introducing GPT-5,”

  43. [43]

    Chain-of-thought prompting elicits reasoning in large 19 language models,

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al., 2022, “Chain-of-thought prompting elicits reasoning in large 19 language models,” Advances in neural information processing systems,35, pp. 24824–24837

  44. [44]

    A survey of prompt engineering meth- ods in large language models for different nlp tasks,

    Vatsal, S. and Dubey, H., 2024, “A survey of prompt engineering methods in large language models for different nlp tasks,” arXiv preprint arXiv:2407.12994

  45. [45]

    Keyword vs Semantic Search for Retrieval-Augmented Generation: A Survey,

    Chihaia, T. and Ciobanu, R.-I., 2025, “Keyword vs Semantic Search for Retrieval-Augmented Generation: A Survey,”2025 25th International Confer- ence on Control Systems and Computer Science (CSCS), IEEE, pp. 169–174

  46. [46]

    An empirical study of the non-determinism of chatgpt in code generation,

    Ouyang, S., Zhang, J. M., Harman, M., and Wang, M., 2025, “An empirical study of the non-determinism of chatgpt in code generation,” ACM Transactions on Software Engineering and Methodology,34(2), pp. 1–28

  47. [47]

    Uncertainty-aware fusion: An ensemble framework for mitigating hallucinations in large language models,

    Dey, P., Merugu, S., and Kaveri, S., 2025, “Uncertainty-aware fusion: An ensemble framework for mitigating hallucinations in large language models,” Companion Proceedings of the ACM on Web Conference 2025, pp. 947–951

  48. [48]

    One llm is not enough: Harnessing the power of ensemble learning for medical question answering,

    Yang, H., Li, M., Zhou, H., Xiao, Y., Fang, Q., and Zhang, R., 2023, “One llm is not enough: Harnessing the power of ensemble learning for medical question answering,” medRxiv

  49. [49]

    Has gpt-5 achieved spatial intelligence? an empirical study.arXiv preprint arXiv:2508.13142, 2025

    Cai, Z., Wang, Y., Sun, Q., Wang, R., Gu, C., Yin, W., Lin, Z., Yang, Z., Wei, C., Shi, X., et al., 2025, “Has GPT-5 Achieved Spatial Intelligence? An Empirical Study,” arXiv preprint arXiv:2508.13142

  50. [50]

    arXiv preprint arXiv:2408.01319 (2024)

    Wang, J., Jiang, H., Liu, Y., Ma, C., Zhang, X., Pan, Y., Liu, M., Gu, P., Xia, S., Li, W., et al., 2024, “A comprehensive review of multimodal large language models: Performance and challenges across different tasks,” arXiv preprint arXiv:2408.01319

  51. [51]

    Segment anything,

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., et al., 2023, “Segment anything,” Proceedings of the IEEE/CVF international conference on computer vision, pp. 4015–4026

  52. [52]

    StepI- deator: Utilizing Mixed Representations to Support Step-By-Step Design With Generative Artificial Intelligence,

    Yao, J., Chen, P., Li, Z., Cai, Y., Wu, Y., You, W., and Sun, L., 2025, “StepI- deator: Utilizing Mixed Representations to Support Step-By-Step Design With Generative Artificial Intelligence,” Journal of Mechanical Design,147(7), p. 071703

  53. [53]

    Identifying Reliable Evaluation Metrics for Scientific Text Revision,

    Jourdan, L., Hernandez, N., Boudin, F., and Dufour, R., 2025, “Identifying Reliable Evaluation Metrics for Scientific Text Revision,”Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6731–6756

  54. [54]

    Reconstruction and gen- eration of porous metamaterial units via variational graph autoencoder and large language model,

    Naghavi Khanghah, K., Wang, Z., and Xu, H., 2025, “Reconstruction and gen- eration of porous metamaterial units via variational graph autoencoder and large language model,” Journal of Computing and Information Science in Engineer- ing,25(2), p. 021003

  55. [55]

    LLMs in e-commerce: A comparative analysis of GPT and LLaMA models in product review evaluation,

    Roumeliotis, K. I., Tselikas, N. D., and Nasiopoulos, D. K., 2024, “LLMs in e-commerce: A comparative analysis of GPT and LLaMA models in product review evaluation,” Natural Language Processing Journal,6, p. 100056

  56. [56]

    Language models are few-shot learners,

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al., 2020, “Language models are few-shot learners,” Advances in neural information processing systems,33, pp. 1877–1901. 20