From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap
Pith reviewed 2026-05-23 19:05 UTC · model grok-4.3
The pith
Production-ready FMware requires solving distinct challenges in model selection, data alignment, prompt engineering, agent orchestration, testing, and deployment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that critical issues arise in FM(s) selection, data and model alignment, prompt engineering, agent orchestration, system testing, and deployment, alongside cross-cutting concerns such as memory management, observability, and feedback integration; these issues are identified through a semi-structured thematic synthesis of industry experience, grey literature, academic publications, OPEA involvement, AIware events, and ISO SPDX work on AI and datasets.
What carries the argument
Semi-structured thematic synthesis that aggregates and thematically codes evidence from the authors' FMArts platform work, Huawei Cloud integration, grey literature, academic papers, OPEA, AIware events, and ISO SPDX SBOM efforts on AI and datasets.
If this is right
- Better methods and criteria for selecting suitable foundation models become necessary for production systems.
- Dedicated techniques for aligning data and models are required to achieve consistent behavior.
- Specialized tooling and practices for prompt engineering and multi-agent orchestration are needed.
- Testing and deployment pipelines must incorporate FM-specific uncertainties and compliance checks.
- Observability, memory management, and continuous feedback mechanisms must be built into the system architecture.
Where Pith is reading between the lines
- Widespread adoption of the identified practices could shorten the time from prototype to reliable enterprise deployment.
- Multi-industry efforts may be required to develop the supporting platforms and standards referenced in the roadmap.
- Regulatory and standards bodies could incorporate the listed challenges into future compliance frameworks for AI software.
Load-bearing premise
The sources consulted comprehensively capture the main production challenges across the FMware domain.
What would settle it
A broad survey of deployed production FMware systems that identifies a substantially different primary set of obstacles not surfaced by the thematic synthesis.
Figures
read the original abstract
The rapid expansion of foundation models (FMs), such as large language models (LLMs), has given rise to FMware, software systems that integrate FM(s) as core components. While building demonstration-level FMware is relatively straightforward, transitioning to production-ready systems presents numerous challenges, including reliability, high implementation costs, scalability, and compliance with privacy regulations. Our paper conducts a semi-structured thematic synthesis to identify key challenges in productionizing FMware across diverse data sources, including our industry experience developing FMArts, a FMware lifecycle engineering platform, and its integration into Huawei Cloud; grey literature; academic publications; hands-on involvement in the Open Platform for Enterprise AI (OPEA); organizing the AIware conference and bootcamp; and co-leading the ISO SPDX SBOM working group on AI and datasets. We identify critical issues in FM(s) selection, data and model alignment, prompt engineering, agent orchestration, system testing, and deployment, alongside cross-cutting concerns such as memory management, observability, and feedback integration. We discuss necessary technologies and strategies to address these challenges and offer guidance to enable the transition from demonstration systems to scalable, production-ready FMware solutions. Our findings underscore the importance of continued research and multi-industry collaboration to advance the development of production-ready FMware.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that a semi-structured thematic synthesis drawing on the authors' industry experience with FMArts and Huawei Cloud, grey literature, academic publications, OPEA involvement, AIware events, and ISO SPDX SBOM work identifies core challenges in productionizing FMware (foundation-model software systems). These include FM(s) selection, data and model alignment, prompt engineering, agent orchestration, system testing, deployment, and cross-cutting concerns such as memory management, observability, and feedback integration. The paper discusses necessary technologies and strategies and calls for continued research and multi-industry collaboration.
Significance. If the synthesis holds, the paper would offer a structured, practitioner-oriented roadmap that consolidates emerging challenges in an important area of software engineering. Its grounding in standards work (ISO) and open platforms (OPEA) adds practical relevance and could help focus research efforts on production concerns rather than demonstration-level capabilities.
major comments (1)
- [Abstract / synthesis description] The description of the semi-structured thematic synthesis (abstract and any methods section) names the data sources but provides no detail on the coding scheme, inter-rater reliability, or how conflicts between sources were resolved. Because the central claim rests on the synthesis yielding the enumerated challenges, this omission is load-bearing for assessing the reliability of the findings.
minor comments (1)
- [Introduction] The term 'FMware' is introduced as an invented entity without an explicit, concise definition early in the manuscript; adding one would improve accessibility for readers outside the immediate community.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on the methodology description. We address the major comment below and will revise the manuscript to provide greater transparency.
read point-by-point responses
-
Referee: The description of the semi-structured thematic synthesis (abstract and any methods section) names the data sources but provides no detail on the coding scheme, inter-rater reliability, or how conflicts between sources were resolved. Because the central claim rests on the synthesis yielding the enumerated challenges, this omission is load-bearing for assessing the reliability of the findings.
Authors: We agree that additional detail on the synthesis process would strengthen the paper. In the revised version we will insert a dedicated Methods section describing the semi-structured thematic synthesis. It will explain: (1) the iterative coding process, beginning with challenges observed during FMArts development and Huawei Cloud integration, then refined against grey literature, academic publications, OPEA contributions, AIware events, and ISO SPDX work; (2) resolution of conflicts across sources through structured author discussions that weighted empirical production issues more heavily than purely theoretical ones; and (3) the rationale for not computing formal inter-rater reliability statistics, given the expert-driven, single-team nature of the synthesis, with reliability instead supported by triangulation across independent data sources. This addition will allow readers to evaluate the findings more rigorously while preserving the paper's practitioner-oriented focus. revision: yes
Circularity Check
No significant circularity: synthesis paper whose claim is the act of synthesis itself
full rationale
The paper performs a semi-structured thematic synthesis to enumerate production challenges for FMware. Its central claim is satisfied by reporting the synthesis performed on the listed sources (industry experience, grey literature, academic papers, OPEA, conferences, ISO work). No equations, quantitative predictions, fitted parameters, or uniqueness theorems appear. No step reduces a claimed derivation to its own inputs by construction, self-citation chain, or renaming. The reliance on authors' experience is disclosed as one input among others and does not create a load-bearing loop that forces the enumerated challenges. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Thematic synthesis of the listed practitioner and literature sources yields a complete and unbiased set of production challenges.
invented entities (1)
-
FMware
no independent evidence
Reference graph
Works this paper leans on
-
[1]
[n. d.]. Amazon Web Services (AWS) - Cloud Computing Services — pages.awscloud.com. https://pages.awscloud.com/EMEA-Data-Flywheel.html. [Accessed 11-10-2024]
work page 2024
-
[2]
[n. d.]. Building a foundation for the future of AI models — research.ibm.com. https://research.ibm.com/blog/generative-ai-dario-gil-think. [Accessed 08-10-2024]
work page 2024
-
[3]
[n. d.]. Generative AI could raise global GDP by 7% — goldmansachs.com. https://www.goldmansachs.com/insights/articles/generative-ai-could- raise-global-gdp-by-7-percent. [Accessed 08-10-2024]
work page 2024
-
[4]
[n. d.]. LangSmith — langchain.com. https://www.langchain.com/langsmith. [Accessed 09-10-2024]
work page 2024
-
[5]
[n. d.]. Open-source Observability for LLMs with OpenTelemetry — traceloop.com. https://www.traceloop.com/openllmetry. [Accessed 09-10-2024]
work page 2024
-
[6]
[n. d.]. Optimizing OpenAI API Performance - Reducing Latency — signoz.io. https://signoz.io/guides/open-ai-api-latency/. [Accessed 07-10-2024]
work page 2024
-
[7]
[n. d.]. Rate limits – LLM engine. https://llm-engine.scale.com/guides/rate_limits/. [Accessed 07-10-2024]
work page 2024
-
[8]
10% of Organizations Surveyed Launched GenAI Solutions to Production
2023. 10% of Organizations Surveyed Launched GenAI Solutions to Production... — intel.com. https://www.intel.com/content/www/us/en/ newsroom/news/10-per-cent-orgs-launched-genai-solutions-2023.html#gs.gdxejv. [Accessed 08-10-2024]
work page 2023
-
[9]
2023. Evaluating LLM Applications. https://humanloop.com/blog/evaluating-llm-apps. [Accessed 07-10-2024]
work page 2023
-
[10]
Testing LLM-Based Applications: Strategy and Challenges — blog.scottlogic.com
2023. Testing LLM-Based Applications: Strategy and Challenges — blog.scottlogic.com. https://blog.scottlogic.com/2023/11/14/testing-LLM-based- applications-strategy-and-challenges.html. [Accessed 07-10-2024]
work page 2023
-
[11]
AIware 2024: Proceedings of the 1st ACM International Conference on AI-Powered Software
2024. AIware 2024: Proceedings of the 1st ACM International Conference on AI-Powered Software. Association for Computing Machinery, New York, NY, USA
work page 2024
-
[12]
2024. FM+SE Summit 2024 — fmse.io. https://fmse.io/. [Accessed 11-10-2024]
work page 2024
-
[13]
Musings on Building a Generative AI Product — linkedin.com
2024. Musings on Building a Generative AI Product — linkedin.com. https://www.linkedin.com/blog/engineering/generative-ai/musings-on- building-a-generative-ai-product. [Accessed 08-10-2024]
work page 2024
-
[14]
OPEA Community Events - LF AI Foundation - Confluence — lf-aidata.atlassian.net
2024. OPEA Community Events - LF AI Foundation - Confluence — lf-aidata.atlassian.net. https://lf-aidata.atlassian.net/wiki/spaces/DL/pages/ 14094763/OPEA+Community+Events. [Accessed 11-10-2024]
work page 2024
-
[15]
Dr. Assad Abbas. 2024. The Financial Challenges of Leading in AI: A Look at OpenAI’s Operating Costs — unite.ai. https://www.unite.ai/the- financial-challenges-of-leading-in-ai-a-look-at-openais-operating-costs/. [Accessed 11-10-2024]
work page 2024
- [16]
-
[17]
Ibrahim Ahmed, Sahil Parmar, Matthew Boyd, Michael Beidler, Kris Kang, Bill Liu, Kyle Roach, John Kim, and Dennis Abts. 2022. Answer fast: Accelerating BERT on the tensor streaming processor. In 2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Manuscript submitted to ACM 20 Rajbahadur et al. Processors (ASAP). IEEE, 80–87
work page 2022
-
[18]
Toufique Ahmed and Premkumar Devanbu. 2023. Better patching using LLM prompting, via Self-Consistency. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 1742–1746
work page 2023
-
[19]
SM Didar Al Alam, Maleknaz Nayebi, Dietmar Pfahl, and Guenther Ruhe. 2017. A two-staged survey on release readiness. In Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering . 374–383
work page 2017
-
[20]
SM Didar Al Alam, SM Shahnewaz, Dietmar Pfahl, and Guenther Ruhe. 2014. Monitoring bottlenecks in achieving release readiness: A retrospective case study across ten oss projects. In Proceedings of the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement . 1–4
work page 2014
- [21]
-
[22]
Peter Alvaro and Andrew Quinn. 2024. Deterministic Record-and-Replay: Zeroing in only on the nondeterministic actions of the process. Queue 22, 4 (2024), 120–129
work page 2024
- [23]
-
[24]
Narges Ashtari, Ryan Mullins, Crystal Qian, James Wexler, Ian Tenney, and Mahima Pushkarna. 2023. From Discovery to Adoption: Understanding the ML Practitioners’ Interpretability Journey. InProceedings of the 2023 ACM Designing Interactive Systems Conference . 2304–2325
work page 2023
-
[25]
Abhaya Asthana and Jack Olivieri. 2009. Quantifying software reliability and readiness. In 2009 IEEE international workshop technical committee on communications quality and reliability . IEEE, 1–6
work page 2009
- [26]
-
[27]
Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, and D Sculley. 2017. The ML test score: A rubric for ML production readiness and technical debt reduction. In 2017 IEEE international conference on big data (big data) . IEEE, 1123–1132
work page 2017
-
[28]
Jenna L Butler, Thomas Zimmermann, and Christian Bird. 2024. Objectives and Key Results in Software Teams: Challenges, Opportunities and Impact on Development. In Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice . 358–368
work page 2024
- [29]
- [30]
-
[31]
Dong Chen, Shaoxin Lin, Muhan Zeng, Daoguang Zan, Jian-Gang Wang, Anton Cheshkov, Jun Sun, Hao Yu, Guoliang Dong, Artem Aliev, et al
-
[32]
arXiv preprint arXiv:2406.01304 (2024)
CodeR: Issue Resolving with Multi-Agent and Task Graphs. arXiv preprint arXiv:2406.01304 (2024)
- [33]
- [34]
- [35]
- [36]
-
[37]
Xiang Chen, Chaoyang Gao, Chunyang Chen, Guangbei Zhang, and Yong Liu. 2025. An Empirical Study on Challenges for LLM Application Developers. ACM Transactions on Software Engineering and Methodology (2025)
work page 2025
- [38]
-
[39]
Daniela S Cruzes and Tore Dybå. 2010. Synthesizing evidence in software engineering research. In Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement . 1–10
work page 2010
-
[40]
Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. 2024. The power of noise: Redefining retrieval for rag systems. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval . 719–729
work page 2024
-
[41]
James Cusick. 2013. Architecture and Production Readiness Reviews in Practice. arXiv preprint arXiv:1305.2402 (2013)
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[42]
Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E Ho. 2024. Hallucinating law: Legal mistakes with large language models are pervasive
work page 2024
- [43]
-
[44]
Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large language models for software engineering: Survey and open problems. In2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). IEEE, 31–53
work page 2023
-
[45]
Lizhou Fan, Lingyao Li, Zihui Ma, Sanggyu Lee, Huizi Yu, and Libby Hemphill. 2024. A bibliometric review of large language models research from 2017 to 2023. ACM Transactions on Intelligent Systems and Technology 15, 5 (2024), 1–25
work page 2024
- [46]
-
[47]
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. 2021. Datasheets for datasets. Commun. ACM 64, 12 (2021), 86–92
work page 2021
-
[48]
Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Tiezheng Guo, Qingwen Yang, Chen Wang, Yanyi Liu, Pan Li, Jiawei Tang, Dapeng Li, and Yingyou Wen. 2024. Knowledgenavigator: Leveraging large language models for enhanced reasoning over knowledge graph. Complex & Intelligent Systems 10, 5 (2024), 7063–7076
work page 2024
-
[50]
Philipp Hacker, Andreas Engel, and Marco Mauer. 2023. Regulating ChatGPT and other large generative AI models. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency . 1112–1123
work page 2023
-
[51]
Hosein Hasanbeig, Hiteshi Sharma, Leo Betthauser, Felipe Vieira Frujeri, and Ida Momennejad. 2023. ALLURE: auditing and improving llm-based evaluation of text using iterative in-context-learning. arXiv e-prints (2023), arXiv–2309
work page 2023
-
[52]
Hassan, Bram Adams, Haoxiang Zhang, Thomas Zimmermann, Foutse Khomh, and Nachi Nagappan
Ahmed E. Hassan, Bram Adams, Haoxiang Zhang, Thomas Zimmermann, Foutse Khomh, and Nachi Nagappan. 2024. Challenges and Opportunities in the Road Ahead . Technical Report. Queen’s University, Canada. https://sail.cs.queensu.ca/data/pdfs/2024_FMSE_Vision_2030_Summary_ Report__Challenges_and_Opportunities_in_the_Road_Ahead.pdf Accessed: 2024-10-10
work page 2024
-
[53]
Ahmed E. Hassan, Dayi Lin, Gopi Krishnan Rajbahadur, Keheliya Gallaba, Filipe Roseiro Cogo, Boyuan Chen, Haoxiang Zhang, Kishanthan Thangarajah, Gustavo Oliva, Jiahuei (Justina) Lin, Wali Mohammad Abdullah, and Zhen Ming (Jack) Jiang. 2024. Rethinking Software Engineering in the Era of Foundation Models: A Curated Catalogue of Challenges in the Developmen...
work page 2024
-
[54]
Ahmed E. Hassan, Gustavo A. Oliva, Dayi Lin, Boyuan Chen, Zhen Ming, and Jiang. 2024. Towards AI-Native Software Engineering (SE 3.0): A Vision and a Challenge Roadmap. arXiv:2410.06107
-
[55]
Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology 33, 8 (2024), 1–79
work page 2024
-
[56]
Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. 2024. Understanding the planning of LLM agents: A survey. arXiv preprint arXiv:2402.02716 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [57]
-
[58]
Chandra Irugalbandara, Ashish Mahendra, Roland Daynauth, Tharuka Kasthuri Arachchige, Jayanaka Dantanarayana, Krisztian Flautner, Lingjia Tang, Yiping Kang, and Jason Mars. 2024. Scaling Down to Scale Up: A Cost-Benefit Analysis of Replacing OpenAI’s LLM with Open Source SLMs in Production. In 2024 IEEE International Symposium on Performance Analysis of S...
work page 2024
-
[59]
Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation. arXiv preprint arXiv:2406.00515 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [60]
-
[61]
Uday Kamath, Kevin Keenan, Garrett Somers, and Sarah Sorenson. 2024. LLMs in Production. In Large Language Models: A Deep Dive: Bridging Theory and Practice. Springer, 315–373
work page 2024
- [62]
-
[63]
Samantha Murphy Kelly. [n. d.]. Apple’s new China problem: ChatGPT is banned there | CNN Business — cnn.com. https://www.cnn.com/2024/06/ 21/tech/apple-ai-chatgpt-ban-china/index.html. [Accessed 04-10-2024]
work page 2024
- [64]
-
[65]
Nico Koprowski, M Firdaus Harun, and Horst Lichter. 2014. Release readiness measurement: a comparison of best practices. In 2014 8th. Malaysian Software Engineering Conference (MySEC) . IEEE, 166–171
work page 2014
-
[66]
Michael Kuchnik, Virginia Smith, and George Amvrosiadis. 2023. Validating large language models with relm. Proceedings of Machine Learning and Systems 5 (2023), 457–476
work page 2023
-
[67]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles . 611–626
work page 2023
- [68]
- [69]
-
[70]
Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. 2023. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[71]
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics 12 (2024), 157–173. Manuscript submitted to ACM 22 Rajbahadur et al
work page 2024
- [72]
- [73]
-
[74]
Wanqin Ma, Chenyang Yang, and Christian Kästner. 2024. (Why) Is My Prompt Getting Worse? Rethinking Regression Testing for Evolving LLM APIs. In Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering-Software Engineering for AI . 166–171
work page 2024
-
[75]
Audris Mockus. 2003. Analogy based prediction of work item flow in software projects: a case study. In 2003 International Symposium on Empirical Software Engineering, 2003. ISESE 2003. Proceedings. IEEE, 110–119
work page 2003
- [76]
-
[77]
OPEA. [n. d.]. Open Platform For Enterprise AI — opea.dev. https://opea.dev/. [Accessed 03-10-2024]
work page 2024
-
[78]
Charles Packer, Vivian Fang, Shishir G Patil, Kevin Lin, Sarah Wooders, and Joseph E Gonzalez. 2023. Memgpt: Towards llms as operating systems. arXiv preprint arXiv:2310.08560 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [79]
-
[80]
Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, and Arsalan Shahid. 2024. The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities. arXiv preprint arXiv:2408.13296 (2024)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.