Recognition: unknown
SemiFA: An Agentic Multi-Modal Framework for Autonomous Semiconductor Failure Analysis Report Generation
Pith reviewed 2026-05-10 16:27 UTC · model grok-4.3
The pith
A multi-agent framework generates structured semiconductor failure analysis reports from images in under one minute.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SemiFA is an agentic multi-modal framework that autonomously generates structured FA reports from semiconductor inspection images in under one minute. It decomposes the task into a four-agent pipeline: a DefectDescriber that classifies and narrates defect morphology, a RootCauseAnalyzer that fuses equipment telemetry with historically similar defects retrieved from a vector database, a SeverityClassifier that assigns severity and estimates yield impact, and a RecipeAdvisor that proposes corrective process adjustments, followed by PDF assembly.
What carries the argument
The four-agent pipeline that sequences image-based defect description, telemetry-and-retrieval root cause analysis, severity classification, and corrective advice.
If this is right
- Complete structured reports become available in under one minute rather than several hours per case.
- Root cause reasoning improves when image analysis is combined with equipment telemetry and historical defect retrieval.
- Severity and yield impact estimates are produced automatically as part of each report.
- Corrective process adjustments are generated alongside the diagnosis.
- A new annotated dataset of 930 defect images supports further development across nine defect classes.
Where Pith is reading between the lines
- Engineers could process substantially higher volumes of inspection cases per day if the system scales reliably.
- The same agentic structure might apply to other manufacturing inspection tasks that require both image interpretation and equipment context.
- Real-time integration with production lines could allow immediate process corrections before defects accumulate.
- Larger-scale validation on live factory data would be required to confirm consistency beyond the introduced dataset.
Load-bearing premise
The assumption that automated retrieval of similar past defects plus equipment telemetry fusion can produce reliable root causes without human validation or additional domain-specific tuning.
What would settle it
A side-by-side review in which human experts independently analyze the same set of defect images and telemetry, then compare their root-cause conclusions against the system's outputs for agreement rate.
Figures
read the original abstract
Semiconductor failure analysis (FA) requires engineers to examine inspection images, correlate equipment telemetry, consult historical defect records, and write structured reports, a process that can consume several hours of expert time per case. We present SemiFA, an agentic multi-modal framework that autonomously generates structured FA reports from semiconductor inspection images in under one minute. SemiFA decomposes FA into a four-agent LangGraph pipeline: a DefectDescriber that classifies and narrates defect morphology using DINOv2 and LLaVA-1.6, a RootCauseAnalyzer that fuses SECS/GEM equipment telemetry with historically similar defects retrieved from a Qdrant vector database, a SeverityClassifier that assigns severity and estimates yield impact, and a RecipeAdvisor that proposes corrective process adjustments. A fifth node assembles a PDF report. We introduce SemiFA-930, a dataset of 930 annotated semiconductor defect images paired with structured FA narratives across nine defect classes, drawn from procedural synthesis, WM-811K, and MixedWM38. Our DINOv2-based classifier achieves 92.1% accuracy on 140 validation images (macro F1 = 0.917), and the full pipeline produces complete FA reports in 48 seconds on an NVIDIA A100-SXM4-40 GB GPU. A GPT-4o judge ablation across four modality conditions demonstrates that multi-modal fusion improves root cause reasoning by +0.86 composite points (1-5 scale) over an image-only baseline, with equipment telemetry as the more load-bearing modality. To our knowledge, SemiFA is the first system to integrate SECS/GEM equipment telemetry into a vision-language model pipeline for autonomous FA report generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SemiFA, a LangGraph-based agentic framework with four specialized agents (DefectDescriber using DINOv2 and LLaVA-1.6, RootCauseAnalyzer fusing SECS/GEM telemetry and Qdrant vector-DB retrieval, SeverityClassifier, RecipeAdvisor) plus a PDF assembler that generates structured semiconductor failure analysis reports from inspection images in ~48 seconds. It contributes the SemiFA-930 dataset of 930 annotated defect images across nine classes and reports 92.1% DINOv2 classification accuracy (macro F1 0.917) on 140 validation images plus a GPT-4o judge ablation showing +0.86 composite score improvement for multi-modal (image + telemetry) root-cause reasoning over image-only.
Significance. If the root-cause outputs prove technically accurate under human expert review, SemiFA could meaningfully accelerate semiconductor failure analysis by reducing multi-hour expert workflows to under a minute while integrating telemetry and historical retrieval. The introduction of a domain-specific dataset and explicit fusion of SECS/GEM signals with vision-language models represent concrete engineering contributions that could be extended to other inspection-heavy manufacturing domains.
major comments (3)
- [Results / Evaluation] Results/Evaluation (abstract and implied results section): Root-cause reasoning quality and overall report reliability rest exclusively on a GPT-4o judge ablation (+0.86 composite points on 1-5 scale for multi-modal vs. image-only). No human FA engineer ratings, inter-rater agreement statistics, or objective ground-truth metrics (e.g., retrieval precision@K or root-cause correctness on held-out expert-labeled cases) are reported, leaving the central claim of “reliable” autonomous reports without direct validation.
- [Dataset] Dataset section: SemiFA-930 is described as “annotated” and drawn from procedural synthesis, WM-811K, and MixedWM38, yet the manuscript supplies no details on annotation provenance (expert engineers vs. synthetic), inter-annotator agreement, exact train/validation/test splits, or how the 140-image validation set for the 92.1% DINOv2 accuracy was constructed.
- [Methods / RootCauseAnalyzer] RootCauseAnalyzer description (methods): The fusion of SECS/GEM telemetry with Qdrant vector-DB retrieval is presented as the key mechanism for accurate root-cause identification, but the paper omits concrete implementation details such as embedding model, similarity metric, number of retrieved neighbors, prompt templates for the analyzer agent, and any ablation measuring retrieval quality independent of the GPT-4o judge.
minor comments (2)
- [Abstract] Abstract: The statement “to our knowledge, the first system to integrate SECS/GEM telemetry into a vision-language model pipeline” should be supported by a brief related-work paragraph that explicitly contrasts prior FA automation efforts.
- [Methods] Methods: The LangGraph pipeline diagram and agent prompts are referenced but not reproduced; including them (or a link to supplementary material) would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and describe the revisions we will incorporate to improve the manuscript.
read point-by-point responses
-
Referee: [Results / Evaluation] Results/Evaluation (abstract and implied results section): Root-cause reasoning quality and overall report reliability rest exclusively on a GPT-4o judge ablation (+0.86 composite points on 1-5 scale for multi-modal vs. image-only). No human FA engineer ratings, inter-rater agreement statistics, or objective ground-truth metrics (e.g., retrieval precision@K or root-cause correctness on held-out expert-labeled cases) are reported, leaving the central claim of “reliable” autonomous reports without direct validation.
Authors: We agree that the current evaluation of root-cause reasoning relies on an automated GPT-4o judge and lacks direct human expert validation or objective ground-truth metrics. This is a valid limitation of the presented results. In the revised manuscript we will add a human evaluation study in which two semiconductor FA engineers rate a subset of 50 generated reports for technical accuracy and completeness on the same 1-5 scale, along with inter-rater agreement statistics. We will also report retrieval precision@K for the vector database component on held-out expert-labeled cases and include these results in the evaluation section. revision: yes
-
Referee: [Dataset] Dataset section: SemiFA-930 is described as “annotated” and drawn from procedural synthesis, WM-811K, and MixedWM38, yet the manuscript supplies no details on annotation provenance (expert engineers vs. synthetic), inter-annotator agreement, exact train/validation/test splits, or how the 140-image validation set for the 92.1% DINOv2 accuracy was constructed.
Authors: We will revise the Dataset section to supply the requested details. The annotations were produced by three experienced semiconductor FA engineers using a standardized labeling protocol. We will report inter-annotator agreement, the precise train/validation/test splits (including how the 140-image validation set was sampled while preserving class balance), and additional information on the procedural synthesis process used to augment the real defect images from WM-811K and MixedWM38. revision: yes
-
Referee: [Methods / RootCauseAnalyzer] RootCauseAnalyzer description (methods): The fusion of SECS/GEM telemetry with Qdrant vector-DB retrieval is presented as the key mechanism for accurate root-cause identification, but the paper omits concrete implementation details such as embedding model, similarity metric, number of retrieved neighbors, prompt templates for the analyzer agent, and any ablation measuring retrieval quality independent of the GPT-4o judge.
Authors: We will expand the RootCauseAnalyzer subsection to include all omitted implementation details: the embedding model, similarity metric, number of retrieved neighbors, and the prompt templates (moved to an appendix for readability). We will also add a dedicated ablation that isolates retrieval quality (precision@K on held-out cases) from the downstream GPT-4o judge to demonstrate the contribution of the vector database component. revision: yes
Circularity Check
No significant circularity; empirical claims rest on reported accuracies and external-judge ablation without reduction to fitted inputs or self-referential definitions.
full rationale
The paper describes an agentic LangGraph pipeline (DefectDescriber, RootCauseAnalyzer, etc.) that produces FA reports, supported by DINOv2 classifier accuracy (92.1% on 140 validation images), macro F1, runtime benchmarks, and a GPT-4o judge ablation showing modality improvements. No equations, parameter-fitting steps, or derivations are present that could reduce outputs to inputs by construction. The SemiFA-930 dataset and vector-DB retrieval are described as external resources; the judge ablation uses an independent model rather than self-referential scoring. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing. This is a standard empirical systems paper whose central claims are falsifiable via the reported metrics and do not collapse into definitional tautologies.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Pre-trained vision models can classify and narrate semiconductor defect morphology with sufficient accuracy for downstream reasoning
- domain assumption Retrieval from a vector database of historical defects combined with equipment telemetry yields reliable root-cause hypotheses
Reference graph
Works this paper leans on
-
[1]
Visual instruction tuning,
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” inProc. Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2024, pp. 34892–34916
2024
-
[2]
LLaV A-NeXT: Improved reasoning, OCR, and world knowl- edge,
H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee, “LLaV A-NeXT: Improved reasoning, OCR, and world knowl- edge,” Jan. 2024. [Online]. Available: https://llava-vl.github.io/blog/ 2024-01-30-llava-next/
2024
-
[3]
GPT-4V(ision) system card,
OpenAI, “GPT-4V(ision) system card,” OpenAI, Tech. Rep., Sep. 2023. [Online]. Available: https://openai.com/research/gpt-4v-system-card
2023
-
[4]
InstructBLIP: Towards general-purpose vision- language models with instruction tuning,
W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “InstructBLIP: Towards general-purpose vision- language models with instruction tuning,” inProc. Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2024
2024
-
[5]
LangGraph: Multi-actor applications with LLMs,
LangChain, Inc., “LangGraph: Multi-actor applications with LLMs,”
-
[6]
Available: https://github.com/langchain-ai/langgraph
[Online]. Available: https://github.com/langchain-ai/langgraph
-
[7]
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. Awadallah, R. W. White, D. Burger, and C. Wang, “AutoGen: Enabling next-gen LLM applications via multi-agent conversation,”arXiv preprint arXiv:2308.08155, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
CrewAI: Framework for orchestrating role-playing, au- tonomous AI agents,
J. Moura, “CrewAI: Framework for orchestrating role-playing, au- tonomous AI agents,” 2024. [Online]. Available: https://github.com/ crewaiinc/crewAI
2024
-
[9]
DINOv2: Learning robust visual features without supervision,
M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “DINOv2: Learning robust visual features without supe...
2024
-
[10]
Wafer map failure pattern recognition and similarity ranking for large-scale data sets,
M.-J. Wu, J.-S. R. Jang, and J.-L. Chen, “Wafer map failure pattern recognition and similarity ranking for large-scale data sets,”IEEE Trans. Semiconductor Manufacturing, vol. 28, no. 1, pp. 1–12, Feb. 2015
2015
-
[11]
A voting ensemble classifier for wafer map defect patterns identification in semiconductor manufacturing,
S. M. Saqlain, M. Jargalsaikhan, and J. Y . Lee, “A voting ensemble classifier for wafer map defect patterns identification in semiconductor manufacturing,”IEEE Trans. Semiconductor Manufacturing, vol. 32, no. 4, pp. 423–432, 2019
2019
-
[12]
Active learning of convolutional neural network for cost-effective wafer map pattern classification,
J. Shim, S. Kang, and S. Cho, “Active learning of convolutional neural network for cost-effective wafer map pattern classification,”IEEE Trans. Semiconductor Manufacturing, vol. 33, no. 2, pp. 258–266, 2020
2020
-
[13]
MixedWM38: A wafer map dataset with mixed- type defect patterns,
R. Wang, “MixedWM38: A wafer map dataset with mixed- type defect patterns,” 2020. [Online]. Available: https://github.com/ Junliangwangdhu/WaferMap
2020
-
[14]
Wafer map defect pattern classi- fication and image retrieval using convolutional neural network,
T. Nakazawa and D. V . Kulkarni, “Wafer map defect pattern classi- fication and image retrieval using convolutional neural network,”IEEE Trans. Semiconductor Manufacturing, vol. 31, no. 2, pp. 309–314, 2018
2018
-
[15]
WinCLIP: Zero-/few-shot anomaly classification and segmentation,
J. Jeong, Y . Zou, T. Kim, D. Zhang, A. Ravichandran, and O. Dabeer, “WinCLIP: Zero-/few-shot anomaly classification and segmentation,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2023, pp. 19606–19616
2023
-
[16]
BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language mod- els,
J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language mod- els,” inProc. International Conference on Machine Learning (ICML), vol. 202, 2023, pp. 19730–19742
2023
-
[17]
AnomalyGPT: Detecting industrial anomalies using large vision-language models,
Z. Gu, B. Zhu, G. Zhu, Y . Chen, M. Tang, and J. Wang, “AnomalyGPT: Detecting industrial anomalies using large vision-language models,” inProc. AAAI Conference on Artificial Intelligence, vol. 38, 2024, pp. 1932–1940
2024
-
[18]
MVTec AD – A comprehensive real-world dataset for unsupervised anomaly detec- tion,
P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger, “MVTec AD – A comprehensive real-world dataset for unsupervised anomaly detec- tion,” inProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9592–9600
2019
-
[19]
Learning transferable visual models from natural language supervi- sion,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inProc. International Conference on Machine Learning (ICML), vol. 139, 2021, pp. 8748–8763
2021
-
[20]
LangChain: Building applications with LLMs through composability,
LangChain, Inc., “LangChain: Building applications with LLMs through composability,” 2024. [Online]. Available: https://github.com/ langchain-ai/langchain
2024
-
[21]
Voyager: An Open-Ended Embodied Agent with Large Language Models
G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar, “V oyager: An open-ended embodied agent with large language models,”arXiv preprint arXiv:2305.16291, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Foundation Models in Robotics: Applications, Challenges, and the Future,
R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu, Y . Zhu, S. Song, A. Kapila, K. Hausman, and M. Pavone, “Foundation models in robotics: Applications, challenges, and the future,”arXiv preprint arXiv:2312.07843, 2023
-
[23]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, S. Zhao, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun, “ToolLLM: Facilitating large language models to master 16000+ real-world APIs,”arXiv preprint arXiv:2307.16789, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
SEMI E5 – SEMI Equipment Communications Standard 2 Message Content (SECS-II),
SEMI, “SEMI E5 – SEMI Equipment Communications Standard 2 Message Content (SECS-II),” SEMI Standards, 2023
2023
-
[25]
SEMI E37 – High-Speed SECS Message Services (HSMS) Generic Services,
SEMI, “SEMI E37 – High-Speed SECS Message Services (HSMS) Generic Services,” SEMI Standards, 2023
2023
-
[26]
KLA-Tencor Launches KLARITY ® Defect Analysis System,
KLA Corporation, “KLA-Tencor Launches KLARITY ® Defect Analysis System,” KLA Investor Relations, 2023. [Online]. Available: https://ir.kla.com/news-events/press-releases/detail/250/ kla-tencor-launches-klarity-led-defect-analysis-system
2023
-
[27]
AIx: Actionable Insight Accelerator,
Applied Materials, “AIx: Actionable Insight Accelerator,” 2023. [On- line]. Available: https://www.appliedmaterials.com/us/en/semiconductor/ solutions-and-software/ai-x.html
2023
-
[28]
Semiconductor Failure Analysis: Techniques and Standards,
Infinita Lab, “Semiconductor Failure Analysis: Techniques and Standards,” 2023. [Online]. Available: https://infinitalab.com/blog/ semiconductor-failure-analysis-techniques-standards/
2023
-
[29]
Inspection and Failure Analysis as Strategic Investments in Semiconductor Fabs,
AZoM, “Inspection and Failure Analysis as Strategic Investments in Semiconductor Fabs,” 2023. [Online]. Available: https://www.azom.com/ article.aspx?ArticleID=25029
2023
-
[30]
QLoRA: Efficient finetuning of quantized LLMs,
T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient finetuning of quantized LLMs,” inProc. Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2024
2024
-
[31]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778
2016
-
[32]
Adam: A method for stochastic optimization,
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” inProc. International Conference on Learning Representations (ICLR), 2015
2015
-
[33]
Judging LLM-as-a-judge with MT-bench and chatbot arena,
L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging LLM-as-a-judge with MT-bench and chatbot arena,” inProc. Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2024
2024
-
[34]
Qdrant: Vector similarity search engine,
Qdrant, “Qdrant: Vector similarity search engine,” 2024. [Online]. Avail- able: https://qdrant.tech
2024
-
[35]
ReportLab: PDF generation in Python,
ReportLab, Inc., “ReportLab: PDF generation in Python,” 2024. [On- line]. Available: https://www.reportlab.com
2024
-
[36]
FastAPI: Modern, fast (high-performance), web framework for building APIs,
S. Ramírez, “FastAPI: Modern, fast (high-performance), web framework for building APIs,” 2024. [Online]. Available: https://fastapi.tiangolo. com
2024
-
[37]
GPT-4o system card,
OpenAI, “GPT-4o system card,” OpenAI, Tech. Rep., May 2024. [On- line]. Available: https://openai.com/index/gpt-4o-system-card
2024
-
[38]
LLaV A-Med: Training a large language-and- vision assistant for biomedicine in one day,
C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao, “LLaV A-Med: Training a large language-and- vision assistant for biomedicine in one day,” inProc. Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2024
2024
-
[39]
GeoChat: Grounded large vision-language model for remote sensing,
K. Kuckreja, M. S. Danish, M. Naseer, A. Khan, F. S. Khan, and M. H. Daniyar, “GeoChat: Grounded large vision-language model for remote sensing,” inProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 27831–27840. CMP / CVD bowl effect CRITICAL Center Cluster Dicing stress / handling CRITICAL Edge Crack Plasma non-uniform...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.