Bridging the Sim-to-Real Gap in Semiconductor Visual Program Synthesis via Input Binarization
Pith reviewed 2026-06-28 14:12 UTC · model grok-4.3
The pith
Binarizing SEM images lets a vision-language model trained only on synthetic data convert real inspection photos into accurate editable DSL code for circuit geometries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A Vision-Language Model trained solely on synthetic DSL-rendered data can convert real SEM inspection images into editable DSL code describing circuit geometries when the inputs are first binarized to remove texture and noise, as shown by an increase in mean Dice coefficient from 0.4393 to 0.5256 on the MIIC dataset.
What carries the argument
Input binarization strategy that strips SEM-specific texture and noise so the model focuses on geometric structure.
Load-bearing premise
Binarization removes only irrelevant texture and noise while retaining every geometric feature the VLM needs to produce accurate editable DSL code.
What would settle it
Finding a subset of MIIC images where binarized inputs yield lower Dice scores or visibly incorrect DSL geometry descriptions than the raw-input baseline would falsify the claim that binarization substantially mitigates the gap.
Figures
read the original abstract
Precise parametric control over circuit geometry is essential for semiconductor inspection, yet obtaining sufficient real training data remains costly. Although generative models such as diffusion models and Generative Adversarial Networks (GANs) can augment training data, they cannot guarantee the nanometer-scale geometric accuracy required for metrology tasks. We propose a visual program synthesis framework in which a Vision-Language Model (VLM) converts inspection images into editable Domain-Specific Language (DSL) code describing circuit geometries, enabling controlled generation of training data with exact parameter manipulation. Because the VLM is trained solely on synthetic DSL-rendered data, a domain gap arises when processing real Scanning Electron Microscope (SEM) images. We bridge this gap with an input binarization strategy that strips SEM-specific texture and noise, letting the model focus on geometric structure. On the MIIC dataset, binarized inputs improve the mean Dice coefficient from 0.4393 to 0.5256 over the raw-input baseline, demonstrating that simple texture abstraction substantially mitigates the sim-to-real gap.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a visual program synthesis framework in which a VLM converts SEM inspection images into editable DSL code for circuit geometries. Training occurs exclusively on synthetic DSL-rendered data, creating a sim-to-real gap; the authors propose input binarization to strip texture and noise while preserving geometry. On the MIIC dataset this yields a mean Dice coefficient increase from 0.4393 (raw inputs) to 0.5256 (binarized inputs), which the abstract presents as evidence that texture abstraction substantially mitigates the domain gap.
Significance. If the binarization step reliably preserves the geometric parameters needed for correct DSL emission and the Dice gain is statistically robust, the work offers a low-cost, parameter-free preprocessing technique that could reduce dependence on scarce real metrology data. The approach is simple and directly testable, but its significance for the stated goal of “precise parametric control” remains provisional until the link between mask overlap and DSL parameter fidelity is demonstrated.
major comments (2)
- [Abstract] Abstract: the central claim equates the reported Dice improvement with successful mitigation of the sim-to-real gap for visual program synthesis, yet supplies no information on the number of test images, variance across runs, or statistical testing. Without these details the 0.0863 absolute gain cannot be evaluated for reliability.
- [Abstract] Abstract: Dice is an image-overlap metric computed against ground-truth geometries, but the manuscript provides no evidence that the metric is evaluated on masks rendered from the emitted DSL code, nor any parameter-level error statistics (line width, spacing, etc.) on the MIIC annotations. Because thresholding is lossy, the Dice gain alone does not establish that binarization leaves every geometric feature required for accurate, editable DSL intact.
minor comments (1)
- [Abstract] Abstract: the MIIC dataset is referenced without citation, size, or annotation protocol, hindering reproducibility assessment.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee's comments. We appreciate the feedback on the abstract and will revise the manuscript to address the concerns regarding statistical details and evaluation of the Dice metric.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim equates the reported Dice improvement with successful mitigation of the sim-to-real gap for visual program synthesis, yet supplies no information on the number of test images, variance across runs, or statistical testing. Without these details the 0.0863 absolute gain cannot be evaluated for reliability.
Authors: We agree that the abstract lacks these details. We will revise the manuscript to include the number of test images, variance across runs, and results of statistical testing to support the reliability of the Dice improvement. revision: yes
-
Referee: [Abstract] Abstract: Dice is an image-overlap metric computed against ground-truth geometries, but the manuscript provides no evidence that the metric is evaluated on masks rendered from the emitted DSL code, nor any parameter-level error statistics (line width, spacing, etc.) on the MIIC annotations. Because thresholding is lossy, the Dice gain alone does not establish that binarization leaves every geometric feature required for accurate, editable DSL intact.
Authors: We acknowledge that the manuscript does not provide explicit evidence on the computation of Dice via DSL-rendered masks or parameter-level statistics. We will revise to clarify the evaluation method, add parameter-level error statistics, and address the potential lossiness of binarization to strengthen the connection to DSL fidelity. revision: yes
Circularity Check
No circularity: empirical Dice gain measured on external held-out ground truth
full rationale
The paper's central result is an empirical comparison: binarized inputs raise mean Dice from 0.4393 to 0.5256 on the MIIC dataset relative to a raw-input baseline. Dice is a standard overlap metric computed against independent ground-truth geometries; it is not defined in terms of any fitted parameter, threshold, or output of the proposed method. No equations, self-citations, or uniqueness claims appear in the supplied text that would reduce the reported improvement to a definitional identity or fitted-input prediction. The derivation therefore remains self-contained against an external benchmark.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
High-resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
2022
-
[2]
Generative adversarial nets,
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 27, 2014
2014
-
[3]
ChartCoder: Advancing multimodal large lan- guage model for chart-to-code generation,
X. Zhaoet al., “ChartCoder: Advancing multimodal large lan- guage model for chart-to-code generation,” inProceedings of the Annual Meeting of the Association for Computational Linguis- tics (ACL), 2025
2025
-
[4]
pix2code: Generating code from a graphical user interface screenshot,
T. Beltramelli, “pix2code: Generating code from a graphical user interface screenshot,” inProceedings of the ACM SIGCHI Symposium on Engineering Interactive Computing Systems (EICS), 2018
2018
-
[5]
Im2Vec: Synthesizing vector graphics without vector supervision,
P. Reddy, M. Gharbi, M. Lukac, and N. J. Mitra, “Im2Vec: Synthesizing vector graphics without vector supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021
2021
-
[6]
Draw- ing2CAD: Sequence-to-sequence learning for CAD generation from vectorized drawings,
F. Qin, S. Lu, J. Hou, C. Wang, M. Fang, and L. Liu, “Draw- ing2CAD: Sequence-to-sequence learning for CAD generation from vectorized drawings,”arXiv preprint arXiv:2508.18733, 2025
-
[7]
H. Xie and F. Ju, “Text-to-CadQuery: A new paradigm for CAD generation with scalable large model capabilities,”arXiv preprint arXiv:2505.06507, 2025
-
[8]
Document image binarization with fully convolutional neural networks,
C. Tensmeyer and T. Martinez, “Document image binarization with fully convolutional neural networks,” inProceedings of the International Conference on Document Analysis and Recogni- tion (ICDAR), 2017
2017
-
[9]
S. Baiet al., “Qwen3-VL technical report,”arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Application of advanced image processing techniques to automatic kikuchi lines detection,
R. Fraczek and T. Zielinski, “Application of advanced image processing techniques to automatic kikuchi lines detection,” inProceedings of the European Signal Processing Conference (EUSIPCO), 2006
2006
-
[11]
Joint anomaly detection and inpainting for mi- croscopy images via deep self-supervised learning,
L. Huanget al., “Joint anomaly detection and inpainting for mi- croscopy images via deep self-supervised learning,” inProceed- ings of the IEEE International Conference on Image Processing (ICIP), 2021
2021
-
[12]
What is a good evaluation measure for semantic segmentation?,
G. Csurka, D. Larlus, and F. Perronnin, “What is a good evaluation measure for semantic segmentation?,” inProceedings of the British Machine Vision Conference (BMVC), 2013
2013
-
[13]
Boundary enhanced semantic segmentation for high resolution electron microscope images,
M. Pollach, F. Schiegg, M. Ludwig, A.-C. Bette, and A. Knoll, “Boundary enhanced semantic segmentation for high resolution electron microscope images,” inProceedings of the European Signal Processing Conference (EUSIPCO), 2022
2022
-
[14]
RoadTracer: Automatic extraction of road networks from aerial images,
F. Bastaniet al., “RoadTracer: Automatic extraction of road networks from aerial images,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.