Bridging the Sim-to-Real Gap in Semiconductor Visual Program Synthesis via Input Binarization

Koichiro Yawata; Koki Takeshita; Kota Dohi; Tatsuya Sasaki; Yusuke Ohtsubo

arxiv: 2606.02434 · v1 · pith:OKFLDGCBnew · submitted 2026-06-01 · 💻 cs.AI

Bridging the Sim-to-Real Gap in Semiconductor Visual Program Synthesis via Input Binarization

Yusuke Ohtsubo , Kota Dohi , Koichiro Yawata , Koki Takeshita , Tatsuya Sasaki This is my paper

Pith reviewed 2026-06-28 14:12 UTC · model grok-4.3

classification 💻 cs.AI

keywords visual program synthesissim-to-real gapinput binarizationsemiconductor metrologyvision-language modeldomain-specific languageSEM imagesDice coefficient

0 comments

The pith

Binarizing SEM images lets a vision-language model trained only on synthetic data convert real inspection photos into accurate editable DSL code for circuit geometries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that a straightforward input binarization step substantially reduces the domain gap when a VLM, trained exclusively on synthetic DSL-rendered images, processes real SEM photographs to produce editable code describing semiconductor circuit shapes. This matters because collecting enough real annotated data for metrology training is expensive, and existing generative models cannot deliver the nanometer-scale geometric fidelity required. By removing texture and noise through binarization the model is forced to attend to pure geometry, which the authors measure as a rise in mean Dice coefficient from 0.4393 to 0.5256 on the MIIC dataset. If the claim holds, synthetic data becomes usable for real-world tasks without sacrificing precision in the resulting programs.

Core claim

A Vision-Language Model trained solely on synthetic DSL-rendered data can convert real SEM inspection images into editable DSL code describing circuit geometries when the inputs are first binarized to remove texture and noise, as shown by an increase in mean Dice coefficient from 0.4393 to 0.5256 on the MIIC dataset.

What carries the argument

Input binarization strategy that strips SEM-specific texture and noise so the model focuses on geometric structure.

Load-bearing premise

Binarization removes only irrelevant texture and noise while retaining every geometric feature the VLM needs to produce accurate editable DSL code.

What would settle it

Finding a subset of MIIC images where binarized inputs yield lower Dice scores or visibly incorrect DSL geometry descriptions than the raw-input baseline would falsify the claim that binarization substantially mitigates the gap.

Figures

Figures reproduced from arXiv: 2606.02434 by Koichiro Yawata, Koki Takeshita, Kota Dohi, Tatsuya Sasaki, Yusuke Ohtsubo.

**Figure 2.** Figure 2: Examples of the proposed DSL and corresponding rendered [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of reconstruction results. Columns [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Impact of structural complexity on reconstruction fidelity [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Precise parametric control over circuit geometry is essential for semiconductor inspection, yet obtaining sufficient real training data remains costly. Although generative models such as diffusion models and Generative Adversarial Networks (GANs) can augment training data, they cannot guarantee the nanometer-scale geometric accuracy required for metrology tasks. We propose a visual program synthesis framework in which a Vision-Language Model (VLM) converts inspection images into editable Domain-Specific Language (DSL) code describing circuit geometries, enabling controlled generation of training data with exact parameter manipulation. Because the VLM is trained solely on synthetic DSL-rendered data, a domain gap arises when processing real Scanning Electron Microscope (SEM) images. We bridge this gap with an input binarization strategy that strips SEM-specific texture and noise, letting the model focus on geometric structure. On the MIIC dataset, binarized inputs improve the mean Dice coefficient from 0.4393 to 0.5256 over the raw-input baseline, demonstrating that simple texture abstraction substantially mitigates the sim-to-real gap.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Binarization lifts Dice on MIIC from 0.4393 to 0.5256 but the abstract gives no evidence that the DSL parameters stay accurate.

read the letter

The main takeaway is that feeding binarized SEM images into the VLM improves mean Dice by about 0.086 on the MIIC set compared with raw inputs. That is the only quantitative result reported.

The paper takes an existing VLM-to-DSL pipeline, trains it on synthetic circuit renders, and applies a preprocessing step to handle real microscope images. The motivation is clear: generative models cannot deliver the nanometer precision needed for metrology, so they turn to program synthesis instead. Binarization is a reasonable, low-cost attempt to strip texture and noise while leaving edges intact.

The experiment is straightforward and the reported gain is concrete. For readers who already work on semiconductor inspection or domain adaptation for structured output, the setup is easy to understand.

The soft spot is the gap between the metric and the actual goal. Dice measures mask overlap, yet the claim is that the VLM now produces editable DSL code with correct geometry. Thresholding can move boundaries, and nothing in the abstract shows parameter-level error on line width, spacing, or position, nor confirms that Dice was computed on rendered DSL output versus ground truth. No test-set size, variance, or statistical test is mentioned either.

This is for people in industrial vision or metrology who need controllable synthetic data. A reader already working on VLM program synthesis might pick up the binarization trick and test it themselves.

I would send it for peer review. The empirical observation is specific enough that referees can check whether the full paper supplies the missing parameter evaluation.

Referee Report

2 major / 1 minor

Summary. The paper presents a visual program synthesis framework in which a VLM converts SEM inspection images into editable DSL code for circuit geometries. Training occurs exclusively on synthetic DSL-rendered data, creating a sim-to-real gap; the authors propose input binarization to strip texture and noise while preserving geometry. On the MIIC dataset this yields a mean Dice coefficient increase from 0.4393 (raw inputs) to 0.5256 (binarized inputs), which the abstract presents as evidence that texture abstraction substantially mitigates the domain gap.

Significance. If the binarization step reliably preserves the geometric parameters needed for correct DSL emission and the Dice gain is statistically robust, the work offers a low-cost, parameter-free preprocessing technique that could reduce dependence on scarce real metrology data. The approach is simple and directly testable, but its significance for the stated goal of “precise parametric control” remains provisional until the link between mask overlap and DSL parameter fidelity is demonstrated.

major comments (2)

[Abstract] Abstract: the central claim equates the reported Dice improvement with successful mitigation of the sim-to-real gap for visual program synthesis, yet supplies no information on the number of test images, variance across runs, or statistical testing. Without these details the 0.0863 absolute gain cannot be evaluated for reliability.
[Abstract] Abstract: Dice is an image-overlap metric computed against ground-truth geometries, but the manuscript provides no evidence that the metric is evaluated on masks rendered from the emitted DSL code, nor any parameter-level error statistics (line width, spacing, etc.) on the MIIC annotations. Because thresholding is lossy, the Dice gain alone does not establish that binarization leaves every geometric feature required for accurate, editable DSL intact.

minor comments (1)

[Abstract] Abstract: the MIIC dataset is referenced without citation, size, or annotation protocol, hindering reproducibility assessment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We appreciate the feedback on the abstract and will revise the manuscript to address the concerns regarding statistical details and evaluation of the Dice metric.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim equates the reported Dice improvement with successful mitigation of the sim-to-real gap for visual program synthesis, yet supplies no information on the number of test images, variance across runs, or statistical testing. Without these details the 0.0863 absolute gain cannot be evaluated for reliability.

Authors: We agree that the abstract lacks these details. We will revise the manuscript to include the number of test images, variance across runs, and results of statistical testing to support the reliability of the Dice improvement. revision: yes
Referee: [Abstract] Abstract: Dice is an image-overlap metric computed against ground-truth geometries, but the manuscript provides no evidence that the metric is evaluated on masks rendered from the emitted DSL code, nor any parameter-level error statistics (line width, spacing, etc.) on the MIIC annotations. Because thresholding is lossy, the Dice gain alone does not establish that binarization leaves every geometric feature required for accurate, editable DSL intact.

Authors: We acknowledge that the manuscript does not provide explicit evidence on the computation of Dice via DSL-rendered masks or parameter-level statistics. We will revise to clarify the evaluation method, add parameter-level error statistics, and address the potential lossiness of binarization to strengthen the connection to DSL fidelity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical Dice gain measured on external held-out ground truth

full rationale

The paper's central result is an empirical comparison: binarized inputs raise mean Dice from 0.4393 to 0.5256 on the MIIC dataset relative to a raw-input baseline. Dice is a standard overlap metric computed against independent ground-truth geometries; it is not defined in terms of any fitted parameter, threshold, or output of the proposed method. No equations, self-citations, or uniqueness claims appear in the supplied text that would reduce the reported improvement to a definitional identity or fitted-input prediction. The derivation therefore remains self-contained against an external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced or required by the abstract description.

pith-pipeline@v0.9.1-grok · 5729 in / 1099 out tokens · 22910 ms · 2026-06-28T14:12:46.231351+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 3 canonical work pages · 1 internal anchor

[1]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022
[2]

Generative adversarial nets,

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 27, 2014

2014
[3]

ChartCoder: Advancing multimodal large lan- guage model for chart-to-code generation,

X. Zhaoet al., “ChartCoder: Advancing multimodal large lan- guage model for chart-to-code generation,” inProceedings of the Annual Meeting of the Association for Computational Linguis- tics (ACL), 2025

2025
[4]

pix2code: Generating code from a graphical user interface screenshot,

T. Beltramelli, “pix2code: Generating code from a graphical user interface screenshot,” inProceedings of the ACM SIGCHI Symposium on Engineering Interactive Computing Systems (EICS), 2018

2018
[5]

Im2Vec: Synthesizing vector graphics without vector supervision,

P. Reddy, M. Gharbi, M. Lukac, and N. J. Mitra, “Im2Vec: Synthesizing vector graphics without vector supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

2021
[6]

Draw- ing2CAD: Sequence-to-sequence learning for CAD generation from vectorized drawings,

F. Qin, S. Lu, J. Hou, C. Wang, M. Fang, and L. Liu, “Draw- ing2CAD: Sequence-to-sequence learning for CAD generation from vectorized drawings,”arXiv preprint arXiv:2508.18733, 2025

work page arXiv 2025
[7]

Text-to-cadquery: A new paradigm for cad generation with scalable large model capabilities.arXiv preprint arXiv:2505.06507, 2025

H. Xie and F. Ju, “Text-to-CadQuery: A new paradigm for CAD generation with scalable large model capabilities,”arXiv preprint arXiv:2505.06507, 2025

work page arXiv 2025
[8]

Document image binarization with fully convolutional neural networks,

C. Tensmeyer and T. Martinez, “Document image binarization with fully convolutional neural networks,” inProceedings of the International Conference on Document Analysis and Recogni- tion (ICDAR), 2017

2017
[9]

Qwen3-VL Technical Report

S. Baiet al., “Qwen3-VL technical report,”arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Application of advanced image processing techniques to automatic kikuchi lines detection,

R. Fraczek and T. Zielinski, “Application of advanced image processing techniques to automatic kikuchi lines detection,” inProceedings of the European Signal Processing Conference (EUSIPCO), 2006

2006
[11]

Joint anomaly detection and inpainting for mi- croscopy images via deep self-supervised learning,

L. Huanget al., “Joint anomaly detection and inpainting for mi- croscopy images via deep self-supervised learning,” inProceed- ings of the IEEE International Conference on Image Processing (ICIP), 2021

2021
[12]

What is a good evaluation measure for semantic segmentation?,

G. Csurka, D. Larlus, and F. Perronnin, “What is a good evaluation measure for semantic segmentation?,” inProceedings of the British Machine Vision Conference (BMVC), 2013

2013
[13]

Boundary enhanced semantic segmentation for high resolution electron microscope images,

M. Pollach, F. Schiegg, M. Ludwig, A.-C. Bette, and A. Knoll, “Boundary enhanced semantic segmentation for high resolution electron microscope images,” inProceedings of the European Signal Processing Conference (EUSIPCO), 2022

2022
[14]

RoadTracer: Automatic extraction of road networks from aerial images,

F. Bastaniet al., “RoadTracer: Automatic extraction of road networks from aerial images,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018

2018

[1] [1]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022

[2] [2]

Generative adversarial nets,

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 27, 2014

2014

[3] [3]

ChartCoder: Advancing multimodal large lan- guage model for chart-to-code generation,

X. Zhaoet al., “ChartCoder: Advancing multimodal large lan- guage model for chart-to-code generation,” inProceedings of the Annual Meeting of the Association for Computational Linguis- tics (ACL), 2025

2025

[4] [4]

pix2code: Generating code from a graphical user interface screenshot,

T. Beltramelli, “pix2code: Generating code from a graphical user interface screenshot,” inProceedings of the ACM SIGCHI Symposium on Engineering Interactive Computing Systems (EICS), 2018

2018

[5] [5]

Im2Vec: Synthesizing vector graphics without vector supervision,

P. Reddy, M. Gharbi, M. Lukac, and N. J. Mitra, “Im2Vec: Synthesizing vector graphics without vector supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

2021

[6] [6]

Draw- ing2CAD: Sequence-to-sequence learning for CAD generation from vectorized drawings,

F. Qin, S. Lu, J. Hou, C. Wang, M. Fang, and L. Liu, “Draw- ing2CAD: Sequence-to-sequence learning for CAD generation from vectorized drawings,”arXiv preprint arXiv:2508.18733, 2025

work page arXiv 2025

[7] [7]

Text-to-cadquery: A new paradigm for cad generation with scalable large model capabilities.arXiv preprint arXiv:2505.06507, 2025

H. Xie and F. Ju, “Text-to-CadQuery: A new paradigm for CAD generation with scalable large model capabilities,”arXiv preprint arXiv:2505.06507, 2025

work page arXiv 2025

[8] [8]

Document image binarization with fully convolutional neural networks,

C. Tensmeyer and T. Martinez, “Document image binarization with fully convolutional neural networks,” inProceedings of the International Conference on Document Analysis and Recogni- tion (ICDAR), 2017

2017

[9] [9]

Qwen3-VL Technical Report

S. Baiet al., “Qwen3-VL technical report,”arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Application of advanced image processing techniques to automatic kikuchi lines detection,

R. Fraczek and T. Zielinski, “Application of advanced image processing techniques to automatic kikuchi lines detection,” inProceedings of the European Signal Processing Conference (EUSIPCO), 2006

2006

[11] [11]

Joint anomaly detection and inpainting for mi- croscopy images via deep self-supervised learning,

L. Huanget al., “Joint anomaly detection and inpainting for mi- croscopy images via deep self-supervised learning,” inProceed- ings of the IEEE International Conference on Image Processing (ICIP), 2021

2021

[12] [12]

What is a good evaluation measure for semantic segmentation?,

G. Csurka, D. Larlus, and F. Perronnin, “What is a good evaluation measure for semantic segmentation?,” inProceedings of the British Machine Vision Conference (BMVC), 2013

2013

[13] [13]

Boundary enhanced semantic segmentation for high resolution electron microscope images,

M. Pollach, F. Schiegg, M. Ludwig, A.-C. Bette, and A. Knoll, “Boundary enhanced semantic segmentation for high resolution electron microscope images,” inProceedings of the European Signal Processing Conference (EUSIPCO), 2022

2022

[14] [14]

RoadTracer: Automatic extraction of road networks from aerial images,

F. Bastaniet al., “RoadTracer: Automatic extraction of road networks from aerial images,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018

2018