VLM-Aware Meta-Optic Front-End Design for Frozen Vision-Language Models
Pith reviewed 2026-06-29 00:26 UTC · model grok-4.3
The pith
Optimizing meta-optics directly against a frozen CLIP loss raises zero-shot accuracy from 53.75% to 65.41% on ImageNet-100.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CODA optimizes a continuous-density meta-optic front-end for frozen-model recognition using differentiable image formation and adjoint-gradient updates of Maxwell-based simulations. It directly optimizes the cross-entropy loss of a fixed zero-shot CLIP classifier without learned reconstruction, image signal processing, or image-fidelity auxiliary objectives, improving CLIP ViT-L/14 zero-shot accuracy from 53.75 ± 3.57% to 65.41 ± 3.99% on ImageNet-100. The resulting optics transfer without re-optimization to SigLIP and DINOv2 on ImageNet-100, CIFAR-100, and Food-101.
What carries the argument
The CODA co-design loop that treats the meta-optic parameters as differentiable variables updated by backpropagating the frozen classifier's cross-entropy loss through a Maxwell-based image formation model.
If this is right
- The same optic design works across CLIP, SigLIP, and DINOv2 without further optimization.
- Gains appear on CIFAR-100 and Food-101 as well as ImageNet-100.
- Recognition under meta-optic constraints improves when optical design is aligned with the model loss rather than image fidelity.
- No auxiliary reconstruction or perceptual losses are required for the improvement.
Where Pith is reading between the lines
- Optics optimized for human-interpretable images may be systematically suboptimal for downstream machine perception.
- Similar end-to-end optical co-design could be applied to other constrained sensors such as thermal or event cameras.
- Task-specific optics might become practical if the simulation-to-hardware gap can be closed.
Load-bearing premise
The Maxwell solver and adjoint gradients produce designs whose performance will match real fabricated hardware.
What would settle it
Fabricate the optimized meta-optic, capture real images of ImageNet objects through it, and measure whether zero-shot CLIP accuracy matches or exceeds the simulated 65.41%.
Figures
read the original abstract
Conventional machine-vision pipelines typically rely on high-quality optics that produce clean, human-interpretable images, and optical design has therefore been driven by image-level criteria such as resolution, aberration correction, and pixel fidelity. However, such optics are often impractical for size-, cost-, or form-factor-constrained applications, where compact meta-optics offer an attractive alternative but operate under strict physical efficiency limits. We propose CODA, a co-design framework that optimizes a continuous-density meta-optic front-end for frozen-model recognition using differentiable image formation and adjoint-gradient updates of Maxwell-based simulations. CODA directly optimizes the cross-entropy loss of a fixed zero-shot CLIP classifier without learned reconstruction, image signal processing, or image-fidelity auxiliary objectives. In a two-dimensional simulated imaging benchmark on ImageNet-100, CODA improves CLIP ViT-L/14 zero-shot accuracy from 53.75 $\pm$ 3.57$\%$ with a focal-concentration baseline to 65.41 $\pm$ 3.99$\%$. The optimized optics further transfer without re-optimization across CLIP, SigLIP, and DINOv2 on ImageNet-100, CIFAR-100, and Food-101. These results demonstrate that, under constrained meta-optic imaging, downstream recognition can be improved by aligning optical design with frozen vision-model objectives rather than conventional image-formation criteria.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CODA, a co-design framework that uses differentiable Maxwell-based image formation and adjoint-gradient optimization to directly tune continuous-density meta-optic front-ends for frozen zero-shot VLMs (CLIP, SigLIP, DINOv2) by minimizing cross-entropy loss without reconstruction or image-fidelity auxiliaries. In 2D simulated imaging on ImageNet-100 it reports lifting CLIP ViT-L/14 accuracy from 53.75 ± 3.57 % (focal baseline) to 65.41 ± 3.99 %; the same optics are shown to transfer across models and to CIFAR-100/Food-101 without re-optimization.
Significance. If the 2D simulator faithfully predicts fabricated 3D meta-optic behavior, the result would demonstrate that task-specific optical co-design can outperform conventional image-quality criteria under severe physical constraints, opening a route to compact, model-aware front-ends for edge vision. The absence of learned reconstruction or auxiliary losses is a methodological strength.
major comments (2)
- [Abstract and §4] Abstract and §4 (experimental results): all reported accuracy gains and cross-model transfer are obtained inside a 2D Maxwell simulator; the manuscript supplies no quantitative validation of simulator fidelity against fabricated devices, 3D vectorial effects, material dispersion, or sensor noise, which directly undermines the central claim that the designs “transfer without re-optimization” to real constrained meta-optics.
- [§3.2] §3.2 (image-formation model): the adjoint-gradient updates rest on the assumption that the 2D continuous-density parameterization accurately captures the physical degrees of freedom of a real 3D meta-optic; no sensitivity analysis or fabrication-error model is provided to bound the expected performance drop.
minor comments (2)
- [Table 1] Table 1 (or equivalent results table): report the number of random seeds and exact data splits used for the ±3.57 % / ±3.99 % intervals so that statistical significance of the 11.66-point gain can be assessed.
- [§3.1] Notation: the continuous-density variable and its projection onto the binary fabrication constraint should be defined with an equation number in §3.1.
Simulated Author's Rebuttal
We thank the referee for the constructive comments highlighting the simulation-only nature of our study. We address each major point below and will revise the manuscript to clarify scope and limitations without overstating applicability to physical devices.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (experimental results): all reported accuracy gains and cross-model transfer are obtained inside a 2D Maxwell simulator; the manuscript supplies no quantitative validation of simulator fidelity against fabricated devices, 3D vectorial effects, material dispersion, or sensor noise, which directly undermines the central claim that the designs “transfer without re-optimization” to real constrained meta-optics.
Authors: We agree that the reported gains (53.75% to 65.41% on ImageNet-100) and cross-model transfers are obtained exclusively within the 2D Maxwell simulator, with no fabricated-device validation, 3D effects, dispersion, or noise modeling included. The work is a simulation study; we will revise the abstract and §4 to explicitly qualify all claims as applying to the 2D simulated environment and remove language implying direct transfer to physical meta-optics. This is a scope clarification. revision: yes
-
Referee: [§3.2] §3.2 (image-formation model): the adjoint-gradient updates rest on the assumption that the 2D continuous-density parameterization accurately captures the physical degrees of freedom of a real 3D meta-optic; no sensitivity analysis or fabrication-error model is provided to bound the expected performance drop.
Authors: The 2D continuous-density model is a standard approximation in meta-optics literature for adjoint optimization. We will add a limitations paragraph in §3.2 noting this assumption, citing 2D-to-3D discrepancy studies, and stating that real-device performance may degrade without providing quantitative bounds. A full fabrication-error model lies outside the current simulation-focused scope. revision: partial
- Quantitative validation of simulator fidelity against fabricated 3D meta-optic devices, including 3D vectorial effects, material dispersion, and sensor noise
Circularity Check
No circularity: direct simulation-based optimization on fixed model loss
full rationale
The paper's central derivation optimizes meta-optic density parameters via adjoint gradients on a differentiable 2D Maxwell image-formation model to minimize the cross-entropy of a frozen CLIP classifier. Reported accuracy gains (53.75% to 65.41% on ImageNet-100) and cross-model transfer are produced entirely inside this simulation; no parameter is fitted to a data subset and then renamed as a prediction, no quantity is defined in terms of itself, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The derivation chain therefore remains self-contained against external benchmarks and does not reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
In: European conference on computer vision
Bossard, L., Guillaumin, M., Van Gool, L.: Food-101–mining discriminative com- ponents with random forests. In: European conference on computer vision. pp. 446–461. Springer (2014)
2014
-
[2]
Scientific reports8(1), 12324 (2018)
Chang, J., Sitzmann, V., Dun, X., Heidrich, W., Wetzstein, G.: Hybrid optical- electronic convolutional neural networks with optimized diffractive optics for image classification. Scientific reports8(1), 12324 (2018)
2018
-
[3]
Nature Communications16(1), 363 (2025)
Chen, J., Huang, S.X., Chan, K.F., Wu, G.B., Chan, C.H.: 3d-printed aberration- freeterahertzmetalensforultra-broadbandachromaticsuper-resolutionwide-angle imaging with high numerical aperture. Nature Communications16(1), 363 (2025)
2025
-
[4]
Light: Advanced Man- ufacturing7, 1–12 (2026).https://doi.org/10.37188/lam.2026.045
Chi, C., Hou, Q., Zhao, G., Song, Q., Xu, S., Piao, Y., Qin, M., Hu, Y., Chen, C., Cai, W., Chen, Y., Yuan, X., Duan, H.: Ultracompact wide-fov near-infrared camera with a wafer-level manufactured meta-aspheric lens. Light: Advanced Man- ufacturing7, 1–12 (2026).https://doi.org/10.37188/lam.2026.045
-
[5]
Journal of the Optical Society of America B38(2), 496–509 (2021)
Christiansen, R.E., Sigmund, O.: Inverse design in photonics by topology opti- mization: tutorial. Journal of the Optical Society of America B38(2), 496–509 (2021)
2021
-
[6]
Optics express28(5), 6945–6965 (2020)
Chung, H., Miller, O.D.: High-na achromatic metalenses by inverse design. Optics express28(5), 6945–6965 (2020)
2020
-
[7]
Applied optics58(12), 3179–3186 (2019)
Colburn, S., Chu, Y., Shilzerman, E., Majumdar, A.: Optical frontend for a con- volutional neural network. Applied optics58(12), 3179–3186 (2019)
2019
-
[8]
Science advances4(2), eaar2114 (2018)
Colburn, S., Zhan, A., Majumdar, A.: Metasurface optics for full-color computa- tional imaging. Science advances4(2), eaar2114 (2018)
2018
-
[9]
In: 2009 IEEE conference on computer vision and pattern recognition
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A large- scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
2009
-
[10]
Acs Pho- tonics6(8), 2161–2167 (2019)
Faraji-Dana, M., Arbabi, E., Kwon, H., Kamali, S.M., Arbabi, A., Bartholomew, J.G., Faraon, A.: Hyperspectral imager with folded metasurface optics. Acs Pho- tonics6(8), 2161–2167 (2019)
2019
-
[11]
Fu, W., Zhao, D., Li, Z., Liu, S., Tian, C., Huang, K.: Ultracompact meta-imagers forarbitraryall-opticalconvolution.Light:Science&Applications11(1), 62(2022)
2022
-
[12]
Optics Express30(3), 4467–4491 (2022)
Hammond, A.M., Oskooi, A., Chen, M., Lin, Z., Johnson, S.G., Ralph, S.E.: High- performance hybrid time/frequency-domain topology optimization for large-scale photonics inverse design. Optics Express30(3), 4467–4491 (2022)
2022
-
[13]
Laser & Photonics Reviews20(5), e00803 (2026) 16 Kang et al
Hao, C., Wu, Y., Yuan, Z., Zhou, Z.W., Wang, Y., Li, M., Feng, C., Wang, K., Zhang, Z., Chen, J.: Compact meta-camera for intelligent wide-angle and low-light imaging. Laser & Photonics Reviews20(5), e00803 (2026) 16 Kang et al
2026
-
[14]
ACS Photonics5(12), 4781–4787 (2018)
Hughes, T.W., Minkov, M., Williamson, I.A., Fan, S.: Adjoint method and inverse design for nonlinear nanophotonic devices. ACS Photonics5(12), 4781–4787 (2018)
2018
-
[15]
In: International conference on machine learning
Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning. pp. 4904–4916. PMLR (2021)
2021
-
[16]
arXiv preprint arXiv:2606.16724 (2026)
Kienesberger, L., Kuang, Z., Liu, Y., Miller, O.D.: End-to-end meta-imagers: Information-theoretic objectives and generalized focusing optima. arXiv preprint arXiv:2606.16724 (2026)
arXiv 2026
-
[17]
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
2009
-
[18]
Optica6(12), 1461–1470 (2019)
Liang, H., Martins, A., Borges, B.H.V., Zhou, J., Martins, E.R., Li, J., Krauss, T.F.: High performance metalenses: numerical aperture, aberrations, chromaticity, and trade-offs. Optica6(12), 1461–1470 (2019)
2019
-
[19]
Science 361(6406), 1004–1008 (2018)
Lin, X., Rivenson, Y., Yardimci, N.T., Veli, M., Luo, Y., Jarrahi, M., Ozcan, A.: All-optical machine learning using diffractive deep neural networks. Science 361(6406), 1004–1008 (2018)
2018
-
[20]
Optics express30(16), 28358–28370 (2022)
Lin, Z., Pestourie, R., Roques-Carmes, C., Li, Z., Capasso, F., Soljačić, M., John- son, S.G.: End-to-end metasurface inverse design for single-shot multi-channel imaging. Optics express30(16), 28358–28370 (2022)
2022
-
[21]
Nanophotonics10(3), 1177–1187 (2021)
Lin, Z., Roques-Carmes, C., Pestourie, R., Soljačić, M., Majumdar, A., John- son, S.G.: End-to-end nanophotonic inverse design for imaging and polarimetry. Nanophotonics10(3), 1177–1187 (2021)
2021
-
[22]
Advanced Photonics6(5), 056001– 056001 (2024)
Liu, Y., Li, W.D., Xin, K.Y., Chen, Z.M., Chen, Z.Y., Chen, R., Chen, X.D., Zhao, F.L., Zheng, W.S., Dong, J.W.: Ultra-wide fov meta-camera with transformer- neural-network color imaging methodology. Advanced Photonics6(5), 056001– 056001 (2024)
2024
-
[23]
Nanophotonics15(7), e70054 (2026)
Ma, W., Pestourie, R., Lin, Z., Johnson, S.G.: Inverse design for robust inference in integrated computational spectrometry. Nanophotonics15(7), e70054 (2026)
2026
-
[24]
Acs Photonics 7(8), 2073–2079 (2020)
Martins, A., Li, K., Li, J., Liang, H., Conteduca, D., Borges, B.H.V., Krauss, T.F., Martins, E.R.: On metalenses with arbitrarily wide field of view. Acs Photonics 7(8), 2073–2079 (2020)
2073
-
[25]
Optics Express29(13), 20715– 20723 (2021)
Meem, M., Majumder, A., Banerji, S., Garcia, J.C., Kigner, O.B., Hon, P.W., Sensale-Rodriguez, B., Menon, R.: Imaging from the visible to the longwave in- frared wavelengths via an inverse-designed flat lens. Optics Express29(13), 20715– 20723 (2021)
2021
-
[26]
Miller, O.: Photonic Design: From Fundamental Solar Cell Physics to Computa- tional Inverse Design. Ph.D. thesis, EECS Department, University of California, Berkeley (May 2012),http://www2.eecs.berkeley.edu/Pubs/TechRpts/2012/ EECS-2012-115.html
2012
-
[27]
Nature photonics12(11), 659–670 (2018)
Molesky, S., Lin, Z., Piggott, A.Y., Jin, W., Vucković, J., Rodriguez, A.W.: Inverse design in nanophotonics. Nature photonics12(11), 659–670 (2018)
2018
-
[28]
Applied Optics39(13), 2210–2220 (2000)
Mouroulis, P., Green, R.O., Chrien, T.G.: Design of pushbroom imaging spec- trometers for optimum recovery of spectroscopic and spatial information. Applied Optics39(13), 2210–2220 (2000)
2000
-
[29]
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W.,Howes,R.,Huang,P.Y.,Li,S.W.,Misra,I.,Rabbat,M.,Sharma,V.,Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual features without su...
Pith/arXiv arXiv 2024
-
[30]
Computer Physics Communications181(3), 687–702 (2010)
Oskooi, A.F., Roundy, D., Ibanescu, M., Bermel, P., Joannopoulos, J.D., Johnson, S.G.: Meep: A flexible free-software package for electromagnetic simulations by the fdtd method. Computer Physics Communications181(3), 687–702 (2010)
2010
-
[31]
Nature 654, 917–925 (2026).https://doi.org/10.1038/s41586-026-10635-z
Peng, J., Luo, M., Han, Y., Wu, S., Li, H., Shastri, B.J., Shu, C., Dou, Q., Chai, Y., Huang, C.: Optical metasurfaces for general vision processing on the edge. Nature 654, 917–925 (2026).https://doi.org/10.1038/s41586-026-10635-z
-
[32]
In: Meila, M., Zhang, T
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceed- ings of Machine Learning Res...
2021
-
[33]
arXiv preprint arXiv:2511.18980 (2025)
Rodionov, S., Burguete-Lopez, A., Makarenko, M., Wang, Q., Getman, F., Frat- alocchi, A.: Moclip: A foundation model for large-scale nanophotonic inverse de- sign. arXiv preprint arXiv:2511.18980 (2025)
arXiv 2025
-
[34]
Advanced Pho- tonics6(6), 066002 (2024)
Seo, J., Jo, J., Kim, J., Kang, J., Kang, C., Moon, S.W., Lee, E., Hong, J., Rho, J., Chung, H.: Deep-learning-driven end-to-end metalens imaging. Advanced Pho- tonics6(6), 066002 (2024)
2024
-
[35]
Nature Communications14(1), 1035 (2023)
Shen, Z., Zhao, F., Jin, C., Wang, S., Cao, L., Yang, Y.: Monocular metasurface camera for passive single-shot 4d imaging. Nature Communications14(1), 1035 (2023)
2023
-
[36]
ACM Transactions on Graphics (TOG)37(4), 1–13 (2018)
Sitzmann, V., Diamond, S., Peng, Y., Dun, X., Boyd, S., Heidrich, W., Heide, F., Wetzstein, G.: End-to-end optimization of optics and image processing for achro- matic extended depth of field and super-resolution imaging. ACM Transactions on Graphics (TOG)37(4), 1–13 (2018)
2018
-
[37]
arXiv preprint arXiv:1910.10699 (2019)
Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. arXiv preprint arXiv:1910.10699 (2019)
arXiv 1910
-
[38]
Nature communications 12(1), 6493 (2021)
Tseng,E.,Colburn,S.,Whitehead,J.,Huang,L.,Baek,S.H.,Majumdar,A.,Heide, F.: Neural nano-optics for high-quality thin lens imaging. Nature communications 12(1), 6493 (2021)
2021
-
[39]
ACM Trans- actions on Graphics (TOG)40(2), 1–19 (2021)
Tseng, E., Mosleh, A., Mannan, F., St-Arnaud, K., Sharma, A., Peng, Y., Braun, A., Nowrouzezahrai, D., Lalonde, J.F., Heide, F.: Differentiable compound optics and processing pipeline optimization for end-to-end camera design. ACM Trans- actions on Graphics (TOG)40(2), 1–19 (2021)
2021
-
[40]
npj Nanophotonics1(1), 4 (Apr 2024).https://doi.org/10
Wang, J., Yu, R., Ye, X., Sun, J., Li, J., Huang, C., Xiao, X., Ji, J., Shen, W., Tie, Z., Chen, C., Zhu, S., Li, T.: Quantitative phase imaging with a compact meta-microscope. npj Nanophotonics1(1), 4 (Apr 2024).https://doi.org/10. 1038/s44310-024-00007-8,https://doi.org/10.1038/s44310-024-00007-8
-
[41]
IEEE transactions on image processing 13(4), 600–612 (2004)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004)
2004
-
[42]
Light: Science & Applications14(1), 17 (2025)
Wirth-Singh, A., Fröch, J.E., Yang, F., Martin, L., Zheng, H., Zhang, H., Tanguy, Q.T., Zhou, Z., Huang, L., John, D.D., et al.: Wide field of view large aperture meta-doublet eyepiece. Light: Science & Applications14(1), 17 (2025)
2025
-
[43]
Advanced Photonics Nexus4(2), 026009–026009 (2025)
Wirth-Singh, A., Xiang, J., Choi, M., Fröch, J.E., Huang, L., Colburn, S., Shlizer- man, E., Majumdar, A.: Compressed meta-optical encoder for image classification. Advanced Photonics Nexus4(2), 026009–026009 (2025)
2025
-
[44]
In: Proceedings of the IEEE/CVF international conference on computer vision
Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023) 18 Kang et al
2023
-
[45]
Laser & Photonics Reviews18(8), 2400187 (2024)
Zhang, Q., Lin, P., Wang, C., Zhang, Y., Yu, Z., Liu, X., Lu, Y., Xu, T., Zheng, Z.: Neural-optic co-designed polarization-multiplexed metalens for compact com- putational spectral imaging. Laser & Photonics Reviews18(8), 2400187 (2024)
2024
-
[46]
IEEE Transactions on Image Processing20(12), 3322–3340 (2011)
Zhou, C., Nayar, S.K.: Computational cameras: convergence of optics and process- ing. IEEE Transactions on Image Processing20(12), 3322–3340 (2011)
2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.