DxPTA: An Architecture Design Space Exploration with Optical Dataflow-guided Strategy for HW/SW Co-Design of Photonic Transformer Accelerators

Mahmoud Rasras; Muhammad Shafique; Rachmad Vidya Wicaksana Putra; Solomon Micheal Serunjogi

arxiv: 2606.06515 · v1 · pith:EUPR3HM3new · submitted 2026-06-02 · 💻 cs.AR · cs.AI· cs.DC· cs.ET· cs.LG

DxPTA: An Architecture Design Space Exploration with Optical Dataflow-guided Strategy for HW/SW Co-Design of Photonic Transformer Accelerators

Rachmad Vidya Wicaksana Putra , Solomon Micheal Serunjogi , Mahmoud Rasras , Muhammad Shafique This is my paper

Pith reviewed 2026-06-28 08:10 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.DCcs.ETcs.LG

keywords photonic transformer acceleratorsdesign space explorationhardware software co-designoptical dataflowconstraint-aware searchDeiTBERTaccelerator architecture

0 comments

The pith

DxPTA identifies PTA architecture parameters from coherent optical dataflow, analyzes their impact, and uses the results to run a constraint-aware search that finds suitable designs for transformer models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DxPTA as a methodology to automate hardware/software co-design of photonic transformer accelerators. It first extracts key architecture parameters using the coherent optical dataflow, then determines which parameters matter most for area, power, energy, and latency. This analysis feeds a search algorithm that selects architectures meeting all given constraints for models such as DeiT-T/S/B and BERT-B/L. The approach delivers designs within 26 mm² area, 4.8 W power, 39 mJ energy, and 6 ms latency under the stated limits, while cutting search time by a factor of 15.2 compared with exhaustive enumeration.

Core claim

By grounding PTA parameter selection in coherent optical dataflow analysis and then applying the resulting significance ranking inside a constraint-aware search procedure, DxPTA produces photonic transformer accelerator architectures that simultaneously satisfy area, power, energy, and latency bounds for multiple transformer models, and does so more than fifteen times faster than exhaustive enumeration.

What carries the argument

The constraint-aware architecture search algorithm that ranks and prunes PTA design choices according to the impact analysis derived from coherent optical dataflow.

If this is right

PTA architectures can be generated automatically for any new transformer model once its workload characteristics are supplied.
Design time for photonic accelerators shrinks from days of manual tuning to minutes of automated search.
The same parameter-analysis step can be reused when additional constraints such as thermal limits or fabrication yield are added.
Hardware/software partitioning decisions become repeatable across different transformer sizes rather than ad-hoc.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dataflow-guided ranking could be applied to other photonic linear-algebra accelerators beyond transformers.
If the optical dataflow model is extended to include crosstalk or wavelength-dependent loss, the search could incorporate those effects without changing the overall procedure.
The reported 15.2x speedup suggests that larger design spaces, such as those arising from multi-chip photonic systems, become searchable with modest compute.

Load-bearing premise

The impact and significance ranking of PTA architecture parameters obtained from coherent optical dataflow analysis is accurate enough that the search procedure will locate a valid design whenever one exists.

What would settle it

Run the exhaustive enumeration on the same design space and constraints; if it returns a design that satisfies all limits yet DxPTA misses it, the method fails.

Figures

Figures reproduced from arXiv: 2606.06515 by Mahmoud Rasras, Muhammad Shafique, Rachmad Vidya Wicaksana Putra, Solomon Micheal Serunjogi.

**Figure 1.** Figure 1: (a) Transformer-based models typically can improve the performance at the cost of larger memory size (i.e., higher number of parameters); based on the data from [5]. (b) Experimental results of running Data-efficient Image Transformer Base (DeiT-B) [3] with different compute platforms: CPU, GPU, CMOS-based accelerators (i.e., AutoViT-4bit [25] and HeatViT-8bit [26]) and photonic-based Lightening-Transforme… view at source ↗

**Figure 2.** Figure 2: Experimental results considering different configurations of architec [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Our novel contributions in this work. II. PRELIMINARIES A. Transformer-based Networks A transformer-based network typically consists of multiple identical blocks, known as encoder and decoder blocks. Each block consists of a multi-head self-attention (MHA) module, a feed-forward network (FFN), shortcut connections, as well as a layer normalization (LN) [24]. Furthermore, the decoder block also has cross-at… view at source ↗

**Figure 5.** Figure 5: – An Nh×Nv DDot array is employed to perform multiplications, which defines the parallelism level in a core; see B in [PITH_FULL_IMAGE:figures/full_fig_p003_5.png] view at source ↗

**Figure 4.** Figure 4: Our novel DxPTA methodology with its key steps: (1) identifying of the architecture parameters of PTA; (2) analyzing the impact of architecture [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Architecture of the LT accelerator; based on [24]. [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 7.** Figure 7: (a) Power and area for different configurations of parameters. (b) Energy consumption and latency for different configurations of parameters when running DeiT-B. We observe similar trends for different workloads (DeiT-T/S and BERT-B/L). Algorithm 1 Observing the significance of architecture parameters INPUT: Maximum number of observations (J=10); OUTPUT: Significance score of each parameter on area (SA) a… view at source ↗

**Figure 8.** Figure 8: Experimental setup used in this work. power, energy, and latency profiles; see Alg. 2: lines 11-14. • We evaluate if these profiles meet all constraints and if the energy-delay product (EDP) is lower than the recorded one. If so, then the configuration is saved; see Alg. 2: lines 15- 22. Here, EDP is the metric for reflecting both performance and energy efficiency. • When the DSE process is finished, the l… view at source ↗

**Figure 9.** Figure 9: Experimental results of (a) power vs. area, (b) energy vs. area, (c) latency vs. area, and (d) EDP vs. area for different samples of configurations during DSE process, across different workloads: DeiT-T, DeiT-S, DeiT-B, BERT-B, and BERT-L. 0 25 50 75 100 125 Micro_comb Buffer Adder Core TIA ADC MZM DAC Laser 0 5 10 15 20 25 30 Thousands Buffer Adder Photodetector TIA ADC MZM DAC Laser Area [mm2] (a) (b)Pow… view at source ↗

**Figure 12.** Figure 12: Searching time profiles of the exhaustive search approach and the [PITH_FULL_IMAGE:figures/full_fig_p007_12.png] view at source ↗

**Figure 11.** Figure 11: Experimental results of (a) performance and (b) energy consumption of DeiT-B processing using CPU, GPU, electronic accelerators (i.e., AutoViT4b [25] and HeatViT-8b [26]), state-of-the-art PTAs (i.e., LT-Base and LTLarge [24]), and DxPTA. to 26mm2 area, 4.8W power, 39mJ energy, and 6ms latency, for constraints of 50mm2 area, 5W power, 50mJ energy, and 10ms latency; with 15.2x faster search time than the… view at source ↗

read the original abstract

Transformer-based networks have emerged as prominent AI models with state-of-the-art performance, which potentially pave the way toward artificial general intelligence (AGI). However, their large sizes still hinder their efficient implementation, thus highlighting the need for alternate solutions to enable their energy-efficient acceleration. Recently, state-of-the-art works propose photonic transformer accelerators (PTAs) with significant speedup and energy efficiency improvements over the conventional electronic accelerators. However, their PTA architectures are developed without considering the application constraints (e.g., area, power, energy, and latency). Moreover, their manual design approach also requires huge design time to determine a suitable architecture for the targeted application, hence making this approach not scalable. To address these limitations, we propose DxPTA, a novel design space exploration methodology for enabling efficient hardware/software co-design of the appropriate PTA architecture that meets all constraints. It is achieved by (1) identifying the PTA architecture parameters based on the coherent optical dataflow; (2) analyzing the impact/significance of the parameters; and (3) leveraging this analysis for devising a constraint-aware architecture search algorithm. Experimental results show that, our DxPTA can find the appropriate PTA architectures for different transformer-based models (i.e., DeiT-T/S/B and BERT-B/L). It achieves up to 26mm^2 area, 4.8W power, 39mJ energy, and 6ms latency, for constraints of 50mm^2 area, 5W power, 50mJ energy, and 10ms latency; with 15.2x faster searching time than the exhaustive approach. These results demonstrate the potential of DxPTA methodology for enabling efficient PTA designs for diverse AGI-based applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DxPTA gives a dataflow-guided DSE flow for photonic transformer accelerators that claims a 15x search speedup while meeting area/power/energy/latency bounds, but the parameter significance step may miss interactions and leave the optimality claim unproven.

read the letter

The main point is that DxPTA identifies PTA parameters from coherent optical dataflow, ranks their impact, and feeds that into a constraint-aware search to produce architectures for DeiT-T/S/B and BERT-B/L models. The reported designs hit 26 mm² area, 4.8 W power, 39 mJ energy, and 6 ms latency under the stated limits, all while running 15.2 times faster than exhaustive enumeration.

The work does a clean job of naming the real problems in prior PTA papers: manual tuning that ignores application constraints and does not scale. The three-step structure is a direct response, and the concrete numbers plus the speedup give a practical sense of what the method delivers for those specific models.

The soft spot sits in the middle step. The analysis of parameter significance from the optical dataflow is the load-bearing piece, yet nothing in the abstract shows checks for higher-order couplings (waveguide count with modulator density and thermal effects under varying batch sizes, for example). If those are missed, the search can return feasible but non-optimal points or even designs that later violate a constraint when the model changes. The only baseline shown is exhaustive search; no comparison to other DSE heuristics appears, so the advance over existing electronic-accelerator methods is hard to gauge.

This paper is for people already working on photonic hardware for transformers or on constraint-driven DSE in emerging technologies. A reader who needs a starting template for co-design in that niche can pull useful structure from it.

It deserves a serious referee. The claims are testable once the full analysis and search algorithm are examined, and the topic is timely enough that feedback on the parameter-ranking step would be valuable.

Referee Report

3 major / 2 minor

Summary. The paper proposes DxPTA, a design space exploration (DSE) methodology for photonic transformer accelerators (PTAs) that identifies architecture parameters from coherent optical dataflow, analyzes their impact and significance, and uses this to construct a constraint-aware search algorithm for HW/SW co-design. It reports that the method finds suitable PTA architectures for DeiT-T/S/B and BERT-B/L models that satisfy area (≤50 mm²), power (≤5 W), energy (≤50 mJ), and latency (≤10 ms) constraints, achieving up to 26 mm², 4.8 W, 39 mJ, and 6 ms respectively, while providing 15.2× faster search than exhaustive enumeration.

Significance. If the parameter-impact analysis is shown to be complete and the derived search is proven to reliably locate feasible points without systematic omissions, the work would provide a scalable alternative to manual PTA design, directly addressing the scalability bottleneck for constraint-driven photonic accelerators targeting transformer models. The reported constraint satisfaction and search speedup would be a concrete contribution to HW/SW co-design in the photonic-accelerator literature.

major comments (3)

[§3–4] §3–4 (Parameter identification and impact/significance analysis): The central claim that the optical-dataflow-guided analysis yields a search algorithm capable of reliably meeting all four constraints without overlooking superior designs rests on the assumption that first-order parameter impacts are sufficient; no evidence is provided that higher-order couplings (e.g., between waveguide count, modulator density, and thermal tuning across batch sizes) have been quantified or bounded, leaving open the possibility that the reported 26 mm² / 4.8 W / 39 mJ / 6 ms points are not the best feasible solutions.
[§5] §5 (Constraint-aware search algorithm): The manuscript states that the search meets the stated bounds for the five evaluated models, yet provides no formal argument or exhaustive cross-check that every feasible architecture satisfying the four constraints is either found or provably dominated by the returned solution; the 15.2× speedup claim is therefore only relative to exhaustive search and does not establish optimality or completeness of the pruned space.
[§6] §6 (Experimental results): Validation is limited to comparison against exhaustive search; no additional baselines (e.g., random search, Bayesian optimization, or prior PTA DSE methods) are reported, making it impossible to assess whether the observed speed/quality trade-off is competitive or merely an artifact of the chosen pruning heuristic.

minor comments (2)

[§3] Notation for optical parameters (e.g., waveguide count, modulator density) should be introduced once with consistent symbols and units before being used in the impact tables.
[§6] Figure captions for the search-time and constraint-satisfaction plots should explicitly state the number of evaluated architectures and the exact constraint vector used in each experiment.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [§3–4] §3–4 (Parameter identification and impact/significance analysis): The central claim that the optical-dataflow-guided analysis yields a search algorithm capable of reliably meeting all four constraints without overlooking superior designs rests on the assumption that first-order parameter impacts are sufficient; no evidence is provided that higher-order couplings (e.g., between waveguide count, modulator density, and thermal tuning across batch sizes) have been quantified or bounded, leaving open the possibility that the reported 26 mm² / 4.8 W / 39 mJ / 6 ms points are not the best feasible solutions.

Authors: We acknowledge that §§3–4 focus on first-order impacts derived from coherent optical dataflow analysis. Higher-order couplings (e.g., waveguide-modulator-thermal interactions across batch sizes) were not quantified or bounded. We will add a limitations discussion in the revised manuscript noting this assumption and its implications, while retaining the empirical evidence that the identified architectures satisfy all constraints. revision: partial
Referee: [§5] §5 (Constraint-aware search algorithm): The manuscript states that the search meets the stated bounds for the five evaluated models, yet provides no formal argument or exhaustive cross-check that every feasible architecture satisfying the four constraints is either found or provably dominated by the returned solution; the 15.2× speedup claim is therefore only relative to exhaustive search and does not establish optimality or completeness of the pruned space.

Authors: DxPTA employs a heuristic search guided by the significance analysis; we make no claim of formal optimality or completeness. The 15.2× speedup is presented strictly relative to exhaustive enumeration. We will revise §5 to explicitly state the heuristic character and that feasible solutions are found in the evaluated cases without a guarantee of global optimality. revision: partial
Referee: [§6] §6 (Experimental results): Validation is limited to comparison against exhaustive search; no additional baselines (e.g., random search, Bayesian optimization, or prior PTA DSE methods) are reported, making it impossible to assess whether the observed speed/quality trade-off is competitive or merely an artifact of the chosen pruning heuristic.

Authors: We agree that additional baselines are needed for a stronger evaluation. We will extend §6 with comparisons to random search and Bayesian optimization under identical constraints and design space to demonstrate the relative effectiveness of the dataflow-guided pruning. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in DxPTA derivation

full rationale

The paper presents a three-step methodology: (1) identify PTA architecture parameters from coherent optical dataflow, (2) analyze parameter impact/significance, and (3) leverage the analysis for a constraint-aware search algorithm. No equations, fitted parameters, or self-citations are described that reduce by construction to the inputs (e.g., no self-definitional loops where a prediction is the fit itself, no uniqueness theorems imported from prior author work, and no ansatz smuggled via citation). The reported results (architectures meeting area/power/energy/latency bounds for DeiT/BERT models) are outcomes of the search process rather than tautological renamings or forced predictions. This is a standard empirical DSE flow whose central claims remain independent of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The methodology depends on the domain assumption about optical dataflow modeling and the effectiveness of the derived search algorithm; no explicit free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Coherent optical dataflow provides a reliable basis for determining the significance of architecture parameters in PTAs
This is invoked in the first step of the methodology to identify and analyze parameters.

pith-pipeline@v0.9.1-grok · 5888 in / 1502 out tokens · 41784 ms · 2026-06-28T08:10:56.158496+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 4 canonical work pages

[1]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez et al., “Attention is all you need,”Advances in Neural Information Processing Systems (NIPS), vol. 30, no. 1, pp. 261–272, 2017. 0.01 0.1 1 10 100 DeiT-T DeiT-S DeiT-B BERT-B BERT-L Search Time Normalized to Exhaustive DeiT-T (log scale) Exhaustive DxPTA faster faster faster faster fas...

2017
[2]

An image is worth 16x16 words: Trans- formers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Trans- formers for image recognition at scale,” inInternational Conference on Learning Representations (ICLR), 2021

2021
[3]

Training data-efficient image transformers & distillation through attention,

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. J ´egou, “Training data-efficient image transformers & distillation through attention,” inInternational Conference on Machine Learning (ICML). PMLR, 2021, pp. 10 347–10 357

2021
[4]

Transformers in vision: A survey,

S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in vision: A survey,”ACM Computing Surveys (CSUR), vol. 54, no. 10s, pp. 1–41, 2022

2022
[5]

A survey on vision transformer,

K. Han, Y . Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y . Tang, A. Xiao, C. Xu, Y . Xu, Z. Yang, Y . Zhang, and D. Tao, “A survey on vision transformer,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 45, no. 1, pp. 87–110, 2023

2023
[6]

Large language models for artificial general intelligence (agi): A survey of foundational principles and approaches,

A. Mumuni and F. Mumuni, “Large language models for artificial general intelligence (agi): A survey of foundational principles and approaches,”arXiv preprint arXiv:2501.03151, 2025

work page arXiv 2025
[7]

Artificial general intelligence: Advancements, challenges, and future directions in agi research,

G. Yenduri, R. Murugan, P. Kumar Reddy Maddikunta, S. Bhattacharya, D. Sudheer, and B. Bhushan Savarala, “Artificial general intelligence: Advancements, challenges, and future directions in agi research,”IEEE Access, vol. 13, pp. 134 325–134 356, 2025

2025
[8]

Hardware accelerator for multi-head attention and position-wise feed-forward in the trans- former,

S. Lu, M. Wang, S. Liang, J. Lin, and Z. Wang, “Hardware accelerator for multi-head attention and position-wise feed-forward in the trans- former,” in2020 IEEE 33rd International System-on-Chip Conference (SOCC). IEEE, 2020, pp. 84–89

2020
[9]

Accelerating framework of transformer by hardware design and model compression co-optimization,

P. Qi, E. H.-M. Sha, Q. Zhuge, H. Peng, S. Huang, Z. Kong, Y . Song, and B. Li, “Accelerating framework of transformer by hardware design and model compression co-optimization,” in2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD), 2021, pp. 1–9

2021
[10]

Spatten: Efficient sparse attention architecture with cascade token and head pruning,

H. Wang, Z. Zhang, and S. Han, “Spatten: Efficient sparse attention architecture with cascade token and head pruning,” in2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021, pp. 97–110

2021
[11]

Vaqf: Fully automatic software-hardware co-design frame- work for low-bit vision transformer,

M. Sun, H. Ma, G. Kang, Y . Jiang, T. Chen, X. Ma, Z. Wang, and Y . Wang, “Vaqf: Fully automatic software-hardware co-design frame- work for low-bit vision transformer,”arXiv preprint arXiv:2201.06618, 2022. 8

work page arXiv 2022
[12]

Transpim: A memory- based acceleration via software-hardware co-design for transformer,

M. Zhou, W. Xu, J. Kang, and T. Rosing, “Transpim: A memory- based acceleration via software-hardware co-design for transformer,” in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2022, pp. 1071–1085

2022
[13]

Tpu v4: An optically reconfigurable supercom- puter for machine learning with hardware support for embeddings,

N. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai, N. Patil, S. Subramanian, A. Swing, B. Towles, C. Young, X. Zhou, Z. Zhou, and D. A. Patterson, “Tpu v4: An optically reconfigurable supercom- puter for machine learning with hardware support for embeddings,” in Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA), 2023

2023
[14]

Vitcod: Vision transformer acceleration via dedicated algorithm and accelerator co-design,

H. You, Z. Sun, H. Shi, Z. Yu, Y . Zhao, Y . Zhang, C. Li, B. Li, and Y . Lin, “Vitcod: Vision transformer acceleration via dedicated algorithm and accelerator co-design,” in2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2023, pp. 273–286

2023
[15]

A survey on silicon photonics for deep learning,

F. P. Sunny, E. Taheri, M. Nikdast, and S. Pasricha, “A survey on silicon photonics for deep learning,”ACM Journal of Emerging Technologies in Computing System (JETC), vol. 17, no. 4, pp. 1–57, 2021

2021
[16]

Albireo: Energy- efficient acceleration of convolutional neural networks via silicon pho- tonics,

K. Shiflett, A. Karanth, R. Bunescu, and A. Louri, “Albireo: Energy- efficient acceleration of convolutional neural networks via silicon pho- tonics,” in2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 860–873

2021
[17]

Photonics for artificial intelligence and neuromorphic computing,

B. J. Shastri, A. N. Tait, T. Ferreira de Lima, W. H. Pernice, H. Bhaskaran, C. D. Wright, and P. R. Prucnal, “Photonics for artificial intelligence and neuromorphic computing,”Nature Photonics, vol. 15, no. 2, pp. 102–114, 2021

2021
[18]

Light in ai: toward efficient neurocomputing with optical neural networks—a tutorial,

J. Gu, C. Feng, H. Zhu, R. T. Chen, and D. Z. Pan, “Light in ai: toward efficient neurocomputing with optical neural networks—a tutorial,”IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 69, no. 6, pp. 2581–2585, 2022

2022
[19]

Simphony: A device-circuit-architecture cross-layer modeling and simulation frame- work for heterogeneous electronic-photonic ai system,

Z. Yin, M. Zhang, N. Gangi, R. Huang, J. Zhang, and J. Gu, “Simphony: A device-circuit-architecture cross-layer modeling and simulation frame- work for heterogeneous electronic-photonic ai system,” in2025 62nd ACM/IEEE Design Automation Conference (DAC). IEEE, 2025, pp. 1–7

2025
[20]

Deep learning with coherent nanophotonic circuits,

Y . Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englundet al., “Deep learning with coherent nanophotonic circuits,”Nature photonics, vol. 11, no. 7, pp. 441–446, 2017

2017
[21]

Neuromorphic photonic networks using silicon photonic weight banks,

A. N. Tait, T. F. De Lima, E. Zhou, A. X. Wu, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal, “Neuromorphic photonic networks using silicon photonic weight banks,”Scientific Reports, vol. 7, no. 1, p. 7430, 2017

2017
[22]

Crosslight: A cross- layer optimized silicon photonic neural network accelerator,

F. Sunny, A. Mirza, M. Nikdast, and S. Pasricha, “Crosslight: A cross- layer optimized silicon photonic neural network accelerator,” in2021 58th ACM/IEEE design automation conference (DAC). IEEE, 2021, pp. 1069–1074

2021
[23]

Parallel convolutional processing using an integrated photonic tensor core,

J. Feldmann, N. Youngblood, M. Karpov, H. Gehring, X. Li, M. Stap- pers, M. Le Gallo, X. Fu, A. Lukashchuk, A. S. Rajaet al., “Parallel convolutional processing using an integrated photonic tensor core,” Nature, vol. 589, no. 7840, pp. 52–58, 2021

2021
[24]

Lightening-transformer: A dynamically- operated optically-interconnected photonic transformer accelerator,

H. Zhu, J. Gu, H. Wang, Z. Jiang, Z. Zhang, R. Tang, C. Feng, S. Han, R. T. Chen, and D. Z. Pan, “Lightening-transformer: A dynamically- operated optically-interconnected photonic transformer accelerator,” in 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2024, pp. 686–703

2024
[25]

Auto-vit-acc: An fpga-aware automatic acceleration framework for vision transformer with mixed-scheme quantization,

Z. Li, M. Sun, A. Lu, H. Ma, G. Yuan, Y . Xie, H. Tang, Y . Li, M. Leeser, Z. Wanget al., “Auto-vit-acc: An fpga-aware automatic acceleration framework for vision transformer with mixed-scheme quantization,” in 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2022, pp. 109–116

2022
[26]

Heatvit: Hardware-efficient adaptive token pruning for vision transformers,

P. Dong, M. Sun, A. Lu, Y . Xie, K. Liu, Z. Kong, X. Meng, Z. Li, X. Lin, Z. Fang, and Y . Wang, “Heatvit: Hardware-efficient adaptive token pruning for vision transformers,” in2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023, pp. 442–455

2023
[27]

Sprint: A high-performance, energy- efficient, and scalable chiplet-based accelerator with photonic intercon- nects for cnn inference,

Y . Li, A. Louri, and A. Karanth, “Sprint: A high-performance, energy- efficient, and scalable chiplet-based accelerator with photonic intercon- nects for cnn inference,”IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 33, no. 10, pp. 2332–2345, 2022

2022
[28]

Spacx: Silicon photonics-based scalable chiplet accelerator for dnn inference,

——, “Spacx: Silicon photonics-based scalable chiplet accelerator for dnn inference,” in2022 IEEE International Symposium on High- Performance Computer Architecture (HPCA), 2022, pp. 831–845

2022
[29]

Tron: Transformer neural network acceleration with non-coherent silicon photonics,

S. Afifi, F. Sunny, M. Nikdast, and S. Pasricha, “Tron: Transformer neural network acceleration with non-coherent silicon photonics,” in Great Lakes Symposium on VLSI (GSVLSI) 2023, 2023, pp. 15–21

2023
[30]

A light-speed large language model accelerator with optical stochastic computing,

S. Afifi, O. Alo, I. Thakkar, and S. Pasricha, “A light-speed large language model accelerator with optical stochastic computing,” inGreat Lakes Symposium on VLSI (GLSVLSI) 2025, 2025, pp. 922–928

2025
[31]

Astra: A stochastic transformer neural network accelerator with silicon photonics,

——, “Astra: A stochastic transformer neural network accelerator with silicon photonics,”ACM Transactions on Embedded Computing Systems (TECS), 2025

2025
[32]

Merit: A sustainable dnn accelerator design with photonic phase-change memory,

Y . Li, A. Louri, and A. Karanth, “Merit: A sustainable dnn accelerator design with photonic phase-change memory,”IEEE Transactions on Sustainable Computing (TSUSC), vol. 10, no. 4, pp. 705–716, 2025

2025
[33]

Hyatten: Hybrid photonic-digital architec- ture for accelerating attention mechanism,

H. Li, D. Chen, and T. Mitra, “Hyatten: Hybrid photonic-digital architec- ture for accelerating attention mechanism,” in2025 Design, Automation & Test in Europe Conference (DATE), 2025, pp. 1–7

2025
[34]

P-dac: Power-efficient photonic accelerators for llm inference,

W.-T. Chang, C.-F. Wu, and Y .-C. Lo, “P-dac: Power-efficient photonic accelerators for llm inference,” in2025 62nd ACM/IEEE Design Au- tomation Conference (DAC), 2025, pp. 1–7

2025
[35]

En- lighten: Lighten the transformer, enable efficient optical acceleration,

H. Zhu, Z. Zhou, S. Ning, X. Wu, R. Chen, Y . Wan, and D. Pan, “En- lighten: Lighten the transformer, enable efficient optical acceleration,” arXiv preprint arXiv:2510.01673, 2025

work page arXiv 2025
[36]

Drmap: A generic dram data mapping policy for energy-efficient processing of convolu- tional neural networks,

R. V . W. Putra, M. A. Hanif, and M. Shafique, “Drmap: A generic dram data mapping policy for energy-efficient processing of convolu- tional neural networks,” in2020 57th ACM/IEEE Design Automation Conference (DAC), 2020, pp. 1–6

2020
[37]

Romanet: Fine-grained reuse-driven off-chip memory access management and data organization for deep neural network acceler- ators,

——, “Romanet: Fine-grained reuse-driven off-chip memory access management and data organization for deep neural network acceler- ators,”IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 29, no. 4, pp. 702–715, 2021

2021
[38]

Pendram: Enabling high-performance and energy-efficient pro- cessing of deep neural networks through a generalized dram data mapping policy,

——, “Pendram: Enabling high-performance and energy-efficient pro- cessing of deep neural networks through a generalized dram data mapping policy,”arXiv preprint arXiv:2408.02412, 2024

work page arXiv 2024

[1] [1]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez et al., “Attention is all you need,”Advances in Neural Information Processing Systems (NIPS), vol. 30, no. 1, pp. 261–272, 2017. 0.01 0.1 1 10 100 DeiT-T DeiT-S DeiT-B BERT-B BERT-L Search Time Normalized to Exhaustive DeiT-T (log scale) Exhaustive DxPTA faster faster faster faster fas...

2017

[2] [2]

An image is worth 16x16 words: Trans- formers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Trans- formers for image recognition at scale,” inInternational Conference on Learning Representations (ICLR), 2021

2021

[3] [3]

Training data-efficient image transformers & distillation through attention,

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. J ´egou, “Training data-efficient image transformers & distillation through attention,” inInternational Conference on Machine Learning (ICML). PMLR, 2021, pp. 10 347–10 357

2021

[4] [4]

Transformers in vision: A survey,

S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in vision: A survey,”ACM Computing Surveys (CSUR), vol. 54, no. 10s, pp. 1–41, 2022

2022

[5] [5]

A survey on vision transformer,

K. Han, Y . Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y . Tang, A. Xiao, C. Xu, Y . Xu, Z. Yang, Y . Zhang, and D. Tao, “A survey on vision transformer,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 45, no. 1, pp. 87–110, 2023

2023

[6] [6]

Large language models for artificial general intelligence (agi): A survey of foundational principles and approaches,

A. Mumuni and F. Mumuni, “Large language models for artificial general intelligence (agi): A survey of foundational principles and approaches,”arXiv preprint arXiv:2501.03151, 2025

work page arXiv 2025

[7] [7]

Artificial general intelligence: Advancements, challenges, and future directions in agi research,

G. Yenduri, R. Murugan, P. Kumar Reddy Maddikunta, S. Bhattacharya, D. Sudheer, and B. Bhushan Savarala, “Artificial general intelligence: Advancements, challenges, and future directions in agi research,”IEEE Access, vol. 13, pp. 134 325–134 356, 2025

2025

[8] [8]

Hardware accelerator for multi-head attention and position-wise feed-forward in the trans- former,

S. Lu, M. Wang, S. Liang, J. Lin, and Z. Wang, “Hardware accelerator for multi-head attention and position-wise feed-forward in the trans- former,” in2020 IEEE 33rd International System-on-Chip Conference (SOCC). IEEE, 2020, pp. 84–89

2020

[9] [9]

Accelerating framework of transformer by hardware design and model compression co-optimization,

P. Qi, E. H.-M. Sha, Q. Zhuge, H. Peng, S. Huang, Z. Kong, Y . Song, and B. Li, “Accelerating framework of transformer by hardware design and model compression co-optimization,” in2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD), 2021, pp. 1–9

2021

[10] [10]

Spatten: Efficient sparse attention architecture with cascade token and head pruning,

H. Wang, Z. Zhang, and S. Han, “Spatten: Efficient sparse attention architecture with cascade token and head pruning,” in2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021, pp. 97–110

2021

[11] [11]

Vaqf: Fully automatic software-hardware co-design frame- work for low-bit vision transformer,

M. Sun, H. Ma, G. Kang, Y . Jiang, T. Chen, X. Ma, Z. Wang, and Y . Wang, “Vaqf: Fully automatic software-hardware co-design frame- work for low-bit vision transformer,”arXiv preprint arXiv:2201.06618, 2022. 8

work page arXiv 2022

[12] [12]

Transpim: A memory- based acceleration via software-hardware co-design for transformer,

M. Zhou, W. Xu, J. Kang, and T. Rosing, “Transpim: A memory- based acceleration via software-hardware co-design for transformer,” in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2022, pp. 1071–1085

2022

[13] [13]

Tpu v4: An optically reconfigurable supercom- puter for machine learning with hardware support for embeddings,

N. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai, N. Patil, S. Subramanian, A. Swing, B. Towles, C. Young, X. Zhou, Z. Zhou, and D. A. Patterson, “Tpu v4: An optically reconfigurable supercom- puter for machine learning with hardware support for embeddings,” in Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA), 2023

2023

[14] [14]

Vitcod: Vision transformer acceleration via dedicated algorithm and accelerator co-design,

H. You, Z. Sun, H. Shi, Z. Yu, Y . Zhao, Y . Zhang, C. Li, B. Li, and Y . Lin, “Vitcod: Vision transformer acceleration via dedicated algorithm and accelerator co-design,” in2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2023, pp. 273–286

2023

[15] [15]

A survey on silicon photonics for deep learning,

F. P. Sunny, E. Taheri, M. Nikdast, and S. Pasricha, “A survey on silicon photonics for deep learning,”ACM Journal of Emerging Technologies in Computing System (JETC), vol. 17, no. 4, pp. 1–57, 2021

2021

[16] [16]

Albireo: Energy- efficient acceleration of convolutional neural networks via silicon pho- tonics,

K. Shiflett, A. Karanth, R. Bunescu, and A. Louri, “Albireo: Energy- efficient acceleration of convolutional neural networks via silicon pho- tonics,” in2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 860–873

2021

[17] [17]

Photonics for artificial intelligence and neuromorphic computing,

B. J. Shastri, A. N. Tait, T. Ferreira de Lima, W. H. Pernice, H. Bhaskaran, C. D. Wright, and P. R. Prucnal, “Photonics for artificial intelligence and neuromorphic computing,”Nature Photonics, vol. 15, no. 2, pp. 102–114, 2021

2021

[18] [18]

Light in ai: toward efficient neurocomputing with optical neural networks—a tutorial,

J. Gu, C. Feng, H. Zhu, R. T. Chen, and D. Z. Pan, “Light in ai: toward efficient neurocomputing with optical neural networks—a tutorial,”IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 69, no. 6, pp. 2581–2585, 2022

2022

[19] [19]

Simphony: A device-circuit-architecture cross-layer modeling and simulation frame- work for heterogeneous electronic-photonic ai system,

Z. Yin, M. Zhang, N. Gangi, R. Huang, J. Zhang, and J. Gu, “Simphony: A device-circuit-architecture cross-layer modeling and simulation frame- work for heterogeneous electronic-photonic ai system,” in2025 62nd ACM/IEEE Design Automation Conference (DAC). IEEE, 2025, pp. 1–7

2025

[20] [20]

Deep learning with coherent nanophotonic circuits,

Y . Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englundet al., “Deep learning with coherent nanophotonic circuits,”Nature photonics, vol. 11, no. 7, pp. 441–446, 2017

2017

[21] [21]

Neuromorphic photonic networks using silicon photonic weight banks,

A. N. Tait, T. F. De Lima, E. Zhou, A. X. Wu, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal, “Neuromorphic photonic networks using silicon photonic weight banks,”Scientific Reports, vol. 7, no. 1, p. 7430, 2017

2017

[22] [22]

Crosslight: A cross- layer optimized silicon photonic neural network accelerator,

F. Sunny, A. Mirza, M. Nikdast, and S. Pasricha, “Crosslight: A cross- layer optimized silicon photonic neural network accelerator,” in2021 58th ACM/IEEE design automation conference (DAC). IEEE, 2021, pp. 1069–1074

2021

[23] [23]

Parallel convolutional processing using an integrated photonic tensor core,

J. Feldmann, N. Youngblood, M. Karpov, H. Gehring, X. Li, M. Stap- pers, M. Le Gallo, X. Fu, A. Lukashchuk, A. S. Rajaet al., “Parallel convolutional processing using an integrated photonic tensor core,” Nature, vol. 589, no. 7840, pp. 52–58, 2021

2021

[24] [24]

Lightening-transformer: A dynamically- operated optically-interconnected photonic transformer accelerator,

H. Zhu, J. Gu, H. Wang, Z. Jiang, Z. Zhang, R. Tang, C. Feng, S. Han, R. T. Chen, and D. Z. Pan, “Lightening-transformer: A dynamically- operated optically-interconnected photonic transformer accelerator,” in 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2024, pp. 686–703

2024

[25] [25]

Auto-vit-acc: An fpga-aware automatic acceleration framework for vision transformer with mixed-scheme quantization,

Z. Li, M. Sun, A. Lu, H. Ma, G. Yuan, Y . Xie, H. Tang, Y . Li, M. Leeser, Z. Wanget al., “Auto-vit-acc: An fpga-aware automatic acceleration framework for vision transformer with mixed-scheme quantization,” in 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2022, pp. 109–116

2022

[26] [26]

Heatvit: Hardware-efficient adaptive token pruning for vision transformers,

P. Dong, M. Sun, A. Lu, Y . Xie, K. Liu, Z. Kong, X. Meng, Z. Li, X. Lin, Z. Fang, and Y . Wang, “Heatvit: Hardware-efficient adaptive token pruning for vision transformers,” in2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023, pp. 442–455

2023

[27] [27]

Sprint: A high-performance, energy- efficient, and scalable chiplet-based accelerator with photonic intercon- nects for cnn inference,

Y . Li, A. Louri, and A. Karanth, “Sprint: A high-performance, energy- efficient, and scalable chiplet-based accelerator with photonic intercon- nects for cnn inference,”IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 33, no. 10, pp. 2332–2345, 2022

2022

[28] [28]

Spacx: Silicon photonics-based scalable chiplet accelerator for dnn inference,

——, “Spacx: Silicon photonics-based scalable chiplet accelerator for dnn inference,” in2022 IEEE International Symposium on High- Performance Computer Architecture (HPCA), 2022, pp. 831–845

2022

[29] [29]

Tron: Transformer neural network acceleration with non-coherent silicon photonics,

S. Afifi, F. Sunny, M. Nikdast, and S. Pasricha, “Tron: Transformer neural network acceleration with non-coherent silicon photonics,” in Great Lakes Symposium on VLSI (GSVLSI) 2023, 2023, pp. 15–21

2023

[30] [30]

A light-speed large language model accelerator with optical stochastic computing,

S. Afifi, O. Alo, I. Thakkar, and S. Pasricha, “A light-speed large language model accelerator with optical stochastic computing,” inGreat Lakes Symposium on VLSI (GLSVLSI) 2025, 2025, pp. 922–928

2025

[31] [31]

Astra: A stochastic transformer neural network accelerator with silicon photonics,

——, “Astra: A stochastic transformer neural network accelerator with silicon photonics,”ACM Transactions on Embedded Computing Systems (TECS), 2025

2025

[32] [32]

Merit: A sustainable dnn accelerator design with photonic phase-change memory,

Y . Li, A. Louri, and A. Karanth, “Merit: A sustainable dnn accelerator design with photonic phase-change memory,”IEEE Transactions on Sustainable Computing (TSUSC), vol. 10, no. 4, pp. 705–716, 2025

2025

[33] [33]

Hyatten: Hybrid photonic-digital architec- ture for accelerating attention mechanism,

H. Li, D. Chen, and T. Mitra, “Hyatten: Hybrid photonic-digital architec- ture for accelerating attention mechanism,” in2025 Design, Automation & Test in Europe Conference (DATE), 2025, pp. 1–7

2025

[34] [34]

P-dac: Power-efficient photonic accelerators for llm inference,

W.-T. Chang, C.-F. Wu, and Y .-C. Lo, “P-dac: Power-efficient photonic accelerators for llm inference,” in2025 62nd ACM/IEEE Design Au- tomation Conference (DAC), 2025, pp. 1–7

2025

[35] [35]

En- lighten: Lighten the transformer, enable efficient optical acceleration,

H. Zhu, Z. Zhou, S. Ning, X. Wu, R. Chen, Y . Wan, and D. Pan, “En- lighten: Lighten the transformer, enable efficient optical acceleration,” arXiv preprint arXiv:2510.01673, 2025

work page arXiv 2025

[36] [36]

Drmap: A generic dram data mapping policy for energy-efficient processing of convolu- tional neural networks,

R. V . W. Putra, M. A. Hanif, and M. Shafique, “Drmap: A generic dram data mapping policy for energy-efficient processing of convolu- tional neural networks,” in2020 57th ACM/IEEE Design Automation Conference (DAC), 2020, pp. 1–6

2020

[37] [37]

Romanet: Fine-grained reuse-driven off-chip memory access management and data organization for deep neural network acceler- ators,

——, “Romanet: Fine-grained reuse-driven off-chip memory access management and data organization for deep neural network acceler- ators,”IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 29, no. 4, pp. 702–715, 2021

2021

[38] [38]

Pendram: Enabling high-performance and energy-efficient pro- cessing of deep neural networks through a generalized dram data mapping policy,

——, “Pendram: Enabling high-performance and energy-efficient pro- cessing of deep neural networks through a generalized dram data mapping policy,”arXiv preprint arXiv:2408.02412, 2024

work page arXiv 2024