SpaDA: A Spatial Dataflow Architecture Programming Language

Lukas Gianinazzi; Tal Ben-Nun; Torsten Hoefler

arxiv: 2511.09447 · v2 · submitted 2025-11-12 · 💻 cs.DC · cs.PL

SpaDA: A Spatial Dataflow Architecture Programming Language

Lukas Gianinazzi , Tal Ben-Nun , Torsten Hoefler This is my paper

Pith reviewed 2026-05-17 22:14 UTC · model grok-4.3

classification 💻 cs.DC cs.PL

keywords spatial dataflow architectureprogramming languagecompilerdata placementparallel patternsstencil computationlinear algebrahigh performance computing

0 comments

The pith

SpaDA is a programming language for spatial dataflow architectures that expresses complex parallel patterns in far fewer lines than low-level CSL while delivering high performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SpaDA to simplify programming of architectures that distribute scratchpad memory across hundreds of thousands of processing elements without any shared memory. It gives precise control over data placement, dataflow patterns, and asynchronous operations while hiding most low-level configuration details. The authors built a compiler that lowers SpaDA through multiple levels to a target architecture and adds specialized optimization passes. This approach is shown to support both direct use and serving as an intermediate representation for domain-specific languages, with concrete gains in code length and measured performance on real hardware.

Core claim

SpaDA enables concise expression of operations with complex parallel patterns including pipelined collective operations, multi-dimensional stencils, and dense linear algebra in 14.09x fewer lines than CSL, achieving over 260 TFlop/s across 730,000 PEs on a single device. It functions as a high-level programming interface and an intermediate representation for domain-specific languages, with a compiler that uses multi-level lowering and unique optimization passes targeting the low-level architecture.

What carries the argument

Multi-level lowering compiler passes that translate high-level SpaDA descriptions of data placement and asynchronous dataflow into target code while applying optimizations that preserve the intended patterns.

If this is right

Pipelined collective operations can be written without manually managing every asynchronous task.
Multi-dimensional stencils become expressible at a higher level while still mapping efficiently to the distributed memory layout.
Dense linear algebra kernels benefit from the same reduction in code size and retain the architecture's performance potential.
Domain-specific languages can target SpaDA as an intermediate form rather than writing directly to the low-level interface.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Broader developer access to these architectures could follow if the language reduces the expertise barrier for data placement decisions.
The same lowering strategy might apply to other systems that combine many processing elements with local scratchpads and explicit communication.
Additional domain-specific languages could be layered on top to further expand the set of supported workloads.

Load-bearing premise

The compiler's multi-level lowering and optimization passes preserve the original program semantics and introduce no hidden overheads or bugs for the demonstrated workloads.

What would settle it

An experiment that implements the same multi-dimensional stencil or dense linear algebra kernel in both SpaDA and CSL, then compares numerical results and measured runtime on identical hardware to check for mismatches or unexpected slowdowns.

Figures

Figures reproduced from arXiv: 2511.09447 by Lukas Gianinazzi, Tal Ben-Nun, Torsten Hoefler.

**Figure 2.** Figure 2: SPADA-to-CSL Task Assignment Pipeline @unblock); and (b) a data task is always active, but can be blocked (thus having one predecessor). We write a set of passes to create virtual nodes and local tasks to reduce the in-degree of nodes in the post/wait graph accordingly. In the next step (Figure 2d), the post/wait graph is coarsened to tasks based on feasibility, and edges from statements to successor tasks… view at source ↗

**Figure 3.** Figure 3: Speedup of Pipelined over Vectorized 1D Reduction [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 5.** Figure 5: Stencil Total Flop/s Performance Scaling (K=80) [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Stencil Flop/s for Fixed Horizontal Domain (512 [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Stencil Runtime vs. Horizontal Domain Size (K=80) [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

Spatial dataflow architectures like the Cerebras Wafer-Scale Engine deliver exceptional performance in AI and scientific computing by distributing scratchpad memory across hundreds of thousands of processing elements (PEs). Yet programming these architectures remains difficult: with no shared memory, data movement requires explicit configuration, and asynchronous task management introduces substantial complexity. We present SpaDA, a programming language that offers precise control over data placement, dataflow patterns, and asynchronous operations while abstracting low-level architectural details. We design and implement a compiler targeting Cerebras CSL through multi-level lowering and unique optimization passes. SpaDA functions as a high-level programming interface and an intermediate representation for domain-specific languages (DSLs), demonstrated here with the GT4Py stencil DSL. SpaDA enables concise expression of operations with complex parallel patterns -- including pipelined collective operations, multi-dimensional stencils, and dense linear algebra -- in 14.09x fewer lines than CSL, achieving over 260 TFlop/s across 730,000 PEs on a single device.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpaDA gives a new language and compiler for Cerebras-style spatial hardware with real code-size wins on stencils and collectives, but the performance numbers hinge on unverified lowering passes.

read the letter

Hey, on the SpaDA paper. The core contribution is a language that lets you express data placement, pipelined collectives, multi-dimensional stencils, and dense linear algebra with explicit control over asynchrony while targeting CSL. It also serves as an IR for GT4Py. They report writing the same patterns in 14 times fewer lines than raw CSL and hitting over 260 TFlop/s across 730,000 PEs on one device. That combination of abstraction and claimed performance is the part worth paying attention to if you work on large spatial accelerators. The examples for complex parallel patterns look like they hit a practical middle ground between low-level configuration and higher-level DSLs. The integration story with GT4Py is a concrete plus that shows how the IR could be reused. The soft spots sit in the compiler. The abstract describes multi-level lowering and unique optimization passes at a high level, but supplies no benchmark details, measurement methodology, error bars, or direct generated-versus-hand-written CSL comparisons. Without those, it is difficult to judge whether the TFlop/s number reflects the language abstractions or just the passes working perfectly on the tested cases. The stress-test concern about possible semantic drift or hidden overheads in the lowering passes lands here, because the central claims rest on those passes preserving correctness for async dataflow without extra cost. This paper is for people building or using tools for wafer-scale spatial computing in AI and scientific workloads. A reader interested in dataflow compilers or DSL design would get value from the abstractions and the quantitative line-count results. It deserves a serious referee because the problem is timely, the numbers are specific, and the approach is grounded enough to merit closer examination even if revisions will be needed on the experimental and compiler sections. I would send it out for review and specifically ask for expanded material on verification of the lowering passes and full benchmark methodology.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SpaDA, a programming language for spatial dataflow architectures such as the Cerebras Wafer-Scale Engine. It offers abstractions for precise control over data placement, dataflow patterns, and asynchronous operations while hiding low-level details. A compiler is implemented that targets Cerebras CSL via multi-level lowering and specialized optimization passes. SpaDA is demonstrated both as a standalone interface and as an IR for DSLs such as GT4Py, with claims that it expresses complex patterns (pipelined collectives, multi-dimensional stencils, dense linear algebra) in 14.09x fewer lines than CSL while delivering over 260 TFlop/s on 730,000 PEs.

Significance. If the lowering passes are shown to preserve semantics and incur no hidden overhead, the work would meaningfully improve productivity on large-scale spatial architectures by reducing the complexity of explicit data movement and task management. Integration with existing DSLs such as GT4Py suggests utility beyond hand-written CSL code. The reported performance and code-size numbers, if substantiated with rigorous methodology, would indicate that the optimization passes successfully exploit the architecture's distributed scratchpad and PE array.

major comments (2)

Evaluation section: The central performance claim (>260 TFlop/s on 730,000 PEs) and code-size claim (14.09x reduction versus CSL) are presented without benchmark workload details, measurement methodology, error bars, or explicit side-by-side comparison of generated versus hand-written CSL for the pipelined-collective and multi-dimensional stencil cases. This information is required to assess whether the unique optimization passes deliver the stated benefits or introduce unaccounted overheads.
Compiler section: The multi-level lowering passes and unique optimization passes are described at a high level. No formal semantics, machine-checked proofs, or exhaustive test suite for asynchronous dataflow patterns (e.g., synchronization and memory layout for pipelined collectives or stencils) are provided. Because any semantic drift in these passes would directly undermine both the conciseness and TFlop/s claims, this verification gap is load-bearing.

minor comments (2)

Abstract: The phrase 'unique optimization passes' is used without a forward reference to the specific section or table that defines or evaluates them.
Notation: Dataflow pattern diagrams or additional inline examples would help readers unfamiliar with spatial architectures follow the mapping from SpaDA constructs to CSL configuration.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and have revised the paper to strengthen the presentation of our evaluation and compiler implementation.

read point-by-point responses

Referee: Evaluation section: The central performance claim (>260 TFlop/s on 730,000 PEs) and code-size claim (14.09x reduction versus CSL) are presented without benchmark workload details, measurement methodology, error bars, or explicit side-by-side comparison of generated versus hand-written CSL for the pipelined-collective and multi-dimensional stencil cases. This information is required to assess whether the unique optimization passes deliver the stated benefits or introduce unaccounted overheads.

Authors: We agree that these details are necessary for rigorous assessment. In the revised manuscript we have expanded the Evaluation section with explicit descriptions of the benchmark workloads (including stencil dimensions and collective patterns), the measurement methodology (using Cerebras SDK performance counters with results from repeated executions), error bars, and direct side-by-side code-size and performance comparisons between SpaDA-generated and hand-written CSL implementations. These additions confirm that the reported optimization passes achieve the claimed benefits without hidden overhead. revision: yes
Referee: Compiler section: The multi-level lowering passes and unique optimization passes are described at a high level. No formal semantics, machine-checked proofs, or exhaustive test suite for asynchronous dataflow patterns (e.g., synchronization and memory layout for pipelined collectives or stencils) are provided. Because any semantic drift in these passes would directly undermine both the conciseness and TFlop/s claims, this verification gap is load-bearing.

Authors: We acknowledge the referee's point on verification. We have added a detailed account of our test suite to the Compiler section, including unit and integration tests that exercise asynchronous dataflow patterns, synchronization, and memory layouts for pipelined collectives and stencils. These tests were used throughout development and support semantic preservation in practice. However, machine-checked proofs and formal semantics lie outside the scope of the current work, which emphasizes language design and empirical results on the target architecture. revision: partial

standing simulated objections not resolved

Machine-checked proofs or formal semantics for the multi-level lowering and optimization passes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces SpaDA as a new high-level programming language and compiler for spatial dataflow architectures, with claims of conciseness (14.09x fewer lines than CSL) and performance (over 260 TFlop/s across 730k PEs) resting on empirical implementation results and hardware benchmarks rather than any mathematical derivation, equations, or fitted parameters. No self-definitional steps, fitted inputs called predictions, or load-bearing self-citation chains appear; the multi-level lowering passes and optimization passes are presented as novel contributions whose correctness and efficiency are demonstrated through evaluation on the target architecture, rendering the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an engineering language and compiler paper; it introduces no mathematical free parameters, domain axioms, or new physical entities. All claims rest on the correctness of the implemented lowering passes and optimizations.

pith-pipeline@v0.9.0 · 5477 in / 1104 out tokens · 38381 ms · 2026-05-17T22:14:26.895929+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Stencil Computations on Cerebras Wafer-Scale Engine
cs.DC 2026-05 unverdicted novelty 6.0

CStencil on the WSE-3 achieves up to 342x speedup for 2D stencils versus an adapted single-precision GPU solver and saturates both compute and on-chip memory bandwidth.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 1 Pith paper

[1]

V.,ANDHILL, M

ADVE, S. V.,ANDHILL, M. D. Weak ordering - A new definition. In Proceedings of the 17th Annual International Symposium on Computer Architecture, Seattle, WA, USA, June 1990(1990), J. Baer, L. Snyder, and J. R. Goodman, Eds., ACM, pp. 2–14

work page 1990
[2]

Sonnet 4.5, 2025

ANTHROPIC. Sonnet 4.5, 2025

work page 2025
[3]

C.,ANDHOEFLER, T

BEN-NUN, T., GRONER, L., DECONINCK, F., WICKY, T., DAVIS, E., DAHM, J., ELBERT, O., GEORGE, R., MCGIBBON, J., TR ¨UMPER, L., WU, E., FUHRER, O., SCHULTHESS, T. C.,ANDHOEFLER, T. Productive performance engineering for weather and climate modeling with python. InSC22: International Conference for High Performance Computing, Networking, Storage and Analysis,...

work page 2022
[4]

Communication-sensitive static dataflow for parallel message passing applications

BRONEVETSKY, G. Communication-sensitive static dataflow for parallel message passing applications. InProceedings of the CGO 2009, The Seventh International Symposium on Code Generation and Optimization, Seattle, Washington, USA, March 22-25, 2009(2009), IEEE Computer Society, pp. 1–12

work page 2009
[5]

Zooid: a DSL for certified multiparty computation: from mechanised metatheory to certified multiparty processes

CASTRO-PEREZ, D., FERREIRA, F., GHERI, L.,ANDYOSHIDA, N. Zooid: a DSL for certified multiparty computation: from mechanised metatheory to certified multiparty processes. InPLDI ’21: 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, Virtual Event, Canada, June 20-25, 2021(2021), S. N. Freund and E. Yahav, Eds., AC...

work page 2021
[6]

Multiparty compatibility in com- municating automata: Characterisation and synthesis of global session types

DENI ´ELOU, P.,ANDYOSHIDA, N. Multiparty compatibility in com- municating automata: Characterisation and synthesis of global session types. InAutomata, Languages, and Programming - 40th International Colloquium, ICALP 2013, Riga, Latvia, July 8-12, 2013, Proceedings, Part II(2013), F. V . Fomin, R. Freivalds, M. Z. Kwiatkowska, and D. Peleg, Eds., vol. 79...

work page 2013
[7]

S., Chen, Z., Khachane, H., Marshall, W., Pathria, R., Tom, M., and Hestness, J

DEY, N., GOSAL, G., CHEN, Z., KHACHANE, H., MARSHALL, W., PATHRIA, R., TOM, M.,ANDHESTNESS, J. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster.CoRR abs/2304.03208(2023)

work page arXiv 2023
[8]

M., SIEGEL, S

GOPALAKRISHNAN, G., KIRBY, R. M., SIEGEL, S. F., THAKUR, R., GROPP, W., LUSK, E. L.,DESUPINSKI, B. R., SCHULZ, M.,AND BRONEVETSKY, G. Formal analysis of mpi-based parallel programs. Commun. ACM 54, 12 (2011), 82–91

work page 2011
[9]

Domain- specific multi-level IR rewriting for GPU: the open earth compiler for gpu-accelerated climate simulation.ACM Trans

GYSI, T., M ¨ULLER, C., ZINENKO, O., HERHUT, S., DAVIS, E., WICKY, T., FUHRER, O., HOEFLER, T.,ANDGROSSER, T. Domain- specific multi-level IR rewriting for GPU: the open earth compiler for gpu-accelerated climate simulation.ACM Trans. Archit. Code Optim. 18, 4 (2021), 51:1–51:23

work page 2021
[10]

Multiparty asyn- chronous session types

HONDA, K., YOSHIDA, N.,ANDCARBONE, M. Multiparty asyn- chronous session types. InProceedings of the 35th ACM SIGPLAN- SIGACT Symposium on Principles of Programming Languages, POPL 2008, San Francisco, California, USA, January 7-12, 2008(2008), G. C. Necula and P. Wadler, Eds., ACM, pp. 273–284

work page 2008
[11]

Scalable dis- tributed high-order stencil computations

JACQUELIN, M., ARAYA-POLO, M.,ANDMENG, J. Scalable dis- tributed high-order stencil computations. InSC22: International Con- ference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, November 13-18, 2022(2022), F. Wolf, S. Shende, C. Culhane, S. R. Alam, and H. Jagode, Eds., IEEE, pp. 30:1– 30:13

work page 2022
[12]

A comparison of the cerebras wafer-scale integration technology with nvidia gpu-based systems for artificial intelligence

KUNDU, Y., KAUR, M., WIG, T., KUMAR, K., KUMARI, P., PURI, V., ANDARORA, M. A comparison of the cerebras wafer-scale integration technology with nvidia gpu-based systems for artificial intelligence. CoRR abs/2503.11698(2025)

work page arXiv 2025
[13]

Time, clocks, and the ordering of events in a distributed system.Commun

LAMPORT, L. Time, clocks, and the ordering of events in a distributed system.Commun. ACM 21, 7 (1978), 558–565

work page 1978
[14]

Inside the cerebras wafer-scale cluster.IEEE Micro 44, 3 (2024), 49–57

LIE, S. Inside the cerebras wafer-scale cluster.IEEE Micro 44, 3 (2024), 49–57

work page 2024
[15]

D., ANDHOEFLER, T

LUCZYNSKI, P., GIANINAZZI, L., IFF, P., WILSON, L., SENSI, D. D., ANDHOEFLER, T. Near-optimal wafer-scale reduce. InProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2024, Pisa, Italy, June 3-7, 2024(2024), P. Dazzi, G. Mencagli, D. K. Lowenthal, and R. M. Badia, Eds., ACM, pp. 334–347

work page 2024
[16]

W.,ANDADVE, S

MANSON, J., PUGH, W. W.,ANDADVE, S. V. The java memory model. InProceedings of the 32nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2005, Long Beach, California, USA, January 12-14, 2005(2005), J. Palsberg and M. Abadi, Eds., ACM, pp. 378–391

work page 2005
[17]

STREAM bench- mark on Cerebras Wafer-Scale Engine-2

MIYAJIMA, T., MATSUZAKI, R.,ANDFUKUOKA, L. STREAM bench- mark on Cerebras Wafer-Scale Engine-2. Poster presented at ISC High Performance 2024, May 2024. May 13–15, 2024

work page 2024
[18]

Pabble: Parameterised scribble for parallel programming

NG, N.,ANDYOSHIDA, N. Pabble: Parameterised scribble for parallel programming. In22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2014, Torino, Italy, February 12-14, 2014(2014), IEEE Computer Society, pp. 707–714

work page 2014
[19]

Wafer-scale fast fourier transforms

ORENES-VERA, M., SHARAPOV, I., SCHREIBER, R., JACQUELIN, M., VANDERMERSCH, P.,ANDCHETLUR, S. Wafer-scale fast fourier transforms. InProceedings of the 37th International Conference on Supercomputing, ICS 2023, Orlando, FL, USA, June 21-23, 2023(2023), K. A. Gallivan, E. Gallopoulos, D. S. Nikolopoulos, and R. Beivide, Eds., ACM, pp. 180–191

work page 2023
[20]

G., GRONER, L., UBBIALI, S., VOGT, H., MADONNA, A., MARIOTTI, K., CRUZ, F

PAREDES, E. G., GRONER, L., UBBIALI, S., VOGT, H., MADONNA, A., MARIOTTI, K., CRUZ, F. A., BENEDICIC, L., BIANCO, M., VAN- DEVONDELE, J.,ANDSCHULTHESS, T. C. Gt4py: High performance stencils for weather and climate applications using python.CoRR abs/2311.08322(2023)

work page arXiv 2023
[21]

P., ARAYA-POLO, M.,AND SETTGAST, R

SAI, R., JACQUELIN, M., HAMON, F. P., ARAYA-POLO, M.,AND SETTGAST, R. R. Massively distributed finite-volume flux computation. InProceedings of the SC ’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, SC-W 2023, Denver, CO, USA, November 12-17, 2023(2023), ACM, pp. 1713– 1720

work page 2023
[22]

V., NGUYEN, C

TROTTER, M. V., NGUYEN, C. Q., YOUNG, S., WOODRUFF, R. T., ANDBRANSON, K. M. Epigenomic language models powered by cerebras.CoRR abs/2112.07571(2021)

work page arXiv 2021
[23]

WOO, M., JORDAN, T., SCHREIBER, R., SHARAPOV, I., MUHAMMAD, S., KONERU, A., JAMES, M.,ANDESSENDELFT, D. V. Disruptive changes in field equation modeling: A simple interface for wafer scale engines.CoRR abs/2209.13768(2022)

work page arXiv 2022

[1] [1]

V.,ANDHILL, M

ADVE, S. V.,ANDHILL, M. D. Weak ordering - A new definition. In Proceedings of the 17th Annual International Symposium on Computer Architecture, Seattle, WA, USA, June 1990(1990), J. Baer, L. Snyder, and J. R. Goodman, Eds., ACM, pp. 2–14

work page 1990

[2] [2]

Sonnet 4.5, 2025

ANTHROPIC. Sonnet 4.5, 2025

work page 2025

[3] [3]

C.,ANDHOEFLER, T

BEN-NUN, T., GRONER, L., DECONINCK, F., WICKY, T., DAVIS, E., DAHM, J., ELBERT, O., GEORGE, R., MCGIBBON, J., TR ¨UMPER, L., WU, E., FUHRER, O., SCHULTHESS, T. C.,ANDHOEFLER, T. Productive performance engineering for weather and climate modeling with python. InSC22: International Conference for High Performance Computing, Networking, Storage and Analysis,...

work page 2022

[4] [4]

Communication-sensitive static dataflow for parallel message passing applications

BRONEVETSKY, G. Communication-sensitive static dataflow for parallel message passing applications. InProceedings of the CGO 2009, The Seventh International Symposium on Code Generation and Optimization, Seattle, Washington, USA, March 22-25, 2009(2009), IEEE Computer Society, pp. 1–12

work page 2009

[5] [5]

Zooid: a DSL for certified multiparty computation: from mechanised metatheory to certified multiparty processes

CASTRO-PEREZ, D., FERREIRA, F., GHERI, L.,ANDYOSHIDA, N. Zooid: a DSL for certified multiparty computation: from mechanised metatheory to certified multiparty processes. InPLDI ’21: 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, Virtual Event, Canada, June 20-25, 2021(2021), S. N. Freund and E. Yahav, Eds., AC...

work page 2021

[6] [6]

Multiparty compatibility in com- municating automata: Characterisation and synthesis of global session types

DENI ´ELOU, P.,ANDYOSHIDA, N. Multiparty compatibility in com- municating automata: Characterisation and synthesis of global session types. InAutomata, Languages, and Programming - 40th International Colloquium, ICALP 2013, Riga, Latvia, July 8-12, 2013, Proceedings, Part II(2013), F. V . Fomin, R. Freivalds, M. Z. Kwiatkowska, and D. Peleg, Eds., vol. 79...

work page 2013

[7] [7]

S., Chen, Z., Khachane, H., Marshall, W., Pathria, R., Tom, M., and Hestness, J

DEY, N., GOSAL, G., CHEN, Z., KHACHANE, H., MARSHALL, W., PATHRIA, R., TOM, M.,ANDHESTNESS, J. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster.CoRR abs/2304.03208(2023)

work page arXiv 2023

[8] [8]

M., SIEGEL, S

GOPALAKRISHNAN, G., KIRBY, R. M., SIEGEL, S. F., THAKUR, R., GROPP, W., LUSK, E. L.,DESUPINSKI, B. R., SCHULZ, M.,AND BRONEVETSKY, G. Formal analysis of mpi-based parallel programs. Commun. ACM 54, 12 (2011), 82–91

work page 2011

[9] [9]

Domain- specific multi-level IR rewriting for GPU: the open earth compiler for gpu-accelerated climate simulation.ACM Trans

GYSI, T., M ¨ULLER, C., ZINENKO, O., HERHUT, S., DAVIS, E., WICKY, T., FUHRER, O., HOEFLER, T.,ANDGROSSER, T. Domain- specific multi-level IR rewriting for GPU: the open earth compiler for gpu-accelerated climate simulation.ACM Trans. Archit. Code Optim. 18, 4 (2021), 51:1–51:23

work page 2021

[10] [10]

Multiparty asyn- chronous session types

HONDA, K., YOSHIDA, N.,ANDCARBONE, M. Multiparty asyn- chronous session types. InProceedings of the 35th ACM SIGPLAN- SIGACT Symposium on Principles of Programming Languages, POPL 2008, San Francisco, California, USA, January 7-12, 2008(2008), G. C. Necula and P. Wadler, Eds., ACM, pp. 273–284

work page 2008

[11] [11]

Scalable dis- tributed high-order stencil computations

JACQUELIN, M., ARAYA-POLO, M.,ANDMENG, J. Scalable dis- tributed high-order stencil computations. InSC22: International Con- ference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, November 13-18, 2022(2022), F. Wolf, S. Shende, C. Culhane, S. R. Alam, and H. Jagode, Eds., IEEE, pp. 30:1– 30:13

work page 2022

[12] [12]

A comparison of the cerebras wafer-scale integration technology with nvidia gpu-based systems for artificial intelligence

KUNDU, Y., KAUR, M., WIG, T., KUMAR, K., KUMARI, P., PURI, V., ANDARORA, M. A comparison of the cerebras wafer-scale integration technology with nvidia gpu-based systems for artificial intelligence. CoRR abs/2503.11698(2025)

work page arXiv 2025

[13] [13]

Time, clocks, and the ordering of events in a distributed system.Commun

LAMPORT, L. Time, clocks, and the ordering of events in a distributed system.Commun. ACM 21, 7 (1978), 558–565

work page 1978

[14] [14]

Inside the cerebras wafer-scale cluster.IEEE Micro 44, 3 (2024), 49–57

LIE, S. Inside the cerebras wafer-scale cluster.IEEE Micro 44, 3 (2024), 49–57

work page 2024

[15] [15]

D., ANDHOEFLER, T

LUCZYNSKI, P., GIANINAZZI, L., IFF, P., WILSON, L., SENSI, D. D., ANDHOEFLER, T. Near-optimal wafer-scale reduce. InProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2024, Pisa, Italy, June 3-7, 2024(2024), P. Dazzi, G. Mencagli, D. K. Lowenthal, and R. M. Badia, Eds., ACM, pp. 334–347

work page 2024

[16] [16]

W.,ANDADVE, S

MANSON, J., PUGH, W. W.,ANDADVE, S. V. The java memory model. InProceedings of the 32nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2005, Long Beach, California, USA, January 12-14, 2005(2005), J. Palsberg and M. Abadi, Eds., ACM, pp. 378–391

work page 2005

[17] [17]

STREAM bench- mark on Cerebras Wafer-Scale Engine-2

MIYAJIMA, T., MATSUZAKI, R.,ANDFUKUOKA, L. STREAM bench- mark on Cerebras Wafer-Scale Engine-2. Poster presented at ISC High Performance 2024, May 2024. May 13–15, 2024

work page 2024

[18] [18]

Pabble: Parameterised scribble for parallel programming

NG, N.,ANDYOSHIDA, N. Pabble: Parameterised scribble for parallel programming. In22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2014, Torino, Italy, February 12-14, 2014(2014), IEEE Computer Society, pp. 707–714

work page 2014

[19] [19]

Wafer-scale fast fourier transforms

ORENES-VERA, M., SHARAPOV, I., SCHREIBER, R., JACQUELIN, M., VANDERMERSCH, P.,ANDCHETLUR, S. Wafer-scale fast fourier transforms. InProceedings of the 37th International Conference on Supercomputing, ICS 2023, Orlando, FL, USA, June 21-23, 2023(2023), K. A. Gallivan, E. Gallopoulos, D. S. Nikolopoulos, and R. Beivide, Eds., ACM, pp. 180–191

work page 2023

[20] [20]

G., GRONER, L., UBBIALI, S., VOGT, H., MADONNA, A., MARIOTTI, K., CRUZ, F

PAREDES, E. G., GRONER, L., UBBIALI, S., VOGT, H., MADONNA, A., MARIOTTI, K., CRUZ, F. A., BENEDICIC, L., BIANCO, M., VAN- DEVONDELE, J.,ANDSCHULTHESS, T. C. Gt4py: High performance stencils for weather and climate applications using python.CoRR abs/2311.08322(2023)

work page arXiv 2023

[21] [21]

P., ARAYA-POLO, M.,AND SETTGAST, R

SAI, R., JACQUELIN, M., HAMON, F. P., ARAYA-POLO, M.,AND SETTGAST, R. R. Massively distributed finite-volume flux computation. InProceedings of the SC ’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, SC-W 2023, Denver, CO, USA, November 12-17, 2023(2023), ACM, pp. 1713– 1720

work page 2023

[22] [22]

V., NGUYEN, C

TROTTER, M. V., NGUYEN, C. Q., YOUNG, S., WOODRUFF, R. T., ANDBRANSON, K. M. Epigenomic language models powered by cerebras.CoRR abs/2112.07571(2021)

work page arXiv 2021

[23] [23]

WOO, M., JORDAN, T., SCHREIBER, R., SHARAPOV, I., MUHAMMAD, S., KONERU, A., JAMES, M.,ANDESSENDELFT, D. V. Disruptive changes in field equation modeling: A simple interface for wafer scale engines.CoRR abs/2209.13768(2022)

work page arXiv 2022