The paper delivers a reverse-engineered documentation of the Apple Neural Engine architecture, dispatch mechanisms, weight compression, and roofline performance based on measurements from M1 and M5 chips and analysis of private runtime components.
Hierarchical roofline analysis for gpus: Accelerating performance optimization for the nersc-9 perlmutter system
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4representative citing papers
KernelPro combines LLM code generation, roofline-guided tool orchestration, and domain-adapted MCTS to produce GPU kernels that outperform prior automated and some hand-tuned baselines on KernelBench and VeOmni workloads.
COMPASS formalizes HPC configuration questions as ML tasks on traces, quantifies recommendation trustworthiness, and delivers 65.93% lower average job turnaround time plus 80.93% lower node usage versus prior methods in simulator tests.
Profiling of Med-DDPM shows cuDNN kernels dominate training; TF32 Tensor Core activation and 3D channels-last layout reduce SM cycles up to 100x and raise Tensor Core utilization on A100 without quality loss.
citing papers explorer
-
Apple Neural Engine: Architecture, Programming, and Performance
The paper delivers a reverse-engineered documentation of the Apple Neural Engine architecture, dispatch mechanisms, weight compression, and roofline performance based on measurements from M1 and M5 chips and analysis of private runtime components.
-
Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization
KernelPro combines LLM code generation, roofline-guided tool orchestration, and domain-adapted MCTS to produce GPU kernels that outperform prior automated and some hand-tuned baselines on KernelBench and VeOmni workloads.
-
COMPASS: A Unified Decision-Intelligence System for Navigating Performance Trade-off in HPC
COMPASS formalizes HPC configuration questions as ML tasks on traces, quantifies recommendation trustworthiness, and delivers 65.93% lower average job turnaround time plus 80.93% lower node usage versus prior methods in simulator tests.
-
Performance Analysis and Optimization of 3D Generative Diffusion Models across GPU Architectures
Profiling of Med-DDPM shows cuDNN kernels dominate training; TF32 Tensor Core activation and 3D channels-last layout reduce SM cycles up to 100x and raise Tensor Core utilization on A100 without quality loss.