archive
Every paper Pith has read. Search by title, abstract, or pith.
493 papers in cs.AR · page 7
-
Spatial heat patterns dominate power-grid lifetime over averages
EMSpice 3: Full-chip Temperature-Aware Multiphysics Electromigration and IR-Drop Analysis
-
Octree islands cut PCN feature fetching by 55-94 percent
L-PCN: A Point Cloud Accelerator Exploiting Spatial Locality through Octree-based Islandization
-
Two-stage mining extracts accurate message flows from SoC traces
AutoFlows++: Hierarchical Message Flow Mining for System on Chip Designs
-
BFP NPU hits near-DMR reliability at 3.55% overhead
From Characterization to Microarchitecture: Designing an Elegant and Reliable BFP-Based NPU
-
Re-partitioned NPU catches and fixes faults in under a microsecond
Strix: Re-thinking NPU Reliability from a System Perspective
-
LLM training resists low GPU fault rates but fails in key paths
LLM-PRISM: Characterizing Silent Data Corruption from Permanent GPU Faults in LLM Training
-
Chip renders 3D Gaussian Splatting at 129 FPS in full HD
A 129FPS Full HD Real-Time Accelerator for 3D Gaussian Splatting
-
Wave-aware model picks near-optimal GPU kernel settings fast
WaveTune: Wave-aware Bilinear Modeling for Efficient GPU Kernel Auto-tuning
-
Sparse measurements predict latency at every CPU-GPU frequency
Taming Asynchronous CPU-GPU Coupling for Frequency-aware Latency Estimation on Mobile Edge
-
FlexVector speeds GCN inference 3.78x with flexible registers
FlexVector: A SpMM Vector Processor with Flexible VRF for GCNs on Varying-Sparsity Graphs
-
Open framework speeds SystemC-FPGA co-emulation up to 2500x
Late Breaking Results: CHESSY: Coupled Hybrid Emulation with SystemC-FPGA Synchronization
-
Microcontroller runs full SNN simulation at 20 mW
Full Feature Spiking Neural Network Simulation on Micro-Controllers for Neuromorphic Applications at the Edge
-
DNN-resilient voltage scaling cuts aging degradation up to 46%
Aging Aware Adaptive Voltage Scaling for Reliable and Efficient AI Accelerators
-
Photonic accelerator speeds transformers 7.6x with lower energy
Sustainable Transformer Neural Network Acceleration with Stochastic Photonic Computing
-
0.5V encoder maps voltages to spikes within 5.6 percent linearity
A 0.5-V Linear Neuromorphic Voltage-to-Spike Encoder Using a Bulk-Driven Transconductor
-
MATCHA cuts DNN inference latency up to 35% on heterogeneous edge SoCs
MATCHA: Efficient Deployment of Deep Neural Networks on Multi-Accelerator Heterogeneous Edge SoCs
-
Diffusion models cut energy 36% by tolerating controlled faults
DRIFT: Harnessing Inherent Fault Tolerance for Efficient and Reliable Diffusion Model Inference
-
Key signals cut RTL assertion needs by two thirds
From Indiscriminate to Targeted: Efficient RTL Verification via Functionally Key Signal-Driven LLM Assertion Generation
-
Neuromorphic chips hit new memory wall from on-chip storage
Memory Wall is not gone: A Critical Outlook on Memory Architecture in Digital Neuromorphic Computing
-
Profile labels cut memory dependence checks 79% on small cores
PG-MDP: Profile-Guided Memory Dependence Prediction for Area-Constrained Cores
-
Energy-efficient GPUs deliver better value under budget limits
Wattlytics: A Web Platform for Co-Optimizing Performance, Energy, and TCO in HPC Clusters
-
ATLAS models 3D-DRAM LLM accelerators to 8.57% of silicon accuracy
A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators
-
Mamba-3 raises edge latency up to 48% to favor cloud GPUs
The Hyperscale Lottery: How State-Space Models Have Sacrificed Edge Efficiency
-
Faster 32-bit constant division on 64-bit CPUs
Optimization of 32-bit Unsigned Division by Constants on 64-bit Targets
-
Integrated panels give orbital AI 100 kW per ton
Reduced-Mass Orbital AI Inference via Integrated Solar, Compute, and Radiator Panels
-
TrilinearCIM runs Transformer attention in NVM without reprogramming
Trilinear Compute-in-Memory Architecture for Energy-Efficient Transformer Acceleration
-
RL agent designs ASIC chips for AI that adapt across 7 process nodes
From LLM to Silicon: RL-Driven ASIC Architecture Exploration for On-Device AI Inference
-
FILCO reconfigures DNN accelerators on the fly for 1.3x-5x gains
FILCO: Flexible Composing Architecture with Real-Time Reconfigurability for DNN Acceleration
-
Symbolic analysis estimates energy for loop nests independent of size
Symbolic Polyhedral-Based Energy Analysis for Nested Loop Programs
-
Onboard EO processing delivers sub-3m burnt-area maps
Assessing the Added Value of Onboard Earth Observation Processing with the IRIDE HEO Service Segment
-
GQA models cut peak memory 2.72x versus MHA on embedded hardware
TRAPTI: Time-Resolved Analysis for SRAM Banking and Power Gating Optimization in Embedded Transformer Inference
-
New chip runs annealing and reservoir tasks at 25-54x efficiency
CBM-Dual: A 65-nm Fully Connected Chaotic Boltzmann Machine Processor for Dual Function Simulated Annealing and Reservoir Computing
-
SHIELD cuts eDRAM refresh energy 35% for edge LLM inference
SHIELD: A Segmented Hierarchical Memory Architecture for Energy-Efficient LLM Inference on Edge NPUs
-
SwarmIO emulates 40M IOPS SSDs for GPUs with 300x speedup
SwarmIO: Towards 100 Million IOPS SSD Emulation for Next-generation GPU-centric Storage Systems
-
One DC simulation calibrates LLM equations for analog sizing
A Self-Calibrating Framework for Analog Circuit Sizing Using LLM-Derived Analytical Equations
-
Coverage feedback raises assertion coverage 9-15 percent
CoverAssert: Iterative LLM Assertion Generation Driven by Functional Coverage via Syntax-Semantic Representations
-
Dominant interferer nulling cuts CG iterations in massive MU-MIMO
Interference Suppression for Massive MU-MIMO Long-Term Beamforming with Matrix Inversion Approximation
-
Power reconstruction shows 79% energy cut from mixed precision on Frontier
Fine-Grained Power and Energy Attribution on AMD GPU/APU-Based Exascale Nodes
-
PHAROS finds more deadline-meeting accelerator designs
PHAROS: Pipelined Heterogeneous Accelerators for Real-time Safety-critical Systems With Deadline Compliance
-
Prime power moduli simplify RNS integer division hardware
Direct Integer Division in RNS and its Hardware Solutions
-
KV cache choice depends on memory limits and request load
Comparative Characterization of KV Cache Management Strategies for LLM Inference
-
GPU boosts encrypted LLM nonlinear layers by up to 17 times
GPU Acceleration of TFHE-Based High-Precision Nonlinear Layers for Encrypted LLM Inference
-
DRAM PIM techniques create bursty power demands that stress delivery networks
A comparative study on power delivery aspects of compute-in/near-memory approaches using DRAM
-
Tool explores 250 trillion 3D AI accelerator designs 100000 times faster
DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators
-
Neuromorphic hardware could break CMOS energy limits for AI
Neuromorphic Computing for Low-Power Artificial Intelligence
-
GPIR lifts GPU PIR speed by up to 297 times
GPIR: Enabling Practical Private Information Retrieval with GPUs
-
CGRA sharing with migration cuts workload time by 70%
Mestra: Exploring Migration on Virtualized CGRAs
-
Packed LUTs deliver 1.82x speedup for DNN inference on DRAM-PIM
LOCALUT: Harnessing Capacity-Computation Tradeoffs for LUT-Based Inference in DRAM-PIM
-
Bit partitioning lets one PE run FP8 or dual FP4 with 60% less area
DHFP-PE: Dual-Precision Hybrid Floating Point Processing Element for AI Acceleration
-
Hardware cuts real-time interrupt latency by 50x
Enabling Deterministic User-Level Interrupts in Real-Time Processors via Hardware Extension