archive
Every paper Pith has read. Search by title, abstract, or pith.
493 papers in cs.AR · page 9
-
NTT design detects Trojan control and timing faults in PQC
Trojan-Resilient NTT: Protecting Against Control Flow and Timing Faults on Reconfigurable Platforms
-
First NPU designed for diffusion language model inference
NPU Design for Diffusion Language Model Inference
-
Hypergraphs cut spike traffic in neuromorphic SNN mappings
A Case for Hypergraphs to Model and Map SNNs on Neuromorphic Hardware
-
Hybrid model cuts GPU kernel prediction error by 6.7x
PipeWeave: Synergizing Analytical and Learning Models for Unified GPU Performance Prediction
-
NSF urged to fund AI for faster chip design cycles
Report for NSF Workshop on AI for Electronic Design Automation
-
Compact RISC-V core fits biomedical control in 708 LUTs
Bio-RV: Low-Power Resource-Efficient RISC-V Processor for Biomedical Applications
-
Timing windows detect microcontroller ageing via frequency shifts
Ageing Monitoring for Commercial Microcontrollers Based on Timing Windows
-
Tool spots bit-flip faults in LLMs for fast fixes
BitFlipScope: Scalable Fault Localization and Recovery for Bit-Flip Corruptions in LLMs
-
Dynamic buckets lift LLM cache use 19% on LPDDR chips
ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators
-
Pipelined NN training sets delays by layer depth and reconstructs old weights with moving
LayerPipe2: Multistage Pipelining and Weight Recompute via Improved Exponential Moving Average for Training Neural Networks
-
FPGA accelerator speeds graph classification 6.85× with 3.4% accuracy gain
Efficient and Accurate Graph Classification with Hyperdimensional Computing on FPGA
-
Generative transformer cuts circuit delay 30% and gates 50%
GTAC: A Generative Transformer for Approximate Circuits
-
Models emulate NVIDIA Tensor Core behavior in low precision
Accurate Models of NVIDIA Tensor Cores
-
Co-design framework accelerates domains up to 15x with low overhead
Aquas: Enhancing Domain Specialization through Holistic Hardware-Software Co-Optimization based on MLIR
-
Voxel traits let Spira skip kernel-map overhead for 3x faster point-cloud convolution
Spira: Exploiting Voxel Data Structural Properties for Efficient Sparse Convolution in Point Cloud Networks
-
Round-trip LLM translation catches hallucinations in hardware design
Mitigating hallucinations and omissions in LLMs for invertible problems: An application to hardware logic design automation
-
AmpereOne adds memory tagging with zero capacity overhead
Optimized Memory Tagging on AmpereOne Processors
-
Digital in-memory design reaches 3.59 TOPS/W for AI matrix math
DISCA: A Digital In-memory Stochastic Computing Architecture Using A Compressed Bent-Pyramid Format
-
Fused unit runs mixed-precision dot products in four cycles
Ten-Four: An Open-Source Fused Dot Product Unit for Mixed-Precision GPGPU Tensor Cores
-
Joint data-compute tuning speeds ML kernels on PIM up to 13x
DCC: Data-Centric Compilation of Machine Learning Kernels for Processing-In-Memory Architectures
-
8x faster linearity testing for 16-bit SAR ADCs
Advanced Strategies for Uncertainty-Guided Live Measurement Sequencing in Fast, Robust SAR ADC Linearity Testing
-
Adaptive EKF sequencing cuts SAR ADC linearity test time
Uncertainty-Guided Live Measurement Sequencing for Fast SAR ADC Linearity Testing
-
Closed-loop tests yield first bit-accurate models for ten GPU matrix units
Bit-Accurate Modeling of GPU Matrix Multiply-Accumulate Units: Demystifying Numerical Discrepancy and Accuracy
-
Thermal imbalance creates stragglers that slow multi-GPU nodes
Lit Silicon: A Case Where Thermal Imbalance Couples Concurrent Execution in Multiple GPUs
-
Hybrid formats give 4.9× faster edge LLM inference on PIM
P3-LLM: An Integrated NPU-PIM Accelerator for Edge LLM Inference Using Hybrid Numerical Formats
-
DMA offloads close 4.5x gap for latency-bound ML collectives
DMA-Latte: Expanding the Reach of DMA Offloads to Latency-bound ML Communication
-
Five-minute rule shrinks to seconds for AI systems
Five-Minute Rule 40 Years Later: A First-Principles Revisit for Modern Memory Hierarchy
-
SnapStream cuts KV cache memory by 4x for 128k LLM inference
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators
-
Two-ion traps beat larger designs for surface-code trapped-ion computers
Architecting Scalable Trapped Ion Quantum Computers using Surface Codes
-
SilentZNS slashes ZNS SSD write amplification by 92%
Eliminating the Hidden Cost of Zone Management in ZNS SSDs
-
Search tunes allocators to cut heap use by 4 percent
GreenMalloc: Allocator Optimisation for Industrial Workloads
-
Fixed configs make Ramulator 2.0 match real memory performance
Cleaning up the Mess: Re-Evaluating the Real-System Modeling Accuracy of Ramulator 2.0
-
Dynamic pruning cuts vision transformer ops by 61 percent
Low Power Vision Transformer Accelerator with Hardware-Aware Pruning and Optimized Dataflow
-
Two-stage adaptation hits 93% compression on CIM chips
Computing-In-Memory Aware Model Adaption For Edge Devices
-
Switch cache lifts HDFS metadata throughput up to 181%
Fletch: File-System Metadata Caching in Programmable Switches
-
Framework measures real makespans from abstract graphs on CPU-GPU-FPGA hardware
Evaluating Rapid Makespan Predictions for Heterogeneous Systems with Programmable Logic
-
Profiling uncovers patterns that speed up large MoE inference 6.6x
Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference
-
Extended precision cuts large Max-Cut solve times
A Hardware Accelerator for the Goemans-Williamson Algorithm
-
Chiplet RISC-V SoC achieves 40% efficiency gain for edge AI
Chiplet-Based RISC-V SoC with Modular AI Acceleration
-
Flattened arrays and quantization break LLM memory walls
Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference
-
FASE runs multi-thread benchmarks on FPGA before SoC integration
FASE: FPGA-Assisted Syscall Emulation for Rapid End-to-End Processor Performance Validation
-
Lifetime variation enables 14.5X carbon reduction in disposable smart items
Lifetime-Aware Design for Item-Level Intelligence at the Extreme Edge
-
Diffusion model optimizes all VLSI macros at once
DiffPlace: A Conditional Diffusion Framework for Simultaneous VLSI Placement Beyond Sequential Paradigms
-
Resource estimates find feasible setups for distributed quantum computers
Architecting Distributed Quantum Computers: Design Insights from Resource Estimation
-
Linear GNN tags jets under 60 ns on FPGAs
JEDI-linear: Fast and Efficient Graph Neural Networks for Jet Tagging on FPGAs
-
Memory reads turn into stochastic multiplies for matrix work
OISMA: On-the-fly In-memory Stochastic Multiplication Architecture for Matrix-Multiplication Workloads
-
Expert-sharded KV storage cuts memory use in MoE inference
PiKV: KV Cache Management System for Mixture of Experts
-
Retrieval lifts LLM success on RTL test fixes 7.72 times
VeriRAG: A Retrieval-Augmented Framework for Automated RTL Testability Repair
-
RL trains LLMs to output efficient Verilog designs
ChipSeek: Optimizing Verilog Generation via EDA-Integrated Reinforcement Learning
-
Distributed arithmetic cuts FPGA neural net resources by a third
da4ml: Distributed Arithmetic for Real-time Neural Networks on FPGAs