archive
Every paper Pith has read. Search by title, abstract, or pith.
493 papers in cs.AR · page 1
-
DORA keeps DNN accelerator efficiency steady across 6× workload variation
DORA: Dataflow-Instruction Orchestration Architecture for DNN Acceleration
-
UniSpike bundles spikes to cut neuromorphic traffic 1.93 times
UniSpike: Accelerating Spiking Neural Networks on Neuromorphic Systems via Eliminating Address Redundancy
-
Overlays beat custom designs for frequent model switches in self-driving
To Overlay or to Customize? Revisiting Architectural Choices in Heterogeneous Systems
-
Explicit decoupling gives HLS 10-79x speedups on complex memory patterns
DAE4HLS: Exposing Memory-Level Parallelism for High-Level Synthesis using Explicit Decoupling
-
3D NAND fuses MoE selection and compute for 114x faster inference
NASiC: 3D NAND-based CAM-Selected Multibit CIM Architecture for Efficient On-Device Mixture-of-Experts LLM Inference
-
Stage-wise precision cuts masked diffusion compute by up to 16x
MASQ: Accelerating Masked Diffusion via Stage-Wise Multi-Precision Quantization
-
ACALSim reaches 14x speedup over SST on large GPU simulations
ACALSim: A Scalable Parallel Simulation Framework for High-Performance System Design Space Exploration
-
Prior outputs double token cuts in video diffusion for 4.5x speedup
ORBIS: Output-Guided Token Reduction with Distribution-Aware Matching for Video Diffusion Acceleration
-
Co-design speeds vector search up to 8.4 times over CPU
NasZip: Software and Hardware Co-Design to Accelerate Approximate Nearest Neighbor Search with DIMM-Based Near-Data Processing
-
Memory technologies reviewed for room and cryogenic use
Emerging memory technologies at room/cryogenic temperature
-
Component-level GPU control yields 10% energy savings
CompPow: A Case for Component-level GPU Power Management
-
Dynamic control-flow speeds up reconfigurable processors
Supporting Dynamic Control-Flow Execution for Runtime Reconfigurable Processors
-
Roadside perception services turn on only when vehicles approach
Cloud-Native Operation of Roadside Infrastructure Enabling Demand-Driven Collective Perception via V2X
-
Telesistors provide noise-protected Clifford gates for quantum computing
Towards transistor-based quantum computing
-
ELSA gives spiking networks 3.4x faster inference than top accelerators
ELSA: An ELastic SNN Inference Architecture for Efficient Neuromorphic Computing
-
ReRAM macro reaches 419 TOPS/W for edge neural inference
E-ReCON: An Energy- and Resource-Efficient Precision-Configurable Sparse nvCIM Macro for Conventional and Spiking Neural Edge Inference
-
Multi-rank PIM beats CPUs on AES and SHA-256
Taking Cryptography Out of the Data Path via Near-Memory Processing in DRAM
-
Hardware latch enables 452 nA quiescent drain in sensors
A Hardware-Based Multi-Stage Dynamic Power Management Architecture for Autonomous Low-Light Operation
-
Digital near-memory design accelerates GNNs up to 230x
A complete discussion on fully reconfigurable, digital, scalable, graph and sparsity-aware near-memory accelerator for graph neural networks
-
Only two of five LLMs finish valid SoC co-design
HSCO-Bench: An Agent-Driven End-to-End Hardware-Software Co-design Benchmark for Systems-on-Chip
-
Software scheduling predicts optical thermal drift early
Predictive Software Scheduling as an Early-Warning Hint Layer for Optical Engine Thermal Drift in Heterogeneous SoIC Packaging
-
Input flips extend multiplier life under NBTI aging
Building Reliable Arithmetic Multipliers Under NBTI Aging and Process Variations
-
Hybrid cluster cuts HTTP response time by over 40%
iHAC: A Hybrid Cluster Architecture for Enhanced Performance and Resilience
-
Hybrid radio matches dedicated performance with far less setup
Enabling Agile Ambient IoT Networking via a Parameterized Hybrid Radio
-
JSON IR and compiler checks lift LLM circuit correctness
CPPL: A Circuit Prompt Programming Language
-
ROA bricks stabilize SHIL signals for Ising machines under variations
ROA-Based Subharmonic Injection Locking for Oscillator-Based Ising Machines
-
Direct AIE links enable 0.93 μs DNN inference on ACAP
{\mu}-ORCA: Optimizing Acceleration for Microsecond-Scale Deep Neural Network Inference on ACAP
-
Compressed KV cache yields full accuracy at 4x throughput
VeriCache: Turning Lossy KV Cache into Lossless LLM Inference
-
Workload traces cut early PDN metal area by up to 33%
Workload-Aware Early-Stage Power Delivery Network Optimization via Architectural Power Traces
-
Near-cache accelerator speeds sparse ILP 15x with 152x less energy
A comprehensive study on ILP acceleration accounting for sparsity, area, energy, data movement using near-memory architecture
-
Traversal stack guides precise prefetching for faster ray tracing
TTP: A Hardware-Efficient Design for Precise Prefetching in Ray Tracing
-
6T SRAM sorting cuts latency by 3.4x versus memristor methods
ADS-IMC: Accelerating Data Sorting with In-Memory Computation
-
SRAM engine halves routing for binary neural nets
SRAM Based Digital Custom Compute Engine for Improved Area Efficiency of AI Hardware
-
Certificate-aware PDR solves six more instances with smaller proofs
Certificate-Aware Property-Directed Reachability
-
Instruction correlation prefetcher beats prior art by 14% with 2 KB storage
ICP: Exploiting Instruction Correlation for Prefetching Irregular Memory Accesses
-
Intra-thread duplication catches 39% more defective servers
ITHICA: Intra-Thread Instruction Checking Approach for Defect-Induced Silent Data Corruptions
-
Cache reorganization lifts GPU speedups for 28-qubit simulations on laptops
Accelerating State-Vector Quantum Simulation on Integrated GPUs via Cache Locality Optimization: A Cross-Architecture Evaluation
-
Agentic AI automates full accelerator design from scientific applications
A3D: Agentic AI flow for autonomous Accelerator Design
-
Time-domain near-memory MAC reaches 7.62 TOPS/W
Time Domain Near Memory Computing Engine
-
ViTs reach 84% accuracy by replacing layer norm with evolved scalars
Evolving Layer-Specific Scalar Functions for Hardware-Aware Transformer Adaptation
-
End-to-end DVS-memristor system is the missing piece for low-power vision
Memristor Technologies for Dynamic Vision Sensors: A Critical Assessment and Research Roadmap
-
AI agents drop 37-58% on hardware vs software tasks
Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench
-
FPGA accelerator skips sparse beams for 2x faster MIMO localization
Efficient Implementation of an Adaptive Transformer Accelerator for Massive MIMO Outdoor Localization
-
7B model surpasses 671B baselines on SVA generation
Reward-Weighted On-Policy Distillation with an Open Property-Equivalence Verifier for NL-to-SVA Generation
-
FPGA lock agents boost OLTP throughput 51X over CPUs
FPGA-Accelerated Lock Management and Transaction Processing: Architecture, Optimization, and Design Space Exploration
-
PoisonCap gives CHERI strict use-after-free at zero overhead
PoisonCap: Efficient Hierarchical Temporal Safety for CHERI
-
GenAI workflow maps RISC-V supply chains for risk analysis
GenAI-Driven Approach to RISC-V Supply Chain Exploration
-
Block-scale search cuts quantization error 27% in BFP
Search Your Block Floating Point Scales!
-
Joint TLB-cache tweaks boost instruction prefetching 8.7%
Enhancing Instruction Prefetching via Cache and TLB Management
-
FPGA SoC matches silicon SNN accuracy for neuromorphic edge tasks
Heterogeneous SoC Integrating an Open-Source Recurrent SNN Accelerator for Neuromorphic Edge Computing on FPGA