archive

Every paper Pith has read. Search by title, abstract, or pith.

493 papers in cs.AR · page 9

cs.CR 2026-01-30 reviewed

NTT design detects Trojan control and timing faults in PQC
Trojan-Resilient NTT: Protecting Against Control Flow and Timing Faults on Reconfigurable Platforms

Rourab Paul +2
cs.AR 2026-01-28 reviewed

First NPU designed for diffusion language model inference
NPU Design for Diffusion Language Model Inference

Binglei Lou +11
cs.AR 2026-01-22 reviewed

Hypergraphs cut spike traffic in neuromorphic SNN mappings
A Case for Hypergraphs to Model and Map SNNs on Neuromorphic Hardware

Marco Ronzani +1
cs.PF 2026-01-21 reviewed

Hybrid model cuts GPU kernel prediction error by 6.7x
PipeWeave: Synergizing Analytical and Learning Models for Unified GPU Performance Prediction

Kaixuan Zhang +10
cs.LG 2026-01-20 reviewed

NSF urged to fund AI for faster chip design cycles
Report for NSF Workshop on AI for Electronic Design Automation

Deming Chen +9
eess.SP 2026-01-13 reviewed

Compact RISC-V core fits biomedical control in 708 LUTs
Bio-RV: Low-Power Resource-Efficient RISC-V Processor for Biomedical Applications

Vijay Pratap Sharma +4
cs.AR 2026-01-05 reviewed

Timing windows detect microcontroller ageing via frequency shifts
Ageing Monitoring for Commercial Microcontrollers Based on Timing Windows

Leandro Lanzieri +4
cs.DC 2025-12-18 reviewed

Tool spots bit-flip faults in LLMs for fast fixes
BitFlipScope: Scalable Fault Localization and Recovery for Bit-Flip Corruptions in LLMs

Muhammad Zeeshan Karamat +2
cs.AR 2025-12-10 reviewed

Dynamic buckets lift LLM cache use 19% on LPDDR chips
ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators

Guoqiang Zou +4
cs.LG 2025-12-09 reviewed

Pipelined NN training sets delays by layer depth and reconstructs old weights with moving
LayerPipe2: Multistage Pipelining and Weight Recompute via Improved Exponential Moving Average for Training Neural Networks

Nanda K. Unnikrishnan +1
cs.AR 2025-12-08 reviewed

FPGA accelerator speeds graph classification 6.85× with 3.4% accuracy gain
Efficient and Accurate Graph Classification with Hyperdimensional Computing on FPGA

Jebacyril Arockiaraj +2
cs.AR 2025-12-08 reviewed

Generative transformer cuts circuit delay 30% and gates 50%
GTAC: A Generative Transformer for Approximate Circuits

Jingxin Wang +6
cs.MS 2025-12-07 reviewed

Models emulate NVIDIA Tensor Core behavior in low precision
Accurate Models of NVIDIA Tensor Cores

Faizan A. Khattak +1
cs.AR 2025-11-27 reviewed

Co-design framework accelerates domains up to 15x with low overhead
Aquas: Enhancing Domain Specialization through Holistic Hardware-Software Co-Optimization based on MLIR

Yuyang Zou +8
cs.DC 2025-11-25 reviewed

Voxel traits let Spira skip kernel-map overhead for 3x faster point-cloud convolution
Spira: Exploiting Voxel Data Structural Properties for Efficient Sparse Convolution in Point Cloud Networks

Dionysios Adamopoulos +3
cs.LG 2025-11-25 reviewed

Round-trip LLM translation catches hallucinations in hardware design
Mitigating hallucinations and omissions in LLMs for invertible problems: An application to hardware logic design automation

Andrew S. Cassidy +6
cs.AR 2025-11-21 reviewed

AmpereOne adds memory tagging with zero capacity overhead
Optimized Memory Tagging on AmpereOne Processors

Shivnandan Kaushik +16
cs.AR 2025-11-21 reviewed

Digital in-memory design reaches 3.59 TOPS/W for AI matrix math
DISCA: A Digital In-memory Stochastic Computing Architecture Using A Compressed Bent-Pyramid Format

Shady Agwa +3
cs.AR 2025-11-19 reviewed

Fused unit runs mixed-precision dot products in four cycles
Ten-Four: An Open-Source Fused Dot Product Unit for Mixed-Precision GPGPU Tensor Cores

Nikhil Rout +1
cs.AR 2025-11-19 reviewed

Joint data-compute tuning speeds ML kernels on PIM up to 13x
DCC: Data-Centric Compilation of Machine Learning Kernels for Processing-In-Memory Architectures

Peiming Yang +6
cs.AR 2025-11-14 reviewed

8x faster linearity testing for 16-bit SAR ADCs
Advanced Strategies for Uncertainty-Guided Live Measurement Sequencing in Fast, Robust SAR ADC Linearity Testing

Thorben Schey +3
cs.AR 2025-11-14 reviewed

Adaptive EKF sequencing cuts SAR ADC linearity test time
Uncertainty-Guided Live Measurement Sequencing for Fast SAR ADC Linearity Testing

Thorben Schey +3
cs.AR 2025-11-14 reviewed

Closed-loop tests yield first bit-accurate models for ten GPU matrix units
Bit-Accurate Modeling of GPU Matrix Multiply-Accumulate Units: Demystifying Numerical Discrepancy and Accuracy

Peichen Xie +4
cs.DC 2025-11-13 reviewed

Thermal imbalance creates stragglers that slow multi-GPU nodes
Lit Silicon: A Case Where Thermal Imbalance Couples Concurrent Execution in Multiple GPUs

Marco Kurzynski +2
cs.AR 2025-11-10 reviewed

Hybrid formats give 4.9× faster edge LLM inference on PIM
P3-LLM: An Integrated NPU-PIM Accelerator for Edge LLM Inference Using Hybrid Numerical Formats

Yuzong Chen +6
cs.DC 2025-11-10 reviewed

DMA offloads close 4.5x gap for latency-bound ML collectives
DMA-Latte: Expanding the Reach of DMA Offloads to Latency-bound ML Communication

Suchita Pati +5
cs.AR 2025-11-06 reviewed

Five-minute rule shrinks to seconds for AI systems
Five-Minute Rule 40 Years Later: A First-Principles Revisit for Modern Memory Hierarchy

Tong Zhang +9
cs.AI 2025-11-05 reviewed

SnapStream cuts KV cache memory by 4x for 128k LLM inference
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

Jonathan Li +21
quant-ph 2025-10-27 reviewed

Two-ion traps beat larger designs for surface-code trapped-ion computers
Architecting Scalable Trapped Ion Quantum Computers using Surface Codes

Scott Jones +1
cs.AR 2025-10-24 reviewed

SilentZNS slashes ZNS SSD write amplification by 92%
Eliminating the Hidden Cost of Zone Management in ZNS SSDs

Teona Bagashvili +3
cs.SE 2025-10-24 reviewed

Search tunes allocators to cut heap use by 4 percent
GreenMalloc: Allocator Optimisation for Industrial Workloads

Aidan Dakhama +3
cs.AR 2025-10-17 reviewed

Fixed configs make Ramulator 2.0 match real memory performance
Cleaning up the Mess: Re-Evaluating the Real-System Modeling Accuracy of Ramulator 2.0

F. Nisa Bostanci +6
cs.AR 2025-10-16 reviewed

Dynamic pruning cuts vision transformer ops by 61 percent
Low Power Vision Transformer Accelerator with Hardware-Aware Pruning and Optimized Dataflow

Ching-Lin Hsiung +1
cs.AR 2025-10-16 reviewed

Two-stage adaptation hits 93% compression on CIM chips
Computing-In-Memory Aware Model Adaption For Edge Devices

Ming-Han Lin +1
cs.AR 2025-10-09 reviewed

Switch cache lifts HDFS metadata throughput up to 181%
Fletch: File-System Metadata Caching in Programmable Switches

Qingxiu Liu +6
cs.DC 2025-10-08 reviewed

Framework measures real makespans from abstract graphs on CPU-GPU-FPGA hardware
Evaluating Rapid Makespan Predictions for Heterogeneous Systems with Programmable Logic

Martin Wilhelm +3
cs.DC 2025-10-07 reviewed

Profiling uncovers patterns that speed up large MoE inference 6.6x
Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference

Zhongkai Yu +8
cs.AR 2025-10-03 reviewed

Extended precision cuts large Max-Cut solve times
A Hardware Accelerator for the Goemans-Williamson Algorithm

D. A. Herrera-Mart\'i +2
cs.AR 2025-09-22 reviewed

Chiplet RISC-V SoC achieves 40% efficiency gain for edge AI
Chiplet-Based RISC-V SoC with Modular AI Acceleration

Suhas Suresh Bharadwaj +1
cs.AR 2025-09-11 reviewed

Flattened arrays and quantization break LLM memory walls
Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference

Haoran Wu +17
cs.AR 2025-09-10 reviewed

FASE runs multi-thread benchmarks on FPGA before SoC integration
FASE: FPGA-Assisted Syscall Emulation for Rapid End-to-End Processor Performance Validation

Chengzhen Meng +5
cs.AR 2025-09-09 reviewed

Lifetime variation enables 14.5X carbon reduction in disposable smart items
Lifetime-Aware Design for Item-Level Intelligence at the Extreme Edge

Shvetank Prakash +15
cs.AR 2025-09-09 reviewed

Diffusion model optimizes all VLSI macros at once
DiffPlace: A Conditional Diffusion Framework for Simultaneous VLSI Placement Beyond Sequential Paradigms

Kien Le Trung +1
quant-ph 2025-08-26 reviewed

Resource estimates find feasible setups for distributed quantum computers
Architecting Distributed Quantum Computers: Design Insights from Resource Estimation

Dmitry Filippov +2
hep-ex 2025-08-21 reviewed

Linear GNN tags jets under 60 ns on FPGAs
JEDI-linear: Fast and Efficient Graph Neural Networks for Jet Tagging on FPGAs

Zhiqiang Que +10
cs.AR 2025-08-12 reviewed

Memory reads turn into stochastic multiplies for matrix work
OISMA: On-the-fly In-memory Stochastic Multiplication Architecture for Matrix-Multiplication Workloads

Shady Agwa +3
cs.DC 2025-08-02 reviewed

Expert-sharded KV storage cuts memory use in MoE inference
PiKV: KV Cache Management System for Mixture of Experts

Dong Liu +3
cs.AR 2025-07-21 reviewed

Retrieval lifts LLM success on RTL test fixes 7.72 times
VeriRAG: A Retrieval-Augmented Framework for Automated RTL Testability Repair

Haomin Qi +5
cs.AI 2025-07-07 reviewed

RL trains LLMs to output efficient Verilog designs
ChipSeek: Optimizing Verilog Generation via EDA-Integrated Reinforcement Learning

Zhirong Chen +10
cs.AR 2025-07-06 reviewed

Distributed arithmetic cuts FPGA neural net resources by a third
da4ml: Distributed Arithmetic for Real-time Neural Networks on FPGAs

Chang Sun +4