archive

Every paper Pith has read. Search by title, abstract, or pith.

493 papers in cs.AR · page 3

cs.AR 2026-05-07 reviewed

DySHARP speeds MoE models 1.79x with dynamic in-switch computing
Accelerating MoE with Dynamic In-Switch Computing on Multi-GPUs

Qijun Zhang +12
cs.AR 2026-05-06 reviewed

Reconfigurable arrays nearly double GPU energy efficiency
DICE: Enabling Efficient General-Purpose SIMT Execution with Statically Scheduled Coarse-Grained Reconfigurable Arrays

Jiayi Wang +3
cs.AR 2026-05-06 reviewed

Two policies cut mean IPC loss 13.6 times
Beyond Static Policies: Exploring Dynamic Policy Selection for Single-Thread Performance Optimization

Yanxin Zhang +5
cs.LG 2026-05-06 reviewed

Joint training cuts AI multiplier power by up to 27 percent
TRAM: Training Approximate Multiplier Structures for Low-Power AI Accelerators

Chang Meng +5
cs.AR 2026-05-06 reviewed

Flow automatically converts flip-flops to two-phase latches
An Open-Source Flow for Single-Phase, Edge-Triggered to Two-Phase, Non-Overlapping Clocking Conversion

Paolo Pedroso +2
cs.AR 2026-05-06 reviewed

Multicore design achieves 3.1x speedup with four cores
REPTILES: Repeated Tiles of Sargantana, a RISC-V multicore based on OpenPiton

Noelia Oliete-Escu\'in +28
cs.AR 2026-05-06 reviewed

Agent Builds TurboQuant Accelerator in 80 Hours
Design Conductor 2.0: An agent builds a TurboQuant inference accelerator in 80 hours

The Verkor Team: Ravi Krishna +2
cs.AR 2026-05-06 reviewed

Commercial 3D NAND chips run over a billion bitwise ops error-free
MCFlash: Bulk Bitwise Processing in 3D NAND with Dynamic Sensing and Multi-level Encoding

Habib Ur Rahman +3
cs.AR 2026-05-06 reviewed

Data corruption dominates transient faults in RISC-V vectors
Not All Faults Are Equal: Transient-Fault Sensitivity Characterization of an Open-Source RISC-V Vector Cluster

Maoyuan Cai +3
cs.LG 2026-05-06 reviewed

Approximate multipliers allow full ResNet MoE recovery after retraining
AxMoE: Characterizing the Impact of Approximate Multipliers on Mixture-of-Experts DNN Architectures

Omkar B Shende +2
cs.AR 2026-05-06 reviewed

LLM framework builds UVM testbenches in 4.5 hours at 95.65% coverage
UVMarvel: an Automated LLM-aided UVM Machine for Subsystem-level RTL Verification

Junhao Ye +9
cs.AR 2026-05-06 reviewed

SDM circuit switching cuts NoC power by 38 percent
Ultra Low-Power SDM-based Circuit-Switching for Networks-on-Chip

Meysam Zaeemi +1
cs.AR 2026-05-06 reviewed

RangeGuard corrects 64+ bit flips using 16-bit parity in DNNs
RangeGuard: Efficient, Bounded Approximate Error Correction for Reliable DNNs

Hanum Ko +3
cs.AR 2026-05-05 reviewed

GPU silent errors rarely produce NaN or infinity values
The Anatomy of Silent Data Corruption: GPU Error Pattern Study and Modeling Guidance

Chung-Hsuan Tung +7
cs.DC 2026-05-05 reviewed

Microbenchmark models predict GPU performance with 1% error
Microbenchmark-Driven Analytical Performance Modeling Across Modern GPU Architectures

Aaron Jarmusch +1
cs.AR 2026-05-05 reviewed

ISA-level model defines safe behaviors for programmable caches
t\"{a}k\={o}Formal: Enabling Robust Software for Programmable Memory Hierarchies (Extended Version)

Pranav Srinivasan +2
cs.CR 2026-05-05 reviewed

LIPPEN is a hardware-software co-design that encrypts the full 64-bit pointer in place
LIPPEN: A Lightweight In-Place Pointer Encryption Architecture for Pointer Integrity

Erfan Iravani +6
cs.AR 2026-05-05 reviewed

SPEC CPU2026 increases instruction volume and cache pressure
SPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison

RuiHao Li +3
cs.AR 2026-05-05 reviewed

4-5 workloads preserve 96-99% of SPEC CPU2026 behavior
SPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison

RuiHao Li +3
cs.AR 2026-05-05 reviewed

FPGA BNN YOLO detector matches ONNX at 0.999964 correlation
Design and Implementation of BNN-Based Object Detection on FPGA

Xuyu Zhao +7
cs.AR 2026-05-05 reviewed

FPGA runs BNN object detector matching software at 0.999964 correlation
Design and Implementation of BNN-Based Object Detection on FPGA

Xuyu Zhao +7
cs.AR 2026-05-04 reviewed

Narrow final layer cuts LGN FPGA use by 28%
Resource Utilization of Differentiable Logic Gate Networks Deployed on FPGAs

Stephen Wormald +3
quant-ph 2026-05-04 reviewed

Automated predecoders cut quantum decoder use by up to 4000 times
Mitigating Classical Resource Costs in Quantum Error Correction via Generalized qLDPC Predecoding

Alexander Knapen +8
eess.SP 2026-05-04 reviewed

Beamspace low-rank preconditioner cuts CG iterations by two to three
Low-rank Preconditioning in Beamspace Domain For Massive MU-MIMO Long-Term Beamforming

Amirreza Kiani +3
cs.AR 2026-05-04 reviewed

MRDIMMs raise server memory bandwidth 41% with 30% energy savings
Performance and Energy Benefits of MRDIMMs

Pau D\'iaz +9
cs.AR 2026-05-04 reviewed

Single encoding unifies device
Cerberus: Cross-Layer ECC Co-Design for Robust and Efficient Memory Protection

Junhwan Kim +4
cs.AR 2026-05-04 reviewed

Single encoding reused across DRAM ECC layers
Cerberus: Cross-Layer ECC Co-Design for Robust and Efficient Memory Protection

Junhwan Kim +4
cs.NI 2026-05-04 reviewed

One NIC data path runs TCP and RoCE at line rate
A Protocol-Independent Transport Architecture

Kimiya Mohammadtaheri +10
cs.AR 2026-05-03 reviewed

3D stacking cuts NCL circuit area by 44%
Monolithic 3D Integration for Null Convention Logic (NCL)-Based Asynchronous Circuits

Xiameng Zhang +3
cs.LG 2026-05-03 reviewed

The paper surveys neural architecture search methods through the lens of efficiency
HERCULES: Hardware-Efficient, Robust, Continual Learning Neural Architecture Search

Matteo Gambella +2
cs.AR 2026-05-03 reviewed

The paper introduces ViM-Q, a co-design of quantization techniques and custom FPGA…
ViM-Q: Scalable Algorithm-Hardware Co-Design for Vision Mamba Model Inference on FPGA

Shengzhe Lyu +4
cs.IT 2026-05-03 reviewed

SwiftChannel pairs a compressed deep learning model for reconstructing 5G channel…
SwiftChannel: Algorithm-Hardware Co-Design for Deep Learning-Based 5G Channel Estimation

Shengzhe Lyu +7
cs.AR 2026-05-03 reviewed

RISC-V pipeline at 8 stages triples frequency and lifts throughput 71 percent
RV-IM100: Quantifying ISA Extension, Datapath Width, and Pipeline Depth Trade-offs in RISC-V Microarchitectures

Hyunwoo Kang
cs.AR 2026-05-03 reviewed

IR-level register tweaks cut delay
PipeRTL: Timing-Aware Pipeline Optimization at IR-Level for RTL Generation

Shuo Yin +8
cs.PF 2026-05-02 reviewed

SPEC CPU 2026 standardizes mixed-workload CPU benchmarking
SPEC CPU: The Next Generation

Mahesh Madhav +33
cs.AR 2026-05-02 reviewed

FPGA accelerator speeds SVD for PCA 22x over GPU
MANOJAVAM: A Scalable, Unified FPGA Accelerator for Matrix Multiplication and Singular Value Decomposition in Principal Component Analysis

Srivaths Ramasubramanian +7
cs.AR 2026-05-02 reviewed

Gem5 call stacks reveal what stats miss in simulated CPUs
Understanding Simulated Architecture via gem5 Call-Stack Profiling

Johan S\"oderstr\"om (1) +2
cs.AR 2026-05-02 reviewed

AMSnet-q converts schematic images of analog and mixed-signal circuits into a fully…
AMSnet-q: Unsupervised Circuit Identification and Performance Labeling for AMS Circuits

Ze Zhang +8
eess.IV 2026-05-02 reviewed

Blackwell NVENC UHQ gains quality at 400% latency cost
Evolution of NVENC Efficiency: A Longitudinal Analysis of HQ and UHQ Tuning Efficiency, Latency and Energy Trade-offs

Kasidis Arunruangsirilert +1
cs.AR 2026-05-01 reviewed

Simulator models FlashAttention-3 pipelines to 5.7% error
Sim-FA: A GPGPU Simulator Framework for Fine-Grained FlashAttention Pipeline Analysis

Zhongchun Zhou +4
cs.DC 2026-05-01 reviewed

Fixed-core approach yields 211x higher efficiency for edge GEMM
Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge

M. Grailoo +1
cs.PF 2026-05-01 reviewed

Apple Silicon runs 80B LLMs at 23x Nvidia energy efficiency
Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference

Abdurrahman Javat +1
cs.AR 2026-05-01 reviewed

Prototype chip runs 3B ternary LLM at 72 tokens per second
VitaLLM: A Versatile and Tiny Accelerator for Mixed-Precision LLM Inference on Edge Devices

Zi-Wei Lin +1
cs.AR 2026-05-01 reviewed

Subthreshold SRAM CIM hits 1181 TOPS/W for spiking networks
A PVT-Resilient Subthreshold SRAM-Based In-Memory Computing Accelerator with In-Situ Regulation for Energy-Efficient Spiking Neural Networks

Shih-Hang Kao +9
cs.AR 2026-04-30 reviewed

DPU-GPU split cuts CNN latency up to 3.37 times versus GPU alone
DPU or GPU for Accelerating Neural Networks Inference -- Why not both? Split CNN Inference

Ali Emre Oztas +3
cs.CY 2026-04-30 reviewed

AI trust can be measured via pillars and agentic interfaces
I hope we don't do to trust what advertising has done to love

Jade Alglave
cs.CY 2026-04-30 reviewed

AI trust needs pillars and vectors to stay meaningful
I hope we don't do to trust what advertising has done to love

Jade Alglave
cs.AR 2026-04-30 reviewed

Ring topology on FPGAs runs cortical circuit faster than real time
NeuroRing: Scaling Spiking Neural Networks via Multi-FPGA Bidirectional Ring Topologies and Stream-Dataflow Architectures

Muhammad Ihsan Al Hafiz +1
cs.OS 2026-04-30 reviewed

Affinity hints give 12% throughput boost on chiplet servers
Affinity Tailor: Dynamic Locality-Aware Scheduling at Scale

Jin Xin Ng +9
cs.AR 2026-04-30 reviewed

Memory chips run matrix math at 14.9 GFLOP/s
AME-PIM: Can Memory be Your Next Tensor Accelerator?

Emanuele Venieri +5