archive

Every paper Pith has read. Search by title, abstract, or pith.

493 papers in cs.AR · page 4

cs.AR 2026-04-30 reviewed

Grammar masking creates scalable benchmarks for RTL code completion
RuC: HDL-Agnostic Rule Completion Benchmark Generation

Arnau Ayguad\'e Domingo +7
cs.AR 2026-04-30 reviewed

Hybrid engine generates UVM testbenches via LLM plans and fixed templates
HAVEN: Hybrid Automated Verification ENgine for UVM Testbench Synthesis with LLMs

Chang-Chih Meng +5
cs.AR 2026-04-30 reviewed

Type recovery lifts 99.98% of GPU binaries to LLVM IR
CuLifter: Lifting GPU Binaries to Typed IR

Jisheng Zhao +4
cs.AR 2026-04-30 reviewed

Ternary LLM accelerator hits 70 tokens/s in 0.223 mm² chip
VitaLLM: A Versatile, Ultra-Compact Ternary LLM Accelerator with Dependency-Aware Scheduling

Zi-Wei Lin +1
cs.AR 2026-04-30 reviewed

RCW scheme cuts LLM prefill latency nearly in half on digital CIM
RCW-CIM: A Digital CIM-based LLM Accelerator with Read-Compute/Write

Yan-Cheng Guo +2
cs.AR 2026-04-30 reviewed

Agents convert DRAM specs to formal DRAMPyML
Autoformalizing Memory Specifications with Agents

Jan Ole Ernst +8
cs.CR 2026-04-29 reviewed

SafeTune filters poisoned RTL training data for secure LLM fine-tuning
SafeTune: Mitigating Data Poisoning in LLM Fine-Tuning for RTL Code Generation

Mahshid Rezakhani +3
eess.SP 2026-04-29 reviewed

This paper reviews recent advancements in mm-wave oscillators below 100 GHz and…
Recent Advances in mm-Wave and Sub-THz/THz Oscillators for FutureG Technologies

Baktash Behmanesh +1
cs.AR 2026-04-29 reviewed

The paper introduces Voxel, a compiler-aware simulation framework for studying the…
Exploring the Efficiency of 3D-Stacked AI Chip Architecture for LLM Inference with Voxel

Yiqi Liu +4
cs.AR 2026-04-29 reviewed

More dense PEs outperform sparse hardware for pruned networks
Sparse-on-Dense: Area and Energy-Efficient Computing of Sparse Neural Networks on Dense Matrix Multiplication Accelerators

Hyunsung Yoon +2
cs.AR 2026-04-29 reviewed

V&V loop unifies UVM, FPGA and CI/CD for RISC-V chips
Verification and Validation (V&V)-in-the-Loop for RISC-V Design: The Holistic Vision of BZL

Sajjad Ahmed +23
cs.AR 2026-04-29 reviewed

EMiX emulates 64-core RISC-V across eight FPGAs
EMiX: Emulating Beyond Single-FPGA Limits

Alexander Kropotov +2
cs.DC 2026-04-29 reviewed

Pipelined sharding speeds client xLM inference up to 30x with 10x less VRAM
Efficient, VRAM-Constrained xLM Inference on Clients

Aditya Ukarande +3
cs.AR 2026-04-28 reviewed

The paper proposes RKHS, a method that combines retrieval-augmented generation with…
RAG-Enhanced Kernel-Based Heuristic Synthesis (RKHS): A Structured Methodology Using Large Language Models for Hardware Design

Shiva Ahir +1
cs.AR 2026-04-28 reviewed

Memory-centric chiplets cut attention latency 15 times
AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

Zhongkai Yu +11
cs.AR 2026-04-28 reviewed

FPGA CNN classifies heart vibrations at 8.55 mW
At the Edge of the Heart: ULP FPGA-Based CNN for On-Device Cardiac Feature Extraction in Smart Health Sensors for Astronauts

Kazi Mohammad Abidur Rahman +4
cs.AR 2026-04-28 reviewed

Randomness in BP decoding boosts quantum accuracy by 2-8 orders
Lottery BP: Unlocking Quantum Error Decoding at Scale

Yanzhang Zhu +4
cs.AR 2026-04-28 reviewed

3D NAND flash runs LLM feed-forward math for 38x edge speedup
NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference

Mingbo Hao +6
cs.AR 2026-04-28 reviewed

4-state MTJ crossbar runs MNIST inference at 94.48% accuracy
Multibit neural inference in a N-ary crossbar architecture

Anatole Moureaux +2
quant-ph 2026-04-28 reviewed

Scheduler boosts FTQC multiprogramming by 3.1x
No Tile Left Behind: Multiprogramming for Surface-Code Architectures

Archisman Ghosh +2
cs.AR 2026-04-28 reviewed

Adaptive windows speed multi-macro CIM CNN mapping 1.3x
TetrisG-SDK: Efficient Convolutional Layer Mapping with Adaptive Windows and Grouped Convolutions for Fast In-Memory Computing

Ke Dong +3
cs.AR 2026-04-28 reviewed

Frequency remapping cuts recsys inference latency 81% on flash
RecFlash: Fast Recommendation System on In-Storage Computing with Frequency-Based Data Mapping

Jangho Baik +4
cs.AR 2026-04-28 reviewed

Mobile NPU-PIM design speeds LLM drafts 4.2 times
AHASD: Asynchronous Heterogeneous Architecture for LLM Adaptive Drafting Speculative Decoding on Mobile Devices

Ma Zirui +6
cs.AR 2026-04-28 reviewed

FusionCIM cuts LLM energy use by up to 3.86x
FusionCIM: Accelerating LLM Inference with Fusion-Driven Computing-in-Memory Architecture

Zihao Xuan +6
cs.AR 2026-04-28 reviewed

RL matches expert chip placement by learning rewards from final layouts
How Can Reinforcement Learning Achieve Expert-level Placement?

Ruo-Tong Chen +9
cs.AR 2026-04-28 reviewed

LUT accelerators deliver 2.2x area cut for 1.58-bit LLMs
Hardware Generation and Exploration of Lookup Table-Based Accelerators for 1.58-bit LLM Inference

Robin Geens +3
cs.AI 2026-04-28 reviewed

LLM agent evolves cache policies that beat LRU by 6 percent IPC
Agentic Architect: An Agentic AI Framework for Architecture Design Exploration and Optimization

Alexander Blasberg +2
cs.AR 2026-04-27 reviewed

Signature tree skips repeated checks to speed FPGA packing 3.7x on average
D\'ej\`a Vu Packing: Optimizing FPGA Logic Clustering Runtime via Pattern Memoization

Milo Liebster +2
cs.AR 2026-04-27 reviewed

ASIC accelerator achieves 3.5x throughput for long-context attention
Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding

Wang Fan +7
cs.AR 2026-04-27 reviewed

VTA compiler now automates large CNN compilation
Compilation and Execution of an Embeddable YOLO-NAS on the VTA

Anthony Faure-Gignoux +3
cs.CR 2026-04-27 reviewed

RowHammer refreshes drop 95-99.99% with vulnerability tracking
RowHammer Vulnerability Counter (RVC): Redefining RowHammer Detection with Victim-Centric Tracking

Lavi Jain +1
cs.AR 2026-04-27 reviewed

Atomic coherence lets 3D CNNs classify video at 125000 fps
Opto-Atomic Spatio-Temporal Holographic Correlators for High-Speed 3D CNNs

Xi Shen +3
cs.AR 2026-04-26 reviewed

Edge AI can pass accuracy checks but fail timing on shared hardware
Architectural Isolation as a Timing Safety Primitive for Edge AI Medical Devices: Controlled Experimental Evidence on a Shared-Silicon Platform

Akul Mallayya Swami
cs.AR 2026-04-26 reviewed

Flow matching produces overlap-free chip placements 10-50x faster
FlowPlace: Flow Matching for Chip Placement

Peng Xie +8
cs.AR 2026-04-26 reviewed

Exact normalization preserved in 14x smaller Softmax and LayerNorm
Hardware-Efficient Softmax and Layer Normalization with Guaranteed Normalization for Edge Devices

Dawon Choi +2
cs.AR 2026-04-26 reviewed

Retrieval-augmented LLM forecasts timing slacks from Verilog
TimingLLM: A Two-Stage Retrieval-Augmented Framework for Pre-Synthesis Timing Prediction from Verilog

Armin Abdollahi +3
cs.AR 2026-04-26 reviewed

Mixed-radix CORDIC cuts FPGA sigmoid to 835 slices
Hardware-Efficient FPGA Implementation of Sigmoid Function Using Mixed-Radix Hyperbolic Rotation CORDIC

Chintan Panchal +2
cs.LG 2026-04-25 reviewed

The paper describes a hybrid runtime that mixes Just-In-Time compilation with CUDA Graphs…
Hybrid JIT-CUDA Graph Optimization for Low-Latency Large Language Model Inference

Divakar Kumar Yadav +1
cs.LG 2026-04-25 reviewed

CuTile hits 1007 TFLOP/s attention on B200 in 60 Python lines
Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Divakar Kumar Yadav +2
cs.CR 2026-04-25 reviewed

Tessera hides AES decryption behind DRAM fetches on edge chips
Tessera: Secure, Near-Line-Rate Weight Streaming for UMA Edge Accelerators

Animan Naskar
cs.LG 2026-04-25 reviewed

Cosine similarity and NAS tested for vector-quantized model compression
Efficient VQ-QAT and Mixed Vector/Linear quantized Neural Networks

Terry Gou +1
cs.LG 2026-04-25 reviewed

Activation patterns cut multi-node MoE communication up to 20x
Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns

Abhimanyu Bambhaniya +8
cs.ET 2026-04-25 reviewed

MTJ memory integrates stochastic computing to skip external random generators
Maximizing Memory-Level Parallelism via Integrated Stochastic Logic-in-Memory Architectures

Farzad Razi +4
cs.CR 2026-04-25 reviewed

Agent with solver feedback achieves 82% correct hardware assertions
From Language to Logic: Bridging LLMs & Formal Representations for RTL Assertion Generation

Nowfel Mashnoor +2
cs.AR 2026-04-24 reviewed

Accelerators improve LLM speed on edge single-board computers
Cloud to Edge: Benchmarking LLM Inference On Hardware-Accelerated Single-Board Computers

Harri Renney +3
cs.DC 2026-04-24 reviewed

MPS gains or loses 30% in GPU sharing depending on memory contention
A comprehensive evaluation of spatial co-execution on GPUs using MPS and MIG technologies

Jorge Villarrubia +3
cs.AR 2026-04-24 reviewed

Vector processor optimizations yield 1.33x speedup without extra bandwidth
Microarchitectural Co-Optimization for Sustained Throughput of RISC-V Multi-Lane Chaining Vector Processors

Weiying Wang +1
cs.DC 2026-04-24 reviewed

Top-K method speeds sparse decode 1.88x on Blackwell
Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation

Long Cheng +9
cs.AR 2026-04-24 reviewed

Polyhedral analysis uncovers hidden mmuls for CGRA speedups up to 9.1x
Exploiting pre-optimized kernels with polyhedral transformations for CGRA compilation

Yuxuan Wang +5
cs.AR 2026-04-24 reviewed

HGQ-LUT trains LUT neural nets 100x faster on GPUs
HGQ-LUT: Fast LUT-Aware Training and Efficient Architectures for DNN Inference

Chang Sun +6