archive
Every paper Pith has read. Search by title, abstract, or pith.
493 papers in cs.AR · page 4
-
Grammar masking creates scalable benchmarks for RTL code completion
RuC: HDL-Agnostic Rule Completion Benchmark Generation
-
Hybrid engine generates UVM testbenches via LLM plans and fixed templates
HAVEN: Hybrid Automated Verification ENgine for UVM Testbench Synthesis with LLMs
-
Type recovery lifts 99.98% of GPU binaries to LLVM IR
CuLifter: Lifting GPU Binaries to Typed IR
-
Ternary LLM accelerator hits 70 tokens/s in 0.223 mm² chip
VitaLLM: A Versatile, Ultra-Compact Ternary LLM Accelerator with Dependency-Aware Scheduling
-
RCW scheme cuts LLM prefill latency nearly in half on digital CIM
RCW-CIM: A Digital CIM-based LLM Accelerator with Read-Compute/Write
-
Agents convert DRAM specs to formal DRAMPyML
Autoformalizing Memory Specifications with Agents
-
SafeTune filters poisoned RTL training data for secure LLM fine-tuning
SafeTune: Mitigating Data Poisoning in LLM Fine-Tuning for RTL Code Generation
-
This paper reviews recent advancements in mm-wave oscillators below 100 GHz and…
Recent Advances in mm-Wave and Sub-THz/THz Oscillators for FutureG Technologies
-
The paper introduces Voxel, a compiler-aware simulation framework for studying the…
Exploring the Efficiency of 3D-Stacked AI Chip Architecture for LLM Inference with Voxel
-
More dense PEs outperform sparse hardware for pruned networks
Sparse-on-Dense: Area and Energy-Efficient Computing of Sparse Neural Networks on Dense Matrix Multiplication Accelerators
-
V&V loop unifies UVM, FPGA and CI/CD for RISC-V chips
Verification and Validation (V&V)-in-the-Loop for RISC-V Design: The Holistic Vision of BZL
-
EMiX emulates 64-core RISC-V across eight FPGAs
EMiX: Emulating Beyond Single-FPGA Limits
-
Pipelined sharding speeds client xLM inference up to 30x with 10x less VRAM
Efficient, VRAM-Constrained xLM Inference on Clients
-
The paper proposes RKHS, a method that combines retrieval-augmented generation with…
RAG-Enhanced Kernel-Based Heuristic Synthesis (RKHS): A Structured Methodology Using Large Language Models for Hardware Design
-
Memory-centric chiplets cut attention latency 15 times
AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving
-
FPGA CNN classifies heart vibrations at 8.55 mW
At the Edge of the Heart: ULP FPGA-Based CNN for On-Device Cardiac Feature Extraction in Smart Health Sensors for Astronauts
-
Randomness in BP decoding boosts quantum accuracy by 2-8 orders
Lottery BP: Unlocking Quantum Error Decoding at Scale
-
3D NAND flash runs LLM feed-forward math for 38x edge speedup
NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference
-
4-state MTJ crossbar runs MNIST inference at 94.48% accuracy
Multibit neural inference in a N-ary crossbar architecture
-
Scheduler boosts FTQC multiprogramming by 3.1x
No Tile Left Behind: Multiprogramming for Surface-Code Architectures
-
Adaptive windows speed multi-macro CIM CNN mapping 1.3x
TetrisG-SDK: Efficient Convolutional Layer Mapping with Adaptive Windows and Grouped Convolutions for Fast In-Memory Computing
-
Frequency remapping cuts recsys inference latency 81% on flash
RecFlash: Fast Recommendation System on In-Storage Computing with Frequency-Based Data Mapping
-
Mobile NPU-PIM design speeds LLM drafts 4.2 times
AHASD: Asynchronous Heterogeneous Architecture for LLM Adaptive Drafting Speculative Decoding on Mobile Devices
-
FusionCIM cuts LLM energy use by up to 3.86x
FusionCIM: Accelerating LLM Inference with Fusion-Driven Computing-in-Memory Architecture
-
RL matches expert chip placement by learning rewards from final layouts
How Can Reinforcement Learning Achieve Expert-level Placement?
-
LUT accelerators deliver 2.2x area cut for 1.58-bit LLMs
Hardware Generation and Exploration of Lookup Table-Based Accelerators for 1.58-bit LLM Inference
-
LLM agent evolves cache policies that beat LRU by 6 percent IPC
Agentic Architect: An Agentic AI Framework for Architecture Design Exploration and Optimization
-
Signature tree skips repeated checks to speed FPGA packing 3.7x on average
D\'ej\`a Vu Packing: Optimizing FPGA Logic Clustering Runtime via Pattern Memoization
-
ASIC accelerator achieves 3.5x throughput for long-context attention
Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding
-
VTA compiler now automates large CNN compilation
Compilation and Execution of an Embeddable YOLO-NAS on the VTA
-
RowHammer refreshes drop 95-99.99% with vulnerability tracking
RowHammer Vulnerability Counter (RVC): Redefining RowHammer Detection with Victim-Centric Tracking
-
Atomic coherence lets 3D CNNs classify video at 125000 fps
Opto-Atomic Spatio-Temporal Holographic Correlators for High-Speed 3D CNNs
-
Edge AI can pass accuracy checks but fail timing on shared hardware
Architectural Isolation as a Timing Safety Primitive for Edge AI Medical Devices: Controlled Experimental Evidence on a Shared-Silicon Platform
-
Flow matching produces overlap-free chip placements 10-50x faster
FlowPlace: Flow Matching for Chip Placement
-
Exact normalization preserved in 14x smaller Softmax and LayerNorm
Hardware-Efficient Softmax and Layer Normalization with Guaranteed Normalization for Edge Devices
-
Retrieval-augmented LLM forecasts timing slacks from Verilog
TimingLLM: A Two-Stage Retrieval-Augmented Framework for Pre-Synthesis Timing Prediction from Verilog
-
Mixed-radix CORDIC cuts FPGA sigmoid to 835 slices
Hardware-Efficient FPGA Implementation of Sigmoid Function Using Mixed-Radix Hyperbolic Rotation CORDIC
-
The paper describes a hybrid runtime that mixes Just-In-Time compilation with CUDA Graphs…
Hybrid JIT-CUDA Graph Optimization for Low-Latency Large Language Model Inference
-
CuTile hits 1007 TFLOP/s attention on B200 in 60 Python lines
Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs
-
Tessera hides AES decryption behind DRAM fetches on edge chips
Tessera: Secure, Near-Line-Rate Weight Streaming for UMA Edge Accelerators
-
Cosine similarity and NAS tested for vector-quantized model compression
Efficient VQ-QAT and Mixed Vector/Linear quantized Neural Networks
-
Activation patterns cut multi-node MoE communication up to 20x
Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns
-
MTJ memory integrates stochastic computing to skip external random generators
Maximizing Memory-Level Parallelism via Integrated Stochastic Logic-in-Memory Architectures
-
Agent with solver feedback achieves 82% correct hardware assertions
From Language to Logic: Bridging LLMs & Formal Representations for RTL Assertion Generation
-
Accelerators improve LLM speed on edge single-board computers
Cloud to Edge: Benchmarking LLM Inference On Hardware-Accelerated Single-Board Computers
-
MPS gains or loses 30% in GPU sharing depending on memory contention
A comprehensive evaluation of spatial co-execution on GPUs using MPS and MIG technologies
-
Vector processor optimizations yield 1.33x speedup without extra bandwidth
Microarchitectural Co-Optimization for Sustained Throughput of RISC-V Multi-Lane Chaining Vector Processors
-
Top-K method speeds sparse decode 1.88x on Blackwell
Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation
-
Polyhedral analysis uncovers hidden mmuls for CGRA speedups up to 9.1x
Exploiting pre-optimized kernels with polyhedral transformations for CGRA compilation
-
HGQ-LUT trains LUT neural nets 100x faster on GPUs
HGQ-LUT: Fast LUT-Aware Training and Efficient Architectures for DNN Inference