archive
Every paper Pith has read. Search by title, abstract, or pith.
493 papers in cs.AR · page 3
-
DySHARP speeds MoE models 1.79x with dynamic in-switch computing
Accelerating MoE with Dynamic In-Switch Computing on Multi-GPUs
-
Reconfigurable arrays nearly double GPU energy efficiency
DICE: Enabling Efficient General-Purpose SIMT Execution with Statically Scheduled Coarse-Grained Reconfigurable Arrays
-
Two policies cut mean IPC loss 13.6 times
Beyond Static Policies: Exploring Dynamic Policy Selection for Single-Thread Performance Optimization
-
Joint training cuts AI multiplier power by up to 27 percent
TRAM: Training Approximate Multiplier Structures for Low-Power AI Accelerators
-
Flow automatically converts flip-flops to two-phase latches
An Open-Source Flow for Single-Phase, Edge-Triggered to Two-Phase, Non-Overlapping Clocking Conversion
-
Multicore design achieves 3.1x speedup with four cores
REPTILES: Repeated Tiles of Sargantana, a RISC-V multicore based on OpenPiton
-
Agent Builds TurboQuant Accelerator in 80 Hours
Design Conductor 2.0: An agent builds a TurboQuant inference accelerator in 80 hours
-
Commercial 3D NAND chips run over a billion bitwise ops error-free
MCFlash: Bulk Bitwise Processing in 3D NAND with Dynamic Sensing and Multi-level Encoding
-
Data corruption dominates transient faults in RISC-V vectors
Not All Faults Are Equal: Transient-Fault Sensitivity Characterization of an Open-Source RISC-V Vector Cluster
-
Approximate multipliers allow full ResNet MoE recovery after retraining
AxMoE: Characterizing the Impact of Approximate Multipliers on Mixture-of-Experts DNN Architectures
-
LLM framework builds UVM testbenches in 4.5 hours at 95.65% coverage
UVMarvel: an Automated LLM-aided UVM Machine for Subsystem-level RTL Verification
-
SDM circuit switching cuts NoC power by 38 percent
Ultra Low-Power SDM-based Circuit-Switching for Networks-on-Chip
-
RangeGuard corrects 64+ bit flips using 16-bit parity in DNNs
RangeGuard: Efficient, Bounded Approximate Error Correction for Reliable DNNs
-
GPU silent errors rarely produce NaN or infinity values
The Anatomy of Silent Data Corruption: GPU Error Pattern Study and Modeling Guidance
-
Microbenchmark models predict GPU performance with 1% error
Microbenchmark-Driven Analytical Performance Modeling Across Modern GPU Architectures
-
ISA-level model defines safe behaviors for programmable caches
t\"{a}k\={o}Formal: Enabling Robust Software for Programmable Memory Hierarchies (Extended Version)
-
LIPPEN is a hardware-software co-design that encrypts the full 64-bit pointer in place
LIPPEN: A Lightweight In-Place Pointer Encryption Architecture for Pointer Integrity
-
SPEC CPU2026 increases instruction volume and cache pressure
SPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison
-
4-5 workloads preserve 96-99% of SPEC CPU2026 behavior
SPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison
-
FPGA BNN YOLO detector matches ONNX at 0.999964 correlation
Design and Implementation of BNN-Based Object Detection on FPGA
-
FPGA runs BNN object detector matching software at 0.999964 correlation
Design and Implementation of BNN-Based Object Detection on FPGA
-
Narrow final layer cuts LGN FPGA use by 28%
Resource Utilization of Differentiable Logic Gate Networks Deployed on FPGAs
-
Automated predecoders cut quantum decoder use by up to 4000 times
Mitigating Classical Resource Costs in Quantum Error Correction via Generalized qLDPC Predecoding
-
Beamspace low-rank preconditioner cuts CG iterations by two to three
Low-rank Preconditioning in Beamspace Domain For Massive MU-MIMO Long-Term Beamforming
-
MRDIMMs raise server memory bandwidth 41% with 30% energy savings
Performance and Energy Benefits of MRDIMMs
-
Single encoding unifies device
Cerberus: Cross-Layer ECC Co-Design for Robust and Efficient Memory Protection
-
Single encoding reused across DRAM ECC layers
Cerberus: Cross-Layer ECC Co-Design for Robust and Efficient Memory Protection
-
One NIC data path runs TCP and RoCE at line rate
A Protocol-Independent Transport Architecture
-
3D stacking cuts NCL circuit area by 44%
Monolithic 3D Integration for Null Convention Logic (NCL)-Based Asynchronous Circuits
-
The paper surveys neural architecture search methods through the lens of efficiency
HERCULES: Hardware-Efficient, Robust, Continual Learning Neural Architecture Search
-
The paper introduces ViM-Q, a co-design of quantization techniques and custom FPGA…
ViM-Q: Scalable Algorithm-Hardware Co-Design for Vision Mamba Model Inference on FPGA
-
SwiftChannel pairs a compressed deep learning model for reconstructing 5G channel…
SwiftChannel: Algorithm-Hardware Co-Design for Deep Learning-Based 5G Channel Estimation
-
RISC-V pipeline at 8 stages triples frequency and lifts throughput 71 percent
RV-IM100: Quantifying ISA Extension, Datapath Width, and Pipeline Depth Trade-offs in RISC-V Microarchitectures
-
IR-level register tweaks cut delay
PipeRTL: Timing-Aware Pipeline Optimization at IR-Level for RTL Generation
-
SPEC CPU 2026 standardizes mixed-workload CPU benchmarking
SPEC CPU: The Next Generation
-
FPGA accelerator speeds SVD for PCA 22x over GPU
MANOJAVAM: A Scalable, Unified FPGA Accelerator for Matrix Multiplication and Singular Value Decomposition in Principal Component Analysis
-
Gem5 call stacks reveal what stats miss in simulated CPUs
Understanding Simulated Architecture via gem5 Call-Stack Profiling
-
AMSnet-q converts schematic images of analog and mixed-signal circuits into a fully…
AMSnet-q: Unsupervised Circuit Identification and Performance Labeling for AMS Circuits
-
Blackwell NVENC UHQ gains quality at 400% latency cost
Evolution of NVENC Efficiency: A Longitudinal Analysis of HQ and UHQ Tuning Efficiency, Latency and Energy Trade-offs
-
Simulator models FlashAttention-3 pipelines to 5.7% error
Sim-FA: A GPGPU Simulator Framework for Fine-Grained FlashAttention Pipeline Analysis
-
Fixed-core approach yields 211x higher efficiency for edge GEMM
Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge
-
Apple Silicon runs 80B LLMs at 23x Nvidia energy efficiency
Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference
-
Prototype chip runs 3B ternary LLM at 72 tokens per second
VitaLLM: A Versatile and Tiny Accelerator for Mixed-Precision LLM Inference on Edge Devices
-
Subthreshold SRAM CIM hits 1181 TOPS/W for spiking networks
A PVT-Resilient Subthreshold SRAM-Based In-Memory Computing Accelerator with In-Situ Regulation for Energy-Efficient Spiking Neural Networks
-
DPU-GPU split cuts CNN latency up to 3.37 times versus GPU alone
DPU or GPU for Accelerating Neural Networks Inference -- Why not both? Split CNN Inference
-
AI trust can be measured via pillars and agentic interfaces
I hope we don't do to trust what advertising has done to love
-
AI trust needs pillars and vectors to stay meaningful
I hope we don't do to trust what advertising has done to love
-
Ring topology on FPGAs runs cortical circuit faster than real time
NeuroRing: Scaling Spiking Neural Networks via Multi-FPGA Bidirectional Ring Topologies and Stream-Dataflow Architectures
-
Affinity hints give 12% throughput boost on chiplet servers
Affinity Tailor: Dynamic Locality-Aware Scheduling at Scale
-
Memory chips run matrix math at 14.9 GFLOP/s
AME-PIM: Can Memory be Your Next Tensor Accelerator?