archive

Every paper Pith has read. Search by title, abstract, or pith.

493 papers in cs.AR · page 8

cs.AR 2026-04-04 reviewed

Einsum fusion cuts Mamba traffic for 4.9x prefill speedup
Mambalaya: Einsum-Based Fusion Optimizations on State-Space Models

Toluwanimi O. Odemuyiwa +3
cs.AR 2026-04-03 reviewed

Matrix encoding speeds attention dataflow optimization by 64-343x
Fast Cross-Operator Optimization of Attention Dataflow

Haodong Chang +7
cs.NE 2026-04-03 reviewed

FPGA SNN accelerator scales inference near-linearly with sparsity
YANA: Bridging the Neuromorphic Simulation-to-Hardware Gap

Brian Pachideh +7
cs.AR 2026-04-03 reviewed

Error-driven training puts 32B model at top of industrial code benchmarks
InCoder-32B-Thinking: Industrial Code World Model for Thinking

Jian Yang +24
cs.AR 2026-04-03 reviewed

Graph coloring speeds SPICE up to 45x on 64 cores
EEspice: A Modular Circuit Simulation Platform with Parallel Device Model Evaluation via Graph Coloring

Xuanhao Bao +1
cs.AR 2026-04-03 reviewed

Multi-agent LLMs generate hardware assertions at 96% functional accuracy
ChatSVA: Bridging SVA Generation for Hardware Verification via Task-Specific LLMs

Lik Tung Fu +8
cs.LG 2026-04-03 reviewed

SRAM reads attention scores from quantized KV indices without dequantizing
AXELRAM: Quantize Once, Never Dequantize

Yasushi Nishida
cs.LG 2026-04-02 reviewed

Shared memory speeds NF4 dequantization 2x
Fast NF4 Dequantization Kernels for Large Language Model Inference

Xiangbo Qi +2
cs.DC 2026-04-02 reviewed

Cold TLB misses slow small GPU collectives up to 1.4x
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods

Amel Fatima +2
cs.AR 2026-04-02 reviewed

TensorBoard plugin surfaces hidden fairness gaps during training
InsightBoard: An Interactive Multi-Metric Visualization and Fairness Analysis Plugin for TensorBoard

Ray Zeyao Chen +1
cs.AR 2026-04-02 reviewed

3DGS blending reformulated for Tensor Cores yields 1.42x speedup
GEMM-GS: Accelerating 3D Gaussian Splatting on Tensor Cores with GEMM-Compatible Blending

Haomin Li +6
cs.AR 2026-03-31 reviewed

Automated engines can design computer chips faster than human teams
Computer Architecture's AlphaZero Moment: Automated Discovery in an Encircled World

Karthikeyan Sankaralingam
cs.AR 2026-03-31 reviewed

Fixed Edge AI loses reliability or breaks budgets as conditions change
Position Paper: From Edge AI to Adaptive Edge AI

Fabrizio Pittorino +1
cs.LG 2026-03-30 reviewed

Circuit generator hits 99.9% validity with 8 simulations
ARCS: Autoregressive Circuit Synthesis with Topology-Aware Graph Attention and Spec Conditioning

Tushar Dhananjay Pathak
cs.AR 2026-03-30 reviewed

Switch-centric network speeds All-Reduce up to 8.7x in LLM inference
A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network

Aojie Jiang +6
cs.AR 2026-03-28 reviewed

Lossless compressor speeds Ascend NPU inference up to 6.3 times
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs

Jinwu Yang +19
cs.AR 2026-03-27 reviewed

NoC with direct core access speeds ML collectives 5.3x
A Lightweight High-Throughput Collective-Capable NoC for Large-Scale ML Accelerators

Luca Colagrande +5
cs.AR 2026-03-27 reviewed

Local ChatOps tool hits 0.90 precision on single-hop questions
RAGnaroX: A Secure, Local-Hosted ChatOps Assistant Using Small Language Models

Benedikt Dornauer +1
cs.AR 2026-03-26 reviewed

Simulator verifies accelerator firmware 50x faster than FPGA
FireBridge: Cycle-Accurate Hardware + Firmware Co-Verification for Modern Accelerators

G Abarajithan +3
cs.AR 2026-03-26 reviewed

Review creates unified thermal model for 3D chip stacks
A Review of Multiscale Thermal Modeling in Heterogeneous 3D ICs

Baibhari Priya Barua +2
cs.AI 2026-03-26 reviewed

Ten general agents deliver 8× average HLS speedup
Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization?

Abhishek Bhandwaldar +3
eess.SP 2026-03-25 reviewed

Exact formulas predict spurs from ADC mismatches
Spectral Impact of Mismatches in Interleaved ADCs

J\'er\'emy Guichemerre +3
quant-ph 2026-03-23 reviewed

FPGA accelerator decodes quantum errors in under 1 microsecond
Low Latency GNN Accelerator for Quantum Error Correction

Alessio Cicero +4
cs.CY 2026-03-21 reviewed

AI data centers raise local land temperatures by 2°C
The data heat island effect: quantifying the impact of AI data centers in a warming world

Andrea Marinoni +8
cs.DC 2026-03-21 reviewed

Updated Amdahl sets specialization threshold at 1-1/R
Modernizing Amdahl's Law: How AI Scaling Laws Shape Computer Architecture

Chien-Ping Lu
cs.AR 2026-03-20 reviewed

COmPOSER automates mm-wave designs 100-300x faster
COmPOSER: Circuit Optimization of mm-wave/RF circuits with Performance-Oriented Synthesis for Efficient Realizations

Subhadip Ghosh +6
cs.CR 2026-03-20 reviewed

CPU replays exact NVIDIA GPU matrix multiplies without precision loss
Hawkeye: Reproducing GPU-Level Non-Determinism

Erez Badash +3
cs.CR 2026-03-19 reviewed

ML-KEM key exchange runs in 35.7 ms on M0+
Benchmarking NIST-Standardised ML-KEM and ML-DSA on ARM Cortex-M0+: Performance, Memory, and Energy on the RP2040

Rojin Chhetri
cs.PL 2026-03-18 reviewed

Hyperedges unify geometric algebra with compiler graphs
The Program Hypergraph: Multi-Way Relational Structure for Geometric Algebra, Spatial Compute, and Physics-Aware Compilation

Houston Haynes
cs.NE 2026-03-18 reviewed

Local hardware updates replace backpropagation for neural nets
A Synthesizable RTL Implementation of Predictive Coding Networks

Timothy Oh
cs.PL 2026-03-17 reviewed

Verilog vectorizer cuts Jasper elaboration time 28% and memory 51%
Vectorization of Verilog Designs and its Effects on Verification and Synthesis

Maria Fernanda Oliveira Guimar\~aes +6
cs.AR 2026-03-11 reviewed

LLM RTL generation splits into three quality regimes under synthesis
Synthesis-in-the-Loop Evaluation of LLMs for RTL Generation: Quality, Reliability, and Failure Modes

Weimin Fu +7
cs.AR 2026-03-10 reviewed

Graph unifies netlist and layout to predict chip congestion early
VeriHGN: Heterogeneous Graph-Based Congestion Prediction for Chip Layout Verification

Runbang Hu +3
cs.LG 2026-03-10 reviewed

MSB proxy skips 88% of CNN multiplications with zero accuracy loss
Hardware Efficient Approximate Convolution with Tunable Error Tolerance for CNNs

Vishal Shashidhar +2
cs.AR 2026-03-06 reviewed

Reasoning tree raises SVA functional correctness by 31 percent
FVRuleLearner: Operator-Level Reasoning Tree (OP-Tree)-Based Rules Learning for Formal Verification

Lily Jiaxin Wan +5
cs.AR 2026-03-03 reviewed

Method localizes 51% of bugs at top rank in sequential hardware
Pecker: Bug Localization Framework for Sequential Designs via Causal Chain Reconstruction

Jiaping Tang +5
physics.optics 2026-02-27 reviewed

One RTD creates THz radar sensing 5-micrometer moves
Micrometer-scale displacement and thickness sensing using a single terahertz resonant-tunneling diode

Li Yi +7
cs.CR 2026-02-26 reviewed

TEE architecture secures continuous attestation against platform control
A TEE-Based Architecture for Confidential and Dependable Process Attestation in Authorship Verification

David Condrey
cs.AR 2026-02-24 reviewed

Softcore loads custom instructions from memory with no frequency overhead
LUTstructions: Self-loading FPGA-based Reconfigurable Instructions

Philippos Papaphilippou
cs.AR 2026-02-24 reviewed

SAM2 extracts accurate SEM contours from only 60 images
SegSEM: Enabling and Enhancing SAM2 for SEM Contour Extraction

Da Chen +7
cs.AR 2026-02-17 reviewed

Hybrid memory design runs full kernels for 59x AES and 40x LLM speedups
DARTH-PUM: A Hybrid Processing-Using-Memory Architecture

Ryan Wong +2
cs.AR 2026-02-16 reviewed

Optimal accelerator mappings found in 17 seconds
The Turbo-Charged Mapper: Fast and Optimal Mapping for Energy-efficient and Low-latency Accelerator Design

Michael Gilbert +3
cs.AR 2026-02-16 reviewed

FFM finds optimal fused accelerator mappings over 10,000x faster
Fast and Fusiest: An Optimal Fusion-Aware Mapper for Accelerator Design

Tanner Andrulis +3
cs.AR 2026-02-15 reviewed

Near-memory GPU cuts energy use 6-13x while speeding AI tasks 6-16x
ABI: A tightly integrated, unified, sparsity-aware, reconfigurable, compute near-register file/cache GPU architecture with light-weight softmax for deep learning, linear algebra, and Ising compute

Siddhartha Raman Sundara Raman +1
cs.CY 2026-02-14 reviewed

Offline LLM runs tutoring on legacy hardware without net
Offline-First LLM Architecture for Adaptive Learning in Low-Connectivity Environments

Joseph Walusimbi +3
cs.AR 2026-02-10 reviewed

Bipartite graphs and grammar rules generate valid analog topologies automatically
AnalogToBi: Device-Level Analog Circuit Topology Generation via Bipartite Graph and Grammar Guided Decoding

Seungmin Kim +3
cs.AR 2026-02-05 reviewed

D-Legion architecture reaches 135 TOPS for quantized LLM matrix math
D-Legion: A Scalable Many-Core Architecture for Accelerating Matrix Multiplication in Quantized LLMs

Ahmed J. Abdelmaksoud +3
cs.AR 2026-02-05 reviewed

On-the-fly predictor boosts FP8 CIM efficiency 2.8x
Balancing FP8 Computation Accuracy and Efficiency on Digital CIM via Shift-Aware On-the-fly Aligned-Mantissa Bitwidth Prediction

Liang Zhao +6
cs.AR 2026-02-04 reviewed

Verilog models show shared and model-specific prompt responses
VeriInteresting: An Empirical Study of Model Prompt Interactions in Verilog Code Generation

Luca Collini +4
cs.AR 2026-02-02 reviewed

KANs reach sub-microsecond online learning on FPGAs via spline locality
Ultrafast On-chip Online Learning via Spline Locality in Kolmogorov-Arnold Networks

Duc Hoang +2