archive
Every paper Pith has read. Search by title, abstract, or pith.
493 papers in cs.AR · page 8
-
Einsum fusion cuts Mamba traffic for 4.9x prefill speedup
Mambalaya: Einsum-Based Fusion Optimizations on State-Space Models
-
Matrix encoding speeds attention dataflow optimization by 64-343x
Fast Cross-Operator Optimization of Attention Dataflow
-
FPGA SNN accelerator scales inference near-linearly with sparsity
YANA: Bridging the Neuromorphic Simulation-to-Hardware Gap
-
Error-driven training puts 32B model at top of industrial code benchmarks
InCoder-32B-Thinking: Industrial Code World Model for Thinking
-
Graph coloring speeds SPICE up to 45x on 64 cores
EEspice: A Modular Circuit Simulation Platform with Parallel Device Model Evaluation via Graph Coloring
-
Multi-agent LLMs generate hardware assertions at 96% functional accuracy
ChatSVA: Bridging SVA Generation for Hardware Verification via Task-Specific LLMs
-
SRAM reads attention scores from quantized KV indices without dequantizing
AXELRAM: Quantize Once, Never Dequantize
-
Shared memory speeds NF4 dequantization 2x
Fast NF4 Dequantization Kernels for Large Language Model Inference
-
Cold TLB misses slow small GPU collectives up to 1.4x
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods
-
TensorBoard plugin surfaces hidden fairness gaps during training
InsightBoard: An Interactive Multi-Metric Visualization and Fairness Analysis Plugin for TensorBoard
-
3DGS blending reformulated for Tensor Cores yields 1.42x speedup
GEMM-GS: Accelerating 3D Gaussian Splatting on Tensor Cores with GEMM-Compatible Blending
-
Automated engines can design computer chips faster than human teams
Computer Architecture's AlphaZero Moment: Automated Discovery in an Encircled World
-
Fixed Edge AI loses reliability or breaks budgets as conditions change
Position Paper: From Edge AI to Adaptive Edge AI
-
Circuit generator hits 99.9% validity with 8 simulations
ARCS: Autoregressive Circuit Synthesis with Topology-Aware Graph Attention and Spec Conditioning
-
Switch-centric network speeds All-Reduce up to 8.7x in LLM inference
A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network
-
Lossless compressor speeds Ascend NPU inference up to 6.3 times
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs
-
NoC with direct core access speeds ML collectives 5.3x
A Lightweight High-Throughput Collective-Capable NoC for Large-Scale ML Accelerators
-
Local ChatOps tool hits 0.90 precision on single-hop questions
RAGnaroX: A Secure, Local-Hosted ChatOps Assistant Using Small Language Models
-
Simulator verifies accelerator firmware 50x faster than FPGA
FireBridge: Cycle-Accurate Hardware + Firmware Co-Verification for Modern Accelerators
-
Review creates unified thermal model for 3D chip stacks
A Review of Multiscale Thermal Modeling in Heterogeneous 3D ICs
-
Ten general agents deliver 8× average HLS speedup
Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization?
-
Exact formulas predict spurs from ADC mismatches
Spectral Impact of Mismatches in Interleaved ADCs
-
FPGA accelerator decodes quantum errors in under 1 microsecond
Low Latency GNN Accelerator for Quantum Error Correction
-
AI data centers raise local land temperatures by 2°C
The data heat island effect: quantifying the impact of AI data centers in a warming world
-
Updated Amdahl sets specialization threshold at 1-1/R
Modernizing Amdahl's Law: How AI Scaling Laws Shape Computer Architecture
-
COmPOSER automates mm-wave designs 100-300x faster
COmPOSER: Circuit Optimization of mm-wave/RF circuits with Performance-Oriented Synthesis for Efficient Realizations
-
CPU replays exact NVIDIA GPU matrix multiplies without precision loss
Hawkeye: Reproducing GPU-Level Non-Determinism
-
ML-KEM key exchange runs in 35.7 ms on M0+
Benchmarking NIST-Standardised ML-KEM and ML-DSA on ARM Cortex-M0+: Performance, Memory, and Energy on the RP2040
-
Hyperedges unify geometric algebra with compiler graphs
The Program Hypergraph: Multi-Way Relational Structure for Geometric Algebra, Spatial Compute, and Physics-Aware Compilation
-
Local hardware updates replace backpropagation for neural nets
A Synthesizable RTL Implementation of Predictive Coding Networks
-
Verilog vectorizer cuts Jasper elaboration time 28% and memory 51%
Vectorization of Verilog Designs and its Effects on Verification and Synthesis
-
LLM RTL generation splits into three quality regimes under synthesis
Synthesis-in-the-Loop Evaluation of LLMs for RTL Generation: Quality, Reliability, and Failure Modes
-
Graph unifies netlist and layout to predict chip congestion early
VeriHGN: Heterogeneous Graph-Based Congestion Prediction for Chip Layout Verification
-
MSB proxy skips 88% of CNN multiplications with zero accuracy loss
Hardware Efficient Approximate Convolution with Tunable Error Tolerance for CNNs
-
Reasoning tree raises SVA functional correctness by 31 percent
FVRuleLearner: Operator-Level Reasoning Tree (OP-Tree)-Based Rules Learning for Formal Verification
-
Method localizes 51% of bugs at top rank in sequential hardware
Pecker: Bug Localization Framework for Sequential Designs via Causal Chain Reconstruction
-
One RTD creates THz radar sensing 5-micrometer moves
Micrometer-scale displacement and thickness sensing using a single terahertz resonant-tunneling diode
-
TEE architecture secures continuous attestation against platform control
A TEE-Based Architecture for Confidential and Dependable Process Attestation in Authorship Verification
-
Softcore loads custom instructions from memory with no frequency overhead
LUTstructions: Self-loading FPGA-based Reconfigurable Instructions
-
SAM2 extracts accurate SEM contours from only 60 images
SegSEM: Enabling and Enhancing SAM2 for SEM Contour Extraction
-
Hybrid memory design runs full kernels for 59x AES and 40x LLM speedups
DARTH-PUM: A Hybrid Processing-Using-Memory Architecture
-
Optimal accelerator mappings found in 17 seconds
The Turbo-Charged Mapper: Fast and Optimal Mapping for Energy-efficient and Low-latency Accelerator Design
-
FFM finds optimal fused accelerator mappings over 10,000x faster
Fast and Fusiest: An Optimal Fusion-Aware Mapper for Accelerator Design
-
Near-memory GPU cuts energy use 6-13x while speeding AI tasks 6-16x
ABI: A tightly integrated, unified, sparsity-aware, reconfigurable, compute near-register file/cache GPU architecture with light-weight softmax for deep learning, linear algebra, and Ising compute
-
Offline LLM runs tutoring on legacy hardware without net
Offline-First LLM Architecture for Adaptive Learning in Low-Connectivity Environments
-
Bipartite graphs and grammar rules generate valid analog topologies automatically
AnalogToBi: Device-Level Analog Circuit Topology Generation via Bipartite Graph and Grammar Guided Decoding
-
D-Legion architecture reaches 135 TOPS for quantized LLM matrix math
D-Legion: A Scalable Many-Core Architecture for Accelerating Matrix Multiplication in Quantized LLMs
-
On-the-fly predictor boosts FP8 CIM efficiency 2.8x
Balancing FP8 Computation Accuracy and Efficiency on Digital CIM via Shift-Aware On-the-fly Aligned-Mantissa Bitwidth Prediction
-
Verilog models show shared and model-specific prompt responses
VeriInteresting: An Empirical Study of Model Prompt Interactions in Verilog Code Generation
-
KANs reach sub-microsecond online learning on FPGAs via spline locality
Ultrafast On-chip Online Learning via Spline Locality in Kolmogorov-Arnold Networks