archive
Every paper Pith has read. Search by title, abstract, or pith.
493 papers in cs.AR · page 6
-
Hierarchical sparsity speeds LLM attention 4.57 times
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
-
Genetic search finds shift-add CNNs for 33% faster TinyML on FPGA
Co-Design of CNN Accelerators for TinyML using Approximate Matrix Decomposition
-
Real traces show congestion from HPC collectives
Characterization of Real Communication Patterns and Congestion Dynamics in HPC Interconnection Networks
-
MemExplorer auto-designs memory for agentic NPUs
MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs
-
MLIR unifies equivalence checking from algorithms to netlists
EquivFusion: Unifying Hardware Equivalence Checking from Algorithms to Netlists via MLIR
-
SRAM CIM accelerator hits 26.1 TOPS/W for attention
CIMple: Standard-cell SRAM-based CIM with LUT-based split softmax for attention acceleration
-
SRAM PUF with Hamming codes keeps IoT auth errors below 1%
Secure Authentication in Wireless IoT: Hamming Code Assisted SRAM PUF as Device Fingerprint
-
Specialized agents close hardware coverage with 4-13x fewer tokens
Understanding Inference-Time Token Allocation and Coverage Limits in Agentic Hardware Verification
-
Annealing step stabilizes LLM-generated RTL designs
HYPERHEURIST: A Simulated Annealing-Based Control Framework for LLM-Driven Code Generation in Optimized Hardware Design
-
Overmind hits 8.1 TOPS/W on neuro-symbolic workloads
Overmind NSA: A Unified Neuro-Symbolic Computing Architecture with Approximate Nonlinear Activations and Preemptive Memory Bypass
-
LLM agent closes hardware coverage gaps automatically
Spec2Cov: An Agentic Framework for Code Coverage Closure of Digital Hardware Designs
-
LLM agent reaches 100% hardware coverage on simple designs
Spec2Cov: An Agentic Framework for Code Coverage Closure of Digital Hardware Designs
-
Symmetric grids lift photonic AI use by 6X
Towards Topology-Aware Very Large-Scale Photonic AI Accelerators
-
Rack storage tames millisecond GPU power swings
EasyRider: Mitigating Power Transients in Datacenter-Scale Training Workloads
-
Microcontroller fixes timing for real-time photoacoustic imaging
Democratization of Real-time Multi-Spectral Photoacoustic Imaging: Open-Sourced System Architecture for OPOTEK Phocus & Verasonics Vantage Combination
-
SCENIC hits 200G SmartNIC speed with programmable stream units
SCENIC: Stream Computation-Enhanced SmartNIC
-
LLM agents evolve the ABC synthesis tool to higher QoR
Autonomous Evolution of EDA Tools: Multi-Agent Self-Evolved ABC
-
Agentic AI improves RTL timing by 21 percent on real designs
Dr. RTL: Autonomous Agentic RTL Optimization through Tool-Grounded Self-Improvement
-
CRONet runs fully on-chip on AIE-ML for 2.49x latency gain
Accelerating CRONet on AMD Versal AIE-ML Engines
-
Unary encoding boosts parallelism in photonic tensor cores
Scaling Photonic Tensor Cores with Unary and Homodyne Designs
-
Multi-agent testbenches match SOTA Verilog generation with less data
Exploring LLM-based Verilog Code Generation with Data-Efficient Fine-Tuning and Testbench Automation
-
MoE serving gains 6.6x speedup via elastic self-speculation on 3D stacks
ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving
-
L4 GPU delivers up to 4.4x inference throughput over T4
DEEP-GAP: Deep-learning Evaluation of Execution Parallelism in GPU Architectural Performance
-
Knowledge graph guides LLMs to build correct RISC-V hardware
VeriGraphi: A Multi-Agent Framework of Hierarchical RTL Generation for Large Hardware Designs
-
Chiplet tasks cut LLM decode latency on multi-die GPUs
Fleet: Hierarchical Task-based Abstraction for Megakernels on Multi-Die GPUs
-
Embeddings detect line-level CWEs in Verilog at 89% precision
VeriCWEty: Embedding enabled Line-Level CWE Detection in Verilog
-
ASIC emulates oscillators to solve max-cut and coloring at 97-100% accuracy
An ASIC Emulated Oscillator Ising/Potts Machine Solving Combinatorial Optimization Problems
-
Memory stack runs full matrix math inside the chip
GEM3D CIM General Purpose Matrix Computation Using 3D Integrated SRAM eDRAM Hybrid Compute In Memory on Memory Architecture
-
LSTM accelerator spots gait issues 4x faster on tiny ASIC
Cross-Layer Co-Optimized LSTM Accelerator for Real-Time Gait Analysis
-
Pipeline lifts bit-level accelerator code to tensor ISA specs
ATLAAS: Automatic Tensor-Level Abstraction of Accelerator Semantics
-
Full biosignal model tuning runs under 50mW on edge chips
BioTrain: Sub-MB, Sub-50mW On-Device Fine-Tuning for Edge-AI on Biosignals
-
Hardware unit reorganizes data on the fly for ideal CPU cache locality
Tensor Memory Engine: On-the-fly Data Reorganization for Ideal Locality
-
TCL tunes tensor programs 16x faster across CPU and GPU
TCL: Enabling Fast and Efficient Cross-Hardware Tensor Program Optimization via Continual Learning
-
EPAC RISC-V chip with three tiles taped out in 22nm
EPAC: The Last Dance
-
CODO compiler speeds FPGA dataflow designs up to 33x on DNNs
CODO: An Automated Compiler for Comprehensive Dataflow Optimization
-
Passive optical elements classify images by embedded phase patterns
Photonic AI: A Hybrid Diffractive Holographic Neural System for Passive Optical Real-Time Image Classification
-
Hadamard patterns cut RRAM read noise impact in neural nets
HARP: Hadamard-Domain Write-and-Verify for Noise-Robust RRAM Programming
-
Compiler cuts NPU transformer energy use by up to 41%
Forge-UGC: FX optimization and register-graph engine for universal graph compiler
-
Reference-based replication creates AI agents in constant time
Aethon: A Reference-Based Replication Primitive for Constant-Time Instantiation of Stateful AI Agents
-
Imitation learning yields thermal-safe LFM schedules on 3D many-cores
Active Imitation Learning for Thermal- and Kernel-Aware LFM Inference on 3D S-NUCA Many-Cores
-
Decoupled matrix units deliver up to 2.31x AI speedups on CPUs
CUTEv2: Unified and Configurable Matrix Extension for Diverse CPU Architectures with Minimal Design Overhead
-
Neural model sequences shape operations for better mask correction
MorphOPC: Advancing Mask Optimization with Multi-scale Hierarchical Morphological Learning
-
CIM design runs 1B-4B models at 336 tokens/s with 49x energy gain
EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models
-
New dataset trains ML models on 61k chip layout windows for capacitance
CapBench: A Multi-PDK Dataset for Machine-Learning-Based Post-Layout Capacitance Extraction
-
High-bandwidth storage enables interactive 13B model inference on mobiles
Technology solutions targeting the performance of gen-AI inference in resource constrained platforms
-
Specialized LLM matches syntax but raises SVA semantic accuracy by 23 points
Automated SVA Generation with LLMs
-
Pulse sequence moves Rydberg excitation for remote CZ gates
Compiler Framework for Directional Transport in Zoned Neutral Atom Systems with AOD Assistance: A Hybrid Remote CZ Approach
-
Heterogeneous PIM chiplet speeds graph DP 42x over GPU
GEN-Graph: Heterogeneous PIM Accelerator for General Computational Patterns in Graph-based Dynamic Programming
-
Optimal AI accelerator shifts with batch size and model scale
The xPU-athalon: Quantifying the Competition of AI Acceleration
-
Photonics scales AI past transistor density limits
Harnessing Photonics for Machine Intelligence