archive
Every paper Pith has read. Search by title, abstract, or pith.
493 papers in cs.AR · page 10
-
Specialized LLMs raise HLS debugging success by 32 percent
ChatHLS: Towards Systematic Design Automation and Optimization for High-Level Synthesis
-
Sparse NN linearizes RF amps on FPGA at 241 mW with -59 dBc ACPR
SparseDPD: A Sparse Neural Network-based Digital Predistortion FPGA Accelerator for RF Power Amplifier Linearization
-
Microcanonical annealing cuts random-number use in parallel spin-glass sims
Microcanonical simulated annealing: Massively parallel Monte Carlo simulations with sporadic random-number generation
-
RISC-V calibration lifts CIM compute SNR by 25-45 percent
Acore-CIM: build accurate and reliable mixed-signal CIM cores with RISC-V controlled self-calibration
-
System predicts lane changes 3-4 seconds ahead in real-world tests
Real-World Deployment of a Lane Change Prediction Architecture Based on Knowledge Graph Embeddings and Bayesian Inference
-
MLA cuts bandwidth use in attention and stabilizes hardware performance
Hardware-Centric Analysis of DeepSeek's Multi-Head Latent Attention
-
PIM co-design cuts energy and time for genomics workloads
Processing-in-memory for genomics workloads
-
GreenCache trims LLM carbon 15% by trading storage against compute
Cache Your Prompt When It's Green: Carbon-Aware Caching for Large Language Model Serving
-
60k code pairs train models for 88% accurate CUDA to HIP translation
CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark
-
Seamless switching boosts CPU LLM serving speed by 2x
Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving
-
Co-optimized Iceberg gadgets raise QAOA success from 44% to 65%
Iceberg Beyond the Tip: Co-Compilation of a Quantum Error Detection Code and a Quantum Algorithm
-
LLM automates UVM testbench creation for RTL designs
From Concept to Practice: an Automated LLM-aided UVM Machine for RTL Verification
-
Fusion-aware design speeds SSM accelerators 1.78x at fixed area
Fine-Grained Fusion: The Missing Piece in Area-Efficient State Space Model Acceleration
-
Simulator explores LLM configs without 40K cloud costs
MIST: A Co-Design Framework for Heterogeneous, Multi-Stage LLM Inference
-
Memristor arrays solve XOR-CNF SAT problems 10 times faster
Accelerating Hybrid XOR$-$CNF Boolean Satisfiability Problems Natively with In-Memory Computing
-
71.2 μW accelerator runs real-time speech recognition
A 71.2-$\mu$W Speech Recognition Accelerator with Recurrent Spiking Neural Network
-
Edge criteria halve MACs for 8K super-resolution at 30 FPS
ESSR: An 8K@30FPS Super-Resolution Accelerator With Edge Selective Network
-
Benchmark shows 51 percent area cut for 3D chip designs
Open3DBench: Open-Source Benchmark for 3D-IC Backend Implementation and PPA Evaluation
-
Hardware co-design checks all feasible QAP moves in one step
Hardware-Compatible Single-Shot Feasible-Space Heuristics for Solving the Quadratic Assignment Problem
-
90% of Linux radiation failures route through one eMMC path
Where Linux Breaks Under Radiation: A Cross-Architecture Kernel-Level Characterization of Proton-Induced Failures in COTS SoCs
-
Quantization method raises 4-bit SAM mAP 15.2% on COCO
AHCQ-SAM: Toward Accurate and Hardware-Compatible Post-Training Segment Anything Model Quantization
-
Taxonomy maps 25 years of FPGA neuromorphic architectures
A Quarter of a Century of Neuromorphic Architectures on FPGAs -- an Overview
-
Posits shrink wearable hardware 38% and cut power 42%
Increasing the Energy-Efficiency of Wearables Using Low-Precision Posit Arithmetic with PHEE
-
Framework enables any-cycle preemption for FPGA tasks in clouds
EPOCH: Enabling Preemption Operation for Context Saving in Heterogeneous FPGA Systems
-
Taylor softmax cuts FPGA resources 14% at 0.2% accuracy cost
A Quantitative Evaluation of Approximate Softmax Functions for Deep Neural Networks
-
Octopus sparse links save 3-5.4% server costs in CXL pods
Octopus: Enhancing CXL Memory Pods via Sparse Topology
-
Compiler aligns HE workloads with TPU matrix engines
Leveraging ASIC AI Chips for Homomorphic Encryption
-
Hybrid federated method boosts hotspot detection accuracy
Federated Knowledge Distillation for Multi-Model Architectures Lithography Hotspot Detection
-
Filter turns AI-generated PCIe traces into usable simulation data
The Phantom of PCIe: Constraining Generative Artificial Intelligences for Practical Peripherals Trace Synthesizing
-
Async pipeline training on analog hardware matches digital SGD rate
On the Convergence Theory of Pipeline Gradient-based Analog In-memory Training
-
SSD MobileNet V1 minimizes latency and energy but not accuracy on edge devices
A Comprehensive Evaluation of Deep Learning Object Detection Models on Heterogeneous Edge Devices
-
FPGA idle-waiting extends DL accelerator life 12x vs powering off
Idle is the New Sleep: Configuration-Aware Alternative to Powering Off FPGA-Based DL Accelerators During Inactivity
-
Two-level scheduler cuts quantum decoder hardware by 10-40%
Managing Classical Processing Requirements for Quantum Error Correction
-
Weight shuffling restores 83.5% accuracy in resistive crossbar DNNs
WAGONN: Weight Bit Agglomeration in Crossbar Arrays for Reduced Impact of Interconnect Resistance on DNN Inference Accuracy
-
Accelerator switches dataflows per layer at 6% extra area
FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching
-
SparrowSNN cuts ECG energy by 20-100x at full accuracy
SparrowSNN: A Hardware/software Co-design for Energy Efficient ECG Classification
-
Cache-coherent eFPGAs cut processor-accelerator latency by 82%
Duet: Creating Harmony between Processors and Embedded FPGAs
-
Hundreds of thousands of qubits needed for practical quantum advantage
Assessing requirements to scale to practical quantum advantage
-
QDI adder comparison in 32nm CMOS identifies low-power options
Performance Comparison of Quasi-Delay-Insensitive Asynchronous Adders
-
Memristor-CMOS multiplier reconfigures for multiple bit widths
Reconfigurable multiplier architecture based on memristor-cmos with higher flexibility
-
PPAC runs neural nets and crypto inside memory arrays
PPAC: A Versatile In-Memory Accelerator for Matrix-Vector-Product-Like Operations
-
RL scheduler adapts multicore memory access for 20% CPI gain
CADS: Core-Aware Dynamic Scheduler for Multicore Memory Controllers
-
History yields conditions for coprocessor long-term success
Coprocessors: failures and successes
-
RM-CAM plus TMR repairs NRAM defects with fewer resources at high error rates
A Range Matching CAM for Hierarchical Defect Tolerance Technique in NRAM Structures
-
RTL FPGA accelerator matches Caffe-CPU for CNN inference
FusionAccel: A General Re-configurable Deep Learning Inference Accelerator on FPGA for Convolutional Neural Networks
-
TicToc speeds hybrid memory 10% using 34KB SRAM
TicToc: Enabling Bandwidth-Efficient DRAM Caching for both Hits and Misses in Hybrid Memory Systems
-
One line per region tracks reuse to speed DRAM caches 18%
To Update or Not To Update?: Bandwidth-Efficient Intelligent Replacement Policies for DRAM Caches
-
Hardware scheduler delivers 12x speedup on accelerator systems
HTS: A Hardware Task Scheduler for Heterogeneous Systems
-
FPGA speeds Tucker decomposition up to 30x on heart MRI
Tucker Tensor Decomposition on FPGA
-
Bit-partitioned dot products share A/D converters via charge accumulation
Mixed-Signal Charge-Domain Acceleration of Deep Neural networks through Interleaved Bit-Partitioned Arithmetic