pith. sign in

archive

Every paper Pith has read. Search by title, abstract, or pith.

493 papers in cs.AR · page 4

  1. cs.AR 2026-04-30 reviewed
    Grammar masking creates scalable benchmarks for RTL code completion

    RuC: HDL-Agnostic Rule Completion Benchmark Generation

    Arnau Ayguad\'e Domingo +7

  2. cs.AR 2026-04-30 reviewed
    Hybrid engine generates UVM testbenches via LLM plans and fixed templates

    HAVEN: Hybrid Automated Verification ENgine for UVM Testbench Synthesis with LLMs

    Chang-Chih Meng +5

  3. cs.AR 2026-04-30 reviewed
    Type recovery lifts 99.98% of GPU binaries to LLVM IR

    CuLifter: Lifting GPU Binaries to Typed IR

    Jisheng Zhao +4

  4. cs.AR 2026-04-30 reviewed
    Ternary LLM accelerator hits 70 tokens/s in 0.223 mm² chip

    VitaLLM: A Versatile, Ultra-Compact Ternary LLM Accelerator with Dependency-Aware Scheduling

    Zi-Wei Lin +1

  5. cs.AR 2026-04-30 reviewed
    RCW scheme cuts LLM prefill latency nearly in half on digital CIM

    RCW-CIM: A Digital CIM-based LLM Accelerator with Read-Compute/Write

    Yan-Cheng Guo +2

  6. cs.AR 2026-04-30 reviewed
    Agents convert DRAM specs to formal DRAMPyML

    Autoformalizing Memory Specifications with Agents

    Jan Ole Ernst +8

  7. cs.CR 2026-04-29 reviewed
    SafeTune filters poisoned RTL training data for secure LLM fine-tuning

    SafeTune: Mitigating Data Poisoning in LLM Fine-Tuning for RTL Code Generation

    Mahshid Rezakhani +3

  8. eess.SP 2026-04-29 reviewed
    This paper reviews recent advancements in mm-wave oscillators below 100 GHz and…

    Recent Advances in mm-Wave and Sub-THz/THz Oscillators for FutureG Technologies

    Baktash Behmanesh +1

  9. cs.AR 2026-04-29 reviewed
    The paper introduces Voxel, a compiler-aware simulation framework for studying the…

    Exploring the Efficiency of 3D-Stacked AI Chip Architecture for LLM Inference with Voxel

    Yiqi Liu +4

  10. cs.AR 2026-04-29 reviewed
    More dense PEs outperform sparse hardware for pruned networks

    Sparse-on-Dense: Area and Energy-Efficient Computing of Sparse Neural Networks on Dense Matrix Multiplication Accelerators

    Hyunsung Yoon +2

  11. cs.AR 2026-04-29 reviewed
    V&V loop unifies UVM, FPGA and CI/CD for RISC-V chips

    Verification and Validation (V&V)-in-the-Loop for RISC-V Design: The Holistic Vision of BZL

    Sajjad Ahmed +23

  12. cs.AR 2026-04-29 reviewed
    EMiX emulates 64-core RISC-V across eight FPGAs

    EMiX: Emulating Beyond Single-FPGA Limits

    Alexander Kropotov +2

  13. cs.DC 2026-04-29 reviewed
    Pipelined sharding speeds client xLM inference up to 30x with 10x less VRAM

    Efficient, VRAM-Constrained xLM Inference on Clients

    Aditya Ukarande +3

  14. cs.AR 2026-04-28 reviewed
    The paper proposes RKHS, a method that combines retrieval-augmented generation with…

    RAG-Enhanced Kernel-Based Heuristic Synthesis (RKHS): A Structured Methodology Using Large Language Models for Hardware Design

    Shiva Ahir +1

  15. cs.AR 2026-04-28 reviewed
    Memory-centric chiplets cut attention latency 15 times

    AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

    Zhongkai Yu +11

  16. cs.AR 2026-04-28 reviewed
    FPGA CNN classifies heart vibrations at 8.55 mW

    At the Edge of the Heart: ULP FPGA-Based CNN for On-Device Cardiac Feature Extraction in Smart Health Sensors for Astronauts

    Kazi Mohammad Abidur Rahman +4

  17. cs.AR 2026-04-28 reviewed
    Randomness in BP decoding boosts quantum accuracy by 2-8 orders

    Lottery BP: Unlocking Quantum Error Decoding at Scale

    Yanzhang Zhu +4

  18. cs.AR 2026-04-28 reviewed
    3D NAND flash runs LLM feed-forward math for 38x edge speedup

    NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference

    Mingbo Hao +6

  19. cs.AR 2026-04-28 reviewed
    4-state MTJ crossbar runs MNIST inference at 94.48% accuracy

    Multibit neural inference in a N-ary crossbar architecture

    Anatole Moureaux +2

  20. quant-ph 2026-04-28 reviewed
    Scheduler boosts FTQC multiprogramming by 3.1x

    No Tile Left Behind: Multiprogramming for Surface-Code Architectures

    Archisman Ghosh +2

  21. cs.AR 2026-04-28 reviewed
    Adaptive windows speed multi-macro CIM CNN mapping 1.3x

    TetrisG-SDK: Efficient Convolutional Layer Mapping with Adaptive Windows and Grouped Convolutions for Fast In-Memory Computing

    Ke Dong +3

  22. cs.AR 2026-04-28 reviewed
    Frequency remapping cuts recsys inference latency 81% on flash

    RecFlash: Fast Recommendation System on In-Storage Computing with Frequency-Based Data Mapping

    Jangho Baik +4

  23. cs.AR 2026-04-28 reviewed
    Mobile NPU-PIM design speeds LLM drafts 4.2 times

    AHASD: Asynchronous Heterogeneous Architecture for LLM Adaptive Drafting Speculative Decoding on Mobile Devices

    Ma Zirui +6

  24. cs.AR 2026-04-28 reviewed
    FusionCIM cuts LLM energy use by up to 3.86x

    FusionCIM: Accelerating LLM Inference with Fusion-Driven Computing-in-Memory Architecture

    Zihao Xuan +6

  25. cs.AR 2026-04-28 reviewed
    RL matches expert chip placement by learning rewards from final layouts

    How Can Reinforcement Learning Achieve Expert-level Placement?

    Ruo-Tong Chen +9

  26. cs.AR 2026-04-28 reviewed
    LUT accelerators deliver 2.2x area cut for 1.58-bit LLMs

    Hardware Generation and Exploration of Lookup Table-Based Accelerators for 1.58-bit LLM Inference

    Robin Geens +3

  27. cs.AI 2026-04-28 reviewed
    LLM agent evolves cache policies that beat LRU by 6 percent IPC

    Agentic Architect: An Agentic AI Framework for Architecture Design Exploration and Optimization

    Alexander Blasberg +2

  28. cs.AR 2026-04-27 reviewed
    Signature tree skips repeated checks to speed FPGA packing 3.7x on average

    D\'ej\`a Vu Packing: Optimizing FPGA Logic Clustering Runtime via Pattern Memoization

    Milo Liebster +2

  29. cs.AR 2026-04-27 reviewed
    ASIC accelerator achieves 3.5x throughput for long-context attention

    Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding

    Wang Fan +7

  30. cs.AR 2026-04-27 reviewed
    VTA compiler now automates large CNN compilation

    Compilation and Execution of an Embeddable YOLO-NAS on the VTA

    Anthony Faure-Gignoux +3

  31. cs.CR 2026-04-27 reviewed
    RowHammer refreshes drop 95-99.99% with vulnerability tracking

    RowHammer Vulnerability Counter (RVC): Redefining RowHammer Detection with Victim-Centric Tracking

    Lavi Jain +1

  32. cs.AR 2026-04-27 reviewed
    Atomic coherence lets 3D CNNs classify video at 125000 fps

    Opto-Atomic Spatio-Temporal Holographic Correlators for High-Speed 3D CNNs

    Xi Shen +3

  33. cs.AR 2026-04-26 reviewed
    Edge AI can pass accuracy checks but fail timing on shared hardware

    Architectural Isolation as a Timing Safety Primitive for Edge AI Medical Devices: Controlled Experimental Evidence on a Shared-Silicon Platform

    Akul Mallayya Swami

  34. cs.AR 2026-04-26 reviewed
    Flow matching produces overlap-free chip placements 10-50x faster

    FlowPlace: Flow Matching for Chip Placement

    Peng Xie +8

  35. cs.AR 2026-04-26 reviewed
    Exact normalization preserved in 14x smaller Softmax and LayerNorm

    Hardware-Efficient Softmax and Layer Normalization with Guaranteed Normalization for Edge Devices

    Dawon Choi +2

  36. cs.AR 2026-04-26 reviewed
    Retrieval-augmented LLM forecasts timing slacks from Verilog

    TimingLLM: A Two-Stage Retrieval-Augmented Framework for Pre-Synthesis Timing Prediction from Verilog

    Armin Abdollahi +3

  37. cs.AR 2026-04-26 reviewed
    Mixed-radix CORDIC cuts FPGA sigmoid to 835 slices

    Hardware-Efficient FPGA Implementation of Sigmoid Function Using Mixed-Radix Hyperbolic Rotation CORDIC

    Chintan Panchal +2

  38. cs.LG 2026-04-25 reviewed
    The paper describes a hybrid runtime that mixes Just-In-Time compilation with CUDA Graphs…

    Hybrid JIT-CUDA Graph Optimization for Low-Latency Large Language Model Inference

    Divakar Kumar Yadav +1

  39. cs.LG 2026-04-25 reviewed
    CuTile hits 1007 TFLOP/s attention on B200 in 60 Python lines

    Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

    Divakar Kumar Yadav +2

  40. cs.CR 2026-04-25 reviewed
    Tessera hides AES decryption behind DRAM fetches on edge chips

    Tessera: Secure, Near-Line-Rate Weight Streaming for UMA Edge Accelerators

    Animan Naskar

  41. cs.LG 2026-04-25 reviewed
    Cosine similarity and NAS tested for vector-quantized model compression

    Efficient VQ-QAT and Mixed Vector/Linear quantized Neural Networks

    Terry Gou +1

  42. cs.LG 2026-04-25 reviewed
    Activation patterns cut multi-node MoE communication up to 20x

    Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns

    Abhimanyu Bambhaniya +8

  43. cs.ET 2026-04-25 reviewed
    MTJ memory integrates stochastic computing to skip external random generators

    Maximizing Memory-Level Parallelism via Integrated Stochastic Logic-in-Memory Architectures

    Farzad Razi +4

  44. cs.CR 2026-04-25 reviewed
    Agent with solver feedback achieves 82% correct hardware assertions

    From Language to Logic: Bridging LLMs & Formal Representations for RTL Assertion Generation

    Nowfel Mashnoor +2

  45. cs.AR 2026-04-24 reviewed
    Accelerators improve LLM speed on edge single-board computers

    Cloud to Edge: Benchmarking LLM Inference On Hardware-Accelerated Single-Board Computers

    Harri Renney +3

  46. cs.DC 2026-04-24 reviewed
    MPS gains or loses 30% in GPU sharing depending on memory contention

    A comprehensive evaluation of spatial co-execution on GPUs using MPS and MIG technologies

    Jorge Villarrubia +3

  47. cs.AR 2026-04-24 reviewed
    Vector processor optimizations yield 1.33x speedup without extra bandwidth

    Microarchitectural Co-Optimization for Sustained Throughput of RISC-V Multi-Lane Chaining Vector Processors

    Weiying Wang +1

  48. cs.DC 2026-04-24 reviewed
    Top-K method speeds sparse decode 1.88x on Blackwell

    Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation

    Long Cheng +9

  49. cs.AR 2026-04-24 reviewed
    Polyhedral analysis uncovers hidden mmuls for CGRA speedups up to 9.1x

    Exploiting pre-optimized kernels with polyhedral transformations for CGRA compilation

    Yuxuan Wang +5

  50. cs.AR 2026-04-24 reviewed
    HGQ-LUT trains LUT neural nets 100x faster on GPUs

    HGQ-LUT: Fast LUT-Aware Training and Efficient Architectures for DNN Inference

    Chang Sun +6