pith. sign in

archive

Every paper Pith has read. Search by title, abstract, or pith.

225 papers in cs.PF · page 2

  1. cs.AR 2026-05-07 reviewed
    LLMs automate FPGA accelerator design space exploration

    LLM-Driven Design Space Exploration of FPGA-based Accelerators

    Vinamra Sharma +3

  2. cs.PF 2026-05-07 reviewed
    Int4 KV cache outruns fp16 on Apple Silicon

    When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

    Mohamed Amine Bergach

  3. cs.LG 2026-05-06 reviewed
    Task category predicts LLM kernel success far better than generation method

    KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

    Han Wang +5

  4. cs.LG 2026-05-06 reviewed
    Task category explains 3x more variance than method in LLM kernel correctness

    KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

    Han Wang +5

  5. cs.GR 2026-05-06 reviewed
    Algebraic coarsening delivers 3x speedup in GPU contact solves

    AGIPC: Adaptive In-Solve Algebraic Coarsening for GPU IPC

    Xuan Wang +4

  6. cs.PF 2026-05-06 reviewed
    LLM agents turn GPU profiles into optimization advice

    KEET: Explaining Performance of GPU Kernels Using LLM Agents

    Joshua H. Davis +7

  7. cs.GT 2026-05-05 reviewed
    Light storage limits turn content-provider competition into a potential game

    Decentralized Edge Caching under Budget and Storage Constraints: A Game-Theoretic Approach

    Hamta Sedghani +3

  8. cs.AR 2026-05-05 reviewed
    SPEC CPU2026 increases instruction volume and cache pressure

    SPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison

    RuiHao Li +3

  9. cs.AR 2026-05-05 reviewed
    4-5 workloads preserve 96-99% of SPEC CPU2026 behavior

    SPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison

    RuiHao Li +3

  10. cs.DC 2026-05-05 reviewed
    GPU layer speeds exascale trace analysis by up to 314x

    Enhancing Performance Insight at Scale: A Heterogeneous Framework for Exascale Diagnostics

    Dragana Grbic (Department of Computer Science +1

  11. cs.DC 2026-05-05 reviewed
    GPU speeds exascale trace analysis by 314 times

    Enhancing Performance Insight at Scale: A Heterogeneous Framework for Exascale Diagnostics

    Dragana Grbic (Department of Computer Science +1

  12. cs.PF 2026-05-04 reviewed
    Same model name yields different speed

    When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs

    Haorui Li +9

  13. cs.PF 2026-05-04 reviewed
    Same LLM name produces different services by host

    When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs

    Haorui Li +9

  14. cs.LG 2026-05-04 reviewed
    Streaming top-k runs CSA indexer to 1M tokens on 6 GB

    StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

    Jaber Jaber +1

  15. cs.CR 2026-05-04 reviewed
    Two post-quantum signatures pass Australia's payment speed test

    Post-Quantum Cryptography Migration in Australian Real-Time Payment Infrastructure: A Monte Carlo Simulation Study of the New Payments Platform

    Nazmus Salehin Sammo

  16. cs.PF 2026-05-02 reviewed
    SPEC CPU 2026 standardizes mixed-workload CPU benchmarking

    SPEC CPU: The Next Generation

    Mahesh Madhav +33

  17. cs.PF 2026-05-02 reviewed
    Response time distributions derived for priority queues with preemption overhead

    Priority Scheduling in the M/G/1 with Preemption Overhead

    Shefali Ramakrishna +2

  18. cs.PL 2026-05-01 reviewed
    Compiler splits recursive datatypes into separate field buffers

    SoCal: A Language for Memory-Layout Factorization of Recursive Datatypes

    Vidush Singhal +5

  19. cs.DC 2026-05-01 reviewed
    Fixed-core approach yields 211x higher efficiency for edge GEMM

    Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge

    M. Grailoo +1

  20. cs.PF 2026-05-01 reviewed
    Apple Silicon runs 80B LLMs at 23x Nvidia energy efficiency

    Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference

    Abdurrahman Javat +1

  21. stat.ME 2026-05-01 reviewed
    Workflow turns raw measurements into defensible ECE/CS results

    How to Do Statistical Evaluations in ECE/CS Papers: A Practical Playbook for Defensible Results

    Bhaskar Krishnamachari

  22. cs.AI 2026-05-01 reviewed
    Same model accuracy varies 12 points by endpoint

    Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference

    Yuxuan Gao +2

  23. cs.MA 2026-04-29 reviewed
    C++ engine hits 33 million steps per second on POMDP tasks

    A High-Throughput Compute-Efficient POMDP Hide-And-Seek-Engine (HASE) for Multi-Agent Operations

    Timothy Flavin +1

  24. cs.LG 2026-04-29 reviewed
    Compiler automates sequence parallelism for 2.7x longer LLM contexts

    AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism

    Ahan Gupta +5

  25. cs.PF 2026-04-29 reviewed
    Watchpoint recovers full NVIDIA driver command streams

    Revealing NVIDIA Closed-Source Driver Command Streams for CPU-GPU Runtime Behavior Insight

    Yuang Yan +2

  26. cs.SE 2026-04-29 reviewed
    RAPL tools add up to 47% time overhead at 1 kHz polling

    What Is the Cost of Energy Monitoring? An Empirical Study on the Overhead of RAPL-Based Tools

    Jeremy Diamond +1

  27. cs.DC 2026-04-29 reviewed
    Agentic workflow turns PyTorch graphs into faster CUTLASS kernels

    FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow

    Sina Heidari +1

  28. cs.DC 2026-04-29 reviewed
    Dual-path KV offload cuts edge LLM latency up to 42%

    DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

    Bodon Jeong +6

  29. cs.DC 2026-04-27 reviewed
    Fixed-input lock keeps Spark policy outputs identical under repartitioning

    Spark Policy Toolkit: Semantic Contracts and Scalable Execution for Policy Learning in Spark

    Zeyu Bai

  30. cs.NI 2026-04-27 reviewed
    Reprofiling flows cuts bandwidth for delay guarantees in multi-hop nets

    On the Benefits of Traffic "Reprofiling" -- The Multiple Hops Case -- Part II

    Jiaming Qiu +1

  31. cs.PF 2026-04-26 reviewed
    Optimas automates GPU code optimization with 100% correctness

    Optimas: An Intelligent Analytics-Informed Generative AI Framework for Performance Optimization

    Mohammad Zaeed +2

  32. cs.LG 2026-04-25 reviewed
    Two-block Hadamard rotations match uniform ones on coordinates but not overall

    Approximating Uniform Random Rotations by Two-Block Structured Hadamard Rotations in High Dimensions

    Tomer Zilca +1

  33. cs.PF 2026-04-24 reviewed
    COMPASS cuts HPC job turnaround time by 66% with trace ML

    COMPASS: A Unified Decision-Intelligence System for Navigating Performance Trade-off in HPC

    Ankur Lahiry +4

  34. astro-ph.IM 2026-04-24 reviewed
    Tool shows solar storms trigger Starlink orbit decay and 10 Mbps drops

    CosmicDancePro -- Measuring LEO satellite's orbital decay and network connectivity implications during solar storms

    Suvam Basak +2

  35. cs.AR 2026-04-24 reviewed
    Accelerators improve LLM speed on edge single-board computers

    Cloud to Edge: Benchmarking LLM Inference On Hardware-Accelerated Single-Board Computers

    Harri Renney +3

  36. cs.DC 2026-04-24 reviewed
    Top-K method speeds sparse decode 1.88x on Blackwell

    Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation

    Long Cheng +9

  37. cs.LG 2026-04-23 reviewed
    Parallel task split makes large-scale NN search run at medium-scale cost

    Large-Scale Data Parallelization of Product Quantization and Inverted Indexing Using Dask

    Ashley N. Abraham +4

  38. cs.NI 2026-04-23 reviewed
    Server-driven adaptive sampling cuts wireless iBCI power by 40 mW

    An Efficient Wireless iBCI Headstage with Adaptive ADC Sample Rate

    Hongyao Liu +3

  39. cs.NI 2026-04-23 reviewed
    SparKV cuts on-device LLM first-token time by 1.3x-5.1x

    SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference

    Hongyao Liu +3

  40. cs.LG 2026-04-22 reviewed
    Joint optimizations cut multi-agent edge latency by 62 percent at 200 agents

    A Delta-Aware Orchestration Framework for Scalable Multi-Agent Edge Computing

    Samaresh Kumar Singh +1

  41. cs.DC 2026-04-21 reviewed
    Slicing traces GPU stall roots for 1.8x speedups across vendors

    LEO: Tracing GPU Stall Root Causes via Cross-Vendor Backward Slicing

    Yuning Xia +1

  42. cs.PF 2026-04-20 reviewed
    CPU-GPU hybrid speeds long-context LLM inference 1.41x-3.2x

    HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing

    Mao Lin +4

  43. cs.NI 2026-04-20 reviewed
    Lagrange heuristic lowers age of updates from mixed sensors

    Lagrange Index based Scheduling for Minimizing Age of Updates from Heterogeneous Sources

    Aniket Mukherjee +2

  44. cs.LG 2026-04-19 reviewed
    Crash-aware tuner spends fixed budget more consistently on LLM serving

    SLO-Guard: Crash-Aware, Budget-Consistent Autotuning for SLO-Constrained LLM Serving

    Christian Lysenst{\o}en

  45. cs.AR 2026-04-19 reviewed
    Multi-tier KV cache cuts LLM inference costs by 47%

    Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference

    Sanjeev Rao Ganjihal

  46. cs.DC 2026-04-19 reviewed
    Active inference learns edge AI routing without offline training

    Active Inference-Based Adaptive Routing for Heterogeneous Edge AI Services

    Zihang Wang +2

  47. cs.DB 2026-04-19 reviewed
    Branchable databases slow reads up to 4000x as agent branches deepen

    BranchBench: Aligning Database Branching with Agentic Demands

    Elaine Ang +5

  48. cs.LG 2026-04-17 reviewed
    Precision modeling cuts training time prediction error to 9.8 percent

    Training Time Prediction for Mixed Precision-based Distributed Training

    Minchul Kang +7

  49. cs.CV 2026-04-17 reviewed
    CPU optimizations boost 3D biomechanics pipeline 2.47x

    CPU Optimization of a Monocular 3D Biomechanics Pipeline for Low-Resource Deployment

    Yan Zhang +1

  50. cs.PF 2026-04-16 reviewed
    The paper introduces Ragged Paged Attention (RPA)

    Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU

    Jevin Jiang +4