archive
Every paper Pith has read. Search by title, abstract, or pith.
225 papers in cs.PF · page 2
-
LLMs automate FPGA accelerator design space exploration
LLM-Driven Design Space Exploration of FPGA-based Accelerators
-
Int4 KV cache outruns fp16 on Apple Silicon
When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon
-
Task category predicts LLM kernel success far better than generation method
KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels
-
Task category explains 3x more variance than method in LLM kernel correctness
KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels
-
Algebraic coarsening delivers 3x speedup in GPU contact solves
AGIPC: Adaptive In-Solve Algebraic Coarsening for GPU IPC
-
LLM agents turn GPU profiles into optimization advice
KEET: Explaining Performance of GPU Kernels Using LLM Agents
-
Light storage limits turn content-provider competition into a potential game
Decentralized Edge Caching under Budget and Storage Constraints: A Game-Theoretic Approach
-
SPEC CPU2026 increases instruction volume and cache pressure
SPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison
-
4-5 workloads preserve 96-99% of SPEC CPU2026 behavior
SPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison
-
GPU layer speeds exascale trace analysis by up to 314x
Enhancing Performance Insight at Scale: A Heterogeneous Framework for Exascale Diagnostics
-
GPU speeds exascale trace analysis by 314 times
Enhancing Performance Insight at Scale: A Heterogeneous Framework for Exascale Diagnostics
-
Same model name yields different speed
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
-
Same LLM name produces different services by host
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
-
Streaming top-k runs CSA indexer to 1M tokens on 6 GB
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
-
Two post-quantum signatures pass Australia's payment speed test
Post-Quantum Cryptography Migration in Australian Real-Time Payment Infrastructure: A Monte Carlo Simulation Study of the New Payments Platform
-
SPEC CPU 2026 standardizes mixed-workload CPU benchmarking
SPEC CPU: The Next Generation
-
Response time distributions derived for priority queues with preemption overhead
Priority Scheduling in the M/G/1 with Preemption Overhead
-
Compiler splits recursive datatypes into separate field buffers
SoCal: A Language for Memory-Layout Factorization of Recursive Datatypes
-
Fixed-core approach yields 211x higher efficiency for edge GEMM
Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge
-
Apple Silicon runs 80B LLMs at 23x Nvidia energy efficiency
Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference
-
Workflow turns raw measurements into defensible ECE/CS results
How to Do Statistical Evaluations in ECE/CS Papers: A Practical Playbook for Defensible Results
-
Same model accuracy varies 12 points by endpoint
Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference
-
C++ engine hits 33 million steps per second on POMDP tasks
A High-Throughput Compute-Efficient POMDP Hide-And-Seek-Engine (HASE) for Multi-Agent Operations
-
Compiler automates sequence parallelism for 2.7x longer LLM contexts
AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism
-
Watchpoint recovers full NVIDIA driver command streams
Revealing NVIDIA Closed-Source Driver Command Streams for CPU-GPU Runtime Behavior Insight
-
RAPL tools add up to 47% time overhead at 1 kHz polling
What Is the Cost of Energy Monitoring? An Empirical Study on the Overhead of RAPL-Based Tools
-
Agentic workflow turns PyTorch graphs into faster CUTLASS kernels
FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow
-
Dual-path KV offload cuts edge LLM latency up to 42%
DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference
-
Fixed-input lock keeps Spark policy outputs identical under repartitioning
Spark Policy Toolkit: Semantic Contracts and Scalable Execution for Policy Learning in Spark
-
Reprofiling flows cuts bandwidth for delay guarantees in multi-hop nets
On the Benefits of Traffic "Reprofiling" -- The Multiple Hops Case -- Part II
-
Optimas automates GPU code optimization with 100% correctness
Optimas: An Intelligent Analytics-Informed Generative AI Framework for Performance Optimization
-
Two-block Hadamard rotations match uniform ones on coordinates but not overall
Approximating Uniform Random Rotations by Two-Block Structured Hadamard Rotations in High Dimensions
-
COMPASS cuts HPC job turnaround time by 66% with trace ML
COMPASS: A Unified Decision-Intelligence System for Navigating Performance Trade-off in HPC
-
Tool shows solar storms trigger Starlink orbit decay and 10 Mbps drops
CosmicDancePro -- Measuring LEO satellite's orbital decay and network connectivity implications during solar storms
-
Accelerators improve LLM speed on edge single-board computers
Cloud to Edge: Benchmarking LLM Inference On Hardware-Accelerated Single-Board Computers
-
Top-K method speeds sparse decode 1.88x on Blackwell
Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation
-
Parallel task split makes large-scale NN search run at medium-scale cost
Large-Scale Data Parallelization of Product Quantization and Inverted Indexing Using Dask
-
Server-driven adaptive sampling cuts wireless iBCI power by 40 mW
An Efficient Wireless iBCI Headstage with Adaptive ADC Sample Rate
-
SparKV cuts on-device LLM first-token time by 1.3x-5.1x
SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference
-
Joint optimizations cut multi-agent edge latency by 62 percent at 200 agents
A Delta-Aware Orchestration Framework for Scalable Multi-Agent Edge Computing
-
Slicing traces GPU stall roots for 1.8x speedups across vendors
LEO: Tracing GPU Stall Root Causes via Cross-Vendor Backward Slicing
-
CPU-GPU hybrid speeds long-context LLM inference 1.41x-3.2x
HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
-
Lagrange heuristic lowers age of updates from mixed sensors
Lagrange Index based Scheduling for Minimizing Age of Updates from Heterogeneous Sources
-
Crash-aware tuner spends fixed budget more consistently on LLM serving
SLO-Guard: Crash-Aware, Budget-Consistent Autotuning for SLO-Constrained LLM Serving
-
Multi-tier KV cache cuts LLM inference costs by 47%
Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference
-
Active inference learns edge AI routing without offline training
Active Inference-Based Adaptive Routing for Heterogeneous Edge AI Services
-
Branchable databases slow reads up to 4000x as agent branches deepen
BranchBench: Aligning Database Branching with Agentic Demands
-
Precision modeling cuts training time prediction error to 9.8 percent
Training Time Prediction for Mixed Precision-based Distributed Training
-
CPU optimizations boost 3D biomechanics pipeline 2.47x
CPU Optimization of a Monocular 3D Biomechanics Pipeline for Low-Resource Deployment
-
The paper introduces Ragged Paged Attention (RPA)
Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU