archive

Every paper Pith has read. Search by title, abstract, or pith.

225 papers in cs.PF · page 3

cs.DC 2026-04-16 reviewed

Block placement and cache rules cut LLM serving latency
Serving Chain-structured Jobs with Large Memory Footprints with Application to Large Foundation Model Serving

Tingyang Sun +2
cs.PF 2026-04-16 reviewed

L4 GPU delivers up to 4.4x inference throughput over T4
DEEP-GAP: Deep-learning Evaluation of Execution Parallelism in GPU Architectural Performance

Kathiravan Palaniappan
eess.SY 2026-04-15 reviewed

State-based scheduler maps full polytope of feasible worst-case schedules
Exploiting Scheduling Flexibility via State-Based Scheduling When Guaranteeing Worst-Case Services

Yike Xu +1
cs.PL 2026-04-14 reviewed

Virtual machine speeds array programs 147x on GPUs
Towards a Linear-Algebraic Hypervisor

Breandan Considine
cs.NI 2026-04-14 reviewed

IPFS achieves 70% success on decentralized NAT traversal
Large-Scale Measurement of NAT Traversal for the Decentralized Web: A Case Study of DCUtR in IPFS

Dennis Trautwein +4
cs.CR 2026-04-13 reviewed

Sparse FHE matmul on GPUs runs up to 3x faster than CPU
GPU Acceleration of Sparse Fully Homomorphic Encrypted DNNs

Lara D'Agata +9
quant-ph 2026-04-13 reviewed

Transpiler maps OpenQASM 3.0 dynamic circuits to CUDA-Q kernels
Efficient Transpilation of OpenQASM 3.0 Dynamic Circuits to CUDA-Q: Performance and Expressiveness Advantages

Vinooth Kulkarni +6
cs.PF 2026-04-13 reviewed

H200 outperforms H100 for memory-bound tasks when power-capped
Architectural Trade-offs in the Energy-Efficient Era: A Comparative Study of power-capping NVIDIA H100 and H200

Aditya Ujeniya +3
cs.DC 2026-04-13 reviewed

Hierarchical search tunes GPU apps better and faster
Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search

Daniel Nichols +5
physics.flu-dyn 2026-04-13 reviewed

Julia model for particle flows hits 18x GPU speedup
LCS.jl: A High-Performance, Multi-Platform Computational Model in Julia for Turbulent Particle-Laden Flows

Taketo Tominaga (Institute of Science Tokyo) +1
eess.SY 2026-04-12 reviewed

AI workload mix smooths power variability but keeps fast ramps
Workload composition smooths aggregate power demand while sustaining short-horizon ramps in AI data centers

Subir Majumder +2
physics.app-ph 2026-04-12 reviewed

Adaptive beta tuning curbs dominance in AI resource allocation
Computable Fairness: Boltzmann-Softmax Control for AI Resource Allocation

Ji-Won Park +1
cs.LG 2026-04-12 reviewed

MoEITS prunes experts in LLMs to reduce compute while preserving accuracy
MoEITS: A Green AI approach for simplifying MoE-LLMs

Luis Balderas +2
cs.PF 2026-04-11 reviewed

Wave-aware model picks near-optimal GPU kernel settings fast
WaveTune: Wave-aware Bilinear Modeling for Efficient GPU Kernel Auto-tuning

Kaixuan Zhang +8
cs.PF 2026-04-11 reviewed

Mosaic clusters KVCache for faster streaming video VLMs
Mosaic: Cross-Modal Clustering for Efficient Video Understanding

Tuowei Wang +4
cs.DC 2026-04-09 reviewed

Energy-efficient GPUs deliver better value under budget limits
Wattlytics: A Web Platform for Co-Optimizing Performance, Energy, and TCO in HPC Clusters

Ayesha Afzal +2
cs.DC 2026-04-08 reviewed

CPU-free LLM serving cuts P99 latency up to 8x
Blink: CPU-Free LLM Inference by Delegating the Serving Stack to GPU and SmartNIC

Mohammad Siavashi +4
cs.DC 2026-04-08 reviewed

Client scheduler hits 100% LLM deadlines at 4.2 requests per second
Scheduling the Unschedulable: Taming Black-Box LLM Inference at Scale

Renzhong Yuan +5
cs.DC 2026-04-07 reviewed

Go runtime outperforms Python and Node.js for OpenFaaS on Kubernetes
Optimizing OpenFaaS on Kubernetes: Comparative Analysis of Language Runtimes and Cluster Distributions

Ehsan Ataie +2
cs.PF 2026-04-07 reviewed

PTE metric predicts LLM tool-use latency better than token counts
Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning

Qisheng Su +5
cs.PL 2026-04-06 reviewed

AutoLALA produces symbolic reuse-distance formulas for loop nests
AutoLALA: Automatic Loop Algebraic Locality Analysis for AI and HPC Kernels

Yifan Zhu +3
cs.AI 2026-04-06 reviewed

Three metrics separate AI adaptation from data shifts
Learning, Potential, and Retention: An Approach for Evaluating Adaptive AI-Enabled Medical Devices

Alexis Burgon +4
cs.DC 2026-04-06 reviewed

Execution-idle wastes 10.7% of GPU cluster energy
The Energy Cost of Execution-Idle in GPU Clusters

Yiran Lei +6
cs.DC 2026-04-06 reviewed

Satellite emulators tested against real data show clear gaps
An experimental evaluation of satellite constellation emulators

Victor Cionca +3
eess.SP 2026-04-06 reviewed

UAV flights fit polynomial and ML models to 5G KPIs
Modeling and Analysis of Air-to-Ground Cellular KPIs in a 5G Testbed using Android Smartphones

Simran Singh +7
cs.PF 2026-04-06 reviewed

Half the DCT coefficients train a transformer to near baseline loss
Training Transformers in Cosine Coefficient Space

Mohamed Amine Bergach
cs.AI 2026-04-06 reviewed

Merging experts beats pruning in MoE LLMs
REAM: Merging Improves Pruning of Experts in LLMs

Saurav Jha +5
cs.CR 2026-04-05 reviewed

Container testbed automates reproducible cybersecurity datasets
NetSecBed: A Container-Native Testbed for Reproducible Cybersecurity Experimentation

Leonardo Bitzki +6
cs.PF 2026-04-03 reviewed

Bridges link blockchains but usage lags behind
The Price of Interoperability: Exploring Cross-Chain Bridges and Their Economic Consequences

Yiyue Cao +4
cs.LG 2026-04-02 reviewed

Shared memory speeds NF4 dequantization 2x
Fast NF4 Dequantization Kernels for Large Language Model Inference

Xiangbo Qi +2
cs.DC 2026-03-31 reviewed

Multi-agent LLM workflow maps service text to KVI intervals
KPI2KVI: A Multi Agent Workflow for Calculating Key Value Indicators from Service Descriptions

Masoud Shokrnezhad +3
cs.DC 2026-03-26 reviewed

Erasure coding reduces LLM checkpoint latency 2.7x
GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving

Shakya Jayakody +3
physics.plasm-ph 2026-03-25 reviewed

Hybrid MPI+OpenMP scales PIC Monte Carlo to 16,000 GPUs
Multi-GPU Hybrid Particle-in-Cell Monte Carlo Simulations for Exascale Computing Systems

Jeremy J. Williams +15
cs.CR 2026-03-19 reviewed

ML-KEM key exchange runs in 35.7 ms on M0+
Benchmarking NIST-Standardised ML-KEM and ML-DSA on ARM Cortex-M0+: Performance, Memory, and Energy on the RP2040

Rojin Chhetri
cs.NI 2026-03-14 reviewed

CATS transport cuts first paint time by 78% in worst-case web load
A Case for CATS: A Conductor-driven Asymmetric Transport Scheme for Semantic Prioritization

Syed Muhammad Aqdas Rizvi
cs.DC 2026-03-10 reviewed

FP64 tensor cores speed finite-element kernels 2x
Accelerating High-Order Finite Element Simulations at Extreme Scale with FP64 Tensor Cores

Jiqun Tu +6
cs.DC 2026-03-04 reviewed

Fixed encoding decodes data 9-213× faster than Protocol Buffers
Simplicity Scales

Andrew Sampson (6OVER3 Institute) +2
cs.NI 2026-02-23 reviewed

Dynamic routing across LLMs beats any single model
Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey

Yasmin Moslem +1
cs.DC 2026-02-19 reviewed

SwapLess cuts Edge TPU latency up to 77% via CPU-TPU partitioning
Collaborative Processing for Multi-Tenant Inference on Memory-Constrained Edge TPUs

Nathan Ng +7
cs.LG 2026-02-09 reviewed

WebGPU dispatch overhead is 24-36 μs on Vulkan
Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers

J\k{e}drzej Maczan
cs.AI 2026-02-05 reviewed

LLM energy minima at moderate input and output lengths
SweetSpot: An Analytical Model for Predicting Energy Efficiency of LLM Inference

Hiari Pizzini Cavagna +7
cs.CR 2026-01-30 reviewed

PQC algorithms add manageable delay to enterprise Wi-Fi logins
Assessing the Real-World Impact of Post-Quantum Cryptography on WPA-Enterprise Networks

Lukas K\"oder +5
cs.PF 2026-01-21 reviewed

Hybrid model cuts GPU kernel prediction error by 6.7x
PipeWeave: Synergizing Analytical and Learning Models for Unified GPU Performance Prediction

Kaixuan Zhang +10
cs.DC 2026-01-15 reviewed

Beta metric delivers 96.5% optimal edge AI performance
Mitigating GIL Bottlenecks in Edge AI Systems

Mridankan Mandal +1
cs.LG 2026-01-06 reviewed

Sparse kernels factor forest proximities exactly
Revisiting Forest Proximities via Sparse Leaf-Incidence Kernels

Adrien Aumon +3
cs.DC 2025-12-23 reviewed

SHIRO delivers 221x SpMM speedup on 128 GPUs via sparsity-aware transfers
SHIRO: Near-Optimal Communication Strategies for Distributed Sparse Matrix Multiplication

Chen Zhuang +7
cs.DC 2025-12-18 reviewed

Multipath routing lifts host-GPU bandwidth 4.6x
MultiPath Memory Access: Breaking Host-GPU Bandwidth Bottlenecks in LLM Services

Lingfeng Tang +8
cs.DC 2025-12-17 reviewed

Data movement bottlenecks sit outside the network core
Reexamining Paradigms of End-to-End Data Movement

Chin Fang +3
cs.DC 2025-12-15 reviewed

Framework links SKA imaging quality to energy and cost metrics
astroCAMP: A Community Benchmark and Co-Design Framework for Sustainable SKA-Scale Radio Imaging

Denisa-Andreea Constantinescu +9
cs.SE 2025-12-13 reviewed

Async Kafka rules shift availability forecasts by 0.001 points or less
Evaluating Asynchronous Semantics in Trace-Discovered Resilience Models: A Case Study on the OpenTelemetry Demo

Anatoly A. Krasnovsky