archive
Every paper Pith has read. Search by title, abstract, or pith.
225 papers in cs.PF · page 3
-
Block placement and cache rules cut LLM serving latency
Serving Chain-structured Jobs with Large Memory Footprints with Application to Large Foundation Model Serving
-
L4 GPU delivers up to 4.4x inference throughput over T4
DEEP-GAP: Deep-learning Evaluation of Execution Parallelism in GPU Architectural Performance
-
State-based scheduler maps full polytope of feasible worst-case schedules
Exploiting Scheduling Flexibility via State-Based Scheduling When Guaranteeing Worst-Case Services
-
Virtual machine speeds array programs 147x on GPUs
Towards a Linear-Algebraic Hypervisor
-
IPFS achieves 70% success on decentralized NAT traversal
Large-Scale Measurement of NAT Traversal for the Decentralized Web: A Case Study of DCUtR in IPFS
-
Sparse FHE matmul on GPUs runs up to 3x faster than CPU
GPU Acceleration of Sparse Fully Homomorphic Encrypted DNNs
-
Transpiler maps OpenQASM 3.0 dynamic circuits to CUDA-Q kernels
Efficient Transpilation of OpenQASM 3.0 Dynamic Circuits to CUDA-Q: Performance and Expressiveness Advantages
-
H200 outperforms H100 for memory-bound tasks when power-capped
Architectural Trade-offs in the Energy-Efficient Era: A Comparative Study of power-capping NVIDIA H100 and H200
-
Hierarchical search tunes GPU apps better and faster
Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search
-
Julia model for particle flows hits 18x GPU speedup
LCS.jl: A High-Performance, Multi-Platform Computational Model in Julia for Turbulent Particle-Laden Flows
-
AI workload mix smooths power variability but keeps fast ramps
Workload composition smooths aggregate power demand while sustaining short-horizon ramps in AI data centers
-
Adaptive beta tuning curbs dominance in AI resource allocation
Computable Fairness: Boltzmann-Softmax Control for AI Resource Allocation
-
MoEITS prunes experts in LLMs to reduce compute while preserving accuracy
MoEITS: A Green AI approach for simplifying MoE-LLMs
-
Wave-aware model picks near-optimal GPU kernel settings fast
WaveTune: Wave-aware Bilinear Modeling for Efficient GPU Kernel Auto-tuning
-
Mosaic clusters KVCache for faster streaming video VLMs
Mosaic: Cross-Modal Clustering for Efficient Video Understanding
-
Energy-efficient GPUs deliver better value under budget limits
Wattlytics: A Web Platform for Co-Optimizing Performance, Energy, and TCO in HPC Clusters
-
CPU-free LLM serving cuts P99 latency up to 8x
Blink: CPU-Free LLM Inference by Delegating the Serving Stack to GPU and SmartNIC
-
Client scheduler hits 100% LLM deadlines at 4.2 requests per second
Scheduling the Unschedulable: Taming Black-Box LLM Inference at Scale
-
Go runtime outperforms Python and Node.js for OpenFaaS on Kubernetes
Optimizing OpenFaaS on Kubernetes: Comparative Analysis of Language Runtimes and Cluster Distributions
-
PTE metric predicts LLM tool-use latency better than token counts
Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning
-
AutoLALA produces symbolic reuse-distance formulas for loop nests
AutoLALA: Automatic Loop Algebraic Locality Analysis for AI and HPC Kernels
-
Three metrics separate AI adaptation from data shifts
Learning, Potential, and Retention: An Approach for Evaluating Adaptive AI-Enabled Medical Devices
-
Execution-idle wastes 10.7% of GPU cluster energy
The Energy Cost of Execution-Idle in GPU Clusters
-
Satellite emulators tested against real data show clear gaps
An experimental evaluation of satellite constellation emulators
-
UAV flights fit polynomial and ML models to 5G KPIs
Modeling and Analysis of Air-to-Ground Cellular KPIs in a 5G Testbed using Android Smartphones
-
Half the DCT coefficients train a transformer to near baseline loss
Training Transformers in Cosine Coefficient Space
-
Merging experts beats pruning in MoE LLMs
REAM: Merging Improves Pruning of Experts in LLMs
-
Container testbed automates reproducible cybersecurity datasets
NetSecBed: A Container-Native Testbed for Reproducible Cybersecurity Experimentation
-
Bridges link blockchains but usage lags behind
The Price of Interoperability: Exploring Cross-Chain Bridges and Their Economic Consequences
-
Shared memory speeds NF4 dequantization 2x
Fast NF4 Dequantization Kernels for Large Language Model Inference
-
Multi-agent LLM workflow maps service text to KVI intervals
KPI2KVI: A Multi Agent Workflow for Calculating Key Value Indicators from Service Descriptions
-
Erasure coding reduces LLM checkpoint latency 2.7x
GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving
-
Hybrid MPI+OpenMP scales PIC Monte Carlo to 16,000 GPUs
Multi-GPU Hybrid Particle-in-Cell Monte Carlo Simulations for Exascale Computing Systems
-
ML-KEM key exchange runs in 35.7 ms on M0+
Benchmarking NIST-Standardised ML-KEM and ML-DSA on ARM Cortex-M0+: Performance, Memory, and Energy on the RP2040
-
CATS transport cuts first paint time by 78% in worst-case web load
A Case for CATS: A Conductor-driven Asymmetric Transport Scheme for Semantic Prioritization
-
FP64 tensor cores speed finite-element kernels 2x
Accelerating High-Order Finite Element Simulations at Extreme Scale with FP64 Tensor Cores
-
Fixed encoding decodes data 9-213× faster than Protocol Buffers
Simplicity Scales
-
Dynamic routing across LLMs beats any single model
Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey
-
SwapLess cuts Edge TPU latency up to 77% via CPU-TPU partitioning
Collaborative Processing for Multi-Tenant Inference on Memory-Constrained Edge TPUs
-
WebGPU dispatch overhead is 24-36 μs on Vulkan
Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers
-
LLM energy minima at moderate input and output lengths
SweetSpot: An Analytical Model for Predicting Energy Efficiency of LLM Inference
-
PQC algorithms add manageable delay to enterprise Wi-Fi logins
Assessing the Real-World Impact of Post-Quantum Cryptography on WPA-Enterprise Networks
-
Hybrid model cuts GPU kernel prediction error by 6.7x
PipeWeave: Synergizing Analytical and Learning Models for Unified GPU Performance Prediction
-
Beta metric delivers 96.5% optimal edge AI performance
Mitigating GIL Bottlenecks in Edge AI Systems
-
Sparse kernels factor forest proximities exactly
Revisiting Forest Proximities via Sparse Leaf-Incidence Kernels
-
SHIRO delivers 221x SpMM speedup on 128 GPUs via sparsity-aware transfers
SHIRO: Near-Optimal Communication Strategies for Distributed Sparse Matrix Multiplication
-
Multipath routing lifts host-GPU bandwidth 4.6x
MultiPath Memory Access: Breaking Host-GPU Bandwidth Bottlenecks in LLM Services
-
Data movement bottlenecks sit outside the network core
Reexamining Paradigms of End-to-End Data Movement
-
Framework links SKA imaging quality to energy and cost metrics
astroCAMP: A Community Benchmark and Co-Design Framework for Sustainable SKA-Scale Radio Imaging
-
Async Kafka rules shift availability forecasts by 0.001 points or less
Evaluating Asynchronous Semantics in Trace-Discovered Resilience Models: A Case Study on the OpenTelemetry Demo