archive

Every paper Pith has read. Search by title, abstract, or pith.

225 papers in cs.PF · page 1

cs.LG 2026-05-22 reviewed

Meta-learning yields model performance scores on unlabeled data
Learning to Evaluate: Cost-Effective Model Evaluation on Unlabeled Data with Meta-Learning

Trinh Pham +4
cs.LG 2026-05-21 reviewed

Controller routes LLM requests to best mode for 2x speedup
ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU

Aman Sunesh +2
cs.AR 2026-05-21 reviewed

ACALSim reaches 14x speedup over SST on large GPU simulations
ACALSim: A Scalable Parallel Simulation Framework for High-Performance System Design Space Exploration

Wei-Fen Lin +7
cs.LG 2026-05-21 reviewed

Separate physical pools for KV and SSM caches cut OOMs 7.6% and raise throughput up to 13x
Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference

An Xuan Nguyen
cs.AI 2026-05-20 reviewed

Agentic AI uses 4.33x more energy per successful goal than linear baselines
Energy per Successful Goal: Goal-Level Energy Accounting for Agentic AI Systems

Deepak Panigrahy +1
cs.PF 2026-05-20 reviewed

Discretization produces throughput-optimal policies for continuous MRJ
Throughput-Optimal Multiresource-Job Scheduling with Continuous Requirement Distribution

Heyuan Yao +2
cs.LG 2026-05-19 reviewed

Krylov approximation unlearns data 48x faster than retraining
Causal Unlearning in Collaborative Optimization: Exact and Approximate Influence Reversal under Adversarial Contributions

Ali Mahdavi +3
cs.CV 2026-05-19 reviewed

Billion-scale 3D Gaussians train on one 24 GB GPU
TideGS: Scalable Training of Over One Billion 3D Gaussian Splatting Primitives via Out-of-Core Optimization

Chonghao Zhong +6
cs.SE 2026-05-19 reviewed

Agent skills from expert methods beat docs for PostgreSQL tuning
A Case for Agentic Tuning: From Documentation to Action in PostgreSQL

Hongyu Lin +6
cs.DC 2026-05-19 reviewed

Reasoning LLMs trap data parallelism in KV-cache limits
Understanding Inference Scaling for LLMs: Bottlenecks, Trade-offs, and Performance Principles

Moiz Arif +3
cs.PF 2026-05-18 reviewed

Geo-distributed AI training optimizes at 10-100 km distances
Modeling the Impact of Fiber Latency on Compute-Communication Overlap in Geo-Distributed Multi-Datacenter AI Training

Ioannis Papavasileiou +3
cs.PF 2026-05-18 reviewed

Hybrid model cuts medical tourist waits from 13.7 to 2.4 days
Reducing Waiting Time for Medical Tourists Through Hybrid Agent-Based and Discrete-Event Simulation: A Hospital Case Study

Melika Baghi +1
cs.LO 2026-05-18 reviewed

Unified calculus and lattice language reduce CS problems to performance evaluation
On Generalized Performance Evaluation and Generalized Controller Synthesis

Zining Cao
cs.LG 2026-05-18 reviewed

Boundary protection recovers 69-90% quality at 13% KV retention
Protection Is (Nearly) All You Need: Structural Protection Dominates Scoring in Globally Capped KV Eviction

Gabriel Garcia
cs.LG 2026-05-18 reviewed

Covariance rotations keep 2-bit KV caches accurate
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Zhongzhu Zhou +6
eess.IV 2026-05-16 reviewed

Legacy GPUs power real-time 8K60 for connected vehicles
Sustainable Real-Time 8K60 HEVC Encoding for V2X: Repurposing Legacy NVENC Hardware at the Vehicular Edge

Kasidis Arunruangsirilert +1
cs.PF 2026-05-15 reviewed

Heuristic merges HPC traces to extend hardware counter coverage
Heuristic-Based Merging of HPC Traces to Extend Hardware Counter Coverage

J\'ulia Orteu Aubach +3
cs.LG 2026-05-15 reviewed

Closed-form linear operator fixes layer-pruned LLMs
Ghosted Layers: Unconstrained Activation Alignment for Recovering Layer-Pruned LLMs

Vincent-Daniel Yun +3
quant-ph 2026-05-14 reviewed

Cache reorganization lifts GPU speedups for 28-qubit simulations on laptops
Accelerating State-Vector Quantum Simulation on Integrated GPUs via Cache Locality Optimization: A Cross-Architecture Evaluation

Gabriel Fernandes Thomaz +4
cs.OS 2026-05-14 reviewed

LLM tunes Linux knobs for 72 percent stable gain over defaults
SemaTune: Semantic-Aware Online OS Tuning with Large Language Models

Georgios Liargkovas +3
cs.DC 2026-05-13 reviewed

Heterogeneous solvers up to 32% faster than GPU-only for big matrices
Comparing the Performance of Heterogeneous Conjugate Gradient and Cholesky Solvers on Various Hardware Using SYCL

Tim Th\"uring +2
cs.LG 2026-05-12 reviewed

Block-scale search cuts quantization error 27% in BFP
Search Your Block Floating Point Scales!

Tanmaey Gupta +12
cs.PF 2026-05-12 reviewed

Adaptive packed layouts enable efficient VLA ML code
Scalable Packed Layouts for Vector-Length-Agnostic ML Code Generation

Ege Beysel +2
cs.PF 2026-05-12 reviewed

Packed layouts enable scalable vector ML code
Scalable Packed Layouts for Vector-Length-Agnostic ML Code Generation

Ege Beysel +2
cs.AR 2026-05-12 reviewed

Joint TLB-cache tweaks boost instruction prefetching 8.7%
Enhancing Instruction Prefetching via Cache and TLB Management

Alexandre Valentin Jamet +4
cs.IT 2026-05-12 reviewed

Node failures scale wireless capacity and delay with sqrt of reliable nodes
On Capacity and Delay of Wireless Networks with Node Failures

Wei Li +3
cs.DC 2026-05-12 reviewed

Power capping leaves LLM decode energy untouched
The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures

Bole Ma +3
cs.DC 2026-05-11 reviewed

Chakra standardizes graph traces for AI workload benchmarking
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

Srinivas Sridharan +28
cs.DC 2026-05-11 reviewed

Open traces standardize ML workload benchmarking
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

Srinivas Sridharan +28
cs.LG 2026-05-11 reviewed

DMI-Lib cuts LLM internal observability overhead to 0.4-6.8 percent
Enabling Performant and Flexible Model-Internal Observability for LLM Inference

Nengneng Yu +4
cs.DC 2026-05-11 reviewed

Edge micro-agent fixes failures safely with no destructive actions
An Uncertainty-Aware Resilience Micro-Agent for Causal Observability in the Computing Continuum

Suvi De Silva +4
cs.GR 2026-05-11 reviewed

Inverted culling speeds dynamic LiDAR ray tracing
Geometrically Approximated Modeling for Emitter-Centric Ray-Triangle Filtering in Arbitrarily Dynamic LiDAR Simulation

Rabin Gajmer +2
cs.CR 2026-05-11 reviewed

KEM-IES upgrades ECIES with PQC KEM and Ascon
Key Encapsulation Mechanism-Based Integrated Encryption Scheme (KEM-IES)

Abel C. H. Chen
cs.RO 2026-05-11 reviewed

Caching reuses diffusion steps for 4.6x faster robot plans
Muninn: Your Trajectory Diffusion Model But Faster

Gokul Puthumanaillam +6
cs.CR 2026-05-11 reviewed

Mamba-2 classifies network bursts directly from raw bytes
MambaNetBurst: Direct Byte-level Network Traffic Classification without Tokenization or Pretraining

Gayan K. Kulatilleke +3
cs.DC 2026-05-10 reviewed

Cloud trace decomposition predicts performance at 2% error
Cloud Performance Decomposition for Long-Term Performance Engineering: A Case Study

Shimul Debnath +4
cs.DC 2026-05-10 reviewed

Adaptive DNN splits cut energy by 27-36% on real edge-cloud hardware
Adaptive DNN Partitioning and Offloading in Heterogeneous Edge-Cloud Continuum

Akuen Akoi Deng +3
cs.LG 2026-05-09 reviewed

Apple MPS shows 21x latency spikes in narrow decoding ranges
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes

Willy Fitra Hendria
cs.LG 2026-05-09 reviewed

MPS decoding latency spikes up to 21x in narrow ranges
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes

Willy Fitra Hendria
cs.PF 2026-05-09 reviewed

GPU speedups reach 10x despite 1.85x bandwidth limit in quantum simulation
A Controlled Study of Memory Hierarchy Transitions in Quantum Circuit Simulation on Apple M4 Pro Unified Memory Architecture

Gyan Pratipat
cs.PF 2026-05-09 reviewed

4.46× jump in quantum sim time at 29 qubits on M4 Pro
A Controlled Study of Memory Hierarchy Transitions in Quantum Circuit Simulation on Apple M4 Pro Unified Memory Architecture

Gyan Pratipat
cs.PF 2026-05-09 reviewed

Single-thread JPEG benchmarks misrank decoders for DataLoaders
Single-Thread JPEG Decoder Benchmarks Mis-Evaluate ML Data Loaders

Vladimir Iglovikov +1
cs.PF 2026-05-09 reviewed

DataLoader benchmarks reorder JPEG decoder rankings
Single-Thread JPEG Decoder Benchmarks Mis-Evaluate ML Data Loaders

Vladimir Iglovikov +1
cs.AR 2026-05-09 reviewed

DDR5 single sub-channel matches cache lines but loses 40-60% bandwidth
Single 32-bit Sub-Channel DDR5 DIMMs: Architecture, Performance Bounds, and Standardisation

Chih-Hua Ke
cs.LG 2026-05-08 reviewed

Cyclic tuning raises RAG quality by up to 54 percent
CDS4RAG: Cyclic Dual-Sequential Hyperparameter Optimization for RAG

Pengzhou Chen +1
cs.LG 2026-05-08 reviewed

Unified runtime delivers 2.55x decode speedup for low-rank transformers
FlashSVD v1.5: Making Low-Rank Transformers Inference Actually Fast

Wenhao Wu +7
cs.LG 2026-05-08 reviewed

Fluxion speeds long-context inference 1.5x-3.7x via CPU-GPU hybrid sparse attention
An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference

Feiyu Yao +5
cs.LG 2026-05-08 reviewed

First benchmark supplies real data for LLM hyperparameter tuning
LLMSYS-HPOBench: Hyperparameter Optimization Benchmark Suite for Real-World LLM Systems

Siyu Wu +5
cs.DC 2026-05-07 reviewed

AD replaces finite differences in INLA for 4-8x gradient speedups
ADELIA: Automatic Differentiation for Efficient Laplace Inference Approximations

Afif Boudaoud +8
cs.AR 2026-05-07 reviewed

Pipeline speeds power-of-two DNNs on edge FPGAs by up to 3.6x
PoTAcc: A Pipeline for End-to-End Acceleration of Power-of-Two Quantized DNNs

Rappy Saha +4