archive
Every paper Pith has read. Search by title, abstract, or pith.
225 papers in cs.PF · page 1
-
Meta-learning yields model performance scores on unlabeled data
Learning to Evaluate: Cost-Effective Model Evaluation on Unlabeled Data with Meta-Learning
-
Controller routes LLM requests to best mode for 2x speedup
ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU
-
ACALSim reaches 14x speedup over SST on large GPU simulations
ACALSim: A Scalable Parallel Simulation Framework for High-Performance System Design Space Exploration
-
Separate physical pools for KV and SSM caches cut OOMs 7.6% and raise throughput up to 13x
Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference
-
Agentic AI uses 4.33x more energy per successful goal than linear baselines
Energy per Successful Goal: Goal-Level Energy Accounting for Agentic AI Systems
-
Discretization produces throughput-optimal policies for continuous MRJ
Throughput-Optimal Multiresource-Job Scheduling with Continuous Requirement Distribution
-
Krylov approximation unlearns data 48x faster than retraining
Causal Unlearning in Collaborative Optimization: Exact and Approximate Influence Reversal under Adversarial Contributions
-
Billion-scale 3D Gaussians train on one 24 GB GPU
TideGS: Scalable Training of Over One Billion 3D Gaussian Splatting Primitives via Out-of-Core Optimization
-
Agent skills from expert methods beat docs for PostgreSQL tuning
A Case for Agentic Tuning: From Documentation to Action in PostgreSQL
-
Reasoning LLMs trap data parallelism in KV-cache limits
Understanding Inference Scaling for LLMs: Bottlenecks, Trade-offs, and Performance Principles
-
Geo-distributed AI training optimizes at 10-100 km distances
Modeling the Impact of Fiber Latency on Compute-Communication Overlap in Geo-Distributed Multi-Datacenter AI Training
-
Hybrid model cuts medical tourist waits from 13.7 to 2.4 days
Reducing Waiting Time for Medical Tourists Through Hybrid Agent-Based and Discrete-Event Simulation: A Hospital Case Study
-
Unified calculus and lattice language reduce CS problems to performance evaluation
On Generalized Performance Evaluation and Generalized Controller Synthesis
-
Boundary protection recovers 69-90% quality at 13% KV retention
Protection Is (Nearly) All You Need: Structural Protection Dominates Scoring in Globally Capped KV Eviction
-
Covariance rotations keep 2-bit KV caches accurate
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
-
Legacy GPUs power real-time 8K60 for connected vehicles
Sustainable Real-Time 8K60 HEVC Encoding for V2X: Repurposing Legacy NVENC Hardware at the Vehicular Edge
-
Heuristic merges HPC traces to extend hardware counter coverage
Heuristic-Based Merging of HPC Traces to Extend Hardware Counter Coverage
-
Closed-form linear operator fixes layer-pruned LLMs
Ghosted Layers: Unconstrained Activation Alignment for Recovering Layer-Pruned LLMs
-
Cache reorganization lifts GPU speedups for 28-qubit simulations on laptops
Accelerating State-Vector Quantum Simulation on Integrated GPUs via Cache Locality Optimization: A Cross-Architecture Evaluation
-
LLM tunes Linux knobs for 72 percent stable gain over defaults
SemaTune: Semantic-Aware Online OS Tuning with Large Language Models
-
Heterogeneous solvers up to 32% faster than GPU-only for big matrices
Comparing the Performance of Heterogeneous Conjugate Gradient and Cholesky Solvers on Various Hardware Using SYCL
-
Block-scale search cuts quantization error 27% in BFP
Search Your Block Floating Point Scales!
-
Adaptive packed layouts enable efficient VLA ML code
Scalable Packed Layouts for Vector-Length-Agnostic ML Code Generation
-
Packed layouts enable scalable vector ML code
Scalable Packed Layouts for Vector-Length-Agnostic ML Code Generation
-
Joint TLB-cache tweaks boost instruction prefetching 8.7%
Enhancing Instruction Prefetching via Cache and TLB Management
-
Node failures scale wireless capacity and delay with sqrt of reliable nodes
On Capacity and Delay of Wireless Networks with Node Failures
-
Power capping leaves LLM decode energy untouched
The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
-
Chakra standardizes graph traces for AI workload benchmarking
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces
-
Open traces standardize ML workload benchmarking
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces
-
DMI-Lib cuts LLM internal observability overhead to 0.4-6.8 percent
Enabling Performant and Flexible Model-Internal Observability for LLM Inference
-
Edge micro-agent fixes failures safely with no destructive actions
An Uncertainty-Aware Resilience Micro-Agent for Causal Observability in the Computing Continuum
-
Inverted culling speeds dynamic LiDAR ray tracing
Geometrically Approximated Modeling for Emitter-Centric Ray-Triangle Filtering in Arbitrarily Dynamic LiDAR Simulation
-
KEM-IES upgrades ECIES with PQC KEM and Ascon
Key Encapsulation Mechanism-Based Integrated Encryption Scheme (KEM-IES)
-
Caching reuses diffusion steps for 4.6x faster robot plans
Muninn: Your Trajectory Diffusion Model But Faster
-
Mamba-2 classifies network bursts directly from raw bytes
MambaNetBurst: Direct Byte-level Network Traffic Classification without Tokenization or Pretraining
-
Cloud trace decomposition predicts performance at 2% error
Cloud Performance Decomposition for Long-Term Performance Engineering: A Case Study
-
Adaptive DNN splits cut energy by 27-36% on real edge-cloud hardware
Adaptive DNN Partitioning and Offloading in Heterogeneous Edge-Cloud Continuum
-
Apple MPS shows 21x latency spikes in narrow decoding ranges
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
-
MPS decoding latency spikes up to 21x in narrow ranges
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
-
GPU speedups reach 10x despite 1.85x bandwidth limit in quantum simulation
A Controlled Study of Memory Hierarchy Transitions in Quantum Circuit Simulation on Apple M4 Pro Unified Memory Architecture
-
4.46× jump in quantum sim time at 29 qubits on M4 Pro
A Controlled Study of Memory Hierarchy Transitions in Quantum Circuit Simulation on Apple M4 Pro Unified Memory Architecture
-
Single-thread JPEG benchmarks misrank decoders for DataLoaders
Single-Thread JPEG Decoder Benchmarks Mis-Evaluate ML Data Loaders
-
DataLoader benchmarks reorder JPEG decoder rankings
Single-Thread JPEG Decoder Benchmarks Mis-Evaluate ML Data Loaders
-
DDR5 single sub-channel matches cache lines but loses 40-60% bandwidth
Single 32-bit Sub-Channel DDR5 DIMMs: Architecture, Performance Bounds, and Standardisation
-
Cyclic tuning raises RAG quality by up to 54 percent
CDS4RAG: Cyclic Dual-Sequential Hyperparameter Optimization for RAG
-
Unified runtime delivers 2.55x decode speedup for low-rank transformers
FlashSVD v1.5: Making Low-Rank Transformers Inference Actually Fast
-
Fluxion speeds long-context inference 1.5x-3.7x via CPU-GPU hybrid sparse attention
An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference
-
First benchmark supplies real data for LLM hyperparameter tuning
LLMSYS-HPOBench: Hyperparameter Optimization Benchmark Suite for Real-World LLM Systems
-
AD replaces finite differences in INLA for 4-8x gradient speedups
ADELIA: Automatic Differentiation for Efficient Laplace Inference Approximations
-
Pipeline speeds power-of-two DNNs on edge FPGAs by up to 3.6x
PoTAcc: A Pipeline for End-to-End Acceleration of Power-of-Two Quantized DNNs