archive
Every paper Pith has read. Search by title, abstract, or pith.
225 papers in cs.PF · page 4
-
Voxel traits let Spira skip kernel-map overhead for 3x faster point-cloud convolution
Spira: Exploiting Voxel Data Structural Properties for Efficient Sparse Convolution in Point Cloud Networks
-
Digital in-memory design reaches 3.59 TOPS/W for AI matrix math
DISCA: A Digital In-memory Stochastic Computing Architecture Using A Compressed Bent-Pyramid Format
-
Joint data-compute tuning speeds ML kernels on PIM up to 13x
DCC: Data-Centric Compilation of Machine Learning Kernels for Processing-In-Memory Architectures
-
Neural decider skips 93% iterations to lift LLM reasoning
Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models
-
Dataframe libraries differ in energy use within GPU DL pipelines
Energy Consumption of Dataframe Libraries for End-to-End Deep Learning Pipelines:A Comparative Analysis
-
Domain decomposition scales Monte Carlo to 16384 cores
Scalable Domain-decomposed Monte Carlo Neutral Transport for Nuclear Fusion
-
PyTorch compiler turns plain attention code into fast kernels
Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants
-
RL congestion control underperforms in LEO dynamic tests
Evaluating Learning Congestion control Schemes for LEO Constellations
-
Search tunes allocators to cut heap use by 4 percent
GreenMalloc: Allocator Optimisation for Industrial Workloads
-
Enhanced power-down saves energy in supercomputer Ethernet networks
On the Power Saving in High-Speed Ethernet-based Networks for Supercomputers and Data Centers
-
Fixed configs make Ramulator 2.0 match real memory performance
Cleaning up the Mess: Re-Evaluating the Real-System Modeling Accuracy of Ramulator 2.0
-
LLMs lag humans on real Java performance fixes with high volatility
Do AI Models Dream of Faster Code? An Empirical Study on LLM-Proposed Performance Improvements in Real-World Software
-
Leave-one-out technique tightens 1/(1-ρ) bound for G/G/n queues
A new $1/(1-\rho)$-scaling bound for multiserver queues via a leave-one-out technique
-
GPU data-movement cuts lower both time and energy for large sparse solves
On the energy efficiency of sparse matrix computations on multi-GPU clusters
-
Hybrid tile sparsity speeds LLMs up to 1.38x with higher accuracy
PATCH: Learnable Tile-level Hybrid Sparsity for LLMs
-
NetCAS boosts remote storage speed 174% via dynamic I/O splits
NetCAS: Dynamic Cache and Backend Device Management in Networked Environments
-
denet profiles CPU, memory and I/O for processes and children
denet, A lightweight command-line tool for process monitoring in benchmarking and beyond
-
Shared-memory views double speed of parallel R tasks
Memshare: Memory Sharing for Multicore Computation in R with an Application to Feature Selection by Mutual Information using PDE
-
MAC-based PRNG produces passwords passing NIST randomness tests
Secure Password Generator Based on Secure Pseudo-Random Number Generator
-
Default collectives up to 5x slower than tuned choices
PICO: Performance Insights for Collective Operations
-
NPU pilot compute cuts CPU/GPU needs for on-device LLM attention
ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference
-
Engine cuts mixed-precision LLM latency by up to 61 percent
LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind
-
Memory reads turn into stochastic multiplies for matrix work
OISMA: On-the-fly In-memory Stochastic Multiplication Architecture for Matrix-Multiplication Workloads
-
Dual-stream model cuts microservice latency prediction error 15-26%
Reliable Microservice Tail Latency Prediction via Decoupled Dual-Stream Learning and Gradient Modulation
-
Toroidal multigrid solver beats block Jacobi on stellarator tests
Fast solvers for Tokamak fluid models with PETSC
-
GPUs speed up logic model searches for gene networks up to 19 times
GPU-accelerated Modeling of Biological Regulatory Networks
-
LLMs optimize large Java apps better than compilers
SysLLMatic: Large Language Models are Software System Optimizers
-
Seizure detectors reach only 32% F1 on unseen patients
Quantifying the Generalization Gap in Seizure Detection: A Large-Scale Empirical Benchmark via the SzCORE Challenge
-
Bounded flexibility forces geometric queue decay in growing networks
Geometric lower bounds for the steady-state occupancy of processing networks with limited connectivity
-
Two-stage dispatching improves mean response times
"Two-Stagification": Job Dispatching in Large-Scale Clusters via a Two-Stage Architecture
-
Grover search recovers Boolean logic in 5-protein brain network
Identifying Protein Co-regulatory Network Logic by Solving B-SAT Problems through Gate-based Quantum Computing
-
Quantization and pruning lower LLM energy use while boosting performance
Energy-Aware LLMs: A step towards sustainable AI for downstream applications
-
LLMs match PyTorch kernels in under 20% of ML cases
KernelBench: Can LLMs Write Efficient GPU Kernels?
-
GP 2 programs match imperative speeds for connectivity and shortest paths
Rule-Based Graph Programs Matching the Time Complexity of Imperative Algorithms
-
TrainMover resumes ML jobs in 20 seconds after interruptions
TrainMover: An Interruption-Resilient Runtime for ML Training
-
FlexAttention turns PyTorch code into fast attention kernels
Flex Attention: A Programming Model for Generating Optimized Attention Kernels
-
Hybrid tuner speeds CVD model training while raising accuracy
Time-Efficient Hybrid Hyperparameter Tuning Approach for Cardiovascular Disease Classification
-
Interleaved CPU-GPU optimizer updates cut LLM training time by 2.5×
Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading
-
Survey categorizes GPU communication options that limit CPU role
The Landscape of GPU-Centric Communication
-
ProTrain automates memory tuning to lift LLM training speed 1.43-2.71x
ProTrain: Efficient LLM Training via Memory-Aware Techniques
-
Quantum switch blocking depends only on mean attempt and calibration times
An on-demand resource allocation algorithm for a quantum network hub and its performance analysis
-
Patch pipeline reuses stale maps to speed DiT inference
PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference
-
Distributed MPK with RACE blocking achieves 4x speedup on 832 cores
Cache Blocking of Distributed-Memory Parallel Matrix Power Kernels
-
2-bit KV cache method cuts LLM peak memory 2.6 times
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
-
Clusters and block sharing cut livestream bandwidth use
Bandwidth Efficient Livestreaming in Mobile Wireless Networks: A Peer-to-Peer ACIDE Solution
-
VMT19937 vectorizes Mersenne Twister for linear SIMD gains
VMT19937: A SIMD-Friendly Pseudo Random Number Generator based on Mersenne Twister 19937
-
16-bit training matches 32-bit accuracy at higher speed
Revisiting 16-bit Neural Network Training: A Practical Approach for Resource-Limited Learning
-
FSDP matches DDP speed for much larger models with near-linear TFLOPS scaling
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
-
Random forest repairs offspring to speed multi-objective evolution
Enhanced Innovized Repair Operator for Evolutionary Multi- and Many-objective Optimization
-
OLAP engines waste 25-82% of CPU cycles on stalls
Micro-architectural Analysis of OLAP: Limitations and Opportunities