archive

Every paper Pith has read. Search by title, abstract, or pith.

225 papers in cs.PF · page 4

cs.DC 2025-11-25 reviewed

Voxel traits let Spira skip kernel-map overhead for 3x faster point-cloud convolution
Spira: Exploiting Voxel Data Structural Properties for Efficient Sparse Convolution in Point Cloud Networks

Dionysios Adamopoulos +3
cs.AR 2025-11-21 reviewed

Digital in-memory design reaches 3.59 TOPS/W for AI matrix math
DISCA: A Digital In-memory Stochastic Computing Architecture Using A Compressed Bent-Pyramid Format

Shady Agwa +3
cs.AR 2025-11-19 reviewed

Joint data-compute tuning speeds ML kernels on PIM up to 13x
DCC: Data-Centric Compilation of Machine Learning Kernels for Processing-In-Memory Architectures

Peiming Yang +6
cs.CL 2025-11-11 reviewed

Neural decider skips 93% iterations to lift LLM reasoning
Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

Tianyu Fu +5
cs.SE 2025-11-10 reviewed

Dataframe libraries differ in energy use within GPU DL pipelines
Energy Consumption of Dataframe Libraries for End-to-End Deep Learning Pipelines:A Comparative Analysis

Punit Kumar +2
physics.comp-ph 2025-11-06 reviewed

Domain decomposition scales Monte Carlo to 16384 cores
Scalable Domain-decomposed Monte Carlo Neutral Transport for Nuclear Fusion

Oskar Lappi +5
cs.LG 2025-11-03 reviewed

PyTorch compiler turns plain attention code into fast kernels
Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants

Bozhi You +7
cs.NI 2025-10-29 reviewed

RL congestion control underperforms in LEO dynamic tests
Evaluating Learning Congestion control Schemes for LEO Constellations

Mihai Mazilu +2
cs.SE 2025-10-24 reviewed

Search tunes allocators to cut heap use by 4 percent
GreenMalloc: Allocator Optimisation for Industrial Workloads

Aidan Dakhama +3
cs.NI 2025-10-22 reviewed

Enhanced power-down saves energy in supercomputer Ethernet networks
On the Power Saving in High-Speed Ethernet-based Networks for Supercomputers and Data Centers

Miguel S\'anchez de La Rosa +4
cs.AR 2025-10-17 reviewed

Fixed configs make Ramulator 2.0 match real memory performance
Cleaning up the Mess: Re-Evaluating the Real-System Modeling Accuracy of Ramulator 2.0

F. Nisa Bostanci +6
cs.SE 2025-10-17 reviewed

LLMs lag humans on real Java performance fixes with high volatility
Do AI Models Dream of Faster Code? An Empirical Study on LLM-Proposed Performance Improvements in Real-World Software

Lirong Yi +2
math.PR 2025-10-13 reviewed

Leave-one-out technique tightens 1/(1-ρ) bound for G/G/n queues
A new $1/(1-\rho)$-scaling bound for multiserver queues via a leave-one-out technique

Yige Hong
cs.DC 2025-10-03 reviewed

GPU data-movement cuts lower both time and energy for large sparse solves
On the energy efficiency of sparse matrix computations on multi-GPU clusters

Massimo Bernaschi +3
cs.LG 2025-09-27 reviewed

Hybrid tile sparsity speeds LLMs up to 1.38x with higher accuracy
PATCH: Learnable Tile-level Hybrid Sparsity for LLMs

Younes Hourri +2
cs.OS 2025-09-25 reviewed

NetCAS boosts remote storage speed 174% via dynamic I/O splits
NetCAS: Dynamic Cache and Backend Device Management in Networked Environments

Joon Yong Hwang +2
cs.PF 2025-09-24 reviewed

denet profiles CPU, memory and I/O for processes and children
denet, A lightweight command-line tool for process monitoring in benchmarking and beyond

Ben Carrillo +1
cs.PF 2025-09-10 reviewed

Shared-memory views double speed of parallel R tasks
Memshare: Memory Sharing for Multicore Computation in R with an Application to Feature Selection by Mutual Information using PDE

Michael C. Thrun +1
cs.CR 2025-08-25 reviewed

MAC-based PRNG produces passwords passing NIST randomness tests
Secure Password Generator Based on Secure Pseudo-Random Number Generator

Abel C. H. Chen
cs.DC 2025-08-22 reviewed

Default collectives up to 5x slower than tuned choices
PICO: Performance Insights for Collective Operations

Saverio Pasqualoni +5
cs.PF 2025-08-22 reviewed

NPU pilot compute cuts CPU/GPU needs for on-device LLM attention
ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference

Wangsong Yin +4
cs.DC 2025-08-21 reviewed

Engine cuts mixed-precision LLM latency by up to 61 percent
LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind

Li Zhang +8
cs.AR 2025-08-12 reviewed

Memory reads turn into stochastic multiplies for matrix work
OISMA: On-the-fly In-memory Stochastic Multiplication Architecture for Matrix-Multiplication Workloads

Shady Agwa +3
cs.LG 2025-08-03 reviewed

Dual-stream model cuts microservice latency prediction error 15-26%
Reliable Microservice Tail Latency Prediction via Decoupled Dual-Stream Learning and Gradient Modulation

Wenzhuo Qian +9
physics.plasm-ph 2025-06-20 reviewed

Toroidal multigrid solver beats block Jacobi on stellarator tests
Fast solvers for Tokamak fluid models with PETSC

Mark F. Adams +2
q-bio.MN 2025-06-10 reviewed

GPUs speed up logic model searches for gene networks up to 19 times
GPU-accelerated Modeling of Biological Regulatory Networks

Joyce Reimer +6
cs.SE 2025-06-02 reviewed

LLMs optimize large Java apps better than compilers
SysLLMatic: Large Language Models are Software System Optimizers

Huiyun Peng +9
eess.SP 2025-05-19 reviewed

Seizure detectors reach only 32% F1 on unseen patients
Quantifying the Generalization Gap in Seizure Detection: A Large-Scale Empirical Benchmark via the SzCORE Challenge

Jonathan Dan +3
math.PR 2025-05-13 reviewed

Bounded flexibility forces geometric queue decay in growing networks
Geometric lower bounds for the steady-state occupancy of processing networks with limited connectivity

Diego Goldsztajn +1
cs.DC 2025-05-05 reviewed

Two-stage dispatching improves mean response times
"Two-Stagification": Job Dispatching in Large-Scale Clusters via a Two-Stage Architecture

Mert Yildiz +2
quant-ph 2025-04-12 reviewed

Grover search recovers Boolean logic in 5-protein brain network
Identifying Protein Co-regulatory Network Logic by Solving B-SAT Problems through Gate-based Quantum Computing

Aspen Erlandsson Brisebois +5
cs.PF 2025-03-22 reviewed

Quantization and pruning lower LLM energy use while boosting performance
Energy-Aware LLMs: A step towards sustainable AI for downstream applications

Nguyen Phuc Tran +2
cs.LG 2025-02-14 reviewed

LLMs match PyTorch kernels in under 20% of ML cases
KernelBench: Can LLMs Write Efficient GPU Kernels?

Anne Ouyang +6
cs.PL 2025-01-15 reviewed

GP 2 programs match imperative speeds for connectivity and shortest paths
Rule-Based Graph Programs Matching the Time Complexity of Imperative Algorithms

Ziad Ismaili Alaoui +1
cs.DC 2024-12-17 reviewed

TrainMover resumes ML jobs in 20 seconds after interruptions
TrainMover: An Interruption-Resilient Runtime for ML Training

ChonLam Lao +15
cs.LG 2024-12-07 reviewed

FlexAttention turns PyTorch code into fast attention kernels
Flex Attention: A Programming Model for Generating Optimized Attention Kernels

Juechu Dong +4
cs.LG 2024-11-27 reviewed

Hybrid tuner speeds CVD model training while raising accuracy
Time-Efficient Hybrid Hyperparameter Tuning Approach for Cardiovascular Disease Classification

Abhay Kumar Pathak +2
cs.LG 2024-10-26 reviewed

Interleaved CPU-GPU optimizer updates cut LLM training time by 2.5×
Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading

Avinash Maurya +4
cs.DC 2024-09-15 reviewed

Survey categorizes GPU communication options that limit CPU role
The Landscape of GPU-Centric Communication

Didem Unat +6
cs.DC 2024-06-12 reviewed

ProTrain automates memory tuning to lift LLM training speed 1.43-2.71x
ProTrain: Efficient LLM Training via Memory-Aware Techniques

Hanmei Yang +6
quant-ph 2024-05-28 reviewed

Quantum switch blocking depends only on mean attempt and calibration times
An on-demand resource allocation algorithm for a quantum network hub and its performance analysis

Scarlett Gauthier +2
cs.CV 2024-05-23 reviewed

Patch pipeline reuses stale maps to speed DiT inference
PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference

Jiarui Fang +4
cs.DC 2024-05-21 reviewed

Distributed MPK with RACE blocking achieves 4x speedup on 832 cores
Cache Blocking of Distributed-Memory Parallel Matrix Power Kernels

Dane C. Lacey +5
cs.CL 2024-02-05 reviewed

2-bit KV cache method cuts LLM peak memory 2.6 times
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Zirui Liu +7
cs.NI 2023-10-22 reviewed

Clusters and block sharing cut livestream bandwidth use
Bandwidth Efficient Livestreaming in Mobile Wireless Networks: A Peer-to-Peer ACIDE Solution

Andrei Negulescu +1
cs.DC 2023-08-02 reviewed

VMT19937 vectorizes Mersenne Twister for linear SIMD gains
VMT19937: A SIMD-Friendly Pseudo Random Number Generator based on Mersenne Twister 19937

Fabio Cannizzo
cs.LG 2023-05-18 reviewed

16-bit training matches 32-bit accuracy at higher speed
Revisiting 16-bit Neural Network Training: A Practical Approach for Resource-Limited Learning

Juyoung Yun +4
cs.DC 2023-04-21 reviewed

FSDP matches DDP speed for much larger models with near-linear TFLOPS scaling
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao +17
cs.NE 2020-11-21 reviewed

Random forest repairs offspring to speed multi-objective evolution
Enhanced Innovized Repair Operator for Evolutionary Multi- and Many-objective Optimization

Sukrit Mittal +3
cs.DB 2019-08-13 reviewed

OLAP engines waste 25-82% of CPU cycles on stalls
Micro-architectural Analysis of OLAP: Limitations and Opportunities

Utku Sirin +1