pith. sign in

archive

Every paper Pith has read. Search by title, abstract, or pith.

225 papers in cs.PF · page 4

  1. cs.DC 2025-11-25 reviewed
    Voxel traits let Spira skip kernel-map overhead for 3x faster point-cloud convolution

    Spira: Exploiting Voxel Data Structural Properties for Efficient Sparse Convolution in Point Cloud Networks

    Dionysios Adamopoulos +3

  2. cs.AR 2025-11-21 reviewed
    Digital in-memory design reaches 3.59 TOPS/W for AI matrix math

    DISCA: A Digital In-memory Stochastic Computing Architecture Using A Compressed Bent-Pyramid Format

    Shady Agwa +3

  3. cs.AR 2025-11-19 reviewed
    Joint data-compute tuning speeds ML kernels on PIM up to 13x

    DCC: Data-Centric Compilation of Machine Learning Kernels for Processing-In-Memory Architectures

    Peiming Yang +6

  4. cs.CL 2025-11-11 reviewed
    Neural decider skips 93% iterations to lift LLM reasoning

    Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

    Tianyu Fu +5

  5. cs.SE 2025-11-10 reviewed
    Dataframe libraries differ in energy use within GPU DL pipelines

    Energy Consumption of Dataframe Libraries for End-to-End Deep Learning Pipelines:A Comparative Analysis

    Punit Kumar +2

  6. physics.comp-ph 2025-11-06 reviewed
    Domain decomposition scales Monte Carlo to 16384 cores

    Scalable Domain-decomposed Monte Carlo Neutral Transport for Nuclear Fusion

    Oskar Lappi +5

  7. cs.LG 2025-11-03 reviewed
    PyTorch compiler turns plain attention code into fast kernels

    Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants

    Bozhi You +7

  8. cs.NI 2025-10-29 reviewed
    RL congestion control underperforms in LEO dynamic tests

    Evaluating Learning Congestion control Schemes for LEO Constellations

    Mihai Mazilu +2

  9. cs.SE 2025-10-24 reviewed
    Search tunes allocators to cut heap use by 4 percent

    GreenMalloc: Allocator Optimisation for Industrial Workloads

    Aidan Dakhama +3

  10. cs.NI 2025-10-22 reviewed
    Enhanced power-down saves energy in supercomputer Ethernet networks

    On the Power Saving in High-Speed Ethernet-based Networks for Supercomputers and Data Centers

    Miguel S\'anchez de La Rosa +4

  11. cs.AR 2025-10-17 reviewed
    Fixed configs make Ramulator 2.0 match real memory performance

    Cleaning up the Mess: Re-Evaluating the Real-System Modeling Accuracy of Ramulator 2.0

    F. Nisa Bostanci +6

  12. cs.SE 2025-10-17 reviewed
    LLMs lag humans on real Java performance fixes with high volatility

    Do AI Models Dream of Faster Code? An Empirical Study on LLM-Proposed Performance Improvements in Real-World Software

    Lirong Yi +2

  13. math.PR 2025-10-13 reviewed
    Leave-one-out technique tightens 1/(1-ρ) bound for G/G/n queues

    A new $1/(1-\rho)$-scaling bound for multiserver queues via a leave-one-out technique

    Yige Hong

  14. cs.DC 2025-10-03 reviewed
    GPU data-movement cuts lower both time and energy for large sparse solves

    On the energy efficiency of sparse matrix computations on multi-GPU clusters

    Massimo Bernaschi +3

  15. cs.LG 2025-09-27 reviewed
    Hybrid tile sparsity speeds LLMs up to 1.38x with higher accuracy

    PATCH: Learnable Tile-level Hybrid Sparsity for LLMs

    Younes Hourri +2

  16. cs.OS 2025-09-25 reviewed
    NetCAS boosts remote storage speed 174% via dynamic I/O splits

    NetCAS: Dynamic Cache and Backend Device Management in Networked Environments

    Joon Yong Hwang +2

  17. cs.PF 2025-09-24 reviewed
    denet profiles CPU, memory and I/O for processes and children

    denet, A lightweight command-line tool for process monitoring in benchmarking and beyond

    Ben Carrillo +1

  18. cs.PF 2025-09-10 reviewed
    Shared-memory views double speed of parallel R tasks

    Memshare: Memory Sharing for Multicore Computation in R with an Application to Feature Selection by Mutual Information using PDE

    Michael C. Thrun +1

  19. cs.CR 2025-08-25 reviewed
    MAC-based PRNG produces passwords passing NIST randomness tests

    Secure Password Generator Based on Secure Pseudo-Random Number Generator

    Abel C. H. Chen

  20. cs.DC 2025-08-22 reviewed
    Default collectives up to 5x slower than tuned choices

    PICO: Performance Insights for Collective Operations

    Saverio Pasqualoni +5

  21. cs.PF 2025-08-22 reviewed
    NPU pilot compute cuts CPU/GPU needs for on-device LLM attention

    ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference

    Wangsong Yin +4

  22. cs.DC 2025-08-21 reviewed
    Engine cuts mixed-precision LLM latency by up to 61 percent

    LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind

    Li Zhang +8

  23. cs.AR 2025-08-12 reviewed
    Memory reads turn into stochastic multiplies for matrix work

    OISMA: On-the-fly In-memory Stochastic Multiplication Architecture for Matrix-Multiplication Workloads

    Shady Agwa +3

  24. cs.LG 2025-08-03 reviewed
    Dual-stream model cuts microservice latency prediction error 15-26%

    Reliable Microservice Tail Latency Prediction via Decoupled Dual-Stream Learning and Gradient Modulation

    Wenzhuo Qian +9

  25. physics.plasm-ph 2025-06-20 reviewed
    Toroidal multigrid solver beats block Jacobi on stellarator tests

    Fast solvers for Tokamak fluid models with PETSC

    Mark F. Adams +2

  26. q-bio.MN 2025-06-10 reviewed
    GPUs speed up logic model searches for gene networks up to 19 times

    GPU-accelerated Modeling of Biological Regulatory Networks

    Joyce Reimer +6

  27. cs.SE 2025-06-02 reviewed
    LLMs optimize large Java apps better than compilers

    SysLLMatic: Large Language Models are Software System Optimizers

    Huiyun Peng +9

  28. eess.SP 2025-05-19 reviewed
    Seizure detectors reach only 32% F1 on unseen patients

    Quantifying the Generalization Gap in Seizure Detection: A Large-Scale Empirical Benchmark via the SzCORE Challenge

    Jonathan Dan +3

  29. math.PR 2025-05-13 reviewed
    Bounded flexibility forces geometric queue decay in growing networks

    Geometric lower bounds for the steady-state occupancy of processing networks with limited connectivity

    Diego Goldsztajn +1

  30. cs.DC 2025-05-05 reviewed
    Two-stage dispatching improves mean response times

    "Two-Stagification": Job Dispatching in Large-Scale Clusters via a Two-Stage Architecture

    Mert Yildiz +2

  31. quant-ph 2025-04-12 reviewed
    Grover search recovers Boolean logic in 5-protein brain network

    Identifying Protein Co-regulatory Network Logic by Solving B-SAT Problems through Gate-based Quantum Computing

    Aspen Erlandsson Brisebois +5

  32. cs.PF 2025-03-22 reviewed
    Quantization and pruning lower LLM energy use while boosting performance

    Energy-Aware LLMs: A step towards sustainable AI for downstream applications

    Nguyen Phuc Tran +2

  33. cs.LG 2025-02-14 reviewed
    LLMs match PyTorch kernels in under 20% of ML cases

    KernelBench: Can LLMs Write Efficient GPU Kernels?

    Anne Ouyang +6

  34. cs.PL 2025-01-15 reviewed
    GP 2 programs match imperative speeds for connectivity and shortest paths

    Rule-Based Graph Programs Matching the Time Complexity of Imperative Algorithms

    Ziad Ismaili Alaoui +1

  35. cs.DC 2024-12-17 reviewed
    TrainMover resumes ML jobs in 20 seconds after interruptions

    TrainMover: An Interruption-Resilient Runtime for ML Training

    ChonLam Lao +15

  36. cs.LG 2024-12-07 reviewed
    FlexAttention turns PyTorch code into fast attention kernels

    Flex Attention: A Programming Model for Generating Optimized Attention Kernels

    Juechu Dong +4

  37. cs.LG 2024-11-27 reviewed
    Hybrid tuner speeds CVD model training while raising accuracy

    Time-Efficient Hybrid Hyperparameter Tuning Approach for Cardiovascular Disease Classification

    Abhay Kumar Pathak +2

  38. cs.LG 2024-10-26 reviewed
    Interleaved CPU-GPU optimizer updates cut LLM training time by 2.5×

    Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading

    Avinash Maurya +4

  39. cs.DC 2024-09-15 reviewed
    Survey categorizes GPU communication options that limit CPU role

    The Landscape of GPU-Centric Communication

    Didem Unat +6

  40. cs.DC 2024-06-12 reviewed
    ProTrain automates memory tuning to lift LLM training speed 1.43-2.71x

    ProTrain: Efficient LLM Training via Memory-Aware Techniques

    Hanmei Yang +6

  41. quant-ph 2024-05-28 reviewed
    Quantum switch blocking depends only on mean attempt and calibration times

    An on-demand resource allocation algorithm for a quantum network hub and its performance analysis

    Scarlett Gauthier +2

  42. cs.CV 2024-05-23 reviewed
    Patch pipeline reuses stale maps to speed DiT inference

    PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference

    Jiarui Fang +4

  43. cs.DC 2024-05-21 reviewed
    Distributed MPK with RACE blocking achieves 4x speedup on 832 cores

    Cache Blocking of Distributed-Memory Parallel Matrix Power Kernels

    Dane C. Lacey +5

  44. cs.CL 2024-02-05 reviewed
    2-bit KV cache method cuts LLM peak memory 2.6 times

    KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    Zirui Liu +7

  45. cs.NI 2023-10-22 reviewed
    Clusters and block sharing cut livestream bandwidth use

    Bandwidth Efficient Livestreaming in Mobile Wireless Networks: A Peer-to-Peer ACIDE Solution

    Andrei Negulescu +1

  46. cs.DC 2023-08-02 reviewed
    VMT19937 vectorizes Mersenne Twister for linear SIMD gains

    VMT19937: A SIMD-Friendly Pseudo Random Number Generator based on Mersenne Twister 19937

    Fabio Cannizzo

  47. cs.LG 2023-05-18 reviewed
    16-bit training matches 32-bit accuracy at higher speed

    Revisiting 16-bit Neural Network Training: A Practical Approach for Resource-Limited Learning

    Juyoung Yun +4

  48. cs.DC 2023-04-21 reviewed
    FSDP matches DDP speed for much larger models with near-linear TFLOPS scaling

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Yanli Zhao +17

  49. cs.NE 2020-11-21 reviewed
    Random forest repairs offspring to speed multi-objective evolution

    Enhanced Innovized Repair Operator for Evolutionary Multi- and Many-objective Optimization

    Sukrit Mittal +3

  50. cs.DB 2019-08-13 reviewed
    OLAP engines waste 25-82% of CPU cycles on stalls

    Micro-architectural Analysis of OLAP: Limitations and Opportunities

    Utku Sirin +1