archive
Every paper Pith has read. Search by title, abstract, or pith.
1164 papers in cs.DC · page 7
-
Tensor lifting maps OpenMP loops to AI Engines
Lifting to tensors when compiling scientific computing workloads for AI Engines
-
GPU layer speeds exascale trace analysis by up to 314x
Enhancing Performance Insight at Scale: A Heterogeneous Framework for Exascale Diagnostics
-
GPU speeds exascale trace analysis by 314 times
Enhancing Performance Insight at Scale: A Heterogeneous Framework for Exascale Diagnostics
-
MPC with limited machines needs higher local exponents for superlinear tasks
On Solving Problems of Substantially Super-linear Complexity in $N^{o(1)}$ Rounds in the MPC Model
-
Decoupled virtual cores lift LLM GPU throughput 24% on average
VDCores: Resource Decoupled Programming and Execution for Asynchronous GPU
-
Pact maps choreographic protocols to formal games
Pact: A Choreographic Language for Agentic Ecosystems
-
AI Data Centers Break Grid Load Diversity
From Barrier to Bridge: The Case for AI Data Center/Power Grid Co-Design
-
Draft signals let SpecKV adapt gamma for 56% faster speculative decoding
SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection
-
Workflow templates speed sensor app prototyping for non-experts
From Sensors to Insight: Rapid, Edge-to-Core Application Development for Sensor-Driven Applications
-
AI reuses sensor workflow template to cut dev time to 1-2 days
(POSTER) From Sensors to Insight: Rapid, Edge-to-Core Application Development for Sensor-Driven Applications
-
Parallel HSOM cuts training time for intrusion detection
parHSOM: A novel parallel Hierarchical Self-Organizing Map implementation
-
Optimal configuration found for N-body sims on RISC-V accelerators
Assessing Performance and Porting Strategies for Gravitational $N$-Body Simulations on the RISC-V-Based Tenstorrent Wormhole\textsuperscript{\texttrademark}
-
Global optimization cuts distributed quantum costs most
Distributed Quantum Circuit Optimisation: Evaluating Global and Local encodings
-
Global optimization minimizes distributed quantum circuit costs
Distributed Quantum Circuit Optimisation: Evaluating Global and Local encodings
-
Bayesian optimization lifts Fabric TPS by 12%
Caliper-in-the-Loop: Black-Box Optimization for Hyperledger Fabric Performance Tuning
-
Sign-Muon reaches O(1/sqrt(T)) rate with 32x bandwidth cut
SignMuon: Communication-Efficient Distributed Muon Optimization
-
Partial layer training matches full federated accuracy with 82 percent fewer parameters
FedPLT: Scalable, Resource-Efficient, and Heterogeneity-Aware Federated Learning via Partial Layer Training
-
Kairos raises LLM SLO attainment by up to 34%
Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference
-
Each CAV spots sensor faults using distributed observers
Distributed Observer-based Fault Detection over Intelligent Networked Multi-Vehicle Systems
-
Raspberry Pi clusters teach undergrads practical supercomputing
Leveraging Teaching on Demand: Approaching HPC to Undergrads
-
ZKP wrapper secures federated learning at 94 percent accuracy under attack
Privacy-Preserving Federated Learning: Integrating Zero-Knowledge Proofs in Scalable Distributed Architectures
-
IO500 logs reveal storage patterns missed by scores
A Treasure Trove of Performance: Analyzing the IO500 Submission Data
-
Pipeline offloading lifts offline LLM throughput up to 2.51x
PipeMax: Enhancing Offline LLM Inference on Commodity GPU Servers
-
One-shot diffusion and model fusion reach 33.4% mAP for private surveillance
Heterogeneous Model Fusion for Privacy-Aware Multi-Camera Surveillance via Synthetic Domain Adaptation
-
Privacy-preserving detection hits 33.4% mAP across cameras
Heterogeneous Model Fusion for Privacy-Aware Multi-Camera Surveillance via Synthetic Domain Adaptation
-
AAFLOW speeds agentic AI pipelines 4.64x via zero-copy data flows
AAFLOW: Scalable Patterns for Agentic AI Workflows
-
Smaller idle models speed large LLM serving by more than double
SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference
-
Tail models accelerate large LLM inference by 2.28x as remote drafters
SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference
-
Queue predictions speed federated learning by 20 percent on HPC
FedQueue: Queue-Aware Federated Learning for Cross-Facility HPC Training
-
Queue predictions stabilize federated learning across HPC sites
FedQueue: Queue-Aware Federated Learning for Cross-Facility HPC Training
-
Random circuits distort quantum partitioning benchmarks
On the Distortion of Partitioning Performance by Random Quantum Circuits
-
This paper finds that random quantum circuits used to test hypergraph partitioning for…
On the Distortion of Partitioning Performance by Random Quantum Circuits
-
Data movement and overlap govern energy use in multimodal training
Cross-Layer Energy Analysis of Multimodal Training on Grace Hopper Superchips
-
Decentralized geohash sampling cuts geospatial stream latency
Decentralized Stratified Sampling for Low-Latency Approximate Geospatial Data Stream Processing in Edge-Cloud Architectures
-
Sparse value sampling speeds attention 1.5x at long contexts
Stochastic Sparse Attention for Memory-Bound Inference
-
Declarative framework cuts RAG tuning code changes by 95%
AutoRAGTuner: A Declarative Framework for Automatic Optimization of RAG Pipelines
-
nvPAX three-phase method reaches 98.92% power satisfaction
nvPAX: Constrained Optimization for Dynamic Power Allocation in Hierarchical and Multi-Tenant Systems
-
Joint time-structure model improves microservice fault detection
Joint Temporal-Structural Representation Learning for Distributed Fault Discrimination in Microservice Architectures
-
SplitZip speeds KV cache transfers by 1.32x with lossless GPU coding
SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving
-
SplitZip compresses KV caches at 613 GB/s for faster LLM transfers
SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving
-
CvxCluster uses a two-stage convex optimization approach to allocate resources across…
CvxCluster: Solving Large, Complex, Granular Resource Allocation Problems 100-1000x Faster
-
FPGA accelerator speeds SVD for PCA 22x over GPU
MANOJAVAM: A Scalable, Unified FPGA Accelerator for Matrix Multiplication and Singular Value Decomposition in Principal Component Analysis
-
Turing machine extension defines context-awareness
On defining and modeling context-awareness
-
VUDA delivers 85% higher throughput via CUDA-Vulkan spatial sharing
VUDA: Breaking CUDA-Vulkan Isolation for Spatial Sharing of Compute and Graphics on the Same GPU
-
Complex analysis cuts cloud VM flapping by 94%
Intelligent Autonomous Orchestration for Distributed Cloud Resources using Complex-Stability Analysis
-
LLM serving needs math models over generic heuristics
Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations, Not Just Heuristics
-
DDD simulator runs same microservice code under multiple consistency models
A Domain-Driven Design Simulator for Business Logic-Rich Microservice Systems
-
Interference flips scheduler rankings in 28% of edge cases
ncsim: A Lightweight Simulator for Networked Edge Computing with Wireless Interference Modeling
-
FPTC codec reaches 3.6x compression for power signals
FPTC: A Fast Parallel Transform-based Codec for Efficient Asymmetric Signal Compression
-
Streaming GPU encoding matches batch speed with 12x less memory
SURGE: SuperBatch Unified Resource-efficient GPU Encoding for Heterogeneous Partitioned Data