archive
Every paper Pith has read. Search by title, abstract, or pith.
1164 papers in cs.DC · page 11
-
Post-correction keeps particle clusters intact after lossy compression
Preserving Clusters in Error-Bounded Lossy Compression of Particle Data
-
CPU-GPU hybrid speeds long-context LLM inference 1.41x-3.2x
HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
-
Resilient MPI key-value store hits limits with current ULFM and RMA
User Experiences with MPI RMA and ULFM in a Resilient Key-Value Store Implementation
-
Digital twin tests BFT systems against timing attacks
Trust, but Verify: ByzTwin-Range, a Digital Twin Cyber-Range for Byzantine Faults
-
Memory quantile models cut cluster under-allocations from 4.17% to 2.89%
Optimizing Memory Allocation in Distributed Clusters with Predictive Modeling
-
Tighter analysis cuts leader election messages to O(n log n)
Toward Optimality: A Tighter Analysis of Message Complexity for Leader Election in Diameter-Two Networks
-
Fused CUDA kernel speeds 3D SIMP optimization 4.6-7.3x
Matrix-Free 3D SIMP Topology Optimization with Fused Gather-GEMM-Scatter Kernels
-
One frozen LLM runs many tasks with 4-6x better speed and memory on phones
Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM
-
Persistent GPU kernel yields 15x speedup for tiny tensor operations
GPUOS: A GPU Operating System Primitive for Transparent Operation Fusion
-
Async GPU kernels speed up sparse matrix multiplies by up to 6x
AsyncSparse: Accelerating Sparse Matrix-Matrix Multiplication on Asynchronous GPU Architectures
-
DeInfer speeds parallel inference of decomposed LLMs
DeInfer: Efficient Parallel Inferencing for Decomposed Large Language Models
-
EcoSched cuts multi-GPU energy use by up to 14.8% via per-job GPU counts
Towards Energy Efficient Co-Scheduling in HPC
-
EcoShift gains 6% performance in power-limited CPU-GPU clusters
EcoShift: Performance-Aware Power Management for Power-Constrained Heterogeneous Systems
-
Crash-aware tuner spends fixed budget more consistently on LLM serving
SLO-Guard: Crash-Aware, Budget-Consistent Autotuning for SLO-Constrained LLM Serving
-
Multi-tier KV cache cuts LLM inference costs by 47%
Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference
-
Compiler IR enables hardware-free design exploration for distributed ML
Flint: Compiler Enabled Cluster-Free Design Space Exploration for Distributed ML
-
Active inference learns edge AI routing without offline training
Active Inference-Based Adaptive Routing for Heterogeneous Edge AI Services
-
Hive reuses logits to speed up multi-agent LLM re-sampling 1.11x-1.76x
Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling
-
Cloud-native systems required to scale large language models
Cloud-native and Distributed Systems for Efficient and Scalable Large Language Models -- A Research Agenda
-
Lossless compression speeds GPU communication up to 47%
UCCL-Zip: Lossless Compression Supercharged GPU Communication
-
Proxy borrows OS scheduling to stop LLM agents from crashing APIs
HiveMind: OS-Inspired Scheduling for Concurrent LLM Agent Workloads
-
Tensor fingerprinting cuts AI model hub storage
TStore: Rethinking AI Model Hub with Tensor-Centric Compression
-
TensorHub cuts AI model storage via tensor deduplication
TStore: Rethinking AI Model Hub with Tensor-Centric Compression
-
Standard Podman with added layers matches specialized HPC containers
Sarus Suite: Cloud-native Containers for HPC
-
Pipeline predicts airspace sectors and lets aircraft coordinate entries
Predictive Sectorization and Bayesian Optimized Consensus for Admission Control in Autonomous Airspace Operations
-
Quick intuition tops slow reasoning for edge AI in DAOs
The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus
-
Three axioms force AMM orbits to weighted geometric means
From Swap Axioms to Weighted Geometric Means: A Characterization of AMMs
-
Hierarchical sparsity speeds LLM attention 4.57 times
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
-
Flipped indexing delivers 6.5x lower GPU query latency with dynamic updates
FliX: Flipped-Indexing for Scalable GPU Queries and Updates
-
Adaptive framework trains graph transformers 6x faster on 8 GPUs
Scalable and Adaptive Parallel Training of Graph Transformer on Large Graphs
-
Agent context tracking cuts power use 27% in AI serving
KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving
-
GreenPeas is a C++/CUDA tool that compiles quantum error-correction decoding hypergraphs…
GreenPeas: Unlocking Adaptive Quantum Error Correction with Just-in-Time Decoding Hypergraphs
-
Precision modeling cuts training time prediction error to 9.8 percent
Training Time Prediction for Mixed Precision-based Distributed Training
-
Any amoebot shape breaks into O(holes) convex pieces in log time
Logarithmic-Time Geodesically Convex Decomposition in Programmable Matter
-
Compositional operators let verified swarms be reused safely
Compositional Design, Implementation, and Verification of Swarms (Technical Report)
-
Availability weighting fixes unfair sampling in federated learning
Robust Synchronisation for Federated Learning in The Face of Correlated Device Failure
-
Dynamic grouping plus TEE cuts blockchain consensus messages
T-RBFT: A Scalable and Efficient Byzantine Consensus Based on Trusted Execution Environment for Consortium Blockchain
-
SYCL implementations vary in memory and kernel behavior
Evaluating SYCL as a Unified Programming Model for Heterogeneous Systems
-
Automated pipeline adds continuous benchmarking to HPC
Continuous benchmarking: Keeping pace with an evolving ecosystem of models and technologies
-
Second-gen serverless drops warm latency from 40 ms to 10 ms
New Kids: An Architecture and Performance Investigation of Second-Generation Serverless Platforms
-
Exascale system trains billion-parameter interatomic potentials in hours
Breaking the Training Barrier of Billion-Parameter Universal Machine Learning Interatomic Potentials
-
On-orbit aggregation reduces satellite federated learning energy by 6x
CroSatFL: Energy-Efficient Federated Learning with Cross-Aggregation for Satellite Edge Computing
-
GPU framework speeds NNQS configuration selection 2.32x on 64 GPUs
A Fully GPU-Accelerated Framework for High-Performance Configuration Interaction Selection with Neural Network Quantum States
-
Sequential memory proof caps ASIC speed at DRAM latency
PoSME: Proof of Sequential Memory Execution via Latency-Bound Pointer Chasing with Causal Hash Binding
-
Accuracy drives speed in long-context LLM serving
Accuracy Is Speed: Towards Long-Context-Aware Routing for Distributed LLM Serving
-
RAFT cluster inside blockchain nodes boosts scale and uptime
BlockRaFT: A Distributed Framework for Fault-Tolerant and Scalable Blockchain Nodes
-
The paper introduces DataCenterGym
DataCenterGym: A Physics-Grounded Simulator for Multi-Objective Data Center Scheduling
-
Mixing matrix design speeds SGP convergence in broadcast DFL
Optimizing Stochastic Gradient Push under Broadcast Communications
-
Wave dispatch lets HPC treat quantum fragments as tasks
Wave-Based Dispatch for Circuit Cutting in Hybrid HPC--Quantum Systems
-
Stable per-LLM time shares enable efficient GPU allocation for agentic workflows
Scepsy: Serving Agentic Workflows Using Aggregate LLM Pipelines