archive
Every paper Pith has read. Search by title, abstract, or pith.
1164 papers in cs.DC · page 3
-
Coding plus sketching cuts distributed ML runtime
Approximate Distributed Coded Computing: Polynomial Codes and Randomized Sketching
-
HexAGenT cuts required SLO scale by 20% for agentic LLM workflows
HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling
-
BF16 tensor cores outperform native FP32 SGEMM in speed and accuracy
Exceeding the Numerical and Performance Characteristics of IEEE-754 SGEMM with BFloat16 Tensor Cores on GPUs for Scientific Computing
-
Datacenters should plan for deployable AI power capacity over time
Designing Datacenter Power Delivery Hierarchies for the AI Era
-
Runtime system makes second-order optimizers work for 7B LLMs
Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training
-
GPU system samples causal walks on billion-edge streams in real time
A GPU Accelerated Temporal Window-Based Random Walk Sampler
-
Manufacturing ransomware recovery goes beyond backups
From Backup Restoration to Minimum Viable Factory Recovery: A Systematization of Ransomware Recovery in Manufacturing Systems
-
Diffusion model poisons FL data more stealthily than GANs
PCDM: A Diffusion-Based Data Poisoning Attack Against Federated Learning Systems
-
One GPU runs DG ocean model at speed of 1500 CPU cores
An efficient multi-GPU implementation for the Discontinuous Galerkin ocean model SLIM
-
Parallel code speeds star-M SVD compression for big datasets
High-Performance Star-M SVD for Big Data Compression
-
SNN latency increases 47 times at half a CPU core
Evaluating Container Orchestration for Neuromorphic Workloads in Virtual Edge Environments
-
Online delay tracker holds container SLA violations under 5%
ADAPT: A Self-Calibrating Proactive Autoscaler for Container Orchestration
-
Deep RL scheduler nears optimal for edge serverless containers
Scale: Deep Reinforcement Learning for Container Scheduling in Serverless Edge Computing
-
ParamSpMM adapts SpMM for GNNs to gain 1.92x average speedup
ParamSpMM: Adaptive and Efficient Sparse Matrix-Matrix Multiplication on GPUs for GNNs
-
Emulate 8192-GPU training on a few GPUs
A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM
-
One client inflates its attribution score in distributed ML training
On the Fragility of Data Attribution When Learning Is Distributed
-
OSDF joins U.S
Open Science Data Federation -- operation and monitoring
-
OSDF integration gives BBSO data reliable global access
Using the Open Science Data Federation for data distribution: Big Bear Solar Observatory use case
-
3D satellite clusters scale nodes with cube of radius ratio
Designing Dense Satellite Clusters for Distributed Space-based Datacenters
-
APWA scales agent workflows by parallelizing non-communicating subproblems
APWA: A Distributed Architecture for Parallelizable Agentic Workflows
-
Cache reorganization lifts GPU speedups for 28-qubit simulations on laptops
Accelerating State-Vector Quantum Simulation on Integrated GPUs via Cache Locality Optimization: A Cross-Architecture Evaluation
-
Mat2Boundary turns boundary conditions into SpMV for PDE solvers
Mat2Boundary: Treating User-Defined Boundary Condition as SpMV for Distributed PDE Solvers on Block-Structured Grids
-
Wi-Fi logs build hierarchical mobility models with lower complexity
Analysis of wireless network access logs for a hierarchical characterization of user mobility
-
Unified GPU solver gives exact gradients for stiff heterogeneous soft bodies
DiffPhD: A Unified Differentiable Solver for Projective Heterogeneous Materials in Elastodynamics with Contact-Rich GPU-Acceleration
-
Exploration fails above ceil(k/(n-2))-1 deactivations per round
Semi-Synchronous Exploration in Dynamic Graphs
-
Distributed Sumcheck gives statistical zero-knowledge for graph problems
Distributed Statistical Zero-Knowledge Proofs via Sumcheck
-
EMA cuts model adaptation costs 15-42% in shifting environments
EMA: Efficient Model Adaptation for Learning-based Systems
-
MinT manages million LoRA policies over shared 1T models
MinT: Managed Infrastructure for Training and Serving Millions of LLMs
-
Federated fine-tuning matches centralized LLM training on private data
Towards the Next Frontier of LLMs, Training on Private Data: A Cross-Domain Benchmark for Federated Fine-Tuning
-
Adaptive KV compression speeds disaggregated LLM serving up to 9x
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
-
Client committee speeds secure aggregation 4.6x
DisAgg: Distributed Aggregators for Efficient Secure Aggregation in Federated Learning
-
Multi-agent RL cuts LLM carbon by 33% and water by 43%
MARLIN: Multi-Agent Game-Theoretic Reinforcement Learning for Sustainable LLM Inference in Cloud Datacenters
-
Hybrid method cuts graph scheduling violations 45 percent
Sustainable Graph Analytics Workload Scheduling with Evolutionary Reinforcement Learning in Edge-Cloud Systems
-
Router sends 36% of VLM queries to edge
INAR-VL: Input-Aware Routing for Edge-Cloud Vision-Language Inference
-
Rescaled stepsizes remove bias in async SGD
Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity
-
TurboGR trains 0.2B-param generative recommenders at 54.71% MFU
TurboGR: An Accelerated Training System for Large-Scale Generative Recommendation
-
FPGA lock agents boost OLTP throughput 51X over CPUs
FPGA-Accelerated Lock Management and Transaction Processing: Architecture, Optimization, and Design Space Exploration
-
One rule unifies voting, proposals and constitutional amendment in metric spaces
Constitutional Governance in Metric Spaces
-
Metric-space protocol lets communities self-amend constitutions in polynomial time
Constitutional Governance in Metric Spaces
-
Transformer preconditioner speeds stiff physics 28x
Hierarchical Transformer Preconditioning for Interactive Physics Simulation
-
Hierarchical transformer preconditioner reaches 21 fps on stiff Poisson systems
Hierarchical Transformer Preconditioning for Interactive Physics Simulation
-
Drone swarms adapt composition to deliver lower latency connectivity
Swarm Network-as-a-Service (SNaaS)
-
Pipeline overlap speeds cloud-edge LLM inference up to 2.16x
PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding
-
Pipeline speeds cloud-edge LLM inference 1.16-2.16x
PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding
-
Heterogeneous solvers up to 32% faster than GPU-only for big matrices
Comparing the Performance of Heterogeneous Conjugate Gradient and Cholesky Solvers on Various Hardware Using SYCL
-
Dynamic pricing stabilizes mempool volume at target capacity
Dynamic Transaction Scheduling and Pricing in the Ethereum Mempool
-
LCL complexity on trees shifts without exact n knowledge
The Distributed Complexity Landscape on Trees Depends on the Knowledge About the Network Size
1 Piths -
Overdecomposition supported efficiently on mixed GPGPU clusters
Efficient and Portable Support for Overdecomposition on Distributed Memory GPGPU Platforms
-
Parallel training lets RNNs learn from sequences over 10,000 steps
Parallel-in-Time Training of Recurrent Neural Networks for Dynamical Systems Reconstruction
-
Adaptive eviction cuts LLM prefill time 1.4x to 2.7x
Not All Tokens Are Worth Caching: Learning Semantic-Aware Eviction for LLM Prefix Caches