archive
Every paper Pith has read. Search by title, abstract, or pith.
1164 papers in cs.DC · page 14
-
GROMACS runs deep-potential MD at scale on multi-GPU systems
Making Room for AI: Multi-GPU Molecular Dynamics with Deep Potentials in GROMACS
-
Disaggregating LoRA triples request rate under latency limits
InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models
-
LLM serving policies rewrite themselves online for 34% gains
Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics
-
One LLM call compiles web tasks into JSON that runs forever at fixed low cost
Agentic Compilation: Mitigating the LLM Rerun Crisis for Minimized-Inference-Cost Web Automation
-
Client scheduler hits 100% LLM deadlines at 4.2 requests per second
Scheduling the Unschedulable: Taming Black-Box LLM Inference at Scale
-
Nested pipelining gives 3x faster training on 1,500+ accelerators
NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining
-
Output-set tasks solvable under crashes iff inclusion graph connects
On the Decidability of Distributed Tasks with Output Sets under Asynchrony and Any Number of Crashes
-
Priorities and clocks extend CCS to define coherence
Determinacy with Priorities up to Clocks
-
Multi-robot service prototype runs on Aggregate Programming
Exploiting Aggregate Programming in a Multi-Robot Service Prototype
-
Effpi adds branching for external choice and timeouts
Branching Out: Existential External Choice in Effpi
-
Layer-by-layer freezing fits private LLM tuning on edge devices
Beyond End-to-End: Dynamic Chain Optimization for Private LLM Adaptation on the Edge
-
Nexus cuts serverless CPU use 44% by offloading I/O from VMs
Nexus: Transparent I/O Offloading for High-Density Serverless Computing
-
SwarmIO emulates 40M IOPS SSDs for GPUs with 300x speedup
SwarmIO: Towards 100 Million IOPS SSD Emulation for Next-generation GPU-centric Storage Systems
-
Foundry cuts LLM cold-start time from minutes to seconds
Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start
-
SpMM requires structure-specific roofline models for accurate bounds
Sparsity-Aware Roofline Models for Sparse Matrix-Matrix Multiplication
-
DynLP updates graph labels 13x faster on average by limiting propagation to changed sub-
DynLP: Parallel Dynamic Batch Update for Label Propagation in Semi-Supervised Learning
-
Canceled spot requests yield availability signals at near-zero cost
Ding-Dong Ditch: Peeking Into Spot Instance Availability
-
Adaptive sync raises IoT ledger recovery after partitions
Contextual Chain: Single-State Ledger Design for Mobile/IoT Networks with Frequent Partitions
-
Copy-on-write KV cache triples multi-LoRA agent throughput
ForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache
-
Power reconstruction shows 79% energy cut from mixed precision on Frontier
Fine-Grained Power and Energy Attribution on AMD GPU/APU-Based Exascale Nodes
-
Codec signals triple VLM streaming throughput
CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference
-
GTaP runtime runs fork-join tasks on GPUs faster than CPU OpenMP
GTaP: A GPU-Resident Fork-Join Task-Parallel Runtime with a Pragma-Based Interface
-
Morton plane trees speed GPU neighbor search by over 10x
JZ-Tree: GPU friendly neighbour search and friends-of-friends with dual tree walks in JAX plus CUDA
-
Linearizable registers force extensive message chains
Communication Requirements for Linearizable Registers
-
Go runtime outperforms Python and Node.js for OpenFaaS on Kubernetes
Optimizing OpenFaaS on Kubernetes: Comparative Analysis of Language Runtimes and Cluster Distributions
-
ALTO speeds LoRA tuning 13.8x via early stops and shared scheduling
ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads
-
Persistent Alltoallv cuts MPI runtime up to 44% for large messages
Analyzing Persistent Alltoallv RMA Implementations for High-Performance MPI Communication
-
Single GPU trains 120B-parameter models at full precision
MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU
-
Decentralized relayers route cross-chain messages without hubs
Towards Policy-Enabled Multi-Hop Routing for Cross-Chain Message Delivery
-
Tool explores 250 trillion 3D AI accelerator designs 100000 times faster
DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators
-
RegGuard cuts optimistic rollup settlement failures by over 90 percent
RegGuard: Legitimacy and Fairness Enforcement for Optimistic Rollups
-
Execution-idle wastes 10.7% of GPU cluster energy
The Energy Cost of Execution-Idle in GPU Clusters
-
Sampling parallelism scales Bayesian training linearly across GPUs
Sampling Parallelism for Fast and Efficient Bayesian Learning
-
Splitting LLMs across LEO satellites cuts delay by 42%
Communication-Efficient Collaborative LLM Inference over LEO Satellite Networks
-
Zero downtime achieved in edge energy service migration
Edge-Oriented Orchestration of Energy Services Using Graph-Driven Swarm Intelligence
-
Single-agent exploration in dynamic graphs needs Omega(m) window
Tight Bounds on Window Size and Time for Single-Agent Graph Exploration under T-Interval Connectivity
-
Layout propagation removes redundant packing in GEMM sequences
LP-GEMM: Integrating Layout Propagation into GEMM Operations
-
Slurm tool simplifies submissions and defers jobs to cut energy use
NBI-Slurm: Simplified submission of Slurm jobs with energy saving mode
-
AI peer review platform detects fake citations over 85 percent of the time
OpenCLAW-P2P v7.0-P2PCLAW: Resilient Multi-Layer Persistence, Live Reference Verification, and Production-Scale Evaluation of Decentralized AI Peer Review v7.0 -- Mathematical Corrections & Ecosystem Developments Edition
-
AI agents run peer review with 85% fabricated-citation detection
OpenCLAW-P2P v7.0-P2PCLAW: Resilient Multi-Layer Persistence, Live Reference Verification, and Production-Scale Evaluation of Decentralized AI Peer Review v7.0 -- Mathematical Corrections & Ecosystem Developments Edition
-
Satellite emulators tested against real data show clear gaps
An experimental evaluation of satellite constellation emulators
-
Co-serving system raises SLO attainment for mixed diffusion workloads by up to 44%
GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads
-
Ledger state serves as shared environment for agent coordination
Ledger-State Stigmergy: A Formal Framework for Indirect Coordination Grounded in Distributed Ledger State
-
Lemonshark cuts async BFT latency up to 65% with early finality
Lemonshark: Asynchronous DAG-BFT With Early Finality
-
SecureAFL detects bad updates and estimates missing ones in async FL
SecureAFL: Secure Asynchronous Federated Learning
-
GPU simulator speeds quantum circuits up to 146x over CPU
GPU-Accelerated Quantum Simulation: Empirical Backend Selection, Gate Fusion, and Adaptive Precision
-
Four-layer middleware adapts hybrid quantum-HPC resources at runtime
Hybrid Quantum-HPC Middleware Systems for Adaptive Resource, Workload and Task Management
-
Hybrid parallelism scales encrypted Transformers across multiple GPUs
AEGIS: Scaling Long-Sequence Homomorphic Encrypted Transformer Inference via Hybrid Parallelism on Multi-GPU Systems
-
Granger causality quantifies noisy neighbor effects up to 67% slowdown
Causal Inference for Quantifying Noisy Neighbor Effects in Multi-Tenant Cloud Environments
-
Collective KV sharing runs 2.7x more multi-agent LLM agents
TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing