archive
Every paper Pith has read. Search by title, abstract, or pith.
1164 papers in cs.DC · page 4
-
Decoupled compression speeds GPU collectives up to 9.65x
NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding
-
Link failures cap LEO capacity scalability at O(1/n)
Capacity Scalability of LEO Constellations With Dynamic Link Failures
-
Per-head adaptive blocks improve sparse attention accuracy by 5.43%
AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference
-
Node failures scale wireless capacity and delay with sqrt of reliable nodes
On Capacity and Delay of Wireless Networks with Node Failures
-
Power capping leaves LLM decode energy untouched
The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
-
DynaTrain switches 70B model parallelism in under 2 seconds
DynaTrain: Fast Online Parallelism Switching for Elastic LLM Training
-
Overlays trade reliability against overhead for AI agent discovery
Trade-offs in Decentralized Agentic AI Discovery Across the Compute Continuum
-
LLM inference should be measured in joules per token at scale
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
-
GraphFlash hits 127x speedup in serverless graph processing
GraphFlash: Enabling Fast and Elastic Graph Processing on Serverless Infrastructure
-
NAVIS speeds on-SSD vector inserts up to 2.74x
NAVIS: Concurrent Search and Update with Low Position-Seeking Overhead in On-SSD Graph-Based Vector Search
-
Off-chain twins let DeFi agents simulate trades without waiting for blocks
State Twins: An Off-Chain Substrate for Agentic Reasoning over Decentralized Finance Protocols
-
Storage offloading breaks memory wall for full-graph GNN training
GriNNder: Breaking the Memory Capacity Wall in Full-Graph GNN Training with Storage Offloading
-
Task runtime dispatches QIR programs to multiple quantum processors
Classic and Quantum Task-Based Intelligent Runtime for QIRs Running on Multiple QPUs
-
Kairos cuts physical AI task latency by 32-66 percent
Kairos: A Scalable Serving System for Physical AI
-
Chunked prefetching speeds DiT steps up to 1.28x with 49% less GPU memory
ChunkFlow: Communication-Aware Chunked Prefetching for Layerwise Offloading in Distributed Diffusion Transformer Inference
-
Chakra standardizes graph traces for AI workload benchmarking
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces
-
Open traces standardize ML workload benchmarking
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces
-
Directed graphs support Byzantine consensus only under specific connectivity
Byzantine Consensus in Directed Graphs with Message Authentication
-
ReCoVer keeps microbatch count fixed after GPU failures
ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload
-
ReCoVer preserves exact training trajectory after GPU losses
ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload
-
ShardTensor scales SciML to arbitrary spatial resolutions
ShardTensor: Domain Parallelism for Scientific Machine Learning
-
GCC 15 outperforms LLVM 21 on four of six RISC-V vector apps
Closer in the Gap: Towards Portable Performance on RISC-V Vector Processors
-
GCC 15 outperforms LLVM 21 in four of six RISC-V vector apps
Closer in the Gap: Towards Portable Performance on RISC-V Vector Processors
-
Edge micro-agent fixes failures safely with no destructive actions
An Uncertainty-Aware Resilience Micro-Agent for Causal Observability in the Computing Continuum
-
Mutable membership lets MoE survive rank faults without restarts
Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference
-
This paper performs a structured bidirectional review of peer-reviewed studies on AI and…
SoK: A Systematic Bidirectional Literature Review of AI & DLT Convergence
-
Maestro cuts GPU use by 40% for compound LLM training
Accelerating Compound LLM Training Workloads with Maestro
-
BitTorrent warm-up hides FL update sources from local observers
Privacy-preserving Chunk Scheduling in a BitTorrent Implementation of Federated Learning
-
BitTorrent warm-up bounds FL source attribution to random guessing
Privacy-preserving Chunk Scheduling in a BitTorrent Implementation of Federated Learning
-
Hierarchical RL cuts edge latency 28 percent while saving energy
HiRL: Hierarchical Reinforcement Learning for Coordinated Resource Management in Heterogeneous Edge Computing
-
CPU radix sort reaches 6x bandwidth efficiency on large datasets
FractalSortCPU: Bandwidth-Efficient Compressed Radix Sort on CPU
-
CPU radix sort cuts bandwidth use by 6x on large data
FractalSortCPU: Bandwidth-Efficient Compressed Radix Sort on CPU
-
Small models reach strong edge-agent results when tools match the model
Agentic Performance at the Edge: Insights from Benchmarking
-
Amortized protocol makes async BRB messages linear in size
Amortized Asynchronous Byzantine Reliable Broadcast with Optimal Resilience
-
Amortized BRB reaches O(n|m|) messages in async networks
Amortized Asynchronous Byzantine Reliable Broadcast with Optimal Resilience
-
Autonomous objects resolve over half of scientific data conflicts
Autonomous FAIR Digital Objects: From Passive Assertions to Active Knowledge
-
Block-structured matrix multiplication speeds quantum chemistry by 10x
Accelerating Locality-Driven Integration in Quantum Chemistry with Block-Structured Matrix Multiplication
-
Block-structured matmul speeds DFT integrals up to 10x on GPUs
Accelerating Locality-Driven Integration in Quantum Chemistry with Block-Structured Matrix Multiplication
-
Graph reordering cuts memory pressure in GPU integral evaluation
FusionRCG: Orchestrating Recursive Computation Graphs across GPU Memory Hierarchies
-
Graph orchestration cuts GPU memory use for recursive integrals
FusionRCG: Orchestrating Recursive Computation Graphs across GPU Memory Hierarchies
-
Adaptive clipping lifts private federated LLM accuracy
DP-LAC: Lightweight Adaptive Clipping for Differentially Private Federated Fine-tuning of Language Models
-
Adaptive offloading lifts LLM throughput 65% at 47% lower energy
GELATO: Generative Entropy- and Lyapunov-based Adaptive Token Offloading for Device-Edge Speculative LLM Inference
-
Vehicle screening plus federated segmentation cuts pothole data volume
Edge-Cloud Collaborative Pothole Detection via Onboard Event Screening and Federated Temporal Segmentation
-
Brokerless data plane delivers consistent batches for AI training
BatchWeave: A Consistent Object-Store-Native Data Plane for Large Foundation Model Training
-
Object store delivers atomic batches for 64-GPU model training
BatchWeave: A Consistent Object-Store-Native Data Plane for Large Foundation Model Training
-
Ordered agents let population protocols recognize unambiguous star-free languages
Population Protocols over Ordered Agents
-
Method optimizes server placement for vertical federated learning in dynamic networks
Optimizing Server Placement for Vertical Federated Learning in Dynamic Edge/Fog Networks
-
Cascade labels 8.6M orbital sequences for anomaly detection
Multi-Tier Labeling and Physics-Informed Learning for Orbital Anomaly Detection at Scale
-
Cloud trace decomposition predicts performance at 2% error
Cloud Performance Decomposition for Long-Term Performance Engineering: A Case Study
-
Neural preprocessor lifts H.264 perceptual scores 27 percent on UVG
Kelvin v1.0: A Neural Pre-Encoder for H.264: A standards-compliant learned preprocessor with -27.62% BD-VMAF on UVG