archive
Every paper Pith has read. Search by title, abstract, or pith.
1164 papers in cs.DC · page 10
-
GICC cuts GPU coordination latency up to 229 times on Slingshot
GICC: A High-Performance Runtime for GPU-Initiated Communication and Coordination in Modern HPC Systems
-
Fused GPU kernel speeds epidemic sims 217x over CPU
FlashSpread: IO-Aware GPU Simulation of Non-Markovian Epidemic Dynamics via Kernel Fusion
-
Gradient sharding removes the serverless memory ceiling for federated learning
Shard the Gradient, Scale the Model: Serverless Federated Aggregation via Gradient Partitioning
-
N-gram promotion ensembles match neural accuracy at lower cost
Promoting Simple Agents: Ensemble Methods for Event-Log Prediction
-
Restructured big-integer ops deliver 4x SIMD speedups in libraries
Leveraging SIMD for Accelerating Large-number Arithmetic
-
UBRI analysis abstracts blockchain research into deployable design themes
Systematizing Blockchain Research Themes and Design Patterns: Insights from the University Blockchain Research Initiative (UBRI)
-
Risk estimates and hysteresis cut edge server switches 88%
Risk-Aware and Stable Edge Server Selection Under Network Latency SLOs
-
Delta Lake loads fastest, Iceberg saves most space
Research on the efficiency of data loading and storage in Data Lakehouse architectures for the formation of analytical data systems
-
LLM planner cuts latency 20% in WiFi offload networks
A Task Decomposition and Planning Framework for Efficient LLM Inference in AI-Enabled WiFi-Offload Networks
-
One-layer lookahead decouples graph build from update in Vision GNNs
GraphLeap: Decoupling Graph Construction and Convolution for Vision GNN Acceleration on FPGA
-
Data pipeline changes cut deep learning training from 22 hours to 3 hours
Optimizing High-Throughput Distributed Data Pipelines for Reproducible Deep Learning at Scale
-
Dedicated L2 stack needed for AI agent economies
AGNT2: Autonomous Agent Economies on Interaction-Optimized Layer 2 Infrastructure
-
LLM turns natural language into OpenSearch queries under human control
A Cloud-Native Architecture for Human-in-Control LLM-Assisted OpenSearch in Investigative Settings
-
Runtime dispatcher shares Versal AI Engine tiles among mixed-criticality tasks
Enabling Mixed criticality applications for the Versal AI-Engines
-
FPGA level-wise batch search speeds B+ tree lookups 4.9x
Efficient Batch Search Algorithm for B+ Tree Index Structures with Level-Wise Traversal on FPGAs
-
GPU runs 20,000 GWAS phenotypes in 20 minutes
TorchGWAS : GPU-accelerated GWAS for thousands of quantitative phenotypes
-
BloomBee raises decentralized LLM throughput up to 1.76x
Distributed Generative Inference of LLM at Internet Scales with Multi-Dimensional Communication Optimization
-
Spectral check spots clean clients to fix noisy labels
FedSIR: Spectral Client Identification and Relabeling for Federated Learning with Noisy Labels
-
Exact attention on billion-token sequences runs on single GPU
Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling
-
Quantum preconditioning prevents exponential failures in high-dim search
Distributed Quantum-Enhanced Optimization: A Topographical Preconditioning Approach for High-Dimensional Search
-
Quantum framework solves 500-variable higher-order problems in 170 seconds
Distributed Quantum Optimization for Large-Scale Higher-Order Problems with Dense Interactions
-
Fine-grained phase management boosts LLM serving throughput by 53%
FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving
-
RISC-V SG2044 doubles single-core performance in HPC testbed
Monte Cimone v3: Where RISC-V Stands in High-Performance Computing
-
CoVer verifier extended to Fortran with better efficiency than MUST
Extending Contract Verification for Parallel Programming Models to Fortran
-
Mobile app boosts emergency response with phone sensing and cloud
e112: A Context-Aware Mobile Emergency Communication Platform Leveraging Smartphone Sensing and Cloud Services
-
Joint optimizations cut multi-agent edge latency by 62 percent at 200 agents
A Delta-Aware Orchestration Framework for Scalable Multi-Agent Edge Computing
-
Nine quantum-HPC stacks share design patterns for unifying layers
Quantum-HPC Software Stacks and the openQSE Reference Architecture: A Survey
-
Watchdog turns Lambda kills into clean Spark table rollbacks
Characterizing and Fixing Silent Data Loss in Spark-on-AWS-Lambda with Open Table Formats
-
Four dimensions organize blockchain-federated learning systems
Federated Learning over Blockchain-Enabled Cloud Infrastructure
-
Slicing traces GPU stall roots for 1.8x speedups across vendors
LEO: Tracing GPU Stall Root Causes via Cross-Vendor Backward Slicing
-
Local cost signal lifts satellite goodput 20% and throughput 31%
Equinox: Decentralized Scheduling for Hardware-Aware Orbital Intelligence
-
Predictive autoscaler holds Node.js latency at 26 ms in ramps
Predictive Autoscaling for Node.js on Kubernetes: Lower Latency, Right-Sized Capacity
-
Copy engine enables free intra-node MoE load balancing
FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training
-
35 watchers prevent double-spends without global consensus
Intercloud: Eventual Consistency for Decentralised Economies via Chilling-Effect Consensus
-
ReaLB speeds multimodal MoE inference 1.29x by runtime precision adjustment
ReaLB: Real-Time Load Balancing for Multimodal MoE Inference
-
ReaLB speeds multimodal MoE inference 1.1-1.32x via per-rank precision cuts
ReaLB: Real-Time Load Balancing for Multimodal MoE Inference
-
CXL single-copy cache yields 5.6X geo-mean speedup
DPC: A Distributed Page Cache over CXL
-
Self-stabilizing algorithms minimize IP risks hierarchically
Minimizing Intellectual Property Risks via Self-Stabilizing Algorithms
-
Satellite FL routing is tractable or NP-hard by case
Optimal Routing for Federated Learning over Dynamic Satellite Networks: Tractable or Not?
-
CROWDio cuts execution time by 57% with adaptive scheduling on phones
CROWDio: A Practical Mobile Crowd Computing Framework with Developer-Oriented Design, Adaptive Scheduling, and Fault Resilience
-
Matrix co-design gives PIC particle phase 10.9x speedup
POLAR-PIC: A Holistic Framework for Matrixized PIC with Co-Designed Compute, Layout, and Communication
-
Tensor cores accelerate PIC mass matrix assembly up to 3x
Mass Matrix Assembly on Tensor Cores for Implicit Particle-In-Cell Methods
-
Uniform trees let FMM scale to 32 billion points on 512 nodes
A Simple Communication Scheme for Distributed Fast Multipole Methods
-
MegaKernels fuse MoE communication and computation for up to 38 percent speedup
UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training
-
Multi-party protocol aligns data privately without intersection leaks
Sherpa.ai Privacy-Preserving Multi-Party Entity Alignment without Intersection Disclosure for Noisy Identifiers
-
Multi-party protocol aligns data without revealing shared records
Sherpa.ai Privacy-Preserving Multi-Party Entity Alignment without Intersection Disclosure for Noisy Identifiers
-
Simulator lets agents control and adapt fog models at runtime
YAIFS: Yet (not) Another Intelligent Fog Simulator: A Framework for Agent-Driven Computing Continuum Modeling & Simulation
-
Heuristic partitioning cuts multi-tenant query P95 latency from 61s to 2s
Heuristic Search Space Partitioning for Low-Latency Multi-Tenant Cloud Queries
-
CHRONOS cuts IoT federated learning latency by 74 percent
CHRONOS: A Hardware-Assisted Phase-Decoupled Framework for Secure Federated Learning in IoT
-
HyperLogLog skips exact counts for faster GPU SpGEMM
Ocean: Fast Estimation-Based Sparse General Matrix-Matrix Multiplication on GPU