archive
Every paper Pith has read. Search by title, abstract, or pith.
1164 papers in cs.DC · page 18
-
Bitcoin subnets stake in L1 BTC to cut tx cost by 23x
Bitcoin-IPC Whitepaper: Scaling Bitcoin with a Network of Proof-of-Stake Subnets
-
Graph-guided LLM fixes cloud incidents with 6x accuracy
PRAXIS: Integrating Program Analysis with Observability for Root-Cause Analysis
-
Modal logic axioms capture distributed protocols
Declarative distributed algorithms as axiomatic theories in three-valued modal logic over semitopologies
-
GPU data structure speeds hypergraph triad counting up to 473x
ESCHER: Efficient and Scalable Hypergraph Evolution Representation with Application to Triad Counting
-
SHIRO delivers 221x SpMM speedup on 128 GPUs via sparsity-aware transfers
SHIRO: Near-Optimal Communication Strategies for Distributed Sparse Matrix Multiplication
-
ROS 2 real-time support detailed in survey of analyses and enhancements
A Survey of Real-Time Support, Analysis, and Advancements in ROS 2
-
Length groups cut LLM latency by up to 67%
CascadeInfer: Length-Aware Scheduling of LLM Serving with Low Latency and Load Balancing
-
Uncertainty scores select compatible peers in decentralized learning
Evidential Trust-Aware Model Personalization in Decentralized Federated Learning for Wearable IoT
-
Tool spots bit-flip faults in LLMs for fast fixes
BitFlipScope: Scalable Fault Localization and Recovery for Bit-Flip Corruptions in LLMs
-
Federated platform runs full AI lifecycle in open science cloud
AI4EOSC: a Federated Cloud Platform for Artificial Intelligence in Scientific Research
-
Multipath routing lifts host-GPU bandwidth 4.6x
MultiPath Memory Access: Breaking Host-GPU Bandwidth Bottlenecks in LLM Services
-
TileLoom matches vendor libraries on spatial accelerators
TileLoom: Automatic Dataflow Planning for Tile-Based Languages on Spatial Dataflow Accelerators
-
Data movement bottlenecks sit outside the network core
Reexamining Paradigms of End-to-End Data Movement
-
Automated planner boosts any-to-any model goodput up to 6x
Cornfigurator: Automated Planning for Any-to-Any Multimodal Model Serving
-
Framework links SKA imaging quality to energy and cost metrics
astroCAMP: A Community Benchmark and Co-Design Framework for Sustainable SKA-Scale Radio Imaging
-
Disaggregating attention and experts yields 4.7x MoE inference speedup
Janus: Disaggregating Attention and Experts for Scalable MoE Inference
-
HetRL raises LLM RL throughput up to 9x on mixed GPUs
HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments
-
Edge devices train large models at cloud speeds via GEMM asymmetry
On Harnessing Idle Compute at the Edge for Foundation Model Training
-
Async Kafka rules shift availability forecasts by 0.001 points or less
Evaluating Asynchronous Semantics in Trace-Discovered Resilience Models: A Case Study on the OpenTelemetry Demo
-
DRL model resolves conflicts in computing continuum resources
A Conflict-Aware Resource Management Framework for the Computing Continuum
-
Low-rank LLMs train up to 2.27x faster with new parallelism
BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models
-
Hybrid noise keeps ML models at 80 percent accuracy on private health data
Differential Privacy for Secure Machine Learning in Healthcare IoT-Cloud Systems
-
SynthPix streams synthetic PIV images on demand at accelerator speed
SynthPix: A lightspeed PIV image generator
-
Local build lets spiking networks scale to thousands of GPUs
Scalable Construction of Spiking Neural Networks using up to thousands of GPUs
-
Prewarming multiple LLMs cuts tail TTFT by 50x
WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving
-
BatANN passes full query state to scale vector search
Passing the Baton: High Throughput Distributed Disk-Based Vector Search with BatANN
-
SHARe-KAN cuts KAN head storage 9.3X at 2-point accuracy cost
SHARe-KAN: Post-Training Vector Quantization for Cache-Resident KAN Inference
-
Bi-level search finds ML shifts that quadruple VM allocation loss
A Performance Analyzer for a Public Cloud's ML-Augmented VM Allocator
-
Vector LUT speeds parallel ultra-low-bit LLM inference up to 4.2×
Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices
-
Gradual model growth lets limited clients contribute in federated learning
Breaking the Capacity Bottleneck in Model-Heterogeneous Federated Learning via Gradual Model Restoration
-
Tokenized context speeds edge LLM responses by up to 14%
DisCEdge: Distributed Context Management for Large Language Models at the Edge
-
Diagonal scaling cuts database p95 latency by up to 40%
Diagonal Scaling: A Multi-Dimensional Resource Model and Optimization Framework for Distributed Databases
-
Voxel traits let Spira skip kernel-map overhead for 3x faster point-cloud convolution
Spira: Exploiting Voxel Data Structural Properties for Efficient Sparse Convolution in Point Cloud Networks
-
LLM agents give PyTorch 2.88x speedup on H100 GPUs
Optimizing PyTorch Inference with LLM-Based Multi-Agent Systems
-
Joint data-compute tuning speeds ML kernels on PIM up to 13x
DCC: Data-Centric Compilation of Machine Learning Kernels for Processing-In-Memory Architectures
-
Adaptive reputation defends federated learning from malicious clients
FLARE: Adaptive Multi-Dimensional Reputation for Robust Client Reliability in Federated Learning
-
Seer speeds LLM RL rollouts up to 2x by learning prompt output patterns
Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning
-
Regression attributes node energy to individual processes
Learning Process Energy Profiles from Node-Level Power Data
-
Satellite system reduces imagery latency from 51 to 21 minutes
EarthSight: A Distributed Framework for Low-Latency Satellite Intelligence
-
Thermal imbalance creates stragglers that slow multi-GPU nodes
Lit Silicon: A Case Where Thermal Imbalance Couples Concurrent Execution in Multiple GPUs
-
NVRAR cuts multi-node LLM latency up to 3.6x
Understanding and Improving Communication Performance in Multi-node LLM Inference
-
SpaDA expresses parallel patterns in 14x fewer lines
SpaDA: A Spatial Dataflow Architecture Programming Language
-
Local models handle 88.7% of queries at higher intelligence per watt
Intelligence per Watt: Measuring Intelligence Efficiency of Local AI
-
DMA offloads close 4.5x gap for latency-bound ML collectives
DMA-Latte: Expanding the Reach of DMA Offloads to Latency-bound ML Communication
-
Domain decomposition scales Monte Carlo to 16384 cores
Scalable Domain-decomposed Monte Carlo Neutral Transport for Nuclear Fusion
-
Spectral map decides solvability of colorless tasks
Stone Duality Proofs for Colorless Distributed Computability Theorems
-
Simulator reaches 50-qubit universal quantum runs on exascale machine
Universal Quantum Computer Simulation of 50 Qubits on Europe`s First Exascale Supercomputer Harnessing Its Heterogeneous CPU-GPU Architecture
-
Unified layout cuts LLM decode time on edge NPUs by up to 3x
UMDAM: A Unified Data Layout and DRAM Address Mapping for Heterogenous NPU-PIM
-
Essential agents split global platforms into four classes
Characterising Global Platforms: Centralised, Decentralised, Federated, and Grassroots
-
SnapStream cuts KV cache memory by 4x for 128k LLM inference
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators