archive
Every paper Pith has read. Search by title, abstract, or pith.
1164 papers in cs.DC · page 16
-
Grassroots bonds turn community trust into interest-bearing liquidity
Grassroots Bonds as a Foundation for Market Liquidity
-
Token-budget routing cuts LLM GPU fleet 17-39%
Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference
-
Engine runs 1,200-node graphs after one agent call
Separating Intelligence from Execution: A Workflow Engine for the Model Context Protocol
-
Cornserve boosts any-to-any model serving by 3.81x throughput
Cornserve: A Distributed Serving System for Any-to-Any Multimodal Models
-
Multi-agent RL with graphs beats default Kubernetes scheduler
AGMARL-DKS: An Adaptive Graph-Enhanced Multi-Agent Reinforcement Learning for Dynamic Kubernetes Scheduling
-
Batch size cuts energy in LLM workflows but only for certain tasks
Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows
-
NCCLbpf adds verified eBPF policies to NCCL plugins with 130 ns overhead
NCCLbpf: Verified, Composable Policy Execution for GPU Collective Communication
-
Scheduler cuts multi-job federated learning time by 8.3x
FedACT: Concurrent Federated Intelligence across Heterogeneous Data Sources
-
PrefixWall raises LLM cache reuse 70% over isolation
PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems
-
Ozaki-II adapted to FP8 cuts cost of double-precision matrix emulation
Double-Precision Matrix Multiplication Emulation via Ozaki-II Scheme with FP8 Quantization
-
Cloud LLM creates and pushes adaptive code to edge devices
LLM-assisted Agentic Edge Intelligence Framework
-
Flash-KMeans runs exact GPU k-means 18x faster
Flash-KMeans: Fast and Memory-Efficient Exact K-Means
-
Sparse gating turns LLM batches into elastic super-trees for 5x speedup
ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios
-
FP64 tensor cores speed finite-element kernels 2x
Accelerating High-Order Finite Element Simulations at Extreme Scale with FP64 Tensor Cores
-
ArcLight raises CPU LLM throughput by 46% via NUMA control
ArcLight: A Lightweight LLM Inference Architecture for Many-Core CPUs
-
Graph engine runs LLM agents with zero hallucinations
GraphBit: A Graph-based Agentic Framework for Non-Linear Agent Orchestration
-
Potential games and LLM weights optimize UAV networks
Agentic AI-Driven UAV Network Deployment: An LLM-Enhanced Exact Potential Game Approach
-
ML duration predictor trims supercomputer job waits by 11%
Duration-Informed Workload Scheduler
-
Simulator tests failure knobs for large AI clusters
AIReSim: A Discrete Event Simulator for Large-scale AI Cluster Reliability Modeling
-
OMA retains Kubernetes crash evidence past the evidence horizon
Operational Memory Architecture for Kubernetes:Preserving Causal Context Across the Evidence Horizon
-
DMM merges divergent models data-free using normalization stats
Domain-Adaptive Model Merging Across Disconnected Modes
-
Misaligned dimensions keep compressed LLMs from speeding up
Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs
-
Heron beats Eagle in protocol benchmarks for quantum advantage
Benchmarking Quantum Computers via Protocols, Comparing IBM's Heron vs IBM's Eagle
-
Benchmark suite derives efficiency rules for compound AI
Benchmarking Compound AI Applications for Hardware-Software Co-Design
-
Planning system decides satellite vs ground tasks to fit data transfers
Constraint-Aware Execution Planning for Hybrid Space-Ground Compute Workloads
-
Beam search reduces quantum communication costs in circuit partitioning
Efficient Time-Aware Partitioning of Quantum Circuits for Distributed Quantum Computing
-
Unified objects automate IoT edge-cloud apps with 9 nines availability
EdgeWeaver: Accelerating IoT Application Development Across Edge-Cloud Continuum
-
Fixed encoding decodes data 9-213× faster than Protocol Buffers
Simplicity Scales
-
Gate fusion speeds quantum ML simulation by 20 times
Fast and memory-efficient classical simulation of quantum machine learning via forward and backward gate fusion
-
The paper introduces the cuNRTO framework with two new CUDA-based architectures
cuNRTO: GPU-Accelerated Nonlinear Robust Trajectory Optimization
-
Filecoin reaches 2^{-30} finality in 30 rounds not 900
The Finality Calculator: Analyzing and Quantifying Filecoin's Finality Guarantees
-
SPARe keeps fault-tolerance overhead at 2-3x for 100k GPU LLM training
SPARe: Stacked Parallelism with Adaptive Reordering for Fault-Tolerant LLM Pretraining Systems with 100k+ GPUs
-
Perturbed model copies enable private LLM unlearning
MPU: Towards Secure and Privacy-Preserving Knowledge Unlearning for Large Language Models
-
Protocol outsources MSM with 300x faster verification
2G2T: Constant-Size, Statistically Sound MSM Outsourcing
-
Shared caching cuts edge LLM first-token time by 93%
Accelerating Local LLMs on Resource-Constrained Edge Devices via Distributed Prompt Caching
-
CXL memory pool beats InfiniBand on GPU collectives
CCCL: Node-Spanning GPU Collectives with CXL Memory Pooling
-
Flexible sharding lifts FSDP speed by up to 66% at 10k GPUs
veScale-FSDP: Flexible and High-Performance FSDP at Scale
-
GPU hybrid matches top solvers on large multi-depot routing
A GPU-Accelerated Hybrid Method for a Class of Multi-Depot Vehicle Routing Problems
-
Morton curve defined for pyramids in hybrid AMR
A Morton-Type Space-Filling Curve for Pyramid Subdivision and Hybrid Adaptive Mesh Refinement
-
Semantic dependencies resolve data conflicts locally via rebasing
Semantic Conflict Model for Collaborative Data Structures
-
DualScale cuts energy up to 48% in LLM decode phase
DualScale: Energy-Efficient Disaggregated LLM Serving via Phase-Aware Placement and DVFS
-
74% of workflows need no coordination for correctness
When Coordination Is Avoidable: A Monotonicity Analysis of Organizational Tasks
-
GPU memory estimators fail to generalize across hardware
GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations
-
SwapLess cuts Edge TPU latency up to 77% via CPU-TPU partitioning
Collaborative Processing for Multi-Tenant Inference on Memory-Constrained Edge TPUs
-
Prebuilt hypertree removes locks from parallel node generation
Load Balanced Parallel Node Generation for Meshless Numerical Methods
-
Circuit cutting trains QNNs on distributed systems without losing accuracy
DistributedEstimator: Distributed Training of Quantum Neural Networks via Circuit Cutting
-
Cloud inference matches on-device for real-time braking
Cloud Is Closer Than It Appears: Revisiting the Tradeoffs of Distributed Real-Time Inference
-
Baremetal runtime lifts AI efficiency 9x on 10x fewer tiles
AEG: A Baremetal Framework for AI Acceleration via Direct Hardware Access in Heterogeneous Accelerators
-
Direct solvers scale via communication cuts and low-rank compression
Parallel Sparse and Data-Sparse Factorization-based Linear Solvers
-
Energy use shifts from linear to root function as core count rises
The Impact of Process Competition on Energy Consumption: Analysis and Modeling