archive
Every paper Pith has read. Search by title, abstract, or pith.
1164 papers in cs.DC · page 8
-
Local AI agents stop early to cut energy waste 15-20%
AgentStop: Terminating Local AI Agents Early to Save Energy in Consumer Devices
-
Perseus fixes proxy RDMA serialization for 10x multi-node MoE speedup
Eliminating Hidden Serialization in Multi-Node Megakernel Communication
-
Emulator matches vLLM serving within 5 percent error
LLM-Emu: Native Runtime Emulation of LLM Inference via Profile-Driven Sampling
-
Quantization halves memory use in LLM training
AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs
-
Quantization halves memory for 8B–32B LLM training
AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs
-
Fixed-core approach yields 211x higher efficiency for edge GEMM
Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge
-
Workflow scheduling cuts AI agent task time by 1.64x
SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters
-
Ring subnets cut space LLM latency by threefold
SpaceMoE: Realizing Distributed Mixture-of-Experts Inference over Space Networks
-
Ring subnets cut satellite LLM latency threefold
SpaceMoE: Realizing Distributed Mixture-of-Experts Inference over Space Networks
-
IPU scaling boosts CFD AI training throughput fivefold
Adaptation of AI-accelerated CFD Simulations to the IPU platform
-
OrbitBFT scales BFT consensus in LEO satellite networks
OrbitBFT: Enabling Scalable and Robust BFT Consensus in LEO Constellations
-
Architecture shapes convergence in hierarchical federated learning
Hierarchical Federated Learning for Networked AI: From Communication Saving to Architecture-Aware Design
-
Same model accuracy varies 12 points by endpoint
Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference
-
Three streaming covariance algorithms match exactly in exact math
$2B$ or Not $2B$: A Tale of Three Algorithms for Streaming: Covariance Estimation after Welford and Chan-Golub-LeVeque
-
Replication cuts partitioning costs by 17-65 percent on average
Replication in Graph Partitioning and Scheduling Problems
-
Untwinning removes specific network twins without full rebuild
Network Digital Untwinning: Towards Backward Optimization of Digital Twins
-
Dedicated engine separates models for easier architecture simulation
Akita: A High Usability Simulation Framework for Computer Architecture
-
Ring topology on FPGAs runs cortical circuit faster than real time
NeuroRing: Scaling Spiking Neural Networks via Multi-FPGA Bidirectional Ring Topologies and Stream-Dataflow Architectures
-
Fees linked to pool invariant k make CPMM trades path-independent
Characterizing Path-Independent Fees: A Route to Zero Impermanent Loss in CPMMs
-
Model derives DEX fee floor to keep LPs in gain zone
From Impermanent Loss to Sustainable Gain: Quantifying Profitability Zones for Liquidity Providers on DEX
-
CS-3 runs 90% sparse SpMM 100x faster than CPU
Exploring Sparse Matrix Multiplication Kernels on the Cerebras CS-3
-
Santa Claus needs sqrt n rounds for any approximation
Distributed Santa Claus via Global Rounding
-
Most arbitrage chances come from one transaction each
The Origins of MEV: Systematic Attribution of Arbitrage Opportunity Creation at Scale
-
Affinity hints give 12% throughput boost on chiplet servers
Affinity Tailor: Dynamic Locality-Aware Scheduling at Scale
-
Design-time traces yield low WCETs that cut waste 36% in mixed-criticality systems
AnTi-MiCS: Analytical Framework for Bounding Time in Embedded Mixed-Criticality Systems
-
AI inference relocates like electricity demand within latency limits
AI Inference as Relocatable Electricity Demand: A Latency-Constrained Energy-Geography Framework
-
Lossless compression speeds LLM training up to 1.18 times
ZipCCL: Efficient Lossless Data Compression of Communication Collectives for Accelerating LLM Training
-
Traditional methods fail for AI in autonomous system dependability
Autonomous Systems Dependability in the era of AI: Design Challenges in Safety, Security, Reliability and Certification
-
The paper proves that all predicates expressible in monadic Presburger arithmetic can be…
Monadic Presburger Predicates have Robust Population Protocols
-
Consensus-embedded checks give order-execute chains 10.6x throughput
Back to the Future: Rethinking Endorsement in Order-Execute Blockchains
-
Merkle tree pipeline verifies IoT logs at 130k records per second
Lightweight Tamper-Evident Log Integrity Verification for IoT Edge Environments: A Merkle Tree Pipeline with Adaptive Chunking
-
Distributed GPUs train fluid predictors faster than solvers
A Study on the Performance of Distributed Training of Data-driven CFD Simulations
-
Unified API brings dynamic resources to HPC apps via MPI spawning
Towards the Democratization and Standardization of Dynamic Resources with MPI Spawning
-
Jetson AGX Orin runs 25k Monte Carlo AEB samples in 530 ms
Real-Time GPU-Accelerated Monte Carlo Evaluation of Safety-Critical AEB Systems Under Uncertainty
-
Block pipelining lifts Hyperledger Fabric commit throughput 1.9x
End-to-End and Phase-Level Performance Optimization for Hyperledger Fabric
-
Compiler automates sequence parallelism for 2.7x longer LLM contexts
AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism
-
Round-robin stage dispatch breaks GPU pipeline bottleneck for LLM training
Efficient Training on Multiple Consumer GPUs with RoundPipe
-
Deterministic nodes adapt only to uniform goals in dynamic networks
Adaptive Self-Organization in Anonymous Dynamic Networks
-
Serverless MoE serving cuts resources below one third for multi-tenant use
FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving
-
Test taxonomy with CI ecosystem improves HPC fault detection
A Test Taxonomy and Continuous Integration Ecosystem for Dynamic Resource Management in HPC
-
The paper introduces Voxel, a compiler-aware simulation framework for studying the…
Exploring the Efficiency of 3D-Stacked AI Chip Architecture for LLM Inference with Voxel
-
Semantic cache reuses up to 92 percent of quantum circuit results
A Semantic Quantum Circuit Cache for Scalable and Distributed Quantum-Classical Workflows
-
Jointly adapting batch size and parallelism speeds LLM training 4-8%
COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training
-
Agentic workflow turns PyTorch graphs into faster CUTLASS kernels
FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow
-
DMRlib more than triples data center throughput with easy malleable coding
DMRlib: Easy-coding and Efficient Resource Management for Job Malleability
-
Mobile agents scale by denser single capabilities and group collaboration
Scaling Mobile Agent Systems: From Capability Density to Collective Intelligence
-
Malleability cuts malleable HPC workload time by 27%
MPI Malleability Validation under Replayed Real-World HPC Conditions
-
Dual-path KV offload cuts edge LLM latency up to 42%
DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference
-
FloatSOM trains 1024-node maps on 1B samples in 6 minutes on GPUs
FloatSOM: GPU-Accelerated, Distributed, Topology-Flexible Self-Organizing Maps
-
Progressive encoder cuts VLM latency at 1 Mbps uplink
Progressive Semantic Communication for Efficient Edge-Cloud Vision-Language Models