ITHICA generates functional tests via intra-thread instruction duplication and comparison, detecting 39% more defective servers than baseline methods on over 3000 real CPUs while revealing new defect behaviors.
hub Canonical reference
Neutrino Production via $e^-e^+$ Collision at $Z$-boson Peak
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
The production of the three normal neutrinos via $e^-e+$ collision at $Z$-boson peak (neutrino production in a Z-factory) is investigated thoroughly. The differences of $\nu_e$-pair production from $\nu_\mu$-pair and $\nu_\tau$-pair production are presented in various aspects. Namely the total cross sections, relevant differential cross sections and the forward-backward asymmetry etc for these neutrinos are presented in terms of figures as well as numerical tables. The restriction on the room for the mixing of the three species of light neutrinos with possible externals (heavy neutral leptons and/or stereos) from refined measurements of the invisible width of $Z$-boson is discussed.
hub tools
citation-role summary
citation-polarity summary
roles
background 8polarities
background 8representative citing papers
Embedding CUDA Graphs in UCX for multi-path intra-node GPU communication yields up to 2.95x bandwidth improvement over single-path UCX on a four-GPU node for large messages.
JetSCI is a hybrid JAX-PETSc framework that delivers scalable differentiable finite element simulations and outperforms pure JAX implementations on heterogeneous micromechanics problems.
QiankunNet-cuSCI achieves up to 2.32x end-to-end speedup on 64 A100 GPUs for NNQS-SCI while preserving chemical accuracy by fully accelerating global de-duplication and coupled-configuration generation on the device.
Nautilus auto-compiles math-like tensor descriptions into optimized GPU kernels, delivering up to 42% higher throughput than prior compilers on transformer models across NVIDIA GPUs.
NestPipe achieves up to 3.06x speedup and 94.07% scaling efficiency on 1,536 workers via dual-buffer inter-batch and frozen-window intra-batch pipelining that overlaps communication with computation.
GTaP delivers a GPU-resident fork-join task-parallel runtime with pragma support and EPAQ that outperforms CPU OpenMP on several irregular applications.
SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.
EnergyLens predicts multi-GPU LLM inference energy consumption with 9-13% MAPE and identifies configurations with up to 52x energy efficiency differences.
A proxy-method direct solver for Helmholtz transmission problems with many inclusions compresses the linear system to O(ωD) size and runs in O(N^{1.5}) time using the PMCHWT formulation, outperforming Burton-Miller.
A new GPU-oriented batch SVD solver based on the one-sided Jacobi method delivers significant speedups over vendor libraries and prior open-source implementations across precisions and matrix shapes.
FlashInfer delivers a customizable attention kernel that reduces inter-token latency by 29-69% in LLM serving benchmarks via optimized KV-cache storage and load-balanced scheduling compatible with CUDA graphs.
PhantomRun standardizes CI build log retrieval and reproduction for embedded systems, reconstructing 91.8% of 4628 failing runs while preserving outcomes in 98% of cases.
KV-RM regularizes KV-cache movement via block paging and coalesced transfers to improve throughput, tail latency, and memory efficiency in static-graph LLM serving without changing the decoder interface.
FlexPipe introduces runtime pipeline refactoring for LLMs to achieve higher resource efficiency and lower latency in serverless GPU clusters with fragmentation.
A clustering-aware correction algorithm using spatial partitioning and projected gradient descent preserves single-linkage clusters in lossy-compressed particle data while keeping competitive compression ratios.
Aurora reached 1.01 EF/s FP64 HPL and 11.64 EF/s HPL-MxP through locality-aware mapping, CPU-GPU pipelining, mixed-precision orchestration, and hybrid resilience on a large Intel GPU-based system.
Review chapter summarizing advances in parallel sparse direct solvers along communication reduction and data-sparse compression axes.
citing papers explorer
-
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
FlashInfer delivers a customizable attention kernel that reduces inter-token latency by 29-69% in LLM serving benchmarks via optimized KV-cache storage and load-balanced scheduling compatible with CUDA graphs.
-
FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters
FlexPipe introduces runtime pipeline refactoring for LLMs to achieve higher resource efficiency and lower latency in serverless GPU clusters with fragmentation.