A systematic analysis of 301 real-world code generation bugs in tile programs, categorizing root causes, symptoms, input patterns, test oracles, and fix strategies from curated GitHub reports.
arXiv preprint arXiv:2410.20399 , year=
11 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 11roles
background 2polarities
background 2representative citing papers
CODA re-expresses most non-attention Transformer computations as GEMM-plus-epilogue programs using a constrained set of composable primitives to keep intermediate results on-chip and cut global memory traffic.
Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.
Nautilus auto-compiles math-like tensor descriptions into optimized GPU kernels, delivering up to 42% higher throughput than prior compilers on transformer models across NVIDIA GPUs.
A system that auto-generates statically verified megakernels for LLM forward passes, retargets across NVIDIA architectures from one source, matches reference outputs exactly, and self-improves via an agent loop.
Veda formulates tile selection in video diffusion attention as a reconstruction problem from full attention maps, using statistics-aware and head-aware scoring to enable high sparsity with maintained quality and hardware speedups up to 5.1x end-to-end.
TLX introduces MIMW-based extensions to Triton that let developers orchestrate warp-group execution and asynchronous hardware features while preserving blocked programming productivity, with kernels deployed in large-scale training and inference.
DAK enables direct GPU access to remote memory for LLM inference via TMA repurposing and a greedy offloading algorithm, achieving up to 3x gains over prefetching baselines on NVLink-C2C and 1.8x on PCIe.
Flashlight is a compiler-native PyTorch framework that generates efficient fused kernels for arbitrary and data-dependent attention variants, supporting more cases than FlexAttention with competitive performance.
ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.
Sakana Fugu trains LLM orchestrators using fine-tuning, evolutionary algorithms, and RL to build query-adaptive multi-agent scaffolds, claiming SOTA results on benchmarks including SWE-Bench Pro and GPQA-Diamond.
citing papers explorer
No citing papers match the current filters.