Title resolution pending

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author= · 2023

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

browse 5 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

Search Your Block Floating Point Scales!

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.

Flex Attention: A Programming Model for Generating Optimized Attention Kernels

cs.LG · 2024-12-07 · unverdicted · novelty 6.0

FlexAttention supplies a compiler-driven interface that expresses common attention variants in a few lines of PyTorch and emits optimized kernels whose speed matches hand-written implementations.

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

cs.CL · 2024-06-16 · unverdicted · novelty 6.0

Quest speeds up long-context LLM self-attention by up to 2.23x via query-dependent selection of top-K critical KV cache pages, cutting overall latency by 7.03x with negligible accuracy loss.

Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP

cs.DC · 2026-05-08 · unverdicted · novelty 5.0

FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.

Simply Stabilizing the Loop via Fully Looped Transformer

cs.LG · 2026-05-11

citing papers explorer

Showing 5 of 5 citing papers.

Search Your Block Floating Point Scales! cs.LG · 2026-05-12 · unverdicted · none · ref 29
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
Flex Attention: A Programming Model for Generating Optimized Attention Kernels cs.LG · 2024-12-07 · unverdicted · none · ref 5
FlexAttention supplies a compiler-driven interface that expresses common attention variants in a few lines of PyTorch and emits optimized kernels whose speed matches hand-written implementations.
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference cs.CL · 2024-06-16 · unverdicted · none · ref 71
Quest speeds up long-context LLM self-attention by up to 2.23x via query-dependent selection of top-K critical KV cache pages, cutting overall latency by 7.03x with negligible accuracy loss.
Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP cs.DC · 2026-05-08 · unverdicted · none · ref 27
FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.
Simply Stabilizing the Loop via Fully Looped Transformer cs.LG · 2026-05-11 · unreviewed · ref 43

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer