Quip: 2-bit quantization of large language models with guarantees.Advances in Neural Information Processing Systems, 36, 2024

Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, Christopher M De Sa · 2024

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

cs.LG · 2024-07-11 · accept · novelty 7.0

FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.

citing papers explorer

Showing 1 of 1 citing paper.

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision cs.LG · 2024-07-11 · accept · none · ref 9
FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.

Quip: 2-bit quantization of large language models with guarantees.Advances in Neural Information Processing Systems, 36, 2024

fields

years

verdicts

representative citing papers

citing papers explorer