pith. sign in

Gqa: Training generalized multi-query transformer models from multi-head checkpoints

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

citation-role summary

method 1

citation-polarity summary

years

2026 7

verdicts

UNVERDICTED 7

roles

method 1

polarities

use method 1

representative citing papers

Context Memorization for Efficient Long Context Generation

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

Attention-state memory externalizes long prefixes into a lightweight lookup table of precomputed attention states, yielding higher accuracy than standard in-context learning at fixed memory budgets and lower latency than full attention.

Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter

cs.DC · 2026-04-16 · unverdicted · novelty 6.0

PrfaaS enables practical cross-datacenter prefill-decode disaggregation for hybrid-attention models via selective offloading, bandwidth-aware scheduling, and cache-aware placement, yielding 54% higher throughput and 64% lower P90 TTFT than homogeneous baselines in a 1T-parameter case study.

Cubit: Token Mixer with Kernel Ridge Regression

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.

citing papers explorer

Showing 7 of 7 citing papers.