hub

GQA : Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai · 2023 · DOI 10.18653/v1/2023.emnlp-main.298

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

open at publisher browse 13 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

Layer-wise Token Compression for Efficient Document Reranking

cs.IR · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

Layer-wise Token Compression applies adaptive token pooling at middle transformer layers for cross-encoder rerankers, preserving MS MARCO ranking quality while raising QPS up to 25% on passages and 116% on documents, with added gains on listwise LLM rerankers and a regularizer effect for long inputs

SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters

cs.DC · 2026-05-01 · unverdicted · novelty 7.0 · 2 refs

SAGA introduces workflow-atomic scheduling for compound AI agents, achieving 1.64x lower task completion time and 1.22x better memory utilization than vLLM on a 64-GPU cluster at the cost of 30% lower peak throughput.

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

cs.CL · 2025-12-18 · unverdicted · novelty 7.0

Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.

MosaicKV: Serving Long-Context LLM with Dynamic Two-D KV Cache Compression

cs.LG · 2026-07-01 · unverdicted · novelty 6.0

MosaicKV achieves up to 16x attention speedup, 4.8x lower decode latency, 7.3x higher throughput, and 3x memory reduction with 1.76% accuracy loss via dynamic two-D KV cache compression and management on H800 GPUs.

LMs as Task-Specific Knowledge Bases: An Interpretability Analysis

cs.CL · 2026-06-25 · unverdicted · novelty 6.0

LMs store facts in task-specific parameter subsets, shown by inconsistent emergence across tasks during training and distinct localized parameters for the same fact.

Heuresis: Search Strategies for Autonomous AI Research Agents Across Quality, Diversity and Novelty

cs.AI · 2026-06-23 · unverdicted · novelty 6.0 · 2 refs

Heuresis evaluates six search strategies for autonomous ML research agents and finds that novel ideas are rare, none rated original, and only one reaches top-10 quality while strategies steer axes but do not expand the quality-novelty frontier.

Spectral Query-Key Product Weight Steering for Training-Free VLM Hallucination Mitigation

cs.CV · 2026-06-18 · unverdicted · novelty 6.0

QK Product Steering suppresses dominant singular modes in the per-head QK product of selected middle layers via a closed-form query-only update, yielding 4.0% average relative CHAIR_s reduction on three GQA VLMs.

Priming: Hybrid State Space Models From Pre-trained Transformers

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasoning at 32B scale.

EpiCache: Episodic KV Cache Management for Long-Term Conversation on Resource-Constrained Environments

cs.CL · 2025-09-22 · unverdicted · novelty 6.0

EpiCache clusters long conversation history into coherent episodes for per-episode KV cache eviction, delivering up to 30% accuracy gains and 3.7x peak memory reduction on LongConvQA tasks under fixed budgets.

StarCoder 2 and The Stack v2: The Next Generation

cs.SE · 2024-02-29 · accept · novelty 6.0

StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.

MOSAIC: Efficient Mixture-of-Agent Scheduling via Adaptive Aggregation and Inference Concurrency

cs.LG · 2026-06-02 · unverdicted · novelty 5.0

MOSAIC uses an Integer Linear Program scheduler for expert placement and prompt assignment plus adaptive aggregation to achieve 1.7-2.3x end-to-end speedup on 4-GPU MoA workloads while keeping accuracy within 0.1pp.

MATCH: Modulating Attention via In-Context Retrieval for Long-Context Transformers

cs.CL · 2026-06-29 · unverdicted · novelty 4.0

MATCH augments sparsified attention with an efficient in-context retrieval system to boost performance on long-range recall tasks in transformers.

citing papers explorer

Showing 2 of 2 citing papers after filters.

SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters cs.DC · 2026-05-01 · unverdicted · none · ref 3 · 2 links
SAGA introduces workflow-atomic scheduling for compound AI agents, achieving 1.64x lower task completion time and 1.22x better memory utilization than vLLM on a 64-GPU cluster at the cost of 30% lower peak throughput.
Priming: Hybrid State Space Models From Pre-trained Transformers cs.LG · 2026-05-08 · unverdicted · none · ref 1
Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasoning at 32B scale.

GQA : Training generalized multi-query transformer models from multi-head checkpoints

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer