Lynx partitions KV cache bits into anchor and residual streams for progressive transfer, enabling speculative decoding on partial data followed by verification to match BF16 accuracy at 4-bit-like TTFT.
Radev , editor =
7 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
TwinRouterBench supplies 970 execution-verified router prefixes across five datasets plus a live harness for 100 held-out SWE-bench cases, scoring routers on tier accuracy, trajectory success, and realized token cost without LLM judges.
A 1.88-million-article biomedical summarization dataset is released and quality-aware selection of training data based on abstract alignment outperforms random sampling on factuality metrics.
Interdomain Attention integrates SSMs into attention via finite feature maps and basis projections to enable query-conditioned attention over fixed states, showing gains over SSM baselines and matching softmax at 1.3B scale with length-flat scaling.
Introduces predictive prefetching for RAG that anticipates retrieval needs several tokens ahead via three components, reporting up to 43.5% latency reduction and 62.4% TTFT improvement while preserving answer quality.
Constrained candidate selection from retrieved chunks raises Recall@5 from 31.9% to 50.0% and parseable outputs on 420 queries from 200 municipal meeting transcripts.
Gated DeltaNet integrates gating and delta rules into linear transformers, outperforming Mamba2 and DeltaNet on language modeling, reasoning, retrieval, and long-context tasks.
citing papers explorer
-
Lynx: Progressive Speculative Quantization for accelerating KV Transfer in Long-Context Inference
Lynx partitions KV cache bits into anchor and residual streams for progressive transfer, enabling speculative decoding on partial data followed by verification to match BF16 accuracy at 4-bit-like TTFT.
-
Less is More: Quality-Aware Training Data Selection for Scientific Summarization
A 1.88-million-article biomedical summarization dataset is released and quality-aware selection of training data based on abstract alignment outperforms random sampling on factuality metrics.
-
Interdomain Attention: Beyond Token-Level Key-Value Memory
Interdomain Attention integrates SSMs into attention via finite feature maps and basis projections to enable query-conditioned attention over fixed states, showing gains over SSM baselines and matching softmax at 1.3B scale with length-flat scaling.
-
Predictive Prefetching for Retrieval-Augmented Generation
Introduces predictive prefetching for RAG that anticipates retrieval needs several tokens ahead via three components, reporting up to 43.5% latency reduction and 62.4% TTFT improvement while preserving answer quality.
-
Topic-to-Timestamp Alignment by Constrained Evidence Selection
Constrained candidate selection from retrieved chunks raises Recall@5 from 31.9% to 50.0% and parseable outputs on 420 queries from 200 municipal meeting transcripts.
-
Gated Delta Networks: Improving Mamba2 with Delta Rule
Gated DeltaNet integrates gating and delta rules into linear transformers, outperforming Mamba2 and DeltaNet on language modeling, reasoning, retrieval, and long-context tasks.