hub Canonical reference

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

· 2022 · cs.LG · arXiv 2205.14135

Canonical reference. 94% of citing Pith papers cite this work as background.

70 Pith papers citing it

Background 94% of classified citations

open full Pith review browse 70 citing papers arXiv PDF

abstract

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware -- accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method. FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3$\times$ speedup on GPT-2 (seq. length 1K), and 2.4$\times$ speedup on long-range arena (seq. length 1K-4K). FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 15 method 1

citation-polarity summary

background 15 use method 1

representative citing papers

Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

cs.CL · 2026-06-11 · accept · novelty 8.0

Rigel reverse-engineers the Metal 4.1 tensor compute path on M4 Max, finding fp8 matmul2d is emulated on GPU shader cores at 0.94x fp16 throughput with an 8x8 fragment layout and no ANE involvement.

Efficient Training on Multiple Consumer GPUs with RoundPipe

cs.DC · 2026-04-29 · conditional · novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.

Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation

cs.DC · 2026-04-11 · unverdicted · novelty 8.0

Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

cs.CL · 2023-04-03 · accept · novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

Garment Particles: A 2D--3D Symmetric Garment Representation for Generation and Editing

cs.GR · 2026-05-25 · unverdicted · novelty 7.0

Garment Particles is a 5D point cloud representation jointly encoding 2D sewing patterns and 3D geometry, supporting rectified flow generation from high-level inputs and diffusion-based editing of patterns or shapes.

Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

AVMP separates KV and SSM cache pools behind unified virtual addressing with failure-triggered migration, cutting OOM events 7.6% and raising throughput 1.83-13.3x on synthetic loads and 2.36x on ShareGPT traces.

Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU

cs.DC · 2026-05-20 · conditional · novelty 7.0

LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browser frameworks on tested hardware.

PilotWiMAE: Pilot-Native Representation Learning for Wireless Channels

eess.SP · 2026-05-19 · unverdicted · novelty 7.0

PilotWiMAE pretrains an encoder on noisy pilots with factorized attention, 99% masking, patch-normalized reconstruction, scale loss, and AWGN curriculum to outperform supervised baselines in cross-frequency beam selection and channel tasks from 3.5 GHz pretraining to 28 GHz evaluation.

CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

CUDAHercules benchmark demonstrates that leading LLMs generate functional CUDA code but fail to recover expert-level optimization strategies needed for peak performance on Ampere, Hopper, and Blackwell GPUs.

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

cs.SE · 2026-05-04 · conditional · novelty 7.0 · 2 refs

Kerncap automates extraction of faithful, self-contained GPU kernel reproducers from AMD HIP and Triton workloads via HSA interception and address-space closure, delivering 13.6x faster isolated tuning.

Projection-Free Transformers via Gaussian Kernel Attention

cs.LG · 2026-05-04 · unverdicted · novelty 7.0

Gaussian Kernel Attention replaces learned QKV projections with a Gaussian RBF kernel on per-head token features, using 0.42x parameters and 0.49x FLOPs while showing competitive language modeling performance at depth 20.

CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration

cs.DC · 2026-04-28 · unverdicted · novelty 7.0

CacheFlow cuts TTFT by 10-62% in batched LLM serving via 3D-parallel KV cache restoration and a two-pointer scheduler that overlaps recompute and I/O.

Transactional Attention: Semantic Sponsorship for KV-Cache Retention

cs.CL · 2026-04-13 · unverdicted · novelty 7.0

Transactional Attention uses semantic sponsorship from anchor patterns to retain dormant critical tokens in KV caches, achieving 100% credential retrieval at 16 tokens where all prior methods fail.

Fast and Fusiest: An Optimal Fusion-Aware Mapper for Accelerator Design

cs.AR · 2026-02-16 · unverdicted · novelty 7.0

FFM finds optimal fused mappings for tensor accelerators over 10,000 times faster than prior mappers while cutting energy-delay product by up to 1.8x versus hand-tuned designs.

SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

cs.AI · 2025-11-05 · unverdicted · novelty 7.0

SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

cs.LG · 2025-02-07 · unverdicted · novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

cs.CL · 2024-10-14 · conditional · novelty 7.0

DuoAttention identifies retrieval heads requiring full KV cache and streaming heads using constant-length cache to reduce memory and latency in long-context LLM inference.

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

cs.LG · 2022-10-31 · unverdicted · novelty 7.0

GPTQ quantizes 175B-parameter GPT models to 3-4 bits per weight in one shot using approximate second-order information, achieving negligible accuracy degradation and 3-4x inference speedups.

ABot-M0.5: Unified Mobility-and-Manipulation World Action Model

cs.CV · 2026-07-01 · unverdicted · novelty 6.0

ABot-M0.5 proposes a unified mobility-and-manipulation world action model using three alignment strategies that achieves state-of-the-art performance on mobile and fine-grained manipulation benchmarks.

KernelSight-LM: A Kernel-Level LLM Inference Simulator

cs.PF · 2026-06-26 · unverdicted · novelty 6.0 · 2 refs

KernelSight-LM simulates LLM inference at kernel granularity with cross-generation (12.1% per-kernel error) and target-measured (3.8% error) tiers, yielding end-to-end median errors of 15.4%/12.8%/3.0% and 14.3%/6.2%/2.7% for TTFT/TPOT/throughput across six model families.

Better Queries, Cheaper Attention: Adapting Transformers for Efficient Sparse Reconstruction

hep-ex · 2026-06-16 · unverdicted · novelty 6.0

A geometry-aware dynamic-query transformer decoder with Local Strided Cross-Attention raises track reconstruction efficiency from 94.1% to 98.1%, halves latency, and cuts memory use by over 10x versus fixed-query baselines in a simplified HL-LHC simulation.

From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs

cs.AI · 2026-06-08 · unverdicted · novelty 6.0

EntropyInfer adaptively allocates inference compute using per-head attention entropy for rigid/dynamic classification during prefilling and compresses KV cache with generated tokens, achieving up to 2.39x speedup on long contexts.

Attention at the Theoretical Minimum: A Mathematics of Arrays Framework for Memory-Optimal Transformer Kernels

cs.LG · 2026-06-05 · unverdicted · novelty 6.0

MoA framework derives a denotational normal form for attention that eliminates all intermediate arrays by algebraic construction, yielding O(n_dk + n_dv) memory traffic with numerical verification against PyTorch.

Imbuing Large Language Models with Bidirectional Logic for Robust Chain Repair

cs.CL · 2026-06-03 · unverdicted · novelty 6.0

TRI trains LLMs on goal-conditioned fill-in-the-middle tasks via PSM token rearrangement and symbolic verification to surgically repair erroneous CoT segments.

citing papers explorer

Showing 50 of 50 citing papers after filters.

Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU cs.CL · 2026-06-11 · accept · none · ref 3 · internal anchor
Rigel reverse-engineers the Metal 4.1 tensor compute path on M4 Max, finding fp8 matmul2d is emulated on GPU shader cores at 0.94x fp16 throughput with an 8x8 fragment layout and no ANE involvement.
Efficient Training on Multiple Consumer GPUs with RoundPipe cs.DC · 2026-04-29 · conditional · none · ref 8 · internal anchor
RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.
Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation cs.DC · 2026-04-11 · unverdicted · none · ref 27 · internal anchor
Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
Garment Particles: A 2D--3D Symmetric Garment Representation for Generation and Editing cs.GR · 2026-05-25 · unverdicted · none · ref 7 · internal anchor
Garment Particles is a 5D point cloud representation jointly encoding 2D sewing patterns and 3D geometry, supporting rectified flow generation from high-level inputs and diffusion-based editing of patterns or shapes.
Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference cs.LG · 2026-05-21 · unverdicted · none · ref 2 · internal anchor
AVMP separates KV and SSM cache pools behind unified virtual addressing with failure-triggered migration, cutting OOM events 7.6% and raising throughput 1.83-13.3x on synthetic loads and 2.36x on ShareGPT traces.
Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU cs.DC · 2026-05-20 · conditional · none · ref 8 · internal anchor
LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browser frameworks on tested hardware.
PilotWiMAE: Pilot-Native Representation Learning for Wireless Channels eess.SP · 2026-05-19 · unverdicted · none · ref 31 · internal anchor
PilotWiMAE pretrains an encoder on noisy pilots with factorized attention, 99% masking, patch-normalized reconstruction, scale loss, and AWGN curriculum to outperform supervised baselines in cross-frequency beam selection and channel tasks from 3.5 GHz pretraining to 28 GHz evaluation.
CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs cs.LG · 2026-05-08 · unverdicted · none · ref 6 · internal anchor
CUDAHercules benchmark demonstrates that leading LLMs generate functional CUDA code but fail to recover expert-level optimization strategies needed for peak performance on Ampere, Hopper, and Blackwell GPUs.
Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs cs.SE · 2026-05-04 · conditional · none · ref 4 · 2 links · internal anchor
Kerncap automates extraction of faithful, self-contained GPU kernel reproducers from AMD HIP and Triton workloads via HSA interception and address-space closure, delivering 13.6x faster isolated tuning.
Projection-Free Transformers via Gaussian Kernel Attention cs.LG · 2026-05-04 · unverdicted · none · ref 7 · internal anchor
Gaussian Kernel Attention replaces learned QKV projections with a Gaussian RBF kernel on per-head token features, using 0.42x parameters and 0.49x FLOPs while showing competitive language modeling performance at depth 20.
CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration cs.DC · 2026-04-28 · unverdicted · none · ref 1 · internal anchor
CacheFlow cuts TTFT by 10-62% in batched LLM serving via 3D-parallel KV cache restoration and a two-pointer scheduler that overlaps recompute and I/O.
Transactional Attention: Semantic Sponsorship for KV-Cache Retention cs.CL · 2026-04-13 · unverdicted · none · ref 10 · internal anchor
Transactional Attention uses semantic sponsorship from anchor patterns to retain dormant critical tokens in KV caches, achieving 100% credential retrieval at 16 tokens where all prior methods fail.
Fast and Fusiest: An Optimal Fusion-Aware Mapper for Accelerator Design cs.AR · 2026-02-16 · unverdicted · none · ref 13 · internal anchor
FFM finds optimal fused mappings for tensor accelerators over 10,000 times faster than prior mappers while cutting energy-delay product by up to 1.8x versus hand-tuned designs.
ABot-M0.5: Unified Mobility-and-Manipulation World Action Model cs.CV · 2026-07-01 · unverdicted · none · ref 14 · internal anchor
ABot-M0.5 proposes a unified mobility-and-manipulation world action model using three alignment strategies that achieves state-of-the-art performance on mobile and fine-grained manipulation benchmarks.
KernelSight-LM: A Kernel-Level LLM Inference Simulator cs.PF · 2026-06-26 · unverdicted · none · ref 10 · 2 links · internal anchor
KernelSight-LM simulates LLM inference at kernel granularity with cross-generation (12.1% per-kernel error) and target-measured (3.8% error) tiers, yielding end-to-end median errors of 15.4%/12.8%/3.0% and 14.3%/6.2%/2.7% for TTFT/TPOT/throughput across six model families.
Better Queries, Cheaper Attention: Adapting Transformers for Efficient Sparse Reconstruction hep-ex · 2026-06-16 · unverdicted · none · ref 9 · internal anchor
A geometry-aware dynamic-query transformer decoder with Local Strided Cross-Attention raises track reconstruction efficiency from 94.1% to 98.1%, halves latency, and cuts memory use by over 10x versus fixed-query baselines in a simplified HL-LHC simulation.
From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs cs.AI · 2026-06-08 · unverdicted · none · ref 9 · internal anchor
EntropyInfer adaptively allocates inference compute using per-head attention entropy for rigid/dynamic classification during prefilling and compresses KV cache with generated tokens, achieving up to 2.39x speedup on long contexts.
Attention at the Theoretical Minimum: A Mathematics of Arrays Framework for Memory-Optimal Transformer Kernels cs.LG · 2026-06-05 · unverdicted · none · ref 4 · internal anchor
MoA framework derives a denotational normal form for attention that eliminates all intermediate arrays by algebraic construction, yielding O(n_dk + n_dv) memory traffic with numerical verification against PyTorch.
Imbuing Large Language Models with Bidirectional Logic for Robust Chain Repair cs.CL · 2026-06-03 · unverdicted · none · ref 4 · internal anchor
TRI trains LLMs on goal-conditioned fill-in-the-middle tasks via PSM token rearrangement and symbolic verification to surgically repair erroneous CoT segments.
KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators cs.LG · 2026-06-01 · unverdicted · none · ref 6 · internal anchor
KForge uses dual LLM agents for cross-platform kernel generation, reporting 2.12% throughput gain on NVIDIA B200 vs TensorRT-LLM and 5.13x geometric mean speedup on Intel Arc B580 vs PyTorch on 37 workloads.
Do Value Vectors in Deep Layers Need Context from the Residual Stream? cs.CL · 2026-06-01 · unverdicted · none · ref 69 · internal anchor
Deeper transformer layers benefit from context-free token-specific value vectors in a Bank of Values lookup table, improving performance over standard attention with less compute.
Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers cs.CV · 2026-05-26 · unverdicted · none · ref 9 · internal anchor
Tensor Memory augments Transformers with a constant-size 3D voxel grid using differentiable soft writes at predicted locations, local interaction, and gated recurrent dynamics to decouple memory capacity from sequence length.
HyperParallel-MoE: Multi-Core Interleaved Scheduling for Fast MoE Training on Ascend NPUs cs.DC · 2026-05-22 · unverdicted · none · ref 6 · 2 links · internal anchor
HyperParallel-MoE reduces Dispatch-to-Combine MoE-FFN latency by up to 1.58x on Ascend A3 clusters via tile-level heterogeneous scheduling that overlaps communication, matrix, and vector computation inside a single kernel launch.
Adaptive Mass-Segmented KV Compression for Long-Context Reasoning cs.LG · 2026-05-22 · unverdicted · none · ref 10 · internal anchor
AMS KV compression adaptively partitions the cache by attention mass regions and assigns quotas to protect contiguous reasoning blocks during long-context LLM inference.
Towards Understanding Self-Pretraining for Sequence Classification cs.LG · 2026-05-20 · unverdicted · none · ref 6 · internal anchor
Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.
SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference cs.LG · 2026-05-13 · unverdicted · none · ref 6 · internal anchor
Spherical KV combines angle-domain attention using spherical key codes with rate-distortion retention to cut KV cache residency and HBM traffic while keeping a paged, fusion-friendly decode path.
Remember to Forget: Gated Adaptive Positional Encoding cs.LG · 2026-05-11 · unverdicted · none · ref 8 · internal anchor
GAPE augments RoPE with query- and key-dependent gates to stabilize attention and improve long-context performance in language models.
Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation cs.DC · 2026-05-08 · unverdicted · none · ref 15 · 2 links · internal anchor
Dooly reduces LLM inference profiling GPU-hours by 56.4% across 12 models while keeping simulation MAPE under 5% for TTFT and 8% for TPOT by making profiling configuration-agnostic and redundancy-aware.
Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism cs.DC · 2026-05-06 · unverdicted · none · ref 39 · internal anchor
Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.
Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study cs.SE · 2026-04-27 · unverdicted · none · ref 8 · internal anchor
Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models cs.LG · 2026-04-24 · unverdicted · none · ref 15 · internal anchor
HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering cs.AI · 2026-04-22 · unverdicted · none · ref 73 · internal anchor
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
TiledAttention: a CUDA Tile SDPA Kernel for PyTorch cs.LG · 2026-03-02 · unverdicted · none · ref 10 · internal anchor
TiledAttention is a cuTile-based SDPA kernel that balances performance with Python-level customizability for attention research in PyTorch.
SweetSpot: An Analytical Model for Predicting Energy Efficiency of LLM Inference cs.AI · 2026-02-05 · unverdicted · none · ref 7 · internal anchor
SweetSpot is an analytical model from Transformer computational and memory complexity that identifies energy minima at short-to-moderate inputs and medium outputs, achieving 1.79% MAPE on H100 GPU measurements across multiple LLMs.
How Much Dense Attention is Necessary? Oracle-Guided Sparse Prefill for Full/GQA Layers in Hybrid Long-Context Models cs.LG · 2026-06-05 · unverdicted · none · ref 4 · internal anchor
An oracle shows sparse token support preserves near-dense performance on Qwen retrieval tasks, and a KL-distilled head-collapsed indexer delivers 1.7-1.9x speedups with small quality gaps.
Building The Ph(ysical)AI Layer Of Machine Intelligence cs.LG · 2026-06-02 · unverdicted · none · ref 56 · internal anchor
A principle-driven RF encoder achieves 77.7% average accuracy across 15 cross-modal tasks, performing better on physically grounded tasks than semantic ones.
On Efficient Scaling of GNNs via IO-Aware Layers Implementations cs.LG · 2026-05-29 · unverdicted · none · ref 7 · internal anchor
IO-aware GPU kernels for SpMM convolutions, degree-aware reductions, and fused attention layers deliver median speedups of 1.6-2.6x (up to 10x) and memory reductions up to 76x over DGL/PyG baselines on realistic graphs.
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production cs.CE · 2026-05-12 · unverdicted · none · ref 38 · internal anchor
LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving cs.AR · 2026-05-10 · unverdicted · none · ref 10 · 2 links · internal anchor
KV-RM regularizes KV-cache movement via block paging and coalesced transfers to improve throughput, tail latency, and memory efficiency in static-graph LLM serving without changing the decoder interface.
Kaczmarz Linear Attention cs.LG · 2026-05-09 · unverdicted · none · ref 9 · internal anchor
Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack performance.
A Hybrid Method for Low-Resource Named Entity Recognition cs.CE · 2026-05-06 · unverdicted · none · ref 25 · internal anchor
The hybrid method with LLM-augmented data achieves F1 improvements of 7-24 points over baselines on five Vietnamese domain datasets.
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k cs.LG · 2026-05-04 · accept · none · ref 7 · internal anchor
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
From Spherical to Gaussian: A Comparative Analysis of Point Cloud Cropping Strategies in Large-Scale 3D Environments cs.CV · 2026-05-03 · unverdicted · none · ref 68 · internal anchor
Gaussian and related cropping strategies for point cloud subclouds improve 3D neural network performance over spherical cropping on large outdoor scenes.
UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training cs.DC · 2026-04-21 · unverdicted · none · ref 9 · internal anchor
UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.
Reinforcement Learning Improves LLM Accuracy and Reasoning in Disease Classification from Radiology Reports cs.AI · 2026-04-21 · unverdicted · none · ref 28 · internal anchor
SFT followed by GRPO improves LLM accuracy and reasoning recall in disease classification from radiology reports on three radiologist-annotated datasets.
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention cs.DC · 2026-04-18 · unverdicted · none · ref 44 · internal anchor
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, plus up to 1.85x prefill speedup and 1.37x/1.77x speedups with magnitude pruning and
DeltaMCP: Incremental Regeneration via Spec-Aware Transformation for MCP servers cs.SE · 2026-05-27 · unverdicted · none · ref 22 · internal anchor
DeltaMCP introduces specification-aware incremental regeneration to keep MCP servers synchronized with evolving OpenAPI specifications.
Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM cs.DC · 2026-04-20 · unverdicted · none · ref 14 · internal anchor
A framework combines multi-LoRA runtime switching, multi-stream stylistic decoding, and Dynamic Self-Speculative Decoding with INT4 quantization to achieve 4-6x memory and latency gains for on-device inference of a one-for-all foundational LLM on Qualcomm chipsets.
Toward Native Multimodal Modeling: A Roadmap cs.CV · 2026-05-25 · unverdicted · none · ref 225 · internal anchor
A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.
Six Open Questions in Machine-Learned Interatomic Potential Foundation Models cond-mat.mtrl-sci · 2026-06-05 · unverdicted · none · ref 243 · internal anchor
This perspective article develops a definition of foundational MLIPs and poses six open questions that the authors believe will define future research in machine-learned interatomic potentials.

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer