QLAM extends state-space models with quantum superposition in the hidden state for linear-time long-sequence modeling and reports consistent gains over RNN and transformer baselines on sequential image tasks.
Flashattention: Fast and memory-efficient exact attention with io-awareness
6 Pith papers cite this work. Polarity classification is still indexing.
years
2026 6verdicts
UNVERDICTED 6representative citing papers
SHIELD reduces eDRAM refresh energy by 35% for LLM inference on edge NPUs by isolating sign/exponent from mantissa bits, disabling refresh on transient QO mantissas, and relaxing it on persistent KV mantissas while keeping accuracy intact.
FusionCIM is a fusion-driven CIM accelerator for LLM inference that maps QKT to IP-CIM and PV to OP-CIM, uses QO-stationary dataflow, and applies pattern-aware online softmax, delivering up to 3.86x energy savings and 1.98x speedup on LLaMA-3 at 29.4 TOPS/W.
A CIM-based hardware-software co-design in 65nm achieves up to 7.3x higher throughput and 49.59x better energy efficiency than NVIDIA Orin Nano for LLaMA3.2-1B, averaging 336 tokens/s and 173 tokens/J under INT4 across multiple SLMs.
ECO uses supervised warm-up plus iterative batched DPO on a Mamba backbone to reach top neural performance on TSP and CVRP while lowering memory growth and raising throughput.
StableHLO serves as a viable unified representation for cross-architecture performance modeling of distributed ML workloads, preserving relative trends while exposing fidelity trade-offs.
citing papers explorer
-
QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling
QLAM extends state-space models with quantum superposition in the hidden state for linear-time long-sequence modeling and reports consistent gains over RNN and transformer baselines on sequential image tasks.
-
SHIELD: A Segmented Hierarchical Memory Architecture for Energy-Efficient LLM Inference on Edge NPUs
SHIELD reduces eDRAM refresh energy by 35% for LLM inference on edge NPUs by isolating sign/exponent from mantissa bits, disabling refresh on transient QO mantissas, and relaxing it on persistent KV mantissas while keeping accuracy intact.
-
FusionCIM: Accelerating LLM Inference with Fusion-Driven Computing-in-Memory Architecture
FusionCIM is a fusion-driven CIM accelerator for LLM inference that maps QKT to IP-CIM and PV to OP-CIM, uses QO-stationary dataflow, and applies pattern-aware online softmax, delivering up to 3.86x energy savings and 1.98x speedup on LLaMA-3 at 29.4 TOPS/W.
-
EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models
A CIM-based hardware-software co-design in 65nm achieves up to 7.3x higher throughput and 49.59x better energy efficiency than NVIDIA Orin Nano for LLaMA3.2-1B, averaging 336 tokens/s and 173 tokens/J under INT4 across multiple SLMs.
-
Rethinking Efficiency in Neural Combinatorial Optimization: Batched Preference Optimization with Mamba
ECO uses supervised warm-up plus iterative batched DPO on a Mamba backbone to reach top neural performance on TSP and CVRP while lowering memory growth and raising throughput.
-
Evaluating Cross-Architecture Performance Modeling of Distributed ML Workloads Using StableHLO
StableHLO serves as a viable unified representation for cross-architecture performance modeling of distributed ML workloads, preserving relative trends while exposing fidelity trade-offs.