TuniQ uses RL with a dual-encoder, shaped rewards, and action masking to autotune quantum compilation passes, improving fidelity and speed over Qiskit while generalizing across backends and scaling to large circuits.
Humble, Alexander McCaskey, Dmitry I
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.
ACALSim is a new simulation framework with customizable threading, event-driven execution, and shared-memory model that reports over 14x speedup versus SST and enables simulation of large LLaMA models that SST cannot complete.
GPIR achieves up to 297 times higher throughput than prior GPU PIR systems by fusing operations in stages and using pipelined transposed layouts to cut DRAM traffic during batched lattice-based queries.
A two-level decoder scheduling framework reduces classical processing requirements for quantum error correction by 10-40% on fault-tolerant benchmarks by managing bursty workloads as shared resources.
Heterogeneous SYCL-based CG and Cholesky solvers deliver up to 32% and 29% faster runtimes than GPU-only versions for large matrices across multiple GPU vendors.
citing papers explorer
-
TuniQ: Autotuning Compilation Passes for Quantum Workloads at Scale for Effectiveness and Efficiency
TuniQ uses RL with a dual-encoder, shaped rewards, and action masking to autotune quantum compilation passes, improving fidelity and speed over Qiskit while generalizing across backends and scaling to large circuits.
-
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators
SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.
-
ACALSim: A Scalable Parallel Simulation Framework for High-Performance System Design Space Exploration
ACALSim is a new simulation framework with customizable threading, event-driven execution, and shared-memory model that reports over 14x speedup versus SST and enables simulation of large LLaMA models that SST cannot complete.
-
GPIR: Enabling Practical Private Information Retrieval with GPUs
GPIR achieves up to 297 times higher throughput than prior GPU PIR systems by fusing operations in stages and using pipelined transposed layouts to cut DRAM traffic during batched lattice-based queries.
-
Managing Classical Processing Requirements for Quantum Error Correction
A two-level decoder scheduling framework reduces classical processing requirements for quantum error correction by 10-40% on fault-tolerant benchmarks by managing bursty workloads as shared resources.
-
Comparing the Performance of Heterogeneous Conjugate Gradient and Cholesky Solvers on Various Hardware Using SYCL
Heterogeneous SYCL-based CG and Cholesky solvers deliver up to 32% and 29% faster runtimes than GPU-only versions for large matrices across multiple GPU vendors.
- Beyond Silicon: Materials, Mechanisms, and Methods for Physical Neural Computing