OrbitQuant is a data-agnostic PTQ technique for DiTs that uses RPBH rotation in a normalized basis to enable a single codebook across all inputs, achieving SOTA low-bit performance on FLUX.1, CogVideoX and similar models.
hub
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
36 Pith papers cite this work. Polarity classification is still indexing.
abstract
Vector quantization, a problem rooted in Shannon's source coding theory, aims to quantize high-dimensional Euclidean vectors while minimizing distortion in their geometric structure. We propose TurboQuant to address both mean-squared error (MSE) and inner product distortion, overcoming limitations of existing methods that fail to achieve optimal distortion rates. Our data-oblivious algorithms, suitable for online applications, achieve near-optimal distortion rates (within a small constant factor) across all bit-widths and dimensions. TurboQuant achieves this by randomly rotating input vectors, inducing a concentrated Beta distribution on coordinates, and leveraging the near-independence property of distinct coordinates in high dimensions to simply apply optimal scalar quantizers per each coordinate. Recognizing that MSE-optimal quantizers introduce bias in inner product estimation, we propose a two-stage approach: applying an MSE quantizer followed by a 1-bit Quantized JL (QJL) transform on the residual, resulting in an unbiased inner product quantizer. We also provide a formal proof of the information-theoretic lower bounds on best achievable distortion rate by any vector quantizer, demonstrating that TurboQuant closely matches these bounds, differing only by a small constant ($\approx 2.7$) factor. Experimental results validate our theoretical findings, showing that for KV cache quantization, we achieve absolute quality neutrality with 3.5 bits per channel and marginal quality degradation with 2.5 bits per channel. Furthermore, in nearest neighbor search tasks, our method outperforms existing product quantization techniques in recall while reducing indexing time to virtually zero.
hub tools
citation-role summary
citation-polarity summary
years
2026 36roles
background 2polarities
background 2representative citing papers
Shared-embedding sequence models cannot achieve Semantic-Faithful Control over control-authoritative actions due to provenance-recovery impossibility, control-path exposure, and finite-coverage invariance gap.
APEX4 co-designs pure INT4 GEMM kernels with ρ-aware granularity adaptation to deliver up to 2.09× end-to-end speedup on GPUs with low ρ while keeping LLaMA-2-70B perplexity within 0.63 of FP16.
HQMQ quantizes KV cache chunks as quaternions using Hurwitz group elements multiplied by per-layer random unit quaternions plus median outlier handling, matching fp16 perplexity at ~5 bits without calibration on tested models.
BlockQuant is a new block quantization algorithm on the sphere after random rotation that theoretically improves reconstruction MSE and expected inner-product distortion over EDEN, RabitQ, and TurboQuant.
LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell and other GPUs.
IVF-TQ replaces learned codebooks with a fixed random rotation and precomputed scalar quantizer in the residual layer of an IVF index, delivering streaming recall stability at fixed bit budgets via a uniform-over-sphere inner-product bound.
PrismQuant achieves near rate-distortion optimality for Gaussian-mixture sources by losslessly transmitting the mixture component label at H(C)/n bits per dimension and applying component-matched KLT plus scalar quantization, with vanishing gap to the genie-aided bound.
Two randomized Hadamard transforms suffice to make coordinate marginals O(d^{-1/2})-close to Gaussian for most quantization methods, with three needed for vector quantization to match uniform random rotations asymptotically.
A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.
Sequential KV compression via probabilistic language tries and predictive delta coding achieves 3.3-4.3 bits per token entropy, yielding up to 914x better ratios than TurboQuant even with large overhead.
3DTurboQuant achieves training-free near-optimal quantization for 3DGS and DUSt3R models via random rotations inducing Beta distributions, enabling precomputed Lloyd-Max quantizers that deliver 3.5x and 7.9x compression with negligible quality loss.
SuperLocalMemory V3.3 implements a cognitive memory taxonomy with mathematical forgetting and multi-channel retrieval, reaching 70.4% on LoCoMo in zero-LLM mode.
HyperQuant unifies Hadamard transform, optimal lattice quantization, and entropy coding to outperform prior schemes on LLM weight and KV cache quantization down to 1.7 bits per scalar while preserving quality on a 19B DiT model.
Fast-TurboQuant substitutes a structured fast Johnson-Lindenstrauss transform for dense random projections in 1-bit vector quantization, cutting arithmetic to additions only and reporting 19.7x speedup plus lower MSE on DBpedia OpenAI-3 embeddings.
MonaVec provides a training-free 4-bit vector quantization and deterministic search kernel using Randomized Hadamard Transform and ChaCha20 seeding for embedded and offline use.
A tiered KV cache architecture computes per-head per-step error bounds on quantized attention and uses adaptive fallback to guarantee bounded or exact outputs relative to FP16 reference.
PEEK maintains a constant-sized context map via a programmable cache policy to give LLM agents persistent orientation knowledge about recurring external contexts, yielding 6-34% gains and lower cost than prior prompt-learning methods.
OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.
OSCAR achieves near-BF16 accuracy for 2-bit KV cache quantization by using offline spectral covariance-aware rotations aligned with attention, plus a custom deployable INT2 kernel compatible with paged serving.
VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.
Design Conductor 2.0 uses April 2026 frontier models to autonomously create a 5129-unit FP16/32 TurboQuant inference accelerator mapped to FPGA at 125 MHz in 80 hours.
At 4-bit budget KQV wins on KL divergence, geometric K error and 6D distance with unconditional K-V asymmetry; QKQV wins geometrically at other budgets because the Jensen-amplified variance inflation from QJL on K does not bind.
HARBOR formalizes harness optimization as constrained noisy Bayesian optimization over mixed-variable spaces and reports a case study where it outperforms manual tuning on a production coding agent.
citing papers explorer
No citing papers match the current filters.