Title resolution pending

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L · 2024 · arXiv 2404.00456

23 Pith papers cite this work. Polarity classification is still indexing.

23 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

{\Omega}-QVLA: Robust Quantization for Vision-Language-Action Models via Composite Rotation and Per-step Scaling

cs.CV · 2026-05-27 · unverdicted · novelty 7.0

Omega-QVLA is a post-training quantization framework achieving uniform W4A4 for VLA models' LLM backbone and DiT action head via composite SVD-Hadamard rotation and per-step scaling, matching FP16 success rates on LIBERO.

When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

cs.PF · 2026-05-07 · unverdicted · novelty 7.0

A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.

QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

cs.LG · 2026-02-23 · unverdicted · novelty 7.0

QuantVLA is the first post-training quantization framework for VLA models that quantizes the diffusion transformer action head and reports higher task success rates than full-precision baselines with roughly 70% memory savings on the quantized components.

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

cs.CL · 2025-12-01 · conditional · novelty 7.0

Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.

The Speedup Paradox: Rethinking Inference Speed-Quality Trade-off in Embodied Tasks

cs.RO · 2026-06-26 · unverdicted · novelty 6.0 · 2 refs

TISED decomposes inference optimization effects on embodied tasks and identifies paradoxical outcomes where faster per-step inference can increase task completion time on static tasks or raise success rates on dynamic tasks.

Qift: Shift-Friendly No-Zero W2 Post-Training Quantization for Rotated W2A4/KV4 LLM Inference

cs.LG · 2026-06-01 · unverdicted · novelty 6.0

Qift defines a fixed no-zero W2 level set for rotated weights that improves W2A4 perplexity and accuracy on LLaMA-2-7B and LLaMA-3.1-8B over the standard {-2,-1,0,1} set.

Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization

cs.LG · 2026-04-30 · unverdicted · novelty 6.0

ARHQ isolates error-sensitive weight directions in LLMs via truncated SVD on the scaled matrix W G_x^{1/2} from activation residuals, improving SNR and preserving performance under aggressive low-bit quantization.

MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs

cs.AR · 2026-04-17 · unverdicted · novelty 6.0

MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.

Rethinking Residual Errors in Compensation-based LLM Quantization

cs.LG · 2026-04-09 · conditional · novelty 6.0

Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.

Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism

cs.CL · 2026-01-09 · unverdicted · novelty 6.0

Double achieves up to 5.3x inference speedup on 70B LLMs via synchronous double retrieval speculative parallelism that is lossless and outperforms trained baselines like EAGLE-3.

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

cs.LG · 2025-04-28 · unverdicted · novelty 6.0

TurboQuant achieves near-optimal vector quantization distortion for both MSE and inner products via random rotation and per-coordinate scalar quantization, with a formal proof that it matches lower bounds within a factor of approximately 2.7.

SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices

cs.CL · 2026-06-05 · unverdicted · novelty 5.0

Learned diagonal scaling matrices optimized with activation-aware loss reduce effective rank in LLM weight matrices and yield competitive perplexity and zero-shot results versus prior SVD methods on Llama 3.1 8B and Qwen3-8B.

QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer

cs.CV · 2026-05-29 · unverdicted · novelty 5.0

QVGGT uses per-block mixed-precision analysis, outlier token filtering with PCA compensation, and task-aware scale search to achieve near-lossless W4A16 quantization of VGGT with 3-4.9x memory savings and 2.8x speedup.

31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding

cs.AR · 2026-05-10 · unverdicted · novelty 5.0

A ReRAM-on-logic stacked chip delivers 14.08-135.69 tokens/s LLM inference with block-clustered compression and adaptive parallel speculative decoding, yielding 4.46-7.17x speedup over standard methods.

HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

cs.LG · 2026-05-05 · unverdicted · novelty 5.0 · 2 refs

HeadQ applies score-space logit corrections for keys and attention-weighted surrogates for values to KV-cache quantization, removing 84-94% of excess perplexity in 2-bit key experiments across six models.

StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

cs.LG · 2026-05-04 · accept · novelty 5.0

Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.

High-Rate Quantized Matrix Multiplication I

cs.IT · 2026-01-23 · unverdicted · novelty 5.0

High-rate quantization theory yields accurate approximations for the distortion of absmax INT and FP schemes in generic weight-plus-activation matrix multiplication.

Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference

cs.AR · 2025-09-11 · unverdicted · novelty 5.0

PLENA introduces a co-designed system with three optimization pathways for long-context agentic LLM inference, claiming up to 2.23x throughput over A100 and 4.04x energy efficiency.

MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design

cs.LG · 2024-12-19 · unverdicted · novelty 5.0

MixLLM uses global output-feature importance to set mixed bit-widths for LLM quantization and adds two-step dequantization plus software pipelining for system efficiency.

Influence-Inspired Spectral Rotations for Extreme Low-Bit LLM Quantization

cs.LG · 2026-05-24 · unverdicted · novelty 4.0

A WHT rotation plus per-coordinate activation-energy rescaling before auto-round quantization lowers WikiText-2 perplexity 15-58% versus vanilla auto-round at W2A16 on models from 135M to 1.5B parameters.

DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization

cs.CV · 2026-04-20 · unverdicted · novelty 4.0

DuQuant++ adapts outlier-aware fine-grained rotation to MXFP4 by matching block size to the 32-element microscaling group, enabling a single rotation that smooths distributions and achieves SOTA performance on LLaMA-3 with lower cost.

A Survey on Efficient Inference for Large Language Models

cs.CL · 2024-04-22 · accept · novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations

cs.CL · 2025-11-09

citing papers explorer

Showing 23 of 23 citing papers.

{\Omega}-QVLA: Robust Quantization for Vision-Language-Action Models via Composite Rotation and Per-step Scaling cs.CV · 2026-05-27 · unverdicted · none · ref 1
Omega-QVLA is a post-training quantization framework achieving uniform W4A4 for VLA models' LLM backbone and DiT action head via composite SVD-Hadamard rotation and per-step scaling, matching FP16 success rates on LIBERO.
When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon cs.PF · 2026-05-07 · unverdicted · none · ref 7
A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models cs.LG · 2026-02-23 · unverdicted · none · ref 1
QuantVLA is the first post-training quantization framework for VLA models that quantizes the diffusion transformer action head and reports higher task success rates than full-precision baselines with roughly 70% memory savings on the quantized components.
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling cs.CL · 2025-12-01 · conditional · none · ref 37
Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.
The Speedup Paradox: Rethinking Inference Speed-Quality Trade-off in Embodied Tasks cs.RO · 2026-06-26 · unverdicted · none · ref 59 · 2 links
TISED decomposes inference optimization effects on embodied tasks and identifies paradoxical outcomes where faster per-step inference can increase task completion time on static tasks or raise success rates on dynamic tasks.
Qift: Shift-Friendly No-Zero W2 Post-Training Quantization for Rotated W2A4/KV4 LLM Inference cs.LG · 2026-06-01 · unverdicted · none · ref 3
Qift defines a fixed no-zero W2 level set for rotated weights that improves W2A4 perplexity and accuracy on LLaMA-2-7B and LLaMA-3.1-8B over the standard {-2,-1,0,1} set.
Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization cs.LG · 2026-04-30 · unverdicted · none · ref 8
ARHQ isolates error-sensitive weight directions in LLMs via truncated SVD on the scaled matrix W G_x^{1/2} from activation residuals, improving SNR and preserving performance under aggressive low-bit quantization.
MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs cs.AR · 2026-04-17 · unverdicted · none · ref 4
MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.
Rethinking Residual Errors in Compensation-based LLM Quantization cs.LG · 2026-04-09 · conditional · none · ref 1
Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.
Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism cs.CL · 2026-01-09 · unverdicted · none · ref 2
Double achieves up to 5.3x inference speedup on 70B LLMs via synchronous double retrieval speculative parallelism that is lossless and outperforms trained baselines like EAGLE-3.
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate cs.LG · 2025-04-28 · unverdicted · none · ref 8
TurboQuant achieves near-optimal vector quantization distortion for both MSE and inner products via random rotation and per-coordinate scalar quantization, with a formal proof that it matches lower bounds within a factor of approximately 2.7.
SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices cs.CL · 2026-06-05 · unverdicted · none · ref 36
Learned diagonal scaling matrices optimized with activation-aware loss reduce effective rank in LLM weight matrices and yield competitive perplexity and zero-shot results versus prior SVD methods on Llama 3.1 8B and Qwen3-8B.
QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer cs.CV · 2026-05-29 · unverdicted · none · ref 2
QVGGT uses per-block mixed-precision analysis, outlier token filtering with PCA compensation, and task-aware scale search to achieve near-lossless W4A16 quantization of VGGT with 3-4.9x memory savings and 2.8x speedup.
31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding cs.AR · 2026-05-10 · unverdicted · none · ref 11
A ReRAM-on-logic stacked chip delivers 14.08-135.69 tokens/s LLM inference with block-clustered compression and adaptive parallel speculative decoding, yielding 4.46-7.17x speedup over standard methods.
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization cs.LG · 2026-05-05 · unverdicted · none · ref 13 · 2 links
HeadQ applies score-space logit corrections for keys and attention-weighted surrogates for values to KV-cache quantization, removing 84-94% of excess perplexity in 2-bit key experiments across six models.
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k cs.LG · 2026-05-04 · accept · none · ref 26
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
High-Rate Quantized Matrix Multiplication I cs.IT · 2026-01-23 · unverdicted · none · ref 17
High-rate quantization theory yields accurate approximations for the distortion of absmax INT and FP schemes in generic weight-plus-activation matrix multiplication.
Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference cs.AR · 2025-09-11 · unverdicted · none · ref 6
PLENA introduces a co-designed system with three optimization pathways for long-context agentic LLM inference, claiming up to 2.23x throughput over A100 and 4.04x energy efficiency.
MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design cs.LG · 2024-12-19 · unverdicted · none · ref 3
MixLLM uses global output-feature importance to set mixed bit-widths for LLM quantization and adds two-step dequantization plus software pipelining for system efficiency.
Influence-Inspired Spectral Rotations for Extreme Low-Bit LLM Quantization cs.LG · 2026-05-24 · unverdicted · none · ref 5
A WHT rotation plus per-coordinate activation-energy rescaling before auto-round quantization lowers WikiText-2 perplexity 15-58% versus vanilla auto-round at W2A16 on models from 135M to 1.5B parameters.
DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization cs.CV · 2026-04-20 · unverdicted · none · ref 1
DuQuant++ adapts outlier-aware fine-grained rotation to MXFP4 by matching block size to the 32-element microscaling group, enabling a single rotation that smooths distributions and achieves SOTA performance on LLaMA-3 with lower cost.
A Survey on Efficient Inference for Large Language Models cs.CL · 2024-04-22 · accept · none · ref 217
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations cs.CL · 2025-11-09 · unreviewed · ref 5

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer