pith. machine review for the scientific record. sign in

arxiv: 2605.11864 · v1 · submitted 2026-05-12 · 💻 cs.IR · cs.AI· cs.CV· cs.MM

Recognition: 1 theorem link

· Lean Theorem

Very Efficient Listwise Multimodal Reranking for Long Documents

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:12 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CVcs.MM
keywords multimodal rerankinglistwise rerankingefficient retrievallong documentsvision-language modelsinference optimizationMMDocIR benchmark
0
0 comments X

The pith

ZipRerank achieves state-of-the-art multimodal listwise reranking accuracy on long documents while cutting inference latency by up to an order of magnitude.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ZipRerank as a listwise reranker for multimodal documents that shortens visual inputs through early query-image interaction and scores every candidate in one model pass rather than generating tokens step by step. It trains first on large-scale text rendered as images then fine-tunes with soft ranking labels distilled from a larger vision-language model. If this works, reranking becomes fast enough to use routinely in vision-centric search and multimodal retrieval-augmented generation over long documents without sacrificing ranking quality.

Core claim

ZipRerank reduces input length via lightweight query-image early interaction and replaces autoregressive decoding with single-forward-pass scoring of all candidates. A two-stage training process—listwise pretraining on rendered text images followed by multimodal fine-tuning using VLM-teacher-distilled soft supervision—enables the model to match or exceed prior multimodal rerankers on the MMDocIR benchmark while lowering LLM inference latency by up to ten times.

What carries the argument

Lightweight query-image early interaction combined with single-pass listwise scoring that avoids autoregressive token generation.

If this is right

  • Listwise reranking can now be applied in real-time multimodal retrieval systems that previously found it too slow.
  • The same efficiency pattern extends directly to other long-document vision-language tasks that rely on ranking.
  • Two-stage training starting from rendered text allows reuse of large text datasets for multimodal rerankers.
  • Single-pass scoring removes the need for multi-step decoding in listwise settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may scale to documents longer than those tested if the early-interaction compression remains effective.
  • Similar early-interaction designs could be tested in non-ranking multimodal generation pipelines to reduce token counts.
  • If the distilled supervision generalizes, the approach offers a template for distilling efficiency into other vision-language ranking models.

Load-bearing premise

That the early interaction and single-pass scoring, after training that begins with rendered text images, continue to preserve ranking quality on actual long multimodal documents.

What would settle it

Compare ZipRerank against a standard VLM-based listwise reranker on the same MMDocIR queries; if NDCG or similar ranking metrics drop by more than a few points while claimed latency gains hold, the efficiency claim is refuted.

Figures

Figures reproduced from arXiv: 2605.11864 by Lawrence B. Hsieh, Pengfei Wei, Yiqun Sun.

Figure 1
Figure 1. Figure 1: Speed-accuracy trade-off on MMDocIR (Dong et al., 2025) for page-level reranking (Recall@3 vs. LLM latency). ZipRerank (red; varying token keep ratios ρ) achieves state-of-the￾art performance comparable to MM-R5 (Xu et al., 2025) while reducing latency by around 10×, and substantially narrows the gap to GPT-5-mini at about 58× lower inference cost. ments are often visually rich and span many pages, requiri… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the ZipRerank framework. ZipRerank integrates a two-stage training pipeline: (i) listwise pretraining on large-scale text data rendered as images, followed by (ii) vision-centric finetuning with VLM-teacher-distilled soft supervision. The efficient inference design combines query-aware visual token pruning and single-token listwise scoring, enabling end-to-end reranking in a single LLM forward … view at source ↗
Figure 3
Figure 3. Figure 3: Parameter study on the effect of image token keep ratio ρ on reranking effectiveness (Recall@1, 3, 5) and latency (LLM Time in ms) on first stage results from DSEwiki−ss. 10 20 30 40 Number of Candidates (k) 55 60 65 70 75 80 85 90 95 Recall (%) Recall@1 Recall@3 Recall@5 LLM Time 100 200 300 400 500 600 700 LLM Time (ms) [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Parameter study on the effect of the number of input passages k on reranking effectiveness (Recall@1, 3, 5) and latency (LLM Time in ms) on first stage results from DSEwiki−ss. with ColQwen. ZipRerank−50% remains strong despite re￾taining only half of the visual tokens, suggesting that the effi￾ciency gain is not specific to MMDocIR. Pointwise LamRA achieves the highest score, but requires candidate-wise s… view at source ↗
Figure 5
Figure 5. Figure 5: The reranking input prompt template and example target generation sequence for ZipRerank with Qwen3-VL. 2. Ranking Loss: Stage 1 uses weighted RankNet while Stage 2 uses a soft ranking loss with position-decayed target distribution. Stage 1: Pre-training on Rendered Text. In the first stage, we train the model on the RankZephyr dataset (Pradeep et al., 2023b), which contains text passages with relevance la… view at source ↗
Figure 6
Figure 6. Figure 6: Parameter studies on first-stage results from ColQwen [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
read the original abstract

Listwise reranking is a key yet computationally expensive component in vision-centric retrieval and multimodal retrieval-augmented generation (M-RAG) over long documents. While recent VLM-based rerankers achieve strong accuracy, their practicality is often limited by long visual-token sequences and multi-step autoregressive decoding. We propose ZipRerank, a highly efficient listwise multimodal reranker that directly addresses both bottlenecks. It reduces input length via a lightweight query-image early interaction mechanism and eliminates autoregressive decoding by scoring all candidates in a single forward pass. To enable effective learning, ZipRerank adopts a two-stage training strategy: (i) listwise pretraining on large-scale text data rendered as images, and (ii) multimodal finetuning with VLM-teacher-distilled soft-ranking supervision. Extensive experiments on the MMDocIR benchmark show that ZipRerank matches or surpasses state-of-the-art multimodal rerankers while reducing LLM inference latency by up to an order of magnitude, making it well-suited for latency-sensitive real-world systems. The code is available at https://github.com/dukesun99/ZipRerank.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes ZipRerank, a listwise multimodal reranker for long documents that shortens visual input via a lightweight query-image early interaction mechanism and eliminates autoregressive decoding by scoring all candidates in one forward pass. It uses a two-stage training process consisting of listwise pretraining on large-scale text rendered as images followed by finetuning with soft-ranking labels distilled from a VLM teacher. On the MMDocIR benchmark, ZipRerank is reported to match or exceed state-of-the-art multimodal rerankers while cutting LLM inference latency by up to an order of magnitude.

Significance. If the empirical claims hold under scrutiny, the work is significant for practical vision-centric retrieval and M-RAG pipelines, where listwise reranking over long multimodal documents has been limited by high latency. The open-sourced code at the provided GitHub link strengthens reproducibility and potential adoption.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (two-stage training): the claim that listwise pretraining on rendered text images followed by VLM-distilled finetuning preserves ranking quality on real long multimodal documents is load-bearing for the MMDocIR results, yet no ablations are described that test transfer to documents containing non-textual visuals (charts, photos, complex layouts) absent from the rendered-text pretraining data.
  2. [Experiments] Experiments section: aggregate benchmark results are presented without reported statistical significance tests, exact train/validation/test splits for MMDocIR, or implementation details of the early-interaction module and single-pass scorer, which are required to substantiate the latency gains and performance parity with SOTA rerankers.
minor comments (1)
  1. The abstract states that code is available at https://github.com/dukesun99/ZipRerank; confirming that the repository includes the exact training scripts and model checkpoints used for the reported numbers would aid verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of our training strategy and experimental reporting that we will address to strengthen the paper. We respond point-by-point to the major comments below.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (two-stage training): the claim that listwise pretraining on rendered text images followed by VLM-distilled finetuning preserves ranking quality on real long multimodal documents is load-bearing for the MMDocIR results, yet no ablations are described that test transfer to documents containing non-textual visuals (charts, photos, complex layouts) absent from the rendered-text pretraining data.

    Authors: We acknowledge the value of explicit ablations to support the transfer claim. The finetuning stage is performed on the MMDocIR training set, which contains real long documents with charts, photos, and complex layouts, allowing adaptation beyond text-only pretraining. To directly test preservation of ranking quality, we will add ablation studies in the revised manuscript comparing models with and without the rendered-text pretraining stage, evaluated on MMDocIR subsets stratified by visual complexity (e.g., text-heavy vs. chart/photo-heavy documents). These results will be reported in §3 and the experiments section. revision: yes

  2. Referee: [Experiments] Experiments section: aggregate benchmark results are presented without reported statistical significance tests, exact train/validation/test splits for MMDocIR, or implementation details of the early-interaction module and single-pass scorer, which are required to substantiate the latency gains and performance parity with SOTA rerankers.

    Authors: We agree these details are essential for reproducibility and validating the claims. In the revised manuscript, we will update the Experiments section to report: (i) statistical significance tests (e.g., paired bootstrap or t-tests with p-values) for all main results against baselines; (ii) the exact train/validation/test splits used from MMDocIR; and (iii) expanded implementation details, including architecture specifications, hyperparameters, and pseudocode for the query-image early interaction module and single-pass scorer (moved to an appendix if needed for space). The open-sourced code already implements these, and we will cross-reference it explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture and benchmark results

full rationale

The paper describes an engineering proposal (lightweight query-image interaction plus single-pass scoring) trained via a two-stage process (rendered-text pretraining followed by VLM distillation) and validated through direct comparisons on the MMDocIR benchmark. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted parameters, self-citations, or renamed inputs. All load-bearing claims rest on external experimental measurements rather than internal tautologies.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach assumes standard VLM capabilities for image rendering and distillation; no explicit free parameters or invented entities are named in the abstract.

axioms (2)
  • domain assumption Vision-language models can effectively process rendered text images for pretraining
    Invoked in the listwise pretraining stage on text data rendered as images.
  • domain assumption VLM-teacher soft rankings provide useful supervision for multimodal finetuning
    Used in the second training stage.

pith-pipeline@v0.9.0 · 5507 in / 1309 out tokens · 59238 ms · 2026-05-13T05:12:18.624791+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · 11 internal anchors

  1. [1]

    FIRST : Faster Improved Listwise Reranking with Single Token Decoding

    Gangi Reddy, Revanth and Doo, JaeHyeok and Xu, Yifei and Sultan, Md Arafat and Swain, Deevya and Sil, Avirup and Ji, Heng. FIRST : Faster Improved Listwise Reranking with Single Token Decoding. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.491

  2. [2]

    arXiv preprint arXiv:2411.05508 , year=

    An Early FIRST Reproduction and Improvements to Single-Token Decoding for Fast Listwise Reranking , author=. arXiv preprint arXiv:2411.05508 , year=

  3. [3]

    RankZephyr: Effective and robust zero-shot listwise reranking is a breeze!arXiv preprint arXiv:2312.02724, 2023

    RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze! , author=. arXiv preprint arXiv:2312.02724 , year=

  4. [4]

    Proceedings of the 22nd international conference on Machine learning , pages=

    Learning to rank using gradient descent , author=. Proceedings of the 22nd international conference on Machine learning , pages=

  5. [5]

    MMD oc IR : Benchmarking Multimodal Retrieval for Long Documents

    Dong, Kuicai and Chang, Yujing and Goh Xin Deik, Derrick and Li, Dexun and Tang, Ruiming and Liu, Yong. MMD oc IR : Benchmarking Multimodal Retrieval for Long Documents. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025

  6. [6]

    arXiv preprint arXiv:2506.12364 , year=

    MM-R5: MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval , author=. arXiv preprint arXiv:2506.12364 , year=

  7. [7]

    Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding

    Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding , author=. arXiv preprint arXiv:2510.15253 , year=

  8. [8]

    Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings

    Ma, Yubo and Li, Jinsong and Zang, Yuhang and Wu, Xiaobao and Dong, Xiaoyi and Zhang, Pan and Cao, Yuhang and Duan, Haodong and Wang, Jiaqi and Cao, Yixin and Sun, Aixin. Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi...

  9. [9]

    Pattern Recognition , volume=

    Hierarchical multimodal transformers for multipage docvqa , author=. Pattern Recognition , volume=

  10. [10]

    Proceedings of the AAAI Conference on Artificial Intelligence , pages =

    Ryota Tanaka and Kyosuke Nishida and Kosuke Nishida and Taku Hasegawa and Itsumi Saito and Kuniko Saito , title =. Proceedings of the AAAI Conference on Artificial Intelligence , pages =

  11. [11]

    Towards Complex Document Understanding By Discrete Reasoning , booktitle =

    Fengbin Zhu and Wenqiang Lei and Fuli Feng and Chao Wang and Haozhou Zhang and Tat. Towards Complex Document Understanding By Discrete Reasoning , booktitle =

  12. [12]

    2024 , eprint=

    SciQAG: A Framework for Auto-Generated Science Question Answering Dataset with Fine-grained Evaluation , author=. 2024 , eprint=

  13. [13]

    Document Understanding Dataset and Evaluation

    Jordy Van Landeghem and Rafal Powalski and Rub. Document Understanding Dataset and Evaluation

  14. [14]

    Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks , year =

    Dan Hendrycks and Collin Burns and Anya Chen and Spencer Ball , title =. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks , year =

  15. [15]

    Unifying Multimodal Retrieval via Document Screenshot Embedding

    Ma, Xueguang and Lin, Sheng-Chieh and Li, Minghan and Chen, Wenhu and Lin, Jimmy. Unifying Multimodal Retrieval via Document Screenshot Embedding. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024

  16. [16]

    The Thirteenth International Conference on Learning Representations , year=

    ColPali: Efficient Document Retrieval with Vision Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

  17. [17]

    Qwen3-VL Technical Report

    Qwen3-VL Technical Report , author=. arXiv preprint arXiv:2511.21631 , year=

  18. [18]

    2013 12th International Conference on Document Analysis and Recognition , pages=

    Multi-modal information integration for document retrieval , author=. 2013 12th International Conference on Document Analysis and Recognition , pages=. 2013 , organization=

  19. [19]

    arXiv preprint arXiv:2410.02729 , year=

    Unified Multimodal Interleaved Document Representation for Retrieval , author=. arXiv preprint arXiv:2410.02729 , year=

  20. [20]

    ACM Computing Surveys , volume=

    Visual question answering: A survey of methods, datasets, evaluation, and challenges , author=. ACM Computing Surveys , volume=

  21. [21]

    Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

    Retrieving Multimodal Information for Augmented Generation: A Survey , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

  22. [22]

    arXiv preprint arXiv:2504.08748 , year=

    A survey of multimodal retrieval-augmented generation , author=. arXiv preprint arXiv:2504.08748 , year=

  23. [23]

    Proceedings of the IEEE international conference on computer vision , pages=

    Vqa: Visual question answering , author=. Proceedings of the IEEE international conference on computer vision , pages=

  24. [24]

    European Conference on Information Retrieval , pages=

    Cross-modal retrieval for knowledge-based visual question answering , author=. European Conference on Information Retrieval , pages=. 2024 , organization=

  25. [25]

    IEEE Access , volume=

    The state of the art for cross-modal retrieval: A survey , author=. IEEE Access , volume=

  26. [26]

    Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv preprint arXiv:2409.12959, 2024

    Mmsearch: Benchmarking the potential of large models as multi-modal search engines , author=. arXiv preprint arXiv:2409.12959 , year=

  27. [27]

    arXiv preprint arXiv:2407.21439 , year=

    Mllm is a strong reranker: Advancing multimodal retrieval-augmented generation via knowledge-enhanced reranking and noise-injected training , author=. arXiv preprint arXiv:2407.21439 , year=

  28. [28]

    arXiv preprint arXiv:2501.04695 , year=

    Re-ranking the context for multimodal retrieval augmented generation , author=. arXiv preprint arXiv:2501.04695 , year=

  29. [29]

    arXiv preprint arXiv:2507.20198 , year=

    When tokens talk too much: A survey of multimodal long-context token compression across images, videos, and audios , author=. arXiv preprint arXiv:2507.20198 , year=

  30. [30]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Voco-llama: Towards vision compression with large language models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  31. [31]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Visionzip: Longer is better but not necessary in vision language models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  32. [32]

    The Eleventh International Conference on Learning Representations , year=

    Token Merging: Your ViT But Faster , author=. The Eleventh International Conference on Learning Representations , year=

  33. [33]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  34. [34]

    Forty-second International Conference on Machine Learning , year=

    SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference , author=. Forty-second International Conference on Machine Learning , year=

  35. [35]

    and Wang, Benyou

    Song, Dingjie and Wang, Wenjun and Chen, Shunian and Wang, Xidong and Guan, Michael X. and Wang, Benyou. Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLM s. Proceedings of the 31st International Conference on Computational Linguistics. 2025

  36. [36]

    arXiv preprint arXiv:2501.09532 , year=

    AdaFV: Rethinking of Visual-Language alignment for VLM acceleration , author=. arXiv preprint arXiv:2501.09532 , year=

  37. [37]

    European Conference on Computer Vision , pages=

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  38. [38]

    Advances in neural information processing systems , volume=

    Visual instruction tuning , author=. Advances in neural information processing systems , volume=

  39. [39]

    RoFormer: Enhanced Transformer with Rotary Position Embedding,

    RoFormer: Enhanced transformer with Rotary Position Embedding , journal =. 2024 , issn =. doi:https://doi.org/10.1016/j.neucom.2023.127063 , url =

  40. [40]

    arXiv preprint arXiv:2508.07995 , year=

    Diver: A multi-stage approach for reasoning-intensive information retrieval , author=. arXiv preprint arXiv:2508.07995 , year=

  41. [41]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. arXiv preprint arXiv:2506.05176 , year=

  42. [42]

    Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

    Rankllm: A python package for reranking with llms , author=. Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

  43. [43]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

    Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

  44. [44]

    Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages=

    Mteb: Massive text embedding benchmark , author=. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages=

  45. [45]

    BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

    Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models , author=. arXiv preprint arXiv:2104.08663 , year=

  46. [46]

    Passage Re-ranking with BERT

    Passage Re-ranking with BERT , author=. arXiv preprint arXiv:1901.04085 , year=

  47. [47]

    A o E : Angle-optimized Embeddings for Semantic Textual Similarity

    Li, Xianming and Li, Jing. A o E : Angle-optimized Embeddings for Semantic Textual Similarity. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL). 2024

  48. [48]

    Reimers, Nils and Gurevych, Iryna , booktitle=

  49. [49]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Text embeddings by weakly-supervised contrastive pre-training , author=. arXiv preprint arXiv:2212.03533 , year=

  50. [50]

    M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

    Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation , author=. arXiv preprint arXiv:2402.03216 , year=

  51. [51]

    IEEE Transactions on Knowledge and Data Engineering , volume=

    Approximate nearest neighbor search on high dimensional data—experiments, analyses, and improvement , author=. IEEE Transactions on Knowledge and Data Engineering , volume=

  52. [52]

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

    Dense Passage Retrieval for Open-Domain Question Answering , author=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

  53. [53]

    arXiv preprint arXiv:2309.15088 , year=

    Rankvicuna: Zero-shot listwise document reranking with open-source large language models , author=. arXiv preprint arXiv:2309.15088 , year=

  54. [54]

    ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability

    Reasonrank: Empowering passage ranking with strong reasoning ability , author=. arXiv preprint arXiv:2508.07050 , year=

  55. [55]

    European Conference on Information Retrieval , pages=

    Guiding retrieval using llm-based listwise rankers , author=. European Conference on Information Retrieval , pages=. 2025 , organization=

  56. [56]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Flashattention-2: Faster attention with better parallelism and work partitioning , author=. arXiv preprint arXiv:2307.08691 , year=

  57. [57]

    Decoupled Weight Decay Regularization

    Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

  58. [58]

    Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval , pages=

    Colbert: Efficient and effective passage search via contextualized late interaction over bert , author=. Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval , pages=

  59. [59]

    Advances in Neural Information Processing Systems , volume=

    H2o: Heavy-hitter oracle for efficient generative inference of large language models , author=. Advances in Neural Information Processing Systems , volume=

  60. [60]

    Advances in Neural Information Processing Systems , volume=

    Snapkv: Llm knows what you are looking for before generation , author=. Advances in Neural Information Processing Systems , volume=

  61. [61]

    arXiv preprint arXiv:2406.10774 , year=

    Quest: Query-aware sparsity for efficient long-context llm inference , author=. arXiv preprint arXiv:2406.10774 , year=

  62. [62]

    The Thirteenth International Conference on Learning Representations , year=

    Omnikv: Dynamic context selection for efficient long-context llms , author=. The Thirteenth International Conference on Learning Representations , year=

  63. [63]

    Proceedings of the ACM on Management of Data , volume=

    Pqcache: Product quantization-based kvcache for long context llm inference , author=. Proceedings of the ACM on Management of Data , volume=

  64. [64]

    Proceedings of the VLDB Endowment , volume=

    DiversiNews: Enriching News Consumption with Relevant Yet Diverse News Articles Retrieval , author=. Proceedings of the VLDB Endowment , volume=

  65. [65]

    The Thirteenth International Conference on Learning Representations (ICLR) , year=

    A General Framework for Producing Interpretable Semantic Text Embeddings , author=. The Thirteenth International Conference on Learning Representations (ICLR) , year=

  66. [66]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    PRISM: A Framework for Producing Interpretable Political Bias Embeddings with Political-Aware Cross-Encoder , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=. 2025 , url=

  67. [67]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Don't Reinvent the Wheel: Efficient Instruction-Following Text Embedding based on Guided Space Transformation , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=. 2025 , url=

  68. [68]

    arXiv preprint arXiv:2506.08354 , year=

    Text embeddings should capture implicit semantics, not just surface meaning , author=. arXiv preprint arXiv:2506.08354 , year=

  69. [69]

    arXiv preprint arXiv:2512.00852 , year=

    One Swallow Does Not Make a Summer: Understanding Semantic Structures in Embedding Spaces , author=. arXiv preprint arXiv:2512.00852 , year=

  70. [70]

    2017 IEEE 33rd International Conference on Data Engineering (ICDE) , pages=

    Reverse query-aware locality-sensitive hashing for high-dimensional furthest neighbor search , author=. 2017 IEEE 33rd International Conference on Data Engineering (ICDE) , pages=

  71. [71]

    Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD) , pages=

    Accurate and fast asymmetric locality-sensitive hashing scheme for maximum inner product search , author=. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD) , pages=

  72. [72]

    International Conference on Machine Learning , pages=

    Sublinear time nearest neighbor search over generalized weighted space , author=. International Conference on Machine Learning , pages=. 2019 , organization=

  73. [73]

    Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD) , pages=

    Locality-sensitive hashing scheme based on longest circular co-substring , author=. Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD) , pages=

  74. [74]

    Proceedings of the 2021 International Conference on Management of Data (SIGMOD) , pages=

    Point-to-hyperplane nearest neighbor search beyond the unit hypersphere , author=. Proceedings of the 2021 International Conference on Management of Data (SIGMOD) , pages=

  75. [75]

    Huang, Qiang and Wang, Yanhao and Tung, Anthony KH , booktitle=

  76. [76]

    2023 IEEE 39th International Conference on Data Engineering (ICDE) , pages=

    Lightweight-yet-efficient: Revitalizing ball-tree for point-to-hyperplane nearest neighbor search , author=. 2023 IEEE 39th International Conference on Data Engineering (ICDE) , pages=

  77. [77]

    arXiv preprint arXiv:2402.13858 , year=

    Diversity-Aware k -Maximum Inner Product Search Revisited , author=. arXiv preprint arXiv:2402.13858 , year=

  78. [78]

    Proceedings of the AAAI Conference on Artificial Intelligence , pages=

    Knowledge Completes the Vision: A Multimodal Entity-aware Retrieval-Augmented Generation Framework for News Image Captioning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , pages=

  79. [79]

    arXiv preprint arXiv:2603.06213 , year=

    Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events , author=. arXiv preprint arXiv:2603.06213 , year=

  80. [80]

    MG$^2$-RAG: Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation

    MG ^2 -RAG: Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation , author=. arXiv preprint arXiv:2604.04969 , year=

Showing first 80 references.