pith. sign in
Pith Number

pith:LE7ARFCN

pith:2024:LE7ARFCNHEWEZEENCZ3Z4QA3V5
not attested not anchored not stored refs resolved

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Bokai Xu, Chaoyue Tang, Junbo Cui, Junhao Ran, Maosong Sun, Shi Yu, Shuo Wang, Xu Han, Yukun Yan, Zhenghao Liu, Zhiyuan Liu

VisRAG retrieves and generates from multi-modal documents by embedding them directly as images rather than parsing to text.

arxiv:2410.10594 v2 · 2024-10-14 · cs.IR · cs.AI · cs.CL · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{LE7ARFCNHEWEZEENCZ3Z4QA3V5}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Experiments demonstrate that VisRAG outperforms traditional RAG in both the retrieval and generation stages, achieving a 20--40% end-to-end performance gain over traditional text-based RAG pipeline.

C2weakest assumption

That vision-language models can reliably embed and retrieve relevant information directly from document images without text parsing, and that the collected open-source plus synthetic training data generalizes to unseen real-world multi-modality documents.

C3one line summary

VisRAG achieves 20-40% better end-to-end performance than text-based RAG by directly embedding and retrieving document images with VLMs.

References

43 extracted · 43 resolved · 11 Pith anchors

[1] GPT-4 Technical Report · arXiv:2303.08774
[2] A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity 2023
[3] Allava: Harness- ing gpt4v-synthesized data for a lite vision-language model
[4] PP-OCR: A practical ultra lightweight OCR system.CoRR, abs/2009.09941 2009
[5] ColPali: Efficient Document Retrieval with Vision Language Models · arXiv:2407.01449

Formal links

1 machine-checked theorem link

Cited by

25 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:47.418247Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

593e08944d392c4c908d16779e401baf6845fa73cb450646cc58fec8f40735bd

Aliases

arxiv: 2410.10594 · arxiv_version: 2410.10594v2 · doi: 10.48550/arxiv.2410.10594 · pith_short_12: LE7ARFCNHEWE · pith_short_16: LE7ARFCNHEWEZEEN · pith_short_8: LE7ARFCN
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/LE7ARFCNHEWEZEENCZ3Z4QA3V5 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 593e08944d392c4c908d16779e401baf6845fa73cb450646cc58fec8f40735bd
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "ef855c401c9db5f58828228443d2d54b7befe49e7a2d658a3c722ed3ecc37174",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CL",
      "cs.CV"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.IR",
    "submitted_at": "2024-10-14T15:04:18Z",
    "title_canon_sha256": "06d798e0973d1a421d2517422dcf3d932d2229c42b4b5b9dc66fd712adbdc73e"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2410.10594",
    "kind": "arxiv",
    "version": 2
  }
}