VisRAG achieves 20-40% better end-to-end performance than text-based RAG by directly embedding and retrieving document images with VLMs.
It is built upon SigLIP-400M and Qwen2-7B (Yang et al.,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.IR 1years
2024 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
VisRAG achieves 20-40% better end-to-end performance than text-based RAG by directly embedding and retrieving document images with VLMs.