IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.
Late chunking: Contextual chunk embeddings using long-context embedding models, 2024
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
ColChunk adaptively chunks visual document patches into contextual multi-vectors via clustering, cutting storage by over 90% while raising average nDCG@5 by 9 points.
SPIRE presents a tree-structured retrieval method using subdocuments, paths, and dual contextualization that produces higher-quality and more diverse citations than passage-based baselines on HTML QA benchmarks.
Controlled ablations of 38 models find MLM superior to CLM on representation benchmarks while CLM offers better data efficiency and stability; a biphasic CLM-then-MLM schedule is optimal under fixed compute and improves when initialized from pretrained CLM models.
Entity-based chunk filtering reduces RAG vector index size by 25-36% with retrieval quality near baseline levels.
A RAG pipeline with contextual PDF chunking, question-and-answer-aware retrieval and reranking using Qwen3 models reaches 0.96 accuracy on a Ukrainian multi-domain document QA shared task.
citing papers explorer
-
IdioLink: Retrieving Meaning Beyond Words Across Idiomatic and Literal Expressions
IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.
-
Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval
ColChunk adaptively chunks visual document patches into contextual multi-vectors via clustering, cutting storage by over 90% while raising average nDCG@5 by 9 points.
-
SPIRE: Structure-Preserving Interpretable Retrieval of Evidence
SPIRE presents a tree-structured retrieval method using subdocuments, paths, and dual contextualization that produces higher-quality and more diverse citations than passage-based baselines on HTML QA benchmarks.
-
Should We Still Pretrain Encoders with Masked Language Modeling?
Controlled ablations of 38 models find MLM superior to CLM on representation benchmarks while CLM offers better data efficiency and stability; a biphasic CLM-then-MLM schedule is optimal under fixed compute and improves when initialized from pretrained CLM models.
-
Reducing Redundancy in Retrieval-Augmented Generation through Chunk Filtering
Entity-based chunk filtering reduces RAG vector index size by 25-36% with retrieval quality near baseline levels.
-
Qwen Goes Brrr: Off-the-Shelf RAG for Ukrainian Multi-Domain Document Understanding
A RAG pipeline with contextual PDF chunking, question-and-answer-aware retrieval and reranking using Qwen3 models reaches 0.96 accuracy on a Ukrainian multi-domain document QA shared task.