Identifies the generative-discriminative gap in LLM hard negative synthesis for retrieval and proposes CausalNeg using CoT counterfactual perturbation plus query-view entropy maximization to generate more effective negatives.
arXiv preprint arXiv:2108.13897 , year=
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
Permutation-invariant fine-tuning (PI-FT) randomizes field order and applies dropout during embedding model training to eliminate sensitivity to serialization order, reducing order-change penalty from 7.4 to 0.2 nDCG@10 on a generated multilingual DevDataBench while outperforming zero-shot baselines
Reproduction confirms PAG boosts generative retrieval effectiveness, but its look-ahead planning signal collapses under intent-preserving typos and query mismatches, reverting performance to unguided decoding.
MARCA is a bilingual benchmark using 52 questions and validated checklists to evaluate LLM web-search completeness and correctness in English and Portuguese.
SG-SRL applies cross-lingual semantic RL on source monolingual data plus a recovery stage to improve semantic grounding over standard SFT in low-resource target-language generation.
Multilingual RAG rerankers exhibit language bias that limits cross-lingual evidence use, and the proposed LAURA method aligns ranking with downstream generation utility to reduce the bias and improve performance.
Fragata applies hybrid RAG to enable semantic retrieval of HPC support tickets across 20 years of history, handling language differences, typos, and varied wording better than traditional keyword search.
citing papers explorer
-
When Hard Negatives Hurt: Bridging the Generative-Discriminative Gap in Hard Negative Synthesis for Retrieval
Identifies the generative-discriminative gap in LLM hard negative synthesis for retrieval and proposes CausalNeg using CoT counterfactual perturbation plus query-view entropy maximization to generate more effective negatives.
-
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
-
C-Pack: Packed Resources For General Chinese Embeddings
C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
-
Field Order Should Not Matter: Permutation-Invariant Embedding Model Fine-Tuning for Structured Metadata Retrieval
Permutation-invariant fine-tuning (PI-FT) randomizes field order and applies dropout during embedding model training to eliminate sensitivity to serialization order, reducing order-change penalty from 7.4 to 0.2 nDCG@10 on a generated multilingual DevDataBench while outperforming zero-shot baselines
-
Lost in Decoding? Reproducing and Stress-Testing the Look-Ahead Prior in Generative Retrieval
Reproduction confirms PAG boosts generative retrieval effectiveness, but its look-ahead planning signal collapses under intent-preserving typos and query mismatches, reverting performance to unguided decoding.
-
MARCA: A Checklist-Based Benchmark for Multilingual Web Search
MARCA is a bilingual benchmark using 52 questions and validated checklists to evaluate LLM web-search completeness and correctness in English and Portuguese.
-
Source-Grounded Semantic Reinforcement Learning for Low-Resource Target-Language Generation
SG-SRL applies cross-lingual semantic RL on source monolingual data plus a recovery stage to improve semantic grounding over standard SFT in low-resource target-language generation.
-
All Languages Matter: Understanding and Mitigating Language Bias in Multilingual RAG
Multilingual RAG rerankers exhibit language bias that limits cross-lingual evidence use, and the proposed LAURA method aligns ranking with downstream generation utility to reduce the bias and improve performance.
-
FRAGATA: Semantic Retrieval of HPC Support Tickets via Hybrid RAG over 20 Years of Request Tracker History
Fragata applies hybrid RAG to enable semantic retrieval of HPC support tickets across 20 years of history, handling language differences, typos, and varied wording better than traditional keyword search.
- From Tokens to Concepts: Leveraging SAE for SPLADE