Recognition: 2 theorem links
· Lean TheoremQwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Pith reviewed 2026-05-10 13:41 UTC · model grok-4.3
The pith
Qwen3 Embedding models set new highs on multilingual text retrieval by synthesizing their own training data from the Qwen3 foundation models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Qwen3 Embedding series, built on Qwen3 foundation models in sizes 0.6B, 4B, and 8B, reaches state-of-the-art results on the multilingual MTEB benchmark for text embedding as well as on code retrieval, cross-lingual retrieval, and multilingual retrieval tasks. This performance comes from a training pipeline that uses the Qwen3 models both as the core architecture and as the source for synthesizing high-quality, diverse training data across domains and languages, followed by supervised fine-tuning and model merging to improve robustness.
What carries the argument
Multi-stage training pipeline that treats the Qwen3 LLMs as both backbone models and generators of rich, domain-specific training data, combined with model merging after supervised fine-tuning.
If this is right
- Developers can select among the three model sizes to match available compute when deploying embedding or reranking systems.
- The same pipeline supports stronger performance on code search and cross-language document retrieval without custom architectures for each task.
- Public release of the models under Apache 2.0 license enables direct use and further fine-tuning by the community.
- The approach shows that foundation models can generate the data needed to specialize themselves for embedding tasks across many languages.
- Model merging after fine-tuning provides a practical way to combine strengths from different training stages.
Where Pith is reading between the lines
- If self-synthesis of training data works reliably, it could reduce dependence on large manually labeled datasets for future embedding models.
- The results suggest that scaling the same family of models for both generation and embedding may create tighter feedback loops than using separate models for each role.
- Testing the models on entirely held-out languages or domains not used in the synthesis step would clarify how far the generalization extends.
- Similar data-generation and merging steps could be tried on other open foundation models to see whether the performance pattern repeats.
Load-bearing premise
The reported gains reflect real generalization from the training methods rather than overlap with benchmark data or overfitting to known test sets.
What would settle it
Evaluating the released models on a new retrieval benchmark created after the paper's release, with no possible overlap to the synthesized training data, and checking whether performance remains at the claimed level.
read the original abstract
In this work, we introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series, in text embedding and reranking capabilities, built upon the Qwen3 foundation models. Leveraging the Qwen3 LLMs' robust capabilities in multilingual text understanding and generation, our innovative multi-stage training pipeline combines large-scale unsupervised pre-training with supervised fine-tuning on high-quality datasets. Effective model merging strategies further ensure the robustness and adaptability of the Qwen3 Embedding series. During the training process, the Qwen3 LLMs serve not only as backbone models but also play a crucial role in synthesizing high-quality, rich, and diverse training data across multiple domains and languages, thus enhancing the training pipeline. The Qwen3 Embedding series offers a spectrum of model sizes (0.6B, 4B, 8B) for both embedding and reranking tasks, addressing diverse deployment scenarios where users can optimize for either efficiency or effectiveness. Empirical evaluations demonstrate that the Qwen3 Embedding series achieves state-of-the-art results across diverse benchmarks. Notably, it excels on the multilingual evaluation benchmark MTEB for text embedding, as well as in various retrieval tasks, including code retrieval, cross-lingual retrieval and multilingual retrieval. To facilitate reproducibility and promote community-driven research and development, the Qwen3 Embedding models are publicly available under the Apache 2.0 license.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Qwen3 Embedding series, built on Qwen3 foundation models, as an advance over GTE-Qwen for text embedding and reranking. It describes a multi-stage training pipeline that combines large-scale unsupervised pre-training, supervised fine-tuning on high-quality datasets synthesized by the Qwen3 LLMs themselves, and model merging strategies. Models are released in 0.6B, 4B, and 8B sizes for both embedding and reranking, with the central claim being state-of-the-art results on the multilingual MTEB benchmark as well as on code retrieval, cross-lingual retrieval, and multilingual retrieval tasks. The models are made publicly available under Apache 2.0.
Significance. If the empirical claims hold after verification, the work would demonstrate a practical way to leverage the same foundation LLM family for both backbone architecture and training-data synthesis, yielding gains in multilingual and retrieval settings. The public release of the model weights is a clear strength that supports reproducibility and community follow-up work.
major comments (2)
- [Abstract and Evaluation sections] Abstract and Evaluation sections: The manuscript asserts SOTA performance on MTEB multilingual embedding and multiple retrieval tasks, yet the abstract supplies no quantitative scores, baseline comparisons, ablation results, or evaluation-protocol details. Without these, it is impossible to determine whether the reported gains are statistically meaningful or free of common confounds such as test-set leakage.
- [Training pipeline description] Training pipeline description: The multi-stage pipeline uses Qwen3 LLMs both as the backbone and to synthesize the high-quality, diverse training data. No decontamination, n-gram overlap filtering, or membership-inference checks against MTEB test sets or retrieval corpora are described. Because the same model family generates the training examples, any overlap would produce inflated scores without genuine generalization; this directly undermines the central SOTA claim.
minor comments (1)
- [Abstract] The abstract would be strengthened by including at least one or two concrete performance numbers (e.g., MTEB average score and the strongest baseline) to allow readers to gauge the magnitude of the advance immediately.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which have helped clarify the presentation of our results and the rigor of our training pipeline description. We address each major comment below and have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract and Evaluation sections] Abstract and Evaluation sections: The manuscript asserts SOTA performance on MTEB multilingual embedding and multiple retrieval tasks, yet the abstract supplies no quantitative scores, baseline comparisons, ablation results, or evaluation-protocol details. Without these, it is impossible to determine whether the reported gains are statistically meaningful or free of common confounds such as test-set leakage.
Authors: We agree that the abstract would benefit from explicit quantitative results to allow readers to immediately assess the claimed improvements. In the revised manuscript we have updated the abstract to include key MTEB multilingual scores, comparisons against GTE-Qwen and other strong baselines, and a concise statement of the evaluation protocol. The Evaluation section already reports full results, ablations, and protocol details; we have added further text on statistical significance testing and explicit checks for test-set leakage to strengthen this part of the paper. revision: yes
-
Referee: [Training pipeline description] Training pipeline description: The multi-stage pipeline uses Qwen3 LLMs both as the backbone and to synthesize the high-quality, diverse training data. No decontamination, n-gram overlap filtering, or membership-inference checks against MTEB test sets or retrieval corpora are described. Because the same model family generates the training examples, any overlap would produce inflated scores without genuine generalization; this directly undermines the central SOTA claim.
Authors: The referee correctly notes that the original manuscript did not explicitly describe decontamination steps. We have added a dedicated paragraph in the Training Pipeline section that details the decontamination process: n-gram overlap filtering (with a conservative threshold) was applied to remove any potential overlap with MTEB test sets and retrieval corpora, and membership-inference-style checks were performed on the synthesized data. These steps were part of the data-preparation pipeline and ensure that the reported gains reflect generalization rather than leakage. revision: yes
Circularity Check
No significant circularity; empirical claims rest on external benchmarks
full rationale
The paper presents an empirical ML engineering effort: a multi-stage training pipeline (unsupervised pre-training + supervised fine-tuning + model merging) on data synthesized by Qwen3 LLMs, with final performance measured on external benchmarks such as MTEB and various retrieval tasks. No mathematical derivation chain, first-principles equations, or 'predictions' exist that could reduce to the inputs by construction. No self-definitional steps, fitted parameters renamed as predictions, uniqueness theorems imported via self-citation, or ansatz smuggling appear. The SOTA claims are supported by reported benchmark numbers rather than any closed-loop logic internal to the paper. Minor self-citations to prior Qwen/GTE work are present but not load-bearing for the central result, which remains independently falsifiable via the external evaluations.
Axiom & Free-Parameter Ledger
free parameters (2)
- model sizes
- training hyperparameters
axioms (2)
- domain assumption Qwen3 LLMs have robust capabilities in multilingual text understanding and generation
- domain assumption High-quality, diverse training data can be synthesized by the same LLMs across domains and languages
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
multi-stage training pipeline combines large-scale unsupervised pre-training with supervised fine-tuning on high-quality datasets... Qwen3 LLMs serve not only as backbone models but also play a crucial role in synthesizing high-quality, rich, and diverse training data
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
achieves state-of-the-art results across diverse benchmarks... excels on the multilingual evaluation benchmark MTEB
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
STRABLE: Benchmarking Tabular Machine Learning with Strings
A new corpus of 108 mixed string-numeric tables shows that advanced tabular learners with basic string embeddings perform well on most real-world data, while large LLM encoders help on free-text heavy tables.
-
SLAM: Structural Linguistic Activation Marking for Language Models
SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
-
SLAM: Structural Linguistic Activation Marking for Language Models
SLAM achieves 100% detection accuracy on Gemma-2 models with only 1-2 points of quality loss by causally steering SAE-identified structural directions while preserving lexical sampling and semantics.
-
ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval
ReasonAudio benchmark reveals that state-of-the-art text-audio retrieval models struggle with reasoning tasks like negation and duration, and multimodal LLMs lose reasoning ability after contrastive fine-tuning.
-
FollowTable: A Benchmark for Instruction-Following Table Retrieval
FollowTable is the first large-scale benchmark for instruction-following table retrieval, paired with an Instruction Responsiveness Score, showing that existing models fail to adapt to fine-grained constraints beyond ...
-
DermAgent: A Self-Reflective Agentic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-Making
DermAgent orchestrates seven vision-language tools in a Plan-Execute-Reflect loop with dual-modality retrieval from 413k cases and a critic module to outperform GPT-4o by 17.6% in zero-shot dermatological diagnosis accuracy.
-
BOOKMARKS: Efficient Active Storyline Memory for Role-playing
BOOKMARKS introduces searchable bookmarks as reusable answers to storyline questions, enabling active initialization and passive synchronization for more consistent role-playing agent memory than recurrent summarization.
-
ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding
ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.
-
AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions
AcquisitionSynthesis uses acquisition functions as rewards to train generators that produce higher-quality synthetic data, delivering 2-7% gains on math, medical QA, and coding tasks with improved robustness to forgetting.
-
LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving
LeanSearch v2 recovers 46.1% of ground-truth premise groups for research-level Lean 4 theorems within 10 candidates and raises fixed-loop proof success to 20%.
-
LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving
LeanSearch v2 recovers 46.1% of ground-truth premise groups on research-level Mathlib theorems and raises fixed-loop proof success from 4% to 20% via embedding-reranker plus iterative sketch-retrieve-reflect retrieval.
-
AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects
AssemblyBench dataset and AssemblyDyno transformer model enable physics-aware prediction of assembly sequences and trajectories for complex industrial objects from multimodal instructions and 3D shapes.
-
Very Efficient Listwise Multimodal Reranking for Long Documents
ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.
-
Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models
Agentic program search over frozen embedding APIs yields a parameter-free inference algebra—a softmax-weighted centroid of top-K documents interpolated with the query—that lifts nDCG@10 across seven model families on ...
-
Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models
A softmax-weighted centroid of the local top-K documents interpolated with the query improves nDCG@10 for frozen embedding models across seven families on held-out BEIR data.
-
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents
Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
-
Skill Description Deception Attack against Task Routing in Internet of Agents
Malicious agents can deceive LLM-based task routers in Internet of Agents systems by generating fake skill descriptions, achieving up to 98% success rate across nine domains.
-
CHASM: Online Changepoint Detection in Temporal and Cross-Variable Dependence
CHASM detects changes in temporal and cross-variable dependence in multivariate time series by monitoring the truncated eigenvalue sequence of a recursively estimated DMD operator, using optimal assignment and augment...
-
Toward Privileged Foundation Models:LUPI for Accelerated and Improved Learning
PIQL integrates train-time-only privileged information into tabular foundation models via new constructions and a reconstruction architecture to achieve faster convergence and better generalization.
-
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
-
Self Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale
Starling uses LLMs and agents to turn 22.5M PubMed papers into 6.3M nuanced structured records across six tasks with 0.6-7.7% frontier-model rejection rates, lower than error rates on existing curated databases.
-
LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG
LatentRAG performs agentic RAG by generating latent tokens for thoughts and subqueries in one forward pass, matching explicit methods' accuracy on seven benchmarks while reducing latency by ~90%.
-
OBLIQ-Bench: Exposing Overlooked Bottlenecks in Modern Retrievers with Latent and Implicit Queries
OBLIQ-Bench reveals that modern retrievers fail to surface documents for latent and implicit queries even though LLMs reliably recognize relevance when those documents are provided.
-
Retrieval from Within: An Intrinsic Capability of Attention-Based Models
Attention-based models can intrinsically retrieve and reuse pre-encoded evidence chunks via decoder attention queries, unifying retrieval with generation and outperforming external RAG pipelines on QA benchmarks.
-
SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents
SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.
-
TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding
TabEmbed is the first generalist embedding model for tabular data that unifies classification and retrieval in one space via contrastive learning and outperforms text embedding models on the new TabBench benchmark.
-
Rational Communication Shapes Morphological Composition
Using historical corpora and the Rational Speech Act framework, attested English morphological compositions are ranked higher than plausible alternatives from the same time period when both semantic recoverability and...
-
ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval
ReasonAudio benchmark shows current text-audio retrieval models fail at reasoning tasks like negation and duration discrimination beyond simple semantic matching.
-
Is It Novel and Why? Fine-Grained Patent Novelty Prediction Based on Passage Retrieval
Introduces a feature-level annotated patent dataset and LLM retrieval-reasoning workflows that outperform embedding baselines on passage retrieval and novel feature identification while avoiding spurious correlations ...
-
Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese
Prosa demonstrates that rubric-based binary scoring with multi-judge filtering yields full agreement on 16 LLM rankings across judges on Brazilian Portuguese chats, compared to only 7/16 under holistic scoring, while ...
-
Led to Mislead: Adversarial Content Injection for Attacks on Neural Ranking Models
CRAFT is a supervised LLM framework using retrieval-augmented generation, self-refinement, fine-tuning, and preference optimization to create fluent adversarial content that boosts target ranks in neural ranking model...
-
Embedding-based In-Context Prompt Training for Enhancing LLMs as Text Encoders
EPIC trains LLMs to treat continuous embeddings as in-context prompts, yielding state-of-the-art text embedding performance on MTEB with or without prompts at inference and lower compute.
-
E-MIA: Exam-Style Black-Box Membership Inference Attacks against RAG Systems
E-MIA converts document details into four types of exam questions and aggregates the RAG's answers into a membership score that separates member and non-member documents better than prior similarity-based or probe-bas...
-
UnIte: Uncertainty-based Iterative Document Sampling for Domain Adaptation in Information Retrieval
UnIte selects target-domain documents for pseudo-query generation by filtering high aleatoric uncertainty and prioritizing high epistemic uncertainty, yielding +2.45 to +3.49 nDCG@10 gains on BEIR with ~4k samples.
-
Similar Users-Augmented Interest Network
SUIN improves CTR prediction by augmenting target user sequences with similar users' behaviors via embedding-based retrieval, user-specific position encoding, and user-aware target attention.
-
AsmRAG: LLM-Driven Malware Detection by Retrieving Functionally Similar Assembly Code
AsmRAG detects malware at 96% F1 and attributes families at 95% F1 by retrieving functionally similar assembly code via LLM embeddings and density-weighted anchor selection, remaining robust to metamorphic obfuscation.
-
ResRank: Unifying Retrieval and Listwise Reranking via End-to-End Joint Training with Residual Passage Compression
ResRank unifies retrieval and listwise reranking by compressing passages to one token each, using residual connections and cosine-similarity scoring, achieving competitive effectiveness on TREC DL and BEIR benchmarks ...
-
ATIR: Towards Audio-Text Interleaved Contextual Retrieval
Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.
-
TeleEmbedBench: A Multi-Corpus Embedding Benchmark for RAG in Telecommunications
TeleEmbedBench is the first multi-corpus benchmark showing LLM-based embedding models significantly outperform traditional sentence-transformers on telecommunications specifications and code for retrieval accuracy and...
-
Matlas: A Semantic Search Engine for Mathematics
Matlas introduces a semantic retrieval system over 8.07 million mathematical statements from papers and textbooks, using dependency graphs and topological unfolding for self-contained search via natural language queries.
-
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
-
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning
OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.
-
Crowded in B-Space: Calibrating Shared Directions for LoRA Merging
Pico reduces LoRA merge interference by calibrating over-shared directions in the B matrix before merging, yielding 3.4-8.3 point accuracy gains and sometimes beating joint training.
-
On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability
LLM-based dense retrievers generalize better when instruction-tuned but pay a specialization tax when optimized for reasoning; they resist typos and corpus poisoning better than encoder-only baselines yet remain vulne...
-
Psychological Steering of Large Language Models
Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
-
Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs
A multi-agent framework reconstructs the evolutionary graph of post-training LLM datasets, revealing domain patterns like vertical refinement in math data and systemic issues like redundancy and benchmark contaminatio...
-
Inductive Reasoning for Temporal Knowledge Graphs with Emerging Entities
TransFIR enables reasoning on temporal knowledge graphs for emerging entities by clustering them into semantic groups and borrowing interaction histories from similar known entities, yielding 28.6% average MRR gains.
-
Hidden in the Multiplicative Interaction: Uncovering Fragility in Multimodal Contrastive Learning
Multimodal contrastive learning using multilinear products is fragile to single bad modalities, and a gated version improves top-1 retrieval accuracy on synthetic and real trimodal data.
-
Retrieval Augmented Conversational Recommendation with Reinforcement Learning
RAR retrieves candidate items from a 300k-movie corpus then uses LLM generation with RL feedback to produce context-aware recommendations that outperform baselines on benchmarks.
-
Rank, Don't Generate: Statement-level Ranking for Explainable Recommendation
The work reframes explainable recommendation as statement-level ranking, introduces the StaR benchmark from Amazon reviews, and finds popularity baselines outperforming SOTA models in item-level personalized ranking.
-
BBC: Improving Large-k Approximate Nearest Neighbor Search with a Bucket-based Result Collector
BBC improves large-k ANN efficiency via bucketed candidate buffers and optimized re-ranking, delivering up to 3.8x speedup at recall@k=0.95.
-
Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs
LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.
-
Task-Adaptive Embedding Refinement via Test-time LLM Guidance
Test-time LLM feedback refines query embeddings to deliver up to 25% relative gains on zero-shot literature search, intent detection, and related benchmarks.
-
Letting the neural code speak: Automated characterization of monkey visual neurons through human language
Natural-language descriptions generated and verified through generative models and digital twins capture the selectivity of most neurons in macaque V1 and V4.
-
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...
-
POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles
POETS uses compute-efficient LLM policy ensembles to implicitly perform KL-regularized Thompson sampling, delivering O(sqrt(T gamma_T)) regret bounds and state-of-the-art sample efficiency in scientific discovery task...
-
Characterizing and Mitigating False-Positive Bug Reports in the Linux Kernel
False-positive bug reports in the Linux kernel consume effort comparable to real bugs and can be filtered by LLMs using retrieval-augmented generation at 88% F1.
-
Do not copy and paste! Rewriting strategies for code retrieval
Full natural-language rewriting of code and queries boosts retrieval on code benchmarks while corpus-only rewriting often hurts, with token entropy difference serving as a cheap predictor of gains.
-
MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal
MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide di...
-
Topic Is Not Agenda: A Citation-Community Audit of Text Embeddings
Embeddings retrieve same-subfield papers at 45-52% but same-agenda papers at only 15-21%; citation rerank reaches 57-59% on agenda queries.
Reference graph
Works this paper leans on
-
[1]
Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. In Findings of the Association for Computational Linguistics: ACL 2024 , pp. 2318–2335, Bangkok, Thailand, August
work page 2024
-
[2]
URL https:// aclanthology.org/2024.findings-acl.137/
Association for Computational Linguistics. URL https:// aclanthology.org/2024.findings-acl.137/. Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, M´arton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemi ´nski, Genta Indra Winata, et al. MMTEB: Massive multilingual text embedding benchmark. In The Thirteenth International Conferen...
work page 2024
-
[3]
Scaling synthetic data creation with 1,000,000,000 personas
URL https://openreview.net/forum?id=zl3pfz4VCV. Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data creation with 1,000,000,000 personas. arXiv preprint arXiv:2406.20094,
-
[4]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models. arXiv preprint arXiv:2405.17428,
work page internal anchor Pith review arXiv
-
[6]
Gemini embedding: Gen- eralizable embeddings from gemini.arXiv preprint arXiv:2503.07891, 2025
Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. NV-embed: Improved techniques for training LLMs as generalist embedding models. In The Thirteenth International Conference on Learning Representations , 2025a. URL https: //openreview.net/forum?id=lgsyLSsDRe. Jinhyuk Lee, Feiyang Chen, Sahil Dua, Danie...
-
[7]
Towards General Text Embeddings with Multi-stage Contrastive Learning
URL https://arxiv.org/ abs/2308.03281. Xueguang Ma, Xinyu Zhang, Ronak Pradeep, and Jimmy Lin. Zero-shot listwise document reranking with a large language model. arXiv preprint arXiv:2305.02156,
work page internal anchor Pith review arXiv
-
[8]
MTEB: Massive text embed- ding benchmark
Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. MTEB: Massive text embed- ding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 2014–2037, Dubrovnik, Croatia, May
work page 2014
-
[9]
URL https://aclanthology.org/2023.eacl-main.148/
Association for Computa- tional Linguistics. URL https://aclanthology.org/2023.eacl-main.148/. Niklas Muennighoff, Hongjin SU, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Generative representational instruction tuning. In The Thirteenth International Con- ference on Learning Representations,
work page 2023
-
[10]
Representation Learning with Contrastive Predictive Coding
URL https://openreview.net/forum?id=BC4lIvfSzv. Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predic- tive coding. arXiv preprint arXiv:1807.03748,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
arXiv preprint arXiv:2309.15088 , year=
Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. Rankvicuna: Zero-shot listwise docu- ment reranking with open-source large language models. arXiv preprint arXiv:2309.15088,
-
[12]
Sentence-BERT: Sentence embeddings using Siamese BERT- networks
Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT- networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pp. 3982–3992, Hong Kong, China, November
work page 2019
-
[13]
URL https://aclanthology.org/D19-1410/
Association for Computational Linguistics. URL https://aclanthology.org/D19-1410/. Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Smith, Luke Zettlemoyer, and Tao Yu. One embedder, any task: Instruction-finetuned text embeddings. In Findings of the Association for Computational Linguistics: ACL 2023 , pp. ...
work page 2023
-
[14]
Text Embeddings by Weakly-Supervised Contrastive Pre-training
URL https://arxiv.org/abs/2212.03533. Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Im- proving text embeddings with large language models. In Proceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers) , pp. 11897–11916, Bangkok, Thailand, August
work page internal anchor Pith review arXiv
-
[15]
URL https: //aclanthology.org/2024.acl-long.642/
Association for Computational Linguistics. URL https: //aclanthology.org/2024.acl-long.642/. Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle Lo, Arman Cohan, Benjamin Van Durme, Dawn Lawrie, and Luca Soldaini. Followir: Evaluating and teaching information retrieval models to follow instructions. arXiv preprint arXiv:2403.15246,
-
[16]
Association for Computing Machinery. URLhttps://doi.org/10.1145/ 3626772.3657878. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388,
-
[17]
A two-stage adaptation of large language models for text ranking
10 Technical Report Longhui Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, and Min Zhang. A two-stage adaptation of large language models for text ranking. In Findings of the Association for Computational Linguistics ACL 2024 , pp. 11880–11891, 2024a. Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baoso...
work page 2024
-
[18]
Embedding in recommender systems: A survey
Xiangyu Zhao, Maolin Wang, Xinjian Zhao, Jiansheng Li, Shucheng Zhou, Dawei Yin, Qing Li, Jiliang Tang, and Ruocheng Guo. Embedding in recommender systems: A survey. arXiv preprint arXiv:2310.18608,
-
[19]
11 Technical Report A Appendix A.1 Synthetic Data We construct four types of synthetic data—retrieval, bitext mining, semantic textual similarity, and classification to enable the model to adapt to various similarity tasks during pre-training. To ensure both multilingual and cross-lingual diversity, the data is generated using Qwen3 32B. Below is an examp...
work page 2024
-
[20]
(MTEB(cmn, v1). MTEB(Code, v1) Avg.Apps COIR-CodeSearch-Net Code-Edit-Search Code-Feedback-MT Code-Feedback-ST Code-SearchNet-CCR Code-SearchNet Code-Trans-Ocean-Contest Code-Trans-Ocean-DLCosQAStack-Overflow-QA Synthetic-Text2SQL BGEmultilingual 62.0422.93 68.14 60.48 60.52 76.70 73.23 83.43 86.84 32.64 27.93 92.93 58.67NV-Embed-v2 63.7429.72 61.85 73.96...
work page 2094
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.