arxiv: 2506.05176 · v3 · submitted 2025-06-05 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang , Mingxin Li , Dingkun Long , Xin Zhang , Huan Lin , Baosong Yang , Pengjun Xie , An Yang

show 4 more authors

Dayiheng Liu Junyang Lin Fei Huang Jingren Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:41 UTC · model grok-4.3

classification 💻 cs.CL

keywords text embeddingrerankingmultilingual retrievalfoundation modelsdata synthesismodel mergingcross-lingual retrieval

0 comments

The pith

Qwen3 Embedding models set new highs on multilingual text retrieval by synthesizing their own training data from the Qwen3 foundation models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Qwen3 Embedding series as an advance over earlier GTE-Qwen models for turning text into useful vectors and for reranking search results. It builds these systems directly on the Qwen3 large language models through a multi-stage process of large-scale unsupervised pre-training, supervised fine-tuning on datasets generated by the same models, and model merging. The series comes in three sizes (0.6 billion, 4 billion, and 8 billion parameters) that support both embedding and reranking, letting users trade speed for accuracy. A sympathetic reader would care because better embeddings improve search engines, recommendation systems, and analysis tools that must work across many languages and domains at once.

Core claim

The Qwen3 Embedding series, built on Qwen3 foundation models in sizes 0.6B, 4B, and 8B, reaches state-of-the-art results on the multilingual MTEB benchmark for text embedding as well as on code retrieval, cross-lingual retrieval, and multilingual retrieval tasks. This performance comes from a training pipeline that uses the Qwen3 models both as the core architecture and as the source for synthesizing high-quality, diverse training data across domains and languages, followed by supervised fine-tuning and model merging to improve robustness.

What carries the argument

Multi-stage training pipeline that treats the Qwen3 LLMs as both backbone models and generators of rich, domain-specific training data, combined with model merging after supervised fine-tuning.

If this is right

Developers can select among the three model sizes to match available compute when deploying embedding or reranking systems.
The same pipeline supports stronger performance on code search and cross-language document retrieval without custom architectures for each task.
Public release of the models under Apache 2.0 license enables direct use and further fine-tuning by the community.
The approach shows that foundation models can generate the data needed to specialize themselves for embedding tasks across many languages.
Model merging after fine-tuning provides a practical way to combine strengths from different training stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If self-synthesis of training data works reliably, it could reduce dependence on large manually labeled datasets for future embedding models.
The results suggest that scaling the same family of models for both generation and embedding may create tighter feedback loops than using separate models for each role.
Testing the models on entirely held-out languages or domains not used in the synthesis step would clarify how far the generalization extends.
Similar data-generation and merging steps could be tried on other open foundation models to see whether the performance pattern repeats.

Load-bearing premise

The reported gains reflect real generalization from the training methods rather than overlap with benchmark data or overfitting to known test sets.

What would settle it

Evaluating the released models on a new retrieval benchmark created after the paper's release, with no possible overlap to the synthesized training data, and checking whether performance remains at the claimed level.

read the original abstract

In this work, we introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series, in text embedding and reranking capabilities, built upon the Qwen3 foundation models. Leveraging the Qwen3 LLMs' robust capabilities in multilingual text understanding and generation, our innovative multi-stage training pipeline combines large-scale unsupervised pre-training with supervised fine-tuning on high-quality datasets. Effective model merging strategies further ensure the robustness and adaptability of the Qwen3 Embedding series. During the training process, the Qwen3 LLMs serve not only as backbone models but also play a crucial role in synthesizing high-quality, rich, and diverse training data across multiple domains and languages, thus enhancing the training pipeline. The Qwen3 Embedding series offers a spectrum of model sizes (0.6B, 4B, 8B) for both embedding and reranking tasks, addressing diverse deployment scenarios where users can optimize for either efficiency or effectiveness. Empirical evaluations demonstrate that the Qwen3 Embedding series achieves state-of-the-art results across diverse benchmarks. Notably, it excels on the multilingual evaluation benchmark MTEB for text embedding, as well as in various retrieval tasks, including code retrieval, cross-lingual retrieval and multilingual retrieval. To facilitate reproducibility and promote community-driven research and development, the Qwen3 Embedding models are publicly available under the Apache 2.0 license.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the Qwen3 Embedding series, built on Qwen3 foundation models, as an advance over GTE-Qwen for text embedding and reranking. It describes a multi-stage training pipeline that combines large-scale unsupervised pre-training, supervised fine-tuning on high-quality datasets synthesized by the Qwen3 LLMs themselves, and model merging strategies. Models are released in 0.6B, 4B, and 8B sizes for both embedding and reranking, with the central claim being state-of-the-art results on the multilingual MTEB benchmark as well as on code retrieval, cross-lingual retrieval, and multilingual retrieval tasks. The models are made publicly available under Apache 2.0.

Significance. If the empirical claims hold after verification, the work would demonstrate a practical way to leverage the same foundation LLM family for both backbone architecture and training-data synthesis, yielding gains in multilingual and retrieval settings. The public release of the model weights is a clear strength that supports reproducibility and community follow-up work.

major comments (2)

[Abstract and Evaluation sections] Abstract and Evaluation sections: The manuscript asserts SOTA performance on MTEB multilingual embedding and multiple retrieval tasks, yet the abstract supplies no quantitative scores, baseline comparisons, ablation results, or evaluation-protocol details. Without these, it is impossible to determine whether the reported gains are statistically meaningful or free of common confounds such as test-set leakage.
[Training pipeline description] Training pipeline description: The multi-stage pipeline uses Qwen3 LLMs both as the backbone and to synthesize the high-quality, diverse training data. No decontamination, n-gram overlap filtering, or membership-inference checks against MTEB test sets or retrieval corpora are described. Because the same model family generates the training examples, any overlap would produce inflated scores without genuine generalization; this directly undermines the central SOTA claim.

minor comments (1)

[Abstract] The abstract would be strengthened by including at least one or two concrete performance numbers (e.g., MTEB average score and the strongest baseline) to allow readers to gauge the magnitude of the advance immediately.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which have helped clarify the presentation of our results and the rigor of our training pipeline description. We address each major comment below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and Evaluation sections] Abstract and Evaluation sections: The manuscript asserts SOTA performance on MTEB multilingual embedding and multiple retrieval tasks, yet the abstract supplies no quantitative scores, baseline comparisons, ablation results, or evaluation-protocol details. Without these, it is impossible to determine whether the reported gains are statistically meaningful or free of common confounds such as test-set leakage.

Authors: We agree that the abstract would benefit from explicit quantitative results to allow readers to immediately assess the claimed improvements. In the revised manuscript we have updated the abstract to include key MTEB multilingual scores, comparisons against GTE-Qwen and other strong baselines, and a concise statement of the evaluation protocol. The Evaluation section already reports full results, ablations, and protocol details; we have added further text on statistical significance testing and explicit checks for test-set leakage to strengthen this part of the paper. revision: yes
Referee: [Training pipeline description] Training pipeline description: The multi-stage pipeline uses Qwen3 LLMs both as the backbone and to synthesize the high-quality, diverse training data. No decontamination, n-gram overlap filtering, or membership-inference checks against MTEB test sets or retrieval corpora are described. Because the same model family generates the training examples, any overlap would produce inflated scores without genuine generalization; this directly undermines the central SOTA claim.

Authors: The referee correctly notes that the original manuscript did not explicitly describe decontamination steps. We have added a dedicated paragraph in the Training Pipeline section that details the decontamination process: n-gram overlap filtering (with a conservative threshold) was applied to remove any potential overlap with MTEB test sets and retrieval corpora, and membership-inference-style checks were performed on the synthesized data. These steps were part of the data-preparation pipeline and ensure that the reported gains reflect generalization rather than leakage. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper presents an empirical ML engineering effort: a multi-stage training pipeline (unsupervised pre-training + supervised fine-tuning + model merging) on data synthesized by Qwen3 LLMs, with final performance measured on external benchmarks such as MTEB and various retrieval tasks. No mathematical derivation chain, first-principles equations, or 'predictions' exist that could reduce to the inputs by construction. No self-definitional steps, fitted parameters renamed as predictions, uniqueness theorems imported via self-citation, or ansatz smuggling appear. The SOTA claims are supported by reported benchmark numbers rather than any closed-loop logic internal to the paper. Minor self-citations to prior Qwen/GTE work are present but not load-bearing for the central result, which remains independently falsifiable via the external evaluations.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on the assumption that Qwen3 LLMs possess strong multilingual and generative capabilities that transfer to embedding tasks, plus standard supervised learning assumptions about data quality and generalization. No new mathematical axioms or invented physical entities are introduced.

free parameters (2)

model sizes
0.6B, 4B, 8B variants chosen to cover efficiency-effectiveness trade-offs
training hyperparameters
Unspecified details of the multi-stage pipeline and merging weights

axioms (2)

domain assumption Qwen3 LLMs have robust capabilities in multilingual text understanding and generation
Invoked as the foundation for both backbone use and data synthesis
domain assumption High-quality, diverse training data can be synthesized by the same LLMs across domains and languages
Central to the multi-stage pipeline described in the abstract

pith-pipeline@v0.9.0 · 5589 in / 1361 out tokens · 40632 ms · 2026-05-10T13:41:56.602817+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

multi-stage training pipeline combines large-scale unsupervised pre-training with supervised fine-tuning on high-quality datasets... Qwen3 LLMs serve not only as backbone models but also play a crucial role in synthesizing high-quality, rich, and diverse training data
IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

achieves state-of-the-art results across diverse benchmarks... excels on the multilingual evaluation benchmark MTEB

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

STRABLE: Benchmarking Tabular Machine Learning with Strings
cs.LG 2026-05 unverdicted novelty 8.0

A new corpus of 108 mixed string-numeric tables shows that advanced tabular learners with basic string embeddings perform well on most real-world data, while large LLM encoders help on free-text heavy tables.
SLAM: Structural Linguistic Activation Marking for Language Models
cs.CL 2026-05 unverdicted novelty 8.0

SLAM achieves 100% detection accuracy on Gemma-2 models with only 1-2 points of quality loss by causally steering SAE-identified structural directions while preserving lexical sampling and semantics.
SLAM: Structural Linguistic Activation Marking for Language Models
cs.CL 2026-05 unverdicted novelty 8.0

SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval
cs.AI 2026-05 unverdicted novelty 8.0

ReasonAudio benchmark reveals that state-of-the-art text-audio retrieval models struggle with reasoning tasks like negation and duration, and multimodal LLMs lose reasoning ability after contrastive fine-tuning.
FollowTable: A Benchmark for Instruction-Following Table Retrieval
cs.IR 2026-05 unverdicted novelty 8.0

FollowTable is the first large-scale benchmark for instruction-following table retrieval, paired with an Instruction Responsiveness Score, showing that existing models fail to adapt to fine-grained constraints beyond ...
DermAgent: A Self-Reflective Agentic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-Making
cs.CV 2026-05 unverdicted novelty 7.0

DermAgent orchestrates seven vision-language tools in a Plan-Execute-Reflect loop with dual-modality retrieval from 413k cases and a critic module to outperform GPT-4o by 17.6% in zero-shot dermatological diagnosis accuracy.
BOOKMARKS: Efficient Active Storyline Memory for Role-playing
cs.CL 2026-05 unverdicted novelty 7.0

BOOKMARKS introduces searchable bookmarks as reusable answers to storyline questions, enabling active initialization and passive synchronization for more consistent role-playing agent memory than recurrent summarization.
ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding
cs.CV 2026-05 unverdicted novelty 7.0

ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.
AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions
cs.CL 2026-05 unverdicted novelty 7.0

AcquisitionSynthesis uses acquisition functions as rewards to train generators that produce higher-quality synthetic data, delivering 2-7% gains on math, medical QA, and coding tasks with improved robustness to forgetting.
LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving
cs.IR 2026-05 conditional novelty 7.0

LeanSearch v2 recovers 46.1% of ground-truth premise groups for research-level Lean 4 theorems within 10 candidates and raises fixed-loop proof success to 20%.
LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving
cs.IR 2026-05 conditional novelty 7.0

LeanSearch v2 recovers 46.1% of ground-truth premise groups on research-level Mathlib theorems and raises fixed-loop proof success from 4% to 20% via embedding-reranker plus iterative sketch-retrieve-reflect retrieval.
AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects
cs.CV 2026-05 unverdicted novelty 7.0

AssemblyBench dataset and AssemblyDyno transformer model enable physics-aware prediction of assembly sequences and trajectories for complex industrial objects from multimodal instructions and 3D shapes.
Very Efficient Listwise Multimodal Reranking for Long Documents
cs.IR 2026-05 unverdicted novelty 7.0

ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.
Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models
cs.LG 2026-05 unverdicted novelty 7.0

Agentic program search over frozen embedding APIs yields a parameter-free inference algebra—a softmax-weighted centroid of top-K documents interpolated with the query—that lifts nDCG@10 across seven model families on ...
Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models
cs.LG 2026-05 unverdicted novelty 7.0

A softmax-weighted centroid of the local top-K documents interpolated with the query improves nDCG@10 for frozen embedding models across seven families on held-out BEIR data.
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents
cs.AI 2026-05 unverdicted novelty 7.0

Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
Skill Description Deception Attack against Task Routing in Internet of Agents
cs.MA 2026-05 conditional novelty 7.0

Malicious agents can deceive LLM-based task routers in Internet of Agents systems by generating fake skill descriptions, achieving up to 98% success rate across nine domains.
CHASM: Online Changepoint Detection in Temporal and Cross-Variable Dependence
stat.ME 2026-05 unverdicted novelty 7.0

CHASM detects changes in temporal and cross-variable dependence in multivariate time series by monitoring the truncated eigenvalue sequence of a recursively estimated DMD operator, using optimal assignment and augment...
Toward Privileged Foundation Models:LUPI for Accelerated and Improved Learning
cs.LG 2026-05 unverdicted novelty 7.0

PIQL integrates train-time-only privileged information into tabular foundation models via new constructions and a reconstruction architecture to achieve faster convergence and better generalization.
Toward Privileged Foundation Models:LUPI for Accelerated and Improved Learning
cs.LG 2026-05 unverdicted novelty 7.0

PIQL integrates privileged information to accelerate convergence, lower loss, and improve generalization in tabular foundation models.
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
cs.CL 2026-05 unverdicted novelty 7.0

An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
Self Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale
cs.LG 2026-05 conditional novelty 7.0

Starling uses LLMs and agents to turn 22.5M PubMed papers into 6.3M nuanced structured records across six tasks with 0.6-7.7% frontier-model rejection rates, lower than error rates on existing curated databases.
LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG
cs.CL 2026-05 unverdicted novelty 7.0

LatentRAG performs agentic RAG by generating latent tokens for thoughts and subqueries in one forward pass, matching explicit methods' accuracy on seven benchmarks while reducing latency by ~90%.
OBLIQ-Bench: Exposing Overlooked Bottlenecks in Modern Retrievers with Latent and Implicit Queries
cs.IR 2026-05 unverdicted novelty 7.0

OBLIQ-Bench reveals that modern retrievers fail to surface documents for latent and implicit queries even though LLMs reliably recognize relevance when those documents are provided.
Retrieval from Within: An Intrinsic Capability of Attention-Based Models
cs.LG 2026-05 unverdicted novelty 7.0

Attention-based models can intrinsically retrieve and reuse pre-encoded evidence chunks via decoder attention queries, unifying retrieval with generation and outperforming external RAG pipelines on QA benchmarks.
SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents
cs.AI 2026-05 unverdicted novelty 7.0

SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.
TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding
cs.CL 2026-05 unverdicted novelty 7.0

TabEmbed is the first generalist embedding model for tabular data that unifies classification and retrieval in one space via contrastive learning and outperforms text embedding models on the new TabBench benchmark.
Rational Communication Shapes Morphological Composition
cs.CL 2026-05 unverdicted novelty 7.0

Using historical corpora and the Rational Speech Act framework, attested English morphological compositions are ranked higher than plausible alternatives from the same time period when both semantic recoverability and...
ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval
cs.AI 2026-05 unverdicted novelty 7.0

ReasonAudio benchmark shows current text-audio retrieval models fail at reasoning tasks like negation and duration discrimination beyond simple semantic matching.
Is It Novel and Why? Fine-Grained Patent Novelty Prediction Based on Passage Retrieval
cs.CL 2026-05 unverdicted novelty 7.0

Introduces a feature-level annotated patent dataset and LLM retrieval-reasoning workflows that outperform embedding baselines on passage retrieval and novel feature identification while avoiding spurious correlations ...
Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese
cs.CL 2026-05 conditional novelty 7.0

Prosa demonstrates that rubric-based binary scoring with multi-judge filtering yields full agreement on 16 LLM rankings across judges on Brazilian Portuguese chats, compared to only 7/16 under holistic scoring, while ...
Led to Mislead: Adversarial Content Injection for Attacks on Neural Ranking Models
cs.IR 2026-05 unverdicted novelty 7.0

CRAFT is a supervised LLM framework using retrieval-augmented generation, self-refinement, fine-tuning, and preference optimization to create fluent adversarial content that boosts target ranks in neural ranking model...
Embedding-based In-Context Prompt Training for Enhancing LLMs as Text Encoders
cs.CL 2026-05 unverdicted novelty 7.0

EPIC trains LLMs to treat continuous embeddings as in-context prompts, yielding state-of-the-art text embedding performance on MTEB with or without prompts at inference and lower compute.
E-MIA: Exam-Style Black-Box Membership Inference Attacks against RAG Systems
cs.CR 2026-05 unverdicted novelty 7.0

E-MIA converts document details into four types of exam questions and aggregates the RAG's answers into a membership score that separates member and non-member documents better than prior similarity-based or probe-bas...
UnIte: Uncertainty-based Iterative Document Sampling for Domain Adaptation in Information Retrieval
cs.IR 2026-04 unverdicted novelty 7.0

UnIte selects target-domain documents for pseudo-query generation by filtering high aleatoric uncertainty and prioritizing high epistemic uncertainty, yielding +2.45 to +3.49 nDCG@10 gains on BEIR with ~4k samples.
Similar Users-Augmented Interest Network
cs.IR 2026-04 unverdicted novelty 7.0

SUIN improves CTR prediction by augmenting target user sequences with similar users' behaviors via embedding-based retrieval, user-specific position encoding, and user-aware target attention.
AsmRAG: LLM-Driven Malware Detection by Retrieving Functionally Similar Assembly Code
cs.CR 2026-04 unverdicted novelty 7.0

AsmRAG detects malware at 96% F1 and attributes families at 95% F1 by retrieving functionally similar assembly code via LLM embeddings and density-weighted anchor selection, remaining robust to metamorphic obfuscation.
ResRank: Unifying Retrieval and Listwise Reranking via End-to-End Joint Training with Residual Passage Compression
cs.IR 2026-04 conditional novelty 7.0

ResRank unifies retrieval and listwise reranking by compressing passages to one token each, using residual connections and cosine-similarity scoring, achieving competitive effectiveness on TREC DL and BEIR benchmarks ...
ATIR: Towards Audio-Text Interleaved Contextual Retrieval
cs.SD 2026-04 unverdicted novelty 7.0

Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.
TeleEmbedBench: A Multi-Corpus Embedding Benchmark for RAG in Telecommunications
cs.LG 2026-04 unverdicted novelty 7.0

TeleEmbedBench is the first multi-corpus benchmark showing LLM-based embedding models significantly outperform traditional sentence-transformers on telecommunications specifications and code for retrieval accuracy and...
Matlas: A Semantic Search Engine for Mathematics
cs.IR 2026-04 unverdicted novelty 7.0

Matlas introduces a semantic retrieval system over 8.07 million mathematical statements from papers and textbooks, using dependency graphs and topological unfolding for self-contained search via natural language queries.
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
cs.AI 2026-04 unverdicted novelty 7.0

SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.
Crowded in B-Space: Calibrating Shared Directions for LoRA Merging
cs.CL 2026-04 unverdicted novelty 7.0

Pico reduces LoRA merge interference by calibrating over-shared directions in the B matrix before merging, yielding 3.4-8.3 point accuracy gains and sometimes beating joint training.
On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability
cs.IR 2026-04 unverdicted novelty 7.0

LLM-based dense retrievers generalize better when instruction-tuned but pay a specialization tax when optimized for reasoning; they resist typos and corpus poisoning better than encoder-only baselines yet remain vulne...
Psychological Steering of Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs
cs.AI 2026-04 unverdicted novelty 7.0

A multi-agent framework reconstructs the evolutionary graph of post-training LLM datasets, revealing domain patterns like vertical refinement in math data and systemic issues like redundancy and benchmark contaminatio...
Inductive Reasoning for Temporal Knowledge Graphs with Emerging Entities
cs.AI 2026-04 unverdicted novelty 7.0

TransFIR enables reasoning on temporal knowledge graphs for emerging entities by clustering them into semantic groups and borrowing interaction histories from similar known entities, yielding 28.6% average MRR gains.
Hidden in the Multiplicative Interaction: Uncovering Fragility in Multimodal Contrastive Learning
cs.LG 2026-04 unverdicted novelty 7.0

Multimodal contrastive learning using multilinear products is fragile to single bad modalities, and a gated version improves top-1 retrieval accuracy on synthetic and real trimodal data.
Retrieval Augmented Conversational Recommendation with Reinforcement Learning
cs.IR 2026-04 unverdicted novelty 7.0

RAR retrieves candidate items from a 300k-movie corpus then uses LLM generation with RL feedback to produce context-aware recommendations that outperform baselines on benchmarks.
Rank, Don't Generate: Statement-level Ranking for Explainable Recommendation
cs.IR 2026-04 unverdicted novelty 7.0

The work reframes explainable recommendation as statement-level ranking, introduces the StaR benchmark from Amazon reviews, and finds popularity baselines outperforming SOTA models in item-level personalized ranking.
BBC: Improving Large-k Approximate Nearest Neighbor Search with a Bucket-based Result Collector
cs.DB 2026-04 unverdicted novelty 7.0

BBC improves large-k ANN efficiency via bucketed candidate buffers and optimized re-ranking, delivering up to 3.8x speedup at recall@k=0.95.
Spectral Tempering for Embedding Compression in Dense Passage Retrieval
cs.IR 2026-03 unverdicted novelty 7.0

Spectral Tempering derives an adaptive scaling factor γ(k) from the embedding eigenspectrum via local SNR analysis and knee-point normalization to achieve near-optimal compression without training or validation.
Public Profile Matters: A Scalable Integrated Approach to Recommend Citations in the Wild
cs.IR 2026-03 unverdicted novelty 7.0

Profiler captures citation patterns efficiently without learning or bias, DAVINCI integrates it for reranking, and a new inductive temporal evaluation yields SOTA results on citation recommendation benchmarks.
WorkRB: A Community-Driven Evaluation Framework for AI in the Work Domain
cs.CL 2026-03 unverdicted novelty 7.0

WorkRB is the first open community-driven benchmark for AI in the work domain, organizing 13 tasks from 7 groups with dynamic multilingual ontology loading and modular design for proprietary task integration.
LMEB: Long-horizon Memory Embedding Benchmark
cs.CL 2026-03 unverdicted novelty 7.0

LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.
Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs
cs.LG 2026-05 unverdicted novelty 6.0

LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.
Task-Adaptive Embedding Refinement via Test-time LLM Guidance
cs.CL 2026-05 unverdicted novelty 6.0

Test-time LLM feedback refines query embeddings to deliver up to 25% relative gains on zero-shot literature search, intent detection, and related benchmarks.
Letting the neural code speak: Automated characterization of monkey visual neurons through human language
q-bio.NC 2026-05 unverdicted novelty 6.0

Natural-language descriptions generated and verified through generative models and digital twins capture the selectivity of most neurons in macaque V1 and V4.
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 130 Pith papers · 6 internal anchors

[1]

M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation

Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. In Findings of the Association for Computational Linguistics: ACL 2024 , pp. 2318–2335, Bangkok, Thailand, August

work page 2024
[2]

URL https:// aclanthology.org/2024.findings-acl.137/

Association for Computational Linguistics. URL https:// aclanthology.org/2024.findings-acl.137/. Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, M´arton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemi ´nski, Genta Indra Winata, et al. MMTEB: Massive multilingual text embedding benchmark. In The Thirteenth International Conferen...

work page 2024
[3]

Scaling synthetic data creation with 1,000,000,000 personas.arXiv:2406.20094, 2024

URL https://openreview.net/forum?id=zl3pfz4VCV. Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data creation with 1,000,000,000 personas. arXiv preprint arXiv:2406.20094,

work page arXiv
[4]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models. arXiv preprint arXiv:2405.17428,

work page internal anchor Pith review arXiv
[6]

Gemini Embedding: Generalizable Embeddings from Gemini

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. NV-embed: Improved techniques for training LLMs as generalist embedding models. In The Thirteenth International Conference on Learning Representations , 2025a. URL https: //openreview.net/forum?id=lgsyLSsDRe. Jinhyuk Lee, Feiyang Chen, Sahil Dua, Danie...

work page internal anchor Pith review arXiv
[7]

Towards General Text Embeddings with Multi-stage Contrastive Learning

URL https://arxiv.org/ abs/2308.03281. Xueguang Ma, Xinyu Zhang, Ronak Pradeep, and Jimmy Lin. Zero-shot listwise document reranking with a large language model. arXiv preprint arXiv:2305.02156,

work page internal anchor Pith review arXiv
[8]

MTEB: Massive text embed- ding benchmark

Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. MTEB: Massive text embed- ding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 2014–2037, Dubrovnik, Croatia, May

work page 2014
[9]

URL https://aclanthology.org/2023.eacl-main.148/

Association for Computa- tional Linguistics. URL https://aclanthology.org/2023.eacl-main.148/. Niklas Muennighoff, Hongjin SU, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Generative representational instruction tuning. In The Thirteenth International Con- ference on Learning Representations,

work page 2023
[10]

Representation Learning with Contrastive Predictive Coding

URL https://openreview.net/forum?id=BC4lIvfSzv. Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predic- tive coding. arXiv preprint arXiv:1807.03748,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

arXiv preprint arXiv:2309.15088 , year=

Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. Rankvicuna: Zero-shot listwise docu- ment reranking with open-source large language models. arXiv preprint arXiv:2309.15088,

work page arXiv
[12]

Sentence-BERT: Sentence embeddings using Siamese BERT- networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT- networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pp. 3982–3992, Hong Kong, China, November

work page 2019
[13]

URL https://aclanthology.org/D19-1410/

Association for Computational Linguistics. URL https://aclanthology.org/D19-1410/. Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Smith, Luke Zettlemoyer, and Tao Yu. One embedder, any task: Instruction-finetuned text embeddings. In Findings of the Association for Computational Linguistics: ACL 2023 , pp. ...

work page 2023
[14]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

URL https://arxiv.org/abs/2212.03533. Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Im- proving text embeddings with large language models. In Proceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers) , pp. 11897–11916, Bangkok, Thailand, August

work page internal anchor Pith review arXiv
[15]

URL https: //aclanthology.org/2024.acl-long.642/

Association for Computational Linguistics. URL https: //aclanthology.org/2024.acl-long.642/. Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle Lo, Arman Cohan, Benjamin Van Durme, Dawn Lawrie, and Luca Soldaini. Followir: Evaluating and teaching information retrieval models to follow instructions. arXiv preprint arXiv:2403.15246,

work page arXiv 2024
[16]

InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval(Washington DC, USA)(SIGIR ’24)

Association for Computing Machinery. URLhttps://doi.org/10.1145/ 3626772.3657878. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388,

work page arXiv
[17]

A two-stage adaptation of large language models for text ranking

10 Technical Report Longhui Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, and Min Zhang. A two-stage adaptation of large language models for text ranking. In Findings of the Association for Computational Linguistics ACL 2024 , pp. 11880–11891, 2024a. Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baoso...

work page 2024
[18]

Embedding in recommender systems: A survey

Xiangyu Zhao, Maolin Wang, Xinjian Zhao, Jiansheng Li, Shucheng Zhou, Dawei Yin, Qing Li, Jiliang Tang, and Ruocheng Guo. Embedding in recommender systems: A survey. arXiv preprint arXiv:2310.18608,

work page arXiv
[19]

Question Type

11 Technical Report A Appendix A.1 Synthetic Data We construct four types of synthetic data—retrieval, bitext mining, semantic textual similarity, and classification to enable the model to adapt to various similarity tasks during pre-training. To ensure both multilingual and cross-lingual diversity, the data is generated using Qwen3 32B. Below is an examp...

work page 2024
[20]

(MTEB(cmn, v1). MTEB(Code, v1) Avg.Apps COIR-CodeSearch-Net Code-Edit-Search Code-Feedback-MT Code-Feedback-ST Code-SearchNet-CCR Code-SearchNet Code-Trans-Ocean-Contest Code-Trans-Ocean-DLCosQAStack-Overflow-QA Synthetic-Text2SQL BGEmultilingual 62.0422.93 68.14 60.48 60.52 76.70 73.23 83.43 86.84 32.64 27.93 92.93 58.67NV-Embed-v2 63.7429.72 61.85 73.96...

work page 2094