arxiv: 2309.07597 · v5 · submitted 2023-09-14 · 💻 cs.CL · cs.AI· cs.IR

Recognition: 3 theorem links

· Lean Theorem

C-Pack: Packed Resources For General Chinese Embeddings

Shitao Xiao , Zheng Liu , Peitian Zhang , Niklas Muennighoff , Defu Lian , Jian-Yun Nie

Authors on Pith no claims yet

Pith reviewed 2026-05-13 13:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR

keywords Chinese text embeddingsC-MTEB benchmarktext embedding modelsC-MTP datasetnatural language processingembedding trainingmultilingual embeddingssemantic similarity

0 comments

The pith

C-Pack supplies a benchmark, training dataset, and models that let Chinese text embeddings outperform all earlier ones by up to 10 percent on 35 tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents C-Pack as a bundled set of three resources aimed at general Chinese text embeddings. It includes C-MTEB, a new evaluation covering six task types across 35 datasets, C-MTP, a large collection of labeled and unlabeled Chinese text for training, and C-TEM, a family of embedding models in several sizes. The authors report that their C-TEM models exceed previous Chinese embeddings on C-MTEB by as much as 10 percent at release time. They also optimize the full training pipeline for these models and release parallel English data and models that reach state-of-the-art on the English MTEB benchmark, with the English data twice the size of the Chinese data. All components are released publicly to support further work on Chinese embeddings.

Core claim

We introduce C-Pack consisting of C-MTEB, a comprehensive Chinese embedding benchmark with 6 tasks and 35 datasets, C-MTP, a massive curated text embedding training set drawn from Chinese corpora, and C-TEM, a family of embedding models that achieve up to 10 percent higher scores than prior Chinese models on C-MTEB when trained with the integrated suite of methods.

What carries the argument

C-TEM models trained on the C-MTP dataset and evaluated on the C-MTEB benchmark.

If this is right

Downstream Chinese NLP systems can adopt higher-quality embeddings for retrieval, classification, and semantic similarity tasks.
Open release of both the benchmark and the training data allows direct replication and extension by other researchers.
The English models and twice-larger English data set provide a parallel resource that reaches top MTEB scores.
The optimized training pipeline can be applied to produce embeddings in additional languages or sizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same packing approach of benchmark plus data plus model could be replicated for other languages to close performance gaps.
If C-MTEB becomes widely adopted it may standardize evaluation and reduce hidden selection effects in future Chinese embedding papers.
Larger-scale training on the released C-MTP data could further widen the gap over prior methods.
Integration of the English and Chinese resources may support improved bilingual or multilingual embedding models.

Load-bearing premise

The C-MTEB collection of 35 datasets supplies an unbiased and comprehensive test of general Chinese embedding quality.

What would settle it

Release of a new Chinese embedding model that scores higher than the largest C-TEM variant on every C-MTEB task without using the C-MTP training data.

read the original abstract

We introduce C-Pack, a package of resources that significantly advance the field of general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive text embedding dataset curated from labeled and unlabeled Chinese corpora for training embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the release. We also integrate and optimize the entire suite of training methods for C-TEM. Along with our resources on general Chinese embedding, we release our data and models for English text embeddings. The English models achieve state-of-the-art performance on MTEB benchmark; meanwhile, our released English data is 2 times larger than the Chinese data. All these resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces C-Pack, a package of resources for general Chinese embeddings consisting of (1) C-MTEB, a benchmark spanning 6 tasks and 35 datasets, (2) C-MTP, a large curated training corpus from labeled and unlabeled Chinese text, and (3) C-TEM, a family of embedding models of varying sizes. The central claim is that the released C-TEM models outperform all prior Chinese text embeddings on C-MTEB by up to 10% at the time of release; the authors also release English data (twice the size of the Chinese data) and models that reach SOTA on MTEB.

Significance. If the performance claims hold under rigorous scrutiny, the work supplies valuable, publicly released resources that address the relative scarcity of high-quality Chinese embedding benchmarks and training data. The integration of multiple training methods into C-TEM and the dual-language release could accelerate progress in multilingual embedding research.

major comments (3)

[§3] §3 (C-MTEB construction): the description of how the 35 datasets were selected and filtered lacks explicit criteria for avoiding task-specific overfitting or selection effects that could favor the proposed models; a clear protocol for dataset inclusion/exclusion is needed to substantiate the claim that C-MTEB is an unbiased measure of general Chinese embedding quality.
[Table 2] Table 2 (main results): the reported gains of up to +10% are presented without standard deviations across runs, statistical significance tests, or details on the exact baseline implementations and hyper-parameters; this information is load-bearing for the central empirical claim.
[§4.2] §4.2 (training procedure): the statement that the authors 'integrate and optimize the entire suite of training methods' is not accompanied by sufficient ablation results or hyper-parameter schedules to allow reproduction or assessment of whether the gains derive from data scale, model architecture, or training tricks.

minor comments (2)

[Abstract] The GitHub link in the abstract should be repeated in the conclusion or data-availability statement for reader convenience.
[Figure 1] Figure 1 caption could explicitly state the number of parameters for each C-TEM variant shown.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation. We address each major comment below and will revise the manuscript accordingly to improve clarity and reproducibility.

read point-by-point responses

Referee: [§3] §3 (C-MTEB construction): the description of how the 35 datasets were selected and filtered lacks explicit criteria for avoiding task-specific overfitting or selection effects that could favor the proposed models; a clear protocol for dataset inclusion/exclusion is needed to substantiate the claim that C-MTEB is an unbiased measure of general Chinese embedding quality.

Authors: We agree that an explicit protocol strengthens the benchmark's credibility. In the revised manuscript we will add a dedicated subsection to §3 that details the inclusion/exclusion criteria, including steps taken to ensure task diversity, domain coverage, and avoidance of selection bias toward our training data. This protocol draws on established practices from MTEB while adapting for Chinese-specific considerations. revision: yes
Referee: [Table 2] Table 2 (main results): the reported gains of up to +10% are presented without standard deviations across runs, statistical significance tests, or details on the exact baseline implementations and hyper-parameters; this information is load-bearing for the central empirical claim.

Authors: We acknowledge the value of statistical rigor. The revision will expand the experimental section and Table 2 caption with full hyper-parameter settings and exact baseline re-implementation details (including sources and any adaptations). Standard deviations are not reported in the current version because experiments used fixed seeds for reproducibility; we will add a note on this limitation and include variance estimates from additional runs where compute permits. revision: partial
Referee: [§4.2] §4.2 (training procedure): the statement that the authors 'integrate and optimize the entire suite of training methods' is not accompanied by sufficient ablation results or hyper-parameter schedules to allow reproduction or assessment of whether the gains derive from data scale, model architecture, or training tricks.

Authors: We will revise §4.2 to include expanded ablation tables and a hyper-parameter schedule appendix. These additions will isolate the contributions of data scale, contrastive objectives, and other optimizations, enabling readers to assess the sources of improvement. revision: yes

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an applied resource paper; the central claims rest on standard machine-learning practices for contrastive embedding training rather than new axioms or invented entities.

axioms (1)

domain assumption Contrastive learning objectives on curated text pairs produce effective general-purpose embeddings
The paper states it integrates and optimizes the entire suite of training methods for C-TEM.

pith-pipeline@v0.9.0 · 5490 in / 1242 out tokens · 55681 ms · 2026-05-13T13:22:02.053526+00:00 · methodology

discussion (0)

Forward citations

Cited by 30 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection
cs.CL 2026-03 unverdicted novelty 8.0

Knowledge Packs deliver knowledge via pre-computed KV caches with exact equivalence under causal masking, achieving zero divergences on tested questions and enabling value-based steering without training.
Retrieval from Within: An Intrinsic Capability of Attention-Based Models
cs.LG 2026-05 unverdicted novelty 7.0

Attention-based models can intrinsically retrieve and reuse pre-encoded evidence chunks via decoder attention queries, unifying retrieval with generation and outperforming external RAG pipelines on QA benchmarks.
Prism-Reranker: Beyond Relevance Scoring -- Jointly Producing Contributions and Evidence for Agentic Retrieval
cs.IR 2026-04 accept novelty 7.0

Prism-Reranker models output relevance, contribution statements, and evidence passages to support agentic retrieval beyond scalar scoring.
HaS: Accelerating RAG through Homology-Aware Speculative Retrieval
cs.IR 2026-04 unverdicted novelty 7.0

HaS accelerates RAG retrieval via homology-aware speculative retrieval and homologous query re-identification validation, cutting latency 24-37% with 1-2% accuracy drop on tested datasets.
METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues
cs.CL 2026-04 unverdicted novelty 7.0

METRO induces both short-term actions and long-term planning from expert transcripts into a Strategy Forest, outperforming prior methods by 9-10% on two non-collaborative dialogue benchmarks.
DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?
cs.AI 2026-04 unverdicted novelty 7.0

DRBENCHER generates multi-hop questions across biochemistry, finance, geophysics, security, and history that test interleaved browsing and computation, where the strongest models reach only 20% accuracy and human vali...
MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries
cs.CL 2024-01 accept novelty 7.0

MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.
An Annotation Scheme and Classifier for Personal Facts in Dialogue
cs.CL 2026-05 accept novelty 6.0

An extended annotation scheme with new categories and attributes plus a Gemma-300M-based multi-head classifier achieves 81.6% macro F1 on personal fact classification, outperforming few-shot LLM baselines by nearly 9 ...
SkillRAE: Agent Skill-Based Context Compilation for Retrieval-Augmented Execution
cs.CL 2026-05 unverdicted novelty 6.0

SkillRAE organizes skills into a graph and compiles compact, grounded contexts for LLM agents, yielding 11.7% gains on SkillsBench over prior RAE methods.
Retrieval from Within: An Intrinsic Capability of Attention-Based Models
cs.LG 2026-05 unverdicted novelty 6.0

Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.
Agentic Retrieval-Augmented Generation for Financial Document Question Answering
cs.AI 2026-05 unverdicted novelty 6.0

FinAgent-RAG achieves 76.81-78.46% execution accuracy on financial QA benchmarks by combining contrastive retrieval, program-of-thought code generation, and adaptive strategy routing, outperforming baselines by 5.62-9...
A Replicability Study of XTR
cs.IR 2026-05 accept novelty 6.0

XTR training does not improve retrieval effectiveness over ColBERT but enhances IVF engine efficiency by flattening token scores to produce more discriminative centroids.
MemRouter: Memory-as-Embedding Routing for Long-Term Conversational Agents
cs.CL 2026-05 unverdicted novelty 6.0

A lightweight supervised router using frozen-LLM embeddings for memory admission decisions outperforms LLM-based memory managers in both F1 score and latency on the LoCoMo benchmark.
MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment
cs.CV 2026-04 unverdicted novelty 6.0

MiMIC mitigates visual modality collapse and semantic misalignment in universal multimodal retrieval via fusion-in-decoder architecture and robust single-modality training.
EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation
cs.DB 2026-04 unverdicted novelty 6.0

EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.
EpiAgent: An Agent-Centric System for Ancient Inscription Restoration
cs.CV 2026-04 unverdicted novelty 6.0

EpiAgent is a new agent-centric system that restores degraded ancient inscriptions with better quality and generalization than prior rigid AI methods by using an LLM planner to coordinate multimodal tools and iterativ...
Regime-Conditional Retrieval: Theory and a Transferable Router for Two-Hop QA
cs.IR 2026-04 conditional novelty 6.0

Two-hop QA retrieval performance depends on whether the hop-2 entity is in the question or bridge passage, and a simple predicate-based router trained on one dataset transfers to improve R@5 on others.
ResearchEVO: An End-to-End Framework for Automated Scientific Discovery and Documentation
cs.AI 2026-04 unverdicted novelty 6.0

ResearchEVO automates the discover-then-explain cycle by evolving algorithms via fitness-driven LLM co-evolution and generating grounded, anti-hallucination research papers through sentence-level RAG.
SelRoute: Query-Type-Aware Routing for Long-Term Conversational Memory Retrieval
cs.IR 2026-04 conditional novelty 6.0

SelRoute routes queries to type-specific retrieval pipelines, achieving Recall@5 of 0.800 with a 109M model on LongMemEval_M and outperforming LLM-augmented baselines including a strong zero-ML lexical method.
ASTRA: Mapping Art-Technology Institutions via Conceptual Axes, Text Embeddings, and Unsupervised Clustering
cs.DL 2026-03 accept novelty 6.0

ASTRA combines an eight-axis conceptual framework with text embeddings and unsupervised clustering to map and group 78 art-technology institutions into coherent thematic clusters.
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 5.0

OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
cs.AI 2026-04 unverdicted novelty 5.0

DocSeeker improves long-document understanding in MLLMs via a two-stage training process that combines supervised fine-tuning from distilled data with evidence-aware group relative policy optimization and memory-effic...
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
cs.AI 2026-04 unverdicted novelty 5.0

DocSeeker uses supervised fine-tuning on distilled data followed by evidence-aware group relative policy optimization to improve long-document understanding and evidence grounding in MLLMs.
MEME-Fusion@CHiPSAL 2026: Multimodal Ablation Study of Hate Detection and Sentiment Analysis on Nepali Memes
cs.CL 2026-04 unverdicted novelty 5.0

A hybrid cross-modal attention model using CLIP and BGE-M3 improves hate detection F1-macro by 5.9% over text-only baselines on Nepali memes while revealing failures of English-centric vision models and ensembles on s...
Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts
cs.CL 2026-04 unverdicted novelty 5.0

Domain-adapted utterance-level retrieval raises Cohen's kappa for tutoring dialogue act annotation to 0.526-0.580 on TalkMoves and 0.659-0.743 on Eedi, beating no-retrieval baselines by large margins across three LLMs.
The Geometry of Forgetting
q-bio.NC 2026-03 unverdicted novelty 5.0

Interference among memories in embedding spaces produces human-like power-law forgetting (b≈0.46) and false memories (false alarm rate 0.583) from raw pre-trained embeddings with zero tuning.
Retrieval-Augmented Generation for AI-Generated Content: A Survey
cs.CV 2024-02 accept novelty 5.0

A survey classifying RAG foundations for AIGC, summarizing enhancements, cross-modal applications, benchmarks, limitations, and future directions.
Multilingual E5 Text Embeddings: A Technical Report
cs.CL 2024-02 unverdicted novelty 5.0

Open-source multilingual E5 embedding models are trained via contrastive pre-training on 1 billion text pairs and fine-tuning, with an instruction-tuned model matching English SOTA performance.
Mira-Embeddings-V1: Domain-Adapted Semantic Reranking for Recruitment via LLM-Synthesized Data
cs.CL 2026-04 conditional novelty 4.0

Mira-Embeddings-V1 adapts embeddings for recruitment reranking by synthesizing positive and hard-negative samples with LLMs, then applies JD-JD contrastive and JD-CV triplet training plus a BoundaryHead MLP, lifting R...
Hybrid Retrieval for COVID-19 Literature: Comparing Rank Fusion and Projection Fusion with Diversity Reranking
cs.IR 2026-04 unverdicted novelty 4.0

Rank fusion (RRF) reaches the highest relevance (nDCG@10 = 0.828) on expert COVID-19 queries while a projection fusion variant (B5) is 33% faster and produces more diverse results.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · cited by 28 Pith papers · 17 internal anchors

[1]

Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Inigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, et al . 2015. Semeval-2015 task 2: Semantic textual similarity, eng- lish, spanish and pilot on interpretability. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015) . 252–263

work page 2015
[2]

Eneko Agirre, Carmen Banea, Claire Cardie, Daniel M Cer, Mona T Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, and Janyce Wiebe

work page
[3]

In Se- mEval@ COLING

SemEval-2014 Task 10: Multilingual Semantic Textual Similarity.. In Se- mEval@ COLING. 81–91

work page 2014
[4]

Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Aitor Gonzalez Agirre, Rada Mihalcea, German Rigau Claramunt, and Janyce Wiebe. 2016. Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In SemEval-2016. 10th International Workshop on Semantic Evaluation; 2016 Jun 16-17; San Diego, CA. Stroudsburg (PA): ACL; 2016....

work page 2016
[5]

Eneko Agirre, Daniel Cer, Mona Diab, and Aitor Gonzalez-Agirre. 2012. Semeval- 2012 task 6: A pilot on semantic textual similarity. In * SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Eval...

work page 2012
[6]

Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo

work page
[7]

In Second joint confer- ence on lexical and computational semantics (* SEM), volume 1: proceedings of the Main conference and the shared task: semantic textual similarity

* SEM 2013 shared task: Semantic textual similarity. In Second joint confer- ence on lexical and computational semantics (* SEM), volume 1: proceedings of the Main conference and the shared task: semantic textual similarity . 32–43

work page 2013
[8]

Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, 15huggingface.co/hfl/chinese-roberta-wwm-ext-large Manan Dey, et al. 2023. SantaCoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988 (2023)

work page arXiv 2023
[9]

Akari Asai, Timo Schick, Patrick Lewis, Xilun Chen, Gautier Izacard, Sebastian Riedel, Hannaneh Hajishirzi, and Wen-tau Yih. 2022. Task-aware retrieval with instructions. arXiv preprint arXiv:2211.09260 (2022)

work page arXiv 2022
[10]

Luiz Bonifacio, Vitor Jeronymo, Hugo Queiroz Abonizio, Israel Campiotti, Marzieh Fadaee, Roberto Lotufo, and Rodrigo Nogueira. 2021. mmarco: A multilingual version of the ms marco passage ranking dataset. arXiv preprint arXiv:2108.13897 (2021)

work page arXiv 2021
[11]

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Ruther- ford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bog- dan Damoc, Aidan Clark, et al. 2022. Improving language models by retrieving from trillions of tokens. In International conference on machine learning . PMLR, 2206–2240

work page 2022
[12]

Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning

work page
[13]

arXiv preprint arXiv:1508.05326 (2015)

A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326 (2015)

work page arXiv 2015
[14]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901

work page 2020
[15]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Se- bastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Alexis Conneau and Douwe Kiela. 2018. SentEval: An Evaluation Toolkit for Universal Sentence Representations. arXiv preprint arXiv:1803.05449 (2018)

work page arXiv 2018
[18]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

Luyu Gao and Jamie Callan. 2021. Condenser: a pre-training architecture for dense retrieval. arXiv preprint arXiv:2104.08253 (2021)

work page arXiv 2021
[20]

Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2021. A framework for few-shot language model evaluation . https: //doi.org/10.5281/zenodo.5371628

work page doi:10.5281/zenodo.5371628 2021
[21]

Luyu Gao, Yunyi Zhang, Jiawei Han, and Jamie Callan. 2021. Scaling deep contrastive learning batch size under memory limited setup. arXiv preprint arXiv:2101.06983 (2021)

work page arXiv 2021
[22]

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. In International conference on machine learning. PMLR, 3929–3938

work page 2020
[23]

Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu, Qiaoqiao She, et al. 2017. Dureader: a chinese machine reading comprehension dataset from real-world applications. arXiv preprint arXiv:1711.05073 (2017)

work page arXiv 2017
[24]

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes C-Pack: Packed Resources For General Chinese Embeddings SIGIR ’24, July 14–18, 2024, Washington, DC, USA Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models. arXiv preprint arXiv...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bo- janowski, Armand Joulin, and Edouard Grave. 2021. Unsupervised dense in- formation retrieval with contrastive learning. arXiv preprint arXiv:2112.09118 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[26]

Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2022. Few-shot learning with retrieval augmented language models.arXiv preprint arXiv:2208.03299 (2022)

work page arXiv 2022
[27]

Ehsan Kamalloo, Nandan Thakur, Carlos Lassance, Xueguang Ma, Jheng-Hong Yang, and Jimmy Lin. 2023. Resources for Brewing BEIR: Reproducible Reference Models and an Official Leaderboard. arXiv preprint arXiv:2306.07471 (2023)

work page arXiv 2023
[28]

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open- domain question answering. arXiv preprint arXiv:2004.04906 (2020)

work page arXiv 2020
[29]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474

work page 2020
[30]

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023. StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards General Text Embeddings with Multi-stage Contrastive Learning. arXiv preprint arXiv:2308.03281 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[33]

Zheng Liu and Yingxia Shao. 2022. Retromae: Pre-training retrieval-oriented transformers via masked auto-encoder. arXiv preprint arXiv:2205.12035 (2022)

work page arXiv 2022
[34]

Dingkun Long, Qiong Gao, Kuan Zou, Guangwei Xu, Pengjun Xie, Ruijie Guo, Jian Xu, Guanjun Jiang, Luxi Xing, and Ping Yang. 2022. Multi-cpr: A multi domain Chinese dataset for passage retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval . 3046–3056

work page 2022
[35]

Niklas Muennighoff. 2022. Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904 (2022)

work page arXiv 2022
[36]

Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. 2023. OctoPack: Instruction Tuning Code Large Language Models. arXiv preprint arXiv:2308.07124 (2023)

work page arXiv 2023
[37]

Niklas Muennighoff, Alexander M Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. 2023. Scaling Data-Constrained Language Models. arXiv preprint arXiv:2305.16264 (2023)

work page arXiv 2023
[38]

Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. 2022. MTEB: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316 (2022)

work page internal anchor Pith review arXiv 2022
[39]

Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. 2022. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786 (2022)

work page arXiv 2022
[40]

Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, et al

work page
[41]

arXiv preprint arXiv:2201.10005 , year=

Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005 (2022)

work page arXiv 2022
[42]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human-generated machine reading comprehension dataset. (2016)

work page 2016
[43]

Jianmo Ni, Gustavo Hernández Ábrego, Noah Constant, Ji Ma, Keith B Hall, Daniel Cer, and Yinfei Yang. 2021. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877 (2021)

work page arXiv 2021
[44]

Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernández Ábrego, Ji Ma, Vincent Y Zhao, Yi Luan, Keith B Hall, Ming-Wei Chang, et al. 2021. Large dual encoders are generalizable retrievers. arXiv preprint arXiv:2112.07899 (2021)

work page arXiv 2021
[45]

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. 2023. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs.arXiv preprint arXiv:2307.16789 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Yifu Qiu, Hongyu Li, Yingqi Qu, Ying Chen, Qiaoqiao She, Jing Liu, Hua Wu, and Haifeng Wang. 2022. DuReader_retrieval: A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine. arXiv preprint arXiv:2203.10232 (2022)

work page arXiv 2022
[47]

Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxi- ang Dong, Hua Wu, and Haifeng Wang. 2020. RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2010.08191 (2020)

work page arXiv 2020
[48]

Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. 2021. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[49]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551

work page 2020
[50]

Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[51]

Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al

work page
[52]

Multitask Prompted Training Enables Zero-Shot Task Generalization

Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207 (2021)

work page internal anchor Pith review arXiv 2021
[53]

Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2023. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652 (2023)

work page arXiv 2023
[54]

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. 2022. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[55]

Hongjin Su, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Smith, Luke Zettlemoyer, Tao Yu, et al. 2022. One embedder, any task: Instruction-finetuned text embeddings. arXiv preprint arXiv:2212.09741 (2022)

work page arXiv 2022
[56]

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[57]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Simlm: Pre-training with representation bottleneck for dense passage retrieval. arXiv preprint arXiv:2207.02578 (2022)

work page arXiv 2022
[58]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[59]

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[60]

Adina Williams, Nikita Nangia, and Samuel R Bowman. 2017. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426 (2017)

work page arXiv 2017
[61]

Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, et al. 2022. Distill-vq: Learning retrieval oriented vector quantization by distilling knowledge from dense embeddings. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval . 1513–1523

work page 2022
[62]

Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Yingxia Shao, Defu Lian, Chaozhuo Li, Hao Sun, Denvy Deng, Liangjie Zhang, et al. 2022. Progressively optimized bi-granular document representation for scalable embedding based retrieval. In Proceedings of the ACM Web Conference 2022 . 286–296

work page 2022
[63]

Shitao Xiao, Zheng Liu, Yingxia Shao, and Zhao Cao. 2023. RetroMAE-2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models. arXiv preprint arXiv:2305.02564 (2023)

work page arXiv 2023
[64]

Shitao Xiao, Zheng Liu, Yingxia Shao, Defu Lian, and Xing Xie. 2021. Matching- oriented product quantization for ad-hoc retrieval.arXiv preprint arXiv:2104.07858 (2021)

work page arXiv 2021
[65]

Xiaohui Xie, Qian Dong, Bingning Wang, Feiyang Lv, Ting Yao, Weinan Gan, Zhijing Wu, Xiangsheng Li, Haitao Li, Yiqun Liu, et al. 2023. T2Ranking: A large- scale Chinese Benchmark for Passage Ranking. arXiv preprint arXiv:2304.03679 (2023)

work page arXiv 2023
[66]

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2020. Approximate nearest neighbor nega- tive contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808 (2020)

work page arXiv 2020
[67]

Liang Xu, Hai Hu, Xuanwei Zhang, Lu Li, Chenjie Cao, Yudong Li, Yechen Xu, Kai Sun, Dian Yu, Cong Yu, et al. 2020. CLUE: A Chinese language understanding evaluation benchmark. arXiv preprint arXiv:2004.05986 (2020)

work page arXiv 2020
[68]

Sha Yuan, Hanyu Zhao, Zhengxiao Du, Ming Ding, Xiao Liu, Yukuo Cen, Xu Zou, Zhilin Yang, and Jie Tang. 2021. Wudaocorpora: A super large-scale chinese corpora for pre-training language models. AI Open 2 (2021), 65–68

work page 2021
[69]

Jianjin Zhang, Zheng Liu, Weihao Han, Shitao Xiao, Ruicheng Zheng, Yingxia Shao, Hao Sun, Hanqing Zhu, Premkumar Srinivasan, Weiwei Deng, et al. 2022. Uni-retriever: Towards learning the unified embedding based retriever in bing sponsored search. In Proceedings of the 28th ACM SIGKDD Conference on Knowl- edge Discovery and Data Mining . 4493–4501

work page 2022
[70]

Zhang, X

S. Zhang, X. Zhang, H. Wang, L. Guo, and S. Liu. 2018. Multi-Scale Attentive Interaction Networks for Chinese Medical Question Answer Selection. IEEE Access 6 (2018), 74061–74071. https://doi.org/10.1109/ACCESS.2018.2883637

work page doi:10.1109/access.2018.2883637 2018