pith. machine review for the scientific record. sign in

arxiv: 2309.07597 · v5 · submitted 2023-09-14 · 💻 cs.CL · cs.AI· cs.IR

Recognition: 3 theorem links

· Lean Theorem

C-Pack: Packed Resources For General Chinese Embeddings

Authors on Pith no claims yet

Pith reviewed 2026-05-13 13:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR
keywords Chinese text embeddingsC-MTEB benchmarktext embedding modelsC-MTP datasetnatural language processingembedding trainingmultilingual embeddingssemantic similarity
0
0 comments X

The pith

C-Pack supplies a benchmark, training dataset, and models that let Chinese text embeddings outperform all earlier ones by up to 10 percent on 35 tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents C-Pack as a bundled set of three resources aimed at general Chinese text embeddings. It includes C-MTEB, a new evaluation covering six task types across 35 datasets, C-MTP, a large collection of labeled and unlabeled Chinese text for training, and C-TEM, a family of embedding models in several sizes. The authors report that their C-TEM models exceed previous Chinese embeddings on C-MTEB by as much as 10 percent at release time. They also optimize the full training pipeline for these models and release parallel English data and models that reach state-of-the-art on the English MTEB benchmark, with the English data twice the size of the Chinese data. All components are released publicly to support further work on Chinese embeddings.

Core claim

We introduce C-Pack consisting of C-MTEB, a comprehensive Chinese embedding benchmark with 6 tasks and 35 datasets, C-MTP, a massive curated text embedding training set drawn from Chinese corpora, and C-TEM, a family of embedding models that achieve up to 10 percent higher scores than prior Chinese models on C-MTEB when trained with the integrated suite of methods.

What carries the argument

C-TEM models trained on the C-MTP dataset and evaluated on the C-MTEB benchmark.

If this is right

  • Downstream Chinese NLP systems can adopt higher-quality embeddings for retrieval, classification, and semantic similarity tasks.
  • Open release of both the benchmark and the training data allows direct replication and extension by other researchers.
  • The English models and twice-larger English data set provide a parallel resource that reaches top MTEB scores.
  • The optimized training pipeline can be applied to produce embeddings in additional languages or sizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same packing approach of benchmark plus data plus model could be replicated for other languages to close performance gaps.
  • If C-MTEB becomes widely adopted it may standardize evaluation and reduce hidden selection effects in future Chinese embedding papers.
  • Larger-scale training on the released C-MTP data could further widen the gap over prior methods.
  • Integration of the English and Chinese resources may support improved bilingual or multilingual embedding models.

Load-bearing premise

The C-MTEB collection of 35 datasets supplies an unbiased and comprehensive test of general Chinese embedding quality.

What would settle it

Release of a new Chinese embedding model that scores higher than the largest C-TEM variant on every C-MTEB task without using the C-MTP training data.

read the original abstract

We introduce C-Pack, a package of resources that significantly advance the field of general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive text embedding dataset curated from labeled and unlabeled Chinese corpora for training embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the release. We also integrate and optimize the entire suite of training methods for C-TEM. Along with our resources on general Chinese embedding, we release our data and models for English text embeddings. The English models achieve state-of-the-art performance on MTEB benchmark; meanwhile, our released English data is 2 times larger than the Chinese data. All these resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces C-Pack, a package of resources for general Chinese embeddings consisting of (1) C-MTEB, a benchmark spanning 6 tasks and 35 datasets, (2) C-MTP, a large curated training corpus from labeled and unlabeled Chinese text, and (3) C-TEM, a family of embedding models of varying sizes. The central claim is that the released C-TEM models outperform all prior Chinese text embeddings on C-MTEB by up to 10% at the time of release; the authors also release English data (twice the size of the Chinese data) and models that reach SOTA on MTEB.

Significance. If the performance claims hold under rigorous scrutiny, the work supplies valuable, publicly released resources that address the relative scarcity of high-quality Chinese embedding benchmarks and training data. The integration of multiple training methods into C-TEM and the dual-language release could accelerate progress in multilingual embedding research.

major comments (3)
  1. [§3] §3 (C-MTEB construction): the description of how the 35 datasets were selected and filtered lacks explicit criteria for avoiding task-specific overfitting or selection effects that could favor the proposed models; a clear protocol for dataset inclusion/exclusion is needed to substantiate the claim that C-MTEB is an unbiased measure of general Chinese embedding quality.
  2. [Table 2] Table 2 (main results): the reported gains of up to +10% are presented without standard deviations across runs, statistical significance tests, or details on the exact baseline implementations and hyper-parameters; this information is load-bearing for the central empirical claim.
  3. [§4.2] §4.2 (training procedure): the statement that the authors 'integrate and optimize the entire suite of training methods' is not accompanied by sufficient ablation results or hyper-parameter schedules to allow reproduction or assessment of whether the gains derive from data scale, model architecture, or training tricks.
minor comments (2)
  1. [Abstract] The GitHub link in the abstract should be repeated in the conclusion or data-availability statement for reader convenience.
  2. [Figure 1] Figure 1 caption could explicitly state the number of parameters for each C-TEM variant shown.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation. We address each major comment below and will revise the manuscript accordingly to improve clarity and reproducibility.

read point-by-point responses
  1. Referee: [§3] §3 (C-MTEB construction): the description of how the 35 datasets were selected and filtered lacks explicit criteria for avoiding task-specific overfitting or selection effects that could favor the proposed models; a clear protocol for dataset inclusion/exclusion is needed to substantiate the claim that C-MTEB is an unbiased measure of general Chinese embedding quality.

    Authors: We agree that an explicit protocol strengthens the benchmark's credibility. In the revised manuscript we will add a dedicated subsection to §3 that details the inclusion/exclusion criteria, including steps taken to ensure task diversity, domain coverage, and avoidance of selection bias toward our training data. This protocol draws on established practices from MTEB while adapting for Chinese-specific considerations. revision: yes

  2. Referee: [Table 2] Table 2 (main results): the reported gains of up to +10% are presented without standard deviations across runs, statistical significance tests, or details on the exact baseline implementations and hyper-parameters; this information is load-bearing for the central empirical claim.

    Authors: We acknowledge the value of statistical rigor. The revision will expand the experimental section and Table 2 caption with full hyper-parameter settings and exact baseline re-implementation details (including sources and any adaptations). Standard deviations are not reported in the current version because experiments used fixed seeds for reproducibility; we will add a note on this limitation and include variance estimates from additional runs where compute permits. revision: partial

  3. Referee: [§4.2] §4.2 (training procedure): the statement that the authors 'integrate and optimize the entire suite of training methods' is not accompanied by sufficient ablation results or hyper-parameter schedules to allow reproduction or assessment of whether the gains derive from data scale, model architecture, or training tricks.

    Authors: We will revise §4.2 to include expanded ablation tables and a hyper-parameter schedule appendix. These additions will isolate the contributions of data scale, contrastive objectives, and other optimizations, enabling readers to assess the sources of improvement. revision: yes

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an applied resource paper; the central claims rest on standard machine-learning practices for contrastive embedding training rather than new axioms or invented entities.

axioms (1)
  • domain assumption Contrastive learning objectives on curated text pairs produce effective general-purpose embeddings
    The paper states it integrates and optimizes the entire suite of training methods for C-TEM.

pith-pipeline@v0.9.0 · 5490 in / 1242 out tokens · 55681 ms · 2026-05-13T13:22:02.053526+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 30 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection

    cs.CL 2026-03 unverdicted novelty 8.0

    Knowledge Packs deliver knowledge via pre-computed KV caches with exact equivalence under causal masking, achieving zero divergences on tested questions and enabling value-based steering without training.

  2. Retrieval from Within: An Intrinsic Capability of Attention-Based Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Attention-based models can intrinsically retrieve and reuse pre-encoded evidence chunks via decoder attention queries, unifying retrieval with generation and outperforming external RAG pipelines on QA benchmarks.

  3. Prism-Reranker: Beyond Relevance Scoring -- Jointly Producing Contributions and Evidence for Agentic Retrieval

    cs.IR 2026-04 accept novelty 7.0

    Prism-Reranker models output relevance, contribution statements, and evidence passages to support agentic retrieval beyond scalar scoring.

  4. HaS: Accelerating RAG through Homology-Aware Speculative Retrieval

    cs.IR 2026-04 unverdicted novelty 7.0

    HaS accelerates RAG retrieval via homology-aware speculative retrieval and homologous query re-identification validation, cutting latency 24-37% with 1-2% accuracy drop on tested datasets.

  5. METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues

    cs.CL 2026-04 unverdicted novelty 7.0

    METRO induces both short-term actions and long-term planning from expert transcripts into a Strategy Forest, outperforming prior methods by 9-10% on two non-collaborative dialogue benchmarks.

  6. DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?

    cs.AI 2026-04 unverdicted novelty 7.0

    DRBENCHER generates multi-hop questions across biochemistry, finance, geophysics, security, and history that test interleaved browsing and computation, where the strongest models reach only 20% accuracy and human vali...

  7. MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

    cs.CL 2024-01 accept novelty 7.0

    MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.

  8. An Annotation Scheme and Classifier for Personal Facts in Dialogue

    cs.CL 2026-05 accept novelty 6.0

    An extended annotation scheme with new categories and attributes plus a Gemma-300M-based multi-head classifier achieves 81.6% macro F1 on personal fact classification, outperforming few-shot LLM baselines by nearly 9 ...

  9. SkillRAE: Agent Skill-Based Context Compilation for Retrieval-Augmented Execution

    cs.CL 2026-05 unverdicted novelty 6.0

    SkillRAE organizes skills into a graph and compiles compact, grounded contexts for LLM agents, yielding 11.7% gains on SkillsBench over prior RAE methods.

  10. Retrieval from Within: An Intrinsic Capability of Attention-Based Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.

  11. Agentic Retrieval-Augmented Generation for Financial Document Question Answering

    cs.AI 2026-05 unverdicted novelty 6.0

    FinAgent-RAG achieves 76.81-78.46% execution accuracy on financial QA benchmarks by combining contrastive retrieval, program-of-thought code generation, and adaptive strategy routing, outperforming baselines by 5.62-9...

  12. A Replicability Study of XTR

    cs.IR 2026-05 accept novelty 6.0

    XTR training does not improve retrieval effectiveness over ColBERT but enhances IVF engine efficiency by flattening token scores to produce more discriminative centroids.

  13. MemRouter: Memory-as-Embedding Routing for Long-Term Conversational Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    A lightweight supervised router using frozen-LLM embeddings for memory admission decisions outperforms LLM-based memory managers in both F1 score and latency on the LoCoMo benchmark.

  14. MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment

    cs.CV 2026-04 unverdicted novelty 6.0

    MiMIC mitigates visual modality collapse and semantic misalignment in universal multimodal retrieval via fusion-in-decoder architecture and robust single-modality training.

  15. EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation

    cs.DB 2026-04 unverdicted novelty 6.0

    EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.

  16. EpiAgent: An Agent-Centric System for Ancient Inscription Restoration

    cs.CV 2026-04 unverdicted novelty 6.0

    EpiAgent is a new agent-centric system that restores degraded ancient inscriptions with better quality and generalization than prior rigid AI methods by using an LLM planner to coordinate multimodal tools and iterativ...

  17. Regime-Conditional Retrieval: Theory and a Transferable Router for Two-Hop QA

    cs.IR 2026-04 conditional novelty 6.0

    Two-hop QA retrieval performance depends on whether the hop-2 entity is in the question or bridge passage, and a simple predicate-based router trained on one dataset transfers to improve R@5 on others.

  18. ResearchEVO: An End-to-End Framework for Automated Scientific Discovery and Documentation

    cs.AI 2026-04 unverdicted novelty 6.0

    ResearchEVO automates the discover-then-explain cycle by evolving algorithms via fitness-driven LLM co-evolution and generating grounded, anti-hallucination research papers through sentence-level RAG.

  19. SelRoute: Query-Type-Aware Routing for Long-Term Conversational Memory Retrieval

    cs.IR 2026-04 conditional novelty 6.0

    SelRoute routes queries to type-specific retrieval pipelines, achieving Recall@5 of 0.800 with a 109M model on LongMemEval_M and outperforming LLM-augmented baselines including a strong zero-ML lexical method.

  20. ASTRA: Mapping Art-Technology Institutions via Conceptual Axes, Text Embeddings, and Unsupervised Clustering

    cs.DL 2026-03 accept novelty 6.0

    ASTRA combines an eight-axis conceptual framework with text embeddings and unsupervised clustering to map and group 78 art-technology institutions into coherent thematic clusters.

  21. OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 5.0

    OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.

  22. DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

    cs.AI 2026-04 unverdicted novelty 5.0

    DocSeeker improves long-document understanding in MLLMs via a two-stage training process that combines supervised fine-tuning from distilled data with evidence-aware group relative policy optimization and memory-effic...

  23. DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

    cs.AI 2026-04 unverdicted novelty 5.0

    DocSeeker uses supervised fine-tuning on distilled data followed by evidence-aware group relative policy optimization to improve long-document understanding and evidence grounding in MLLMs.

  24. MEME-Fusion@CHiPSAL 2026: Multimodal Ablation Study of Hate Detection and Sentiment Analysis on Nepali Memes

    cs.CL 2026-04 unverdicted novelty 5.0

    A hybrid cross-modal attention model using CLIP and BGE-M3 improves hate detection F1-macro by 5.9% over text-only baselines on Nepali memes while revealing failures of English-centric vision models and ensembles on s...

  25. Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts

    cs.CL 2026-04 unverdicted novelty 5.0

    Domain-adapted utterance-level retrieval raises Cohen's kappa for tutoring dialogue act annotation to 0.526-0.580 on TalkMoves and 0.659-0.743 on Eedi, beating no-retrieval baselines by large margins across three LLMs.

  26. The Geometry of Forgetting

    q-bio.NC 2026-03 unverdicted novelty 5.0

    Interference among memories in embedding spaces produces human-like power-law forgetting (b≈0.46) and false memories (false alarm rate 0.583) from raw pre-trained embeddings with zero tuning.

  27. Retrieval-Augmented Generation for AI-Generated Content: A Survey

    cs.CV 2024-02 accept novelty 5.0

    A survey classifying RAG foundations for AIGC, summarizing enhancements, cross-modal applications, benchmarks, limitations, and future directions.

  28. Multilingual E5 Text Embeddings: A Technical Report

    cs.CL 2024-02 unverdicted novelty 5.0

    Open-source multilingual E5 embedding models are trained via contrastive pre-training on 1 billion text pairs and fine-tuning, with an instruction-tuned model matching English SOTA performance.

  29. Mira-Embeddings-V1: Domain-Adapted Semantic Reranking for Recruitment via LLM-Synthesized Data

    cs.CL 2026-04 conditional novelty 4.0

    Mira-Embeddings-V1 adapts embeddings for recruitment reranking by synthesizing positive and hard-negative samples with LLMs, then applies JD-JD contrastive and JD-CV triplet training plus a BoundaryHead MLP, lifting R...

  30. Hybrid Retrieval for COVID-19 Literature: Comparing Rank Fusion and Projection Fusion with Diversity Reranking

    cs.IR 2026-04 unverdicted novelty 4.0

    Rank fusion (RRF) reaches the highest relevance (nDCG@10 = 0.828) on expert COVID-19 queries while a projection fusion variant (B5) is 33% faster and produces more diverse results.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · cited by 28 Pith papers · 17 internal anchors

  1. [1]

    Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Inigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, et al . 2015. Semeval-2015 task 2: Semantic textual similarity, eng- lish, spanish and pilot on interpretability. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015) . 252–263

  2. [2]

    Eneko Agirre, Carmen Banea, Claire Cardie, Daniel M Cer, Mona T Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, and Janyce Wiebe

  3. [3]

    In Se- mEval@ COLING

    SemEval-2014 Task 10: Multilingual Semantic Textual Similarity.. In Se- mEval@ COLING. 81–91

  4. [4]

    Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Aitor Gonzalez Agirre, Rada Mihalcea, German Rigau Claramunt, and Janyce Wiebe. 2016. Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In SemEval-2016. 10th International Workshop on Semantic Evaluation; 2016 Jun 16-17; San Diego, CA. Stroudsburg (PA): ACL; 2016....

  5. [5]

    Eneko Agirre, Daniel Cer, Mona Diab, and Aitor Gonzalez-Agirre. 2012. Semeval- 2012 task 6: A pilot on semantic textual similarity. In * SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Eval...

  6. [6]

    Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo

  7. [7]

    In Second joint confer- ence on lexical and computational semantics (* SEM), volume 1: proceedings of the Main conference and the shared task: semantic textual similarity

    * SEM 2013 shared task: Semantic textual similarity. In Second joint confer- ence on lexical and computational semantics (* SEM), volume 1: proceedings of the Main conference and the shared task: semantic textual similarity . 32–43

  8. [8]

    Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, 15huggingface.co/hfl/chinese-roberta-wwm-ext-large Manan Dey, et al. 2023. SantaCoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988 (2023)

  9. [9]

    Akari Asai, Timo Schick, Patrick Lewis, Xilun Chen, Gautier Izacard, Sebastian Riedel, Hannaneh Hajishirzi, and Wen-tau Yih. 2022. Task-aware retrieval with instructions. arXiv preprint arXiv:2211.09260 (2022)

  10. [10]

    Luiz Bonifacio, Vitor Jeronymo, Hugo Queiroz Abonizio, Israel Campiotti, Marzieh Fadaee, Roberto Lotufo, and Rodrigo Nogueira. 2021. mmarco: A multilingual version of the ms marco passage ranking dataset. arXiv preprint arXiv:2108.13897 (2021)

  11. [11]

    Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Ruther- ford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bog- dan Damoc, Aidan Clark, et al. 2022. Improving language models by retrieving from trillions of tokens. In International conference on machine learning . PMLR, 2206–2240

  12. [12]

    Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning

  13. [13]

    arXiv preprint arXiv:1508.05326 (2015)

    A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326 (2015)

  14. [14]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901

  15. [15]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Se- bastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022)

  16. [16]

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)

  17. [17]

    Alexis Conneau and Douwe Kiela. 2018. SentEval: An Evaluation Toolkit for Universal Sentence Representations. arXiv preprint arXiv:1803.05449 (2018)

  18. [18]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805 (2018)

  19. [19]

    Luyu Gao and Jamie Callan. 2021. Condenser: a pre-training architecture for dense retrieval. arXiv preprint arXiv:2104.08253 (2021)

  20. [20]

    Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2021. A framework for few-shot language model evaluation . https: //doi.org/10.5281/zenodo.5371628

  21. [21]

    Luyu Gao, Yunyi Zhang, Jiawei Han, and Jamie Callan. 2021. Scaling deep contrastive learning batch size under memory limited setup. arXiv preprint arXiv:2101.06983 (2021)

  22. [22]

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. In International conference on machine learning. PMLR, 3929–3938

  23. [23]

    Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu, Qiaoqiao She, et al. 2017. Dureader: a chinese machine reading comprehension dataset from real-world applications. arXiv preprint arXiv:1711.05073 (2017)

  24. [24]

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes C-Pack: Packed Resources For General Chinese Embeddings SIGIR ’24, July 14–18, 2024, Washington, DC, USA Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models. arXiv preprint arXiv...

  25. [25]

    Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bo- janowski, Armand Joulin, and Edouard Grave. 2021. Unsupervised dense in- formation retrieval with contrastive learning. arXiv preprint arXiv:2112.09118 (2021)

  26. [26]

    Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2022. Few-shot learning with retrieval augmented language models.arXiv preprint arXiv:2208.03299 (2022)

  27. [27]

    Ehsan Kamalloo, Nandan Thakur, Carlos Lassance, Xueguang Ma, Jheng-Hong Yang, and Jimmy Lin. 2023. Resources for Brewing BEIR: Reproducible Reference Models and an Official Leaderboard. arXiv preprint arXiv:2306.07471 (2023)

  28. [28]

    Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open- domain question answering. arXiv preprint arXiv:2004.04906 (2020)

  29. [29]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474

  30. [30]

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023. StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023)

  31. [31]

    Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards General Text Embeddings with Multi-stage Contrastive Learning. arXiv preprint arXiv:2308.03281 (2023)

  32. [32]

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  33. [33]

    Zheng Liu and Yingxia Shao. 2022. Retromae: Pre-training retrieval-oriented transformers via masked auto-encoder. arXiv preprint arXiv:2205.12035 (2022)

  34. [34]

    Dingkun Long, Qiong Gao, Kuan Zou, Guangwei Xu, Pengjun Xie, Ruijie Guo, Jian Xu, Guanjun Jiang, Luxi Xing, and Ping Yang. 2022. Multi-cpr: A multi domain Chinese dataset for passage retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval . 3046–3056

  35. [35]

    Niklas Muennighoff. 2022. Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904 (2022)

  36. [36]

    Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. 2023. OctoPack: Instruction Tuning Code Large Language Models. arXiv preprint arXiv:2308.07124 (2023)

  37. [37]

    Niklas Muennighoff, Alexander M Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. 2023. Scaling Data-Constrained Language Models. arXiv preprint arXiv:2305.16264 (2023)

  38. [38]

    Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. 2022. MTEB: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316 (2022)

  39. [39]

    Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. 2022. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786 (2022)

  40. [40]

    Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, et al

  41. [41]

    arXiv preprint arXiv:2201.10005 , year=

    Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005 (2022)

  42. [42]

    Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human-generated machine reading comprehension dataset. (2016)

  43. [43]

    Jianmo Ni, Gustavo Hernández Ábrego, Noah Constant, Ji Ma, Keith B Hall, Daniel Cer, and Yinfei Yang. 2021. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877 (2021)

  44. [44]

    Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernández Ábrego, Ji Ma, Vincent Y Zhao, Yi Luan, Keith B Hall, Ming-Wei Chang, et al. 2021. Large dual encoders are generalizable retrievers. arXiv preprint arXiv:2112.07899 (2021)

  45. [45]

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. 2023. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs.arXiv preprint arXiv:2307.16789 (2023)

  46. [46]

    Yifu Qiu, Hongyu Li, Yingqi Qu, Ying Chen, Qiaoqiao She, Jing Liu, Hua Wu, and Haifeng Wang. 2022. DuReader_retrieval: A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine. arXiv preprint arXiv:2203.10232 (2022)

  47. [47]

    Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxi- ang Dong, Hua Wu, and Haifeng Wang. 2020. RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2010.08191 (2020)

  48. [48]

    Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. 2021. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446 (2021)

  49. [49]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551

  50. [50]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019)

  51. [51]

    Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al

  52. [52]

    Multitask Prompted Training Enables Zero-Shot Task Generalization

    Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207 (2021)

  53. [53]

    Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2023. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652 (2023)

  54. [54]

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. 2022. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 (2022)

  55. [55]

    Hongjin Su, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Smith, Luke Zettlemoyer, Tao Yu, et al. 2022. One embedder, any task: Instruction-finetuned text embeddings. arXiv preprint arXiv:2212.09741 (2022)

  56. [56]

    Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663 (2021)

  57. [57]

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Simlm: Pre-training with representation bottleneck for dense passage retrieval. arXiv preprint arXiv:2207.02578 (2022)

  58. [58]

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533 (2022)

  59. [59]

    Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652 (2021)

  60. [60]

    Adina Williams, Nikita Nangia, and Samuel R Bowman. 2017. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426 (2017)

  61. [61]

    Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, et al. 2022. Distill-vq: Learning retrieval oriented vector quantization by distilling knowledge from dense embeddings. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval . 1513–1523

  62. [62]

    Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Yingxia Shao, Defu Lian, Chaozhuo Li, Hao Sun, Denvy Deng, Liangjie Zhang, et al. 2022. Progressively optimized bi-granular document representation for scalable embedding based retrieval. In Proceedings of the ACM Web Conference 2022 . 286–296

  63. [63]

    Shitao Xiao, Zheng Liu, Yingxia Shao, and Zhao Cao. 2023. RetroMAE-2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models. arXiv preprint arXiv:2305.02564 (2023)

  64. [64]

    Shitao Xiao, Zheng Liu, Yingxia Shao, Defu Lian, and Xing Xie. 2021. Matching- oriented product quantization for ad-hoc retrieval.arXiv preprint arXiv:2104.07858 (2021)

  65. [65]

    Xiaohui Xie, Qian Dong, Bingning Wang, Feiyang Lv, Ting Yao, Weinan Gan, Zhijing Wu, Xiangsheng Li, Haitao Li, Yiqun Liu, et al. 2023. T2Ranking: A large- scale Chinese Benchmark for Passage Ranking. arXiv preprint arXiv:2304.03679 (2023)

  66. [66]

    Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2020. Approximate nearest neighbor nega- tive contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808 (2020)

  67. [67]

    Liang Xu, Hai Hu, Xuanwei Zhang, Lu Li, Chenjie Cao, Yudong Li, Yechen Xu, Kai Sun, Dian Yu, Cong Yu, et al. 2020. CLUE: A Chinese language understanding evaluation benchmark. arXiv preprint arXiv:2004.05986 (2020)

  68. [68]

    Sha Yuan, Hanyu Zhao, Zhengxiao Du, Ming Ding, Xiao Liu, Yukuo Cen, Xu Zou, Zhilin Yang, and Jie Tang. 2021. Wudaocorpora: A super large-scale chinese corpora for pre-training language models. AI Open 2 (2021), 65–68

  69. [69]

    Jianjin Zhang, Zheng Liu, Weihao Han, Shitao Xiao, Ruicheng Zheng, Yingxia Shao, Hao Sun, Hanqing Zhu, Premkumar Srinivasan, Weiwei Deng, et al. 2022. Uni-retriever: Towards learning the unified embedding based retriever in bing sponsored search. In Proceedings of the 28th ACM SIGKDD Conference on Knowl- edge Discovery and Data Mining . 4493–4501

  70. [70]

    Zhang, X

    S. Zhang, X. Zhang, H. Wang, L. Guo, and S. Liu. 2018. Multi-Scale Attentive Interaction Networks for Chinese Medical Question Answer Selection. IEEE Access 6 (2018), 74061–74071. https://doi.org/10.1109/ACCESS.2018.2883637