C-Pack: Packed Resources For General Chinese Embeddings
Pith reviewed 2026-05-13 13:22 UTC · model grok-4.3
The pith
C-Pack supplies a benchmark, training dataset, and models that let Chinese text embeddings outperform all earlier ones by up to 10 percent on 35 tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce C-Pack consisting of C-MTEB, a comprehensive Chinese embedding benchmark with 6 tasks and 35 datasets, C-MTP, a massive curated text embedding training set drawn from Chinese corpora, and C-TEM, a family of embedding models that achieve up to 10 percent higher scores than prior Chinese models on C-MTEB when trained with the integrated suite of methods.
What carries the argument
C-TEM models trained on the C-MTP dataset and evaluated on the C-MTEB benchmark.
If this is right
- Downstream Chinese NLP systems can adopt higher-quality embeddings for retrieval, classification, and semantic similarity tasks.
- Open release of both the benchmark and the training data allows direct replication and extension by other researchers.
- The English models and twice-larger English data set provide a parallel resource that reaches top MTEB scores.
- The optimized training pipeline can be applied to produce embeddings in additional languages or sizes.
Where Pith is reading between the lines
- The same packing approach of benchmark plus data plus model could be replicated for other languages to close performance gaps.
- If C-MTEB becomes widely adopted it may standardize evaluation and reduce hidden selection effects in future Chinese embedding papers.
- Larger-scale training on the released C-MTP data could further widen the gap over prior methods.
- Integration of the English and Chinese resources may support improved bilingual or multilingual embedding models.
Load-bearing premise
The C-MTEB collection of 35 datasets supplies an unbiased and comprehensive test of general Chinese embedding quality.
What would settle it
Release of a new Chinese embedding model that scores higher than the largest C-TEM variant on every C-MTEB task without using the C-MTP training data.
read the original abstract
We introduce C-Pack, a package of resources that significantly advance the field of general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive text embedding dataset curated from labeled and unlabeled Chinese corpora for training embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the release. We also integrate and optimize the entire suite of training methods for C-TEM. Along with our resources on general Chinese embedding, we release our data and models for English text embeddings. The English models achieve state-of-the-art performance on MTEB benchmark; meanwhile, our released English data is 2 times larger than the Chinese data. All these resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces C-Pack, a package of resources for general Chinese embeddings consisting of (1) C-MTEB, a benchmark spanning 6 tasks and 35 datasets, (2) C-MTP, a large curated training corpus from labeled and unlabeled Chinese text, and (3) C-TEM, a family of embedding models of varying sizes. The central claim is that the released C-TEM models outperform all prior Chinese text embeddings on C-MTEB by up to 10% at the time of release; the authors also release English data (twice the size of the Chinese data) and models that reach SOTA on MTEB.
Significance. If the performance claims hold under rigorous scrutiny, the work supplies valuable, publicly released resources that address the relative scarcity of high-quality Chinese embedding benchmarks and training data. The integration of multiple training methods into C-TEM and the dual-language release could accelerate progress in multilingual embedding research.
major comments (3)
- [§3] §3 (C-MTEB construction): the description of how the 35 datasets were selected and filtered lacks explicit criteria for avoiding task-specific overfitting or selection effects that could favor the proposed models; a clear protocol for dataset inclusion/exclusion is needed to substantiate the claim that C-MTEB is an unbiased measure of general Chinese embedding quality.
- [Table 2] Table 2 (main results): the reported gains of up to +10% are presented without standard deviations across runs, statistical significance tests, or details on the exact baseline implementations and hyper-parameters; this information is load-bearing for the central empirical claim.
- [§4.2] §4.2 (training procedure): the statement that the authors 'integrate and optimize the entire suite of training methods' is not accompanied by sufficient ablation results or hyper-parameter schedules to allow reproduction or assessment of whether the gains derive from data scale, model architecture, or training tricks.
minor comments (2)
- [Abstract] The GitHub link in the abstract should be repeated in the conclusion or data-availability statement for reader convenience.
- [Figure 1] Figure 1 caption could explicitly state the number of parameters for each C-TEM variant shown.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive recommendation. We address each major comment below and will revise the manuscript accordingly to improve clarity and reproducibility.
read point-by-point responses
-
Referee: [§3] §3 (C-MTEB construction): the description of how the 35 datasets were selected and filtered lacks explicit criteria for avoiding task-specific overfitting or selection effects that could favor the proposed models; a clear protocol for dataset inclusion/exclusion is needed to substantiate the claim that C-MTEB is an unbiased measure of general Chinese embedding quality.
Authors: We agree that an explicit protocol strengthens the benchmark's credibility. In the revised manuscript we will add a dedicated subsection to §3 that details the inclusion/exclusion criteria, including steps taken to ensure task diversity, domain coverage, and avoidance of selection bias toward our training data. This protocol draws on established practices from MTEB while adapting for Chinese-specific considerations. revision: yes
-
Referee: [Table 2] Table 2 (main results): the reported gains of up to +10% are presented without standard deviations across runs, statistical significance tests, or details on the exact baseline implementations and hyper-parameters; this information is load-bearing for the central empirical claim.
Authors: We acknowledge the value of statistical rigor. The revision will expand the experimental section and Table 2 caption with full hyper-parameter settings and exact baseline re-implementation details (including sources and any adaptations). Standard deviations are not reported in the current version because experiments used fixed seeds for reproducibility; we will add a note on this limitation and include variance estimates from additional runs where compute permits. revision: partial
-
Referee: [§4.2] §4.2 (training procedure): the statement that the authors 'integrate and optimize the entire suite of training methods' is not accompanied by sufficient ablation results or hyper-parameter schedules to allow reproduction or assessment of whether the gains derive from data scale, model architecture, or training tricks.
Authors: We will revise §4.2 to include expanded ablation tables and a hyper-parameter schedule appendix. These additions will isolate the contributions of data scale, contrastive objectives, and other optimizations, enabling readers to assess the sources of improvement. revision: yes
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Contrastive learning objectives on curated text pairs produce effective general-purpose embeddings
Forward citations
Cited by 49 Pith papers
-
Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection
Knowledge Packs deliver knowledge via pre-computed KV caches with exact equivalence under causal masking, achieving zero divergences on tested questions and enabling value-based steering without training.
-
IdioLink: Retrieving Meaning Beyond Words Across Idiomatic and Literal Expressions
IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.
-
Retrieval from Within: An Intrinsic Capability of Attention-Based Models
Attention-based models can intrinsically retrieve and reuse pre-encoded evidence chunks via decoder attention queries, unifying retrieval with generation and outperforming external RAG pipelines on QA benchmarks.
-
Prism-Reranker: Beyond Relevance Scoring -- Jointly Producing Contributions and Evidence for Agentic Retrieval
Prism-Reranker models output relevance, contribution statements, and evidence passages to support agentic retrieval beyond scalar scoring.
-
HaS: Accelerating RAG through Homology-Aware Speculative Retrieval
HaS accelerates RAG retrieval via homology-aware speculative retrieval and homologous query re-identification validation, cutting latency 24-37% with 1-2% accuracy drop on tested datasets.
-
METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues
METRO induces both short-term actions and long-term planning from expert transcripts into a Strategy Forest, outperforming prior methods by 9-10% on two non-collaborative dialogue benchmarks.
-
DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?
DRBENCHER generates multi-hop questions across biochemistry, finance, geophysics, security, and history that test interleaved browsing and computation, where the strongest models reach only 20% accuracy and human vali...
-
PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments
PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.
-
CHIMERA: A Knowledge Base of Scientific Idea Recombinations for Research Analysis and Ideation
CHIMERA is the first large-scale mined KB of concept recombinations from scientific literature, created via a new IE task and LLM extraction, with demonstrated uses in pattern analysis and hypothesis generation.
-
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
VisRAG achieves 20-40% better end-to-end performance than text-based RAG by directly embedding and retrieving document images with VLMs.
-
MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries
MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.
-
Structure Retention in Embedding Spaces as a Predictor of Benchmark Performance
Embedding model performance on MTEB tasks correlates strongly with nearest-neighbor overlap and ICA magnitude differences in their embedding spaces.
-
An Annotation Scheme and Classifier for Personal Facts in Dialogue
An extended annotation scheme with new categories and attributes plus a Gemma-300M-based multi-head classifier achieves 81.6% macro F1 on personal fact classification, outperforming few-shot LLM baselines by nearly 9 ...
-
SkillRAE: Agent Skill-Based Context Compilation for Retrieval-Augmented Execution
SkillRAE organizes skills into a graph and compiles compact, grounded contexts for LLM agents, yielding 11.7% gains on SkillsBench over prior RAE methods.
-
Retrieval from Within: An Intrinsic Capability of Attention-Based Models
Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.
-
Agentic Retrieval-Augmented Generation for Financial Document Question Answering
FinAgent-RAG achieves 76.81-78.46% execution accuracy on financial QA benchmarks by combining contrastive retrieval, program-of-thought code generation, and adaptive strategy routing, outperforming baselines by 5.62-9...
-
A Replicability Study of XTR
XTR training does not improve retrieval effectiveness over ColBERT but enhances IVF engine efficiency by flattening token scores to produce more discriminative centroids.
-
MemRouter: Memory-as-Embedding Routing for Long-Term Conversational Agents
A lightweight supervised router using frozen-LLM embeddings for memory admission decisions outperforms LLM-based memory managers in both F1 score and latency on the LoCoMo benchmark.
-
MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment
MiMIC mitigates visual modality collapse and semantic misalignment in universal multimodal retrieval via fusion-in-decoder architecture and robust single-modality training.
-
EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation
EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.
-
EpiAgent: An Agent-Centric System for Ancient Inscription Restoration
EpiAgent is a new agent-centric system that restores degraded ancient inscriptions with better quality and generalization than prior rigid AI methods by using an LLM planner to coordinate multimodal tools and iterativ...
-
Regime-Conditional Retrieval: Theory and a Transferable Router for Two-Hop QA
Two-hop QA retrieval performance depends on whether the hop-2 entity is in the question or bridge passage, and a simple predicate-based router trained on one dataset transfers to improve R@5 on others.
-
ResearchEVO: An End-to-End Framework for Automated Scientific Discovery and Documentation
ResearchEVO automates the discover-then-explain cycle by evolving algorithms via fitness-driven LLM co-evolution and generating grounded, anti-hallucination research papers through sentence-level RAG.
-
SelRoute: Query-Type-Aware Routing for Long-Term Conversational Memory Retrieval
SelRoute routes queries to type-specific retrieval pipelines, achieving Recall@5 of 0.800 with a 109M model on LongMemEval_M and outperforming LLM-augmented baselines including a strong zero-ML lexical method.
-
ASTRA: Mapping Art-Technology Institutions via Conceptual Axes, Text Embeddings, and Unsupervised Clustering
ASTRA combines an eight-axis conceptual framework with text embeddings and unsupervised clustering to map and group 78 art-technology institutions into coherent thematic clusters.
-
LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning
LLM agents iteratively generate and optimize data processing strategies for fine-tuning, delivering over 80% win rates versus unprocessed data and 65% versus LLM-based AutoML baselines while cutting search time by up to 10x.
-
Retrieval-Augmented Generation for Natural Language Processing: A Survey
The survey organizes RAG methods via a taxonomy of query-based, logits-based, latent, and parametric fusion with comparisons on accessibility, efficiency, applications, and challenges.
-
Crystallizing Schemas with Teleoscope: Thematic Curation of Large Text Corpora on Reddit
Teleoscope enables thematic curation of large Reddit corpora via interactive refinement, with three deployments indicating benefits in serendipitous keyword discovery, search saturation confidence, and collaborative c...
-
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.
-
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
DocSeeker improves long-document understanding in MLLMs via a two-stage training process that combines supervised fine-tuning from distilled data with evidence-aware group relative policy optimization and memory-effic...
-
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
DocSeeker uses supervised fine-tuning on distilled data followed by evidence-aware group relative policy optimization to improve long-document understanding and evidence grounding in MLLMs.
-
MEME-Fusion@CHiPSAL 2026: Multimodal Ablation Study of Hate Detection and Sentiment Analysis on Nepali Memes
A hybrid cross-modal attention model using CLIP and BGE-M3 improves hate detection F1-macro by 5.9% over text-only baselines on Nepali memes while revealing failures of English-centric vision models and ensembles on s...
-
Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts
Domain-adapted utterance-level retrieval raises Cohen's kappa for tutoring dialogue act annotation to 0.526-0.580 on TalkMoves and 0.659-0.743 on Eedi, beating no-retrieval baselines by large margins across three LLMs.
-
The Geometry of Forgetting
Interference among memories in embedding spaces produces human-like power-law forgetting (b≈0.46) and false memories (false alarm rate 0.583) from raw pre-trained embeddings with zero tuning.
-
Robustness Risk of Conversational Retrieval: Identifying and Mitigating Noise Sensitivity in Qwen3-Embedding Model
Qwen3-embedding models show noise sensitivity in conversational retrieval where dialogue artifacts rank highly despite lacking semantic value, a problem reduced by query prompting and more severe than in prior Qwen ve...
-
Predicting Intermittent Job Failure Categories for Diagnosis Using Few-Shot Fine-Tuned Language Models
FlaXifyer applies few-shot learning on pre-trained language models to categorize intermittent CI job failures from logs at 84.3% Macro F1 and 92.0% Top-2 accuracy using 12 examples per category, with LogSift reducing ...
-
Geometric Organization of Cognitive States in Transformer Embedding Spaces
Transformer sentence embeddings linearly recover continuous energy scores and seven-tier cognitive labels from 480 annotated sentences, with UMAP showing a coherent low-to-high gradient.
-
Contradictions in Context: Challenges for Retrieval-Augmented Generation in Healthcare
Contradictions between highly similar medical abstracts degrade the factual accuracy and consistency of LLM responses in retrieval-augmented generation.
-
Retrofitting Small Multilingual Models for Retrieval: Matching 7B Performance with 300M Parameters
A 300M multilingual embedding model matches or exceeds 7B retrieval performance via optimized data scale, hard negatives, and task diversity over language diversity.
-
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.
-
E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning
E2LLM uses encoder-based soft prompt compression for long contexts to improve LLM reasoning on tasks like summarization and QA while maintaining efficiency.
-
Retrieval-Augmented Generation for AI-Generated Content: A Survey
A survey classifying RAG foundations for AIGC, summarizing enhancements, cross-modal applications, benchmarks, limitations, and future directions.
-
Multilingual E5 Text Embeddings: A Technical Report
Open-source multilingual E5 embedding models are trained via contrastive pre-training on 1 billion text pairs and fine-tuning, with an instruction-tuned model matching English SOTA performance.
-
Data-CUBE: Data Curriculum for Instruction-based Sentence Representation Learning
Data-CUBE applies a two-level curriculum (TSP-based task ordering via simulated annealing plus difficulty-sorted mini-batches) to multi-task instruction tuning and reports gains on MTEB sentence representation tasks.
-
Benchmarking Google Embeddings 2 against Open-Source Models for Multilingual Dense Retrieval and RAG Systems
GE2 tops BEIR and Italian RAG benchmarks at nDCG@10 of 0.638 and 0.282 but with 231.6 ms latency; mE5-L is competitive on Italian at 31 ms while LaBSE underperforms all dedicated retrieval models.
-
HPC-LLM: Practical Domain Adaptation and Retrieval-Augmented Generation for HPC Support
HPC-LLM fine-tunes Llama 3.1 8B via QLoRA on 9k-24k HPC examples and adds dense retrieval to deliver practical support for job scheduling, MPI, and GPU workflows, approaching the performance of larger general models a...
-
Mira-Embeddings-V1: Domain-Adapted Semantic Reranking for Recruitment via LLM-Synthesized Data
Mira-Embeddings-V1 adapts embeddings for recruitment reranking by synthesizing positive and hard-negative samples with LLMs, then applies JD-JD contrastive and JD-CV triplet training plus a BoundaryHead MLP, lifting R...
-
Hybrid Retrieval for COVID-19 Literature: Comparing Rank Fusion and Projection Fusion with Diversity Reranking
Rank fusion (RRF) reaches the highest relevance (nDCG@10 = 0.828) on expert COVID-19 queries while a projection fusion variant (B5) is 33% faster and produces more diverse results.
-
From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems
Coreference resolution improves retrieval relevance and QA performance in RAG systems, with mean pooling performing best and smaller models benefiting more.
Reference graph
Works this paper leans on
-
[1]
Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Inigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, et al . 2015. Semeval-2015 task 2: Semantic textual similarity, eng- lish, spanish and pilot on interpretability. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015) . 252–263
work page 2015
-
[2]
Eneko Agirre, Carmen Banea, Claire Cardie, Daniel M Cer, Mona T Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, and Janyce Wiebe
-
[3]
SemEval-2014 Task 10: Multilingual Semantic Textual Similarity.. In Se- mEval@ COLING. 81–91
work page 2014
-
[4]
Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Aitor Gonzalez Agirre, Rada Mihalcea, German Rigau Claramunt, and Janyce Wiebe. 2016. Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In SemEval-2016. 10th International Workshop on Semantic Evaluation; 2016 Jun 16-17; San Diego, CA. Stroudsburg (PA): ACL; 2016....
work page 2016
-
[5]
Eneko Agirre, Daniel Cer, Mona Diab, and Aitor Gonzalez-Agirre. 2012. Semeval- 2012 task 6: A pilot on semantic textual similarity. In * SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Eval...
work page 2012
-
[6]
Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo
-
[7]
* SEM 2013 shared task: Semantic textual similarity. In Second joint confer- ence on lexical and computational semantics (* SEM), volume 1: proceedings of the Main conference and the shared task: semantic textual similarity . 32–43
work page 2013
-
[8]
Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, 15huggingface.co/hfl/chinese-roberta-wwm-ext-large Manan Dey, et al. 2023. SantaCoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988 (2023)
- [9]
- [10]
-
[11]
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Ruther- ford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bog- dan Damoc, Aidan Clark, et al. 2022. Improving language models by retrieving from trillions of tokens. In International conference on machine learning . PMLR, 2206–2240
work page 2022
-
[12]
Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning
-
[13]
A large annotated corpus for learning natural language inference
A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326 (2015)
work page Pith review arXiv 2015
-
[14]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901
work page 2020
-
[15]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Se- bastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
Alexis Conneau and Douwe Kiela. 2018. SentEval: An Evaluation Toolkit for Universal Sentence Representations. arXiv preprint arXiv:1803.05449 (2018)
work page Pith review arXiv 2018
-
[18]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [19]
-
[20]
Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2021. A framework for few-shot language model evaluation . https: //doi.org/10.5281/zenodo.5371628
- [21]
-
[22]
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. In International conference on machine learning. PMLR, 3929–3938
work page 2020
- [23]
-
[24]
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes C-Pack: Packed Resources For General Chinese Embeddings SIGIR ’24, July 14–18, 2024, Washington, DC, USA Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models. arXiv preprint arXiv...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bo- janowski, Armand Joulin, and Edouard Grave. 2021. Unsupervised dense in- formation retrieval with contrastive learning. arXiv preprint arXiv:2112.09118 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[26]
Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2022. Few-shot learning with retrieval augmented language models.arXiv preprint arXiv:2208.03299 (2022)
work page internal anchor Pith review arXiv 2022
- [27]
-
[28]
Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open- domain question answering. arXiv preprint arXiv:2004.04906 (2020)
work page internal anchor Pith review arXiv 2020
-
[29]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474
work page 2020
-
[30]
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023. StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards General Text Embeddings with Multi-stage Contrastive Learning. arXiv preprint arXiv:2308.03281 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
- [33]
-
[34]
Dingkun Long, Qiong Gao, Kuan Zou, Guangwei Xu, Pengjun Xie, Ruijie Guo, Jian Xu, Guanjun Jiang, Luxi Xing, and Ping Yang. 2022. Multi-cpr: A multi domain Chinese dataset for passage retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval . 3046–3056
work page 2022
- [35]
- [36]
-
[37]
Niklas Muennighoff, Alexander M Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. 2023. Scaling Data-Constrained Language Models. arXiv preprint arXiv:2305.16264 (2023)
work page internal anchor Pith review arXiv 2023
-
[38]
Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. 2022. MTEB: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316 (2022)
work page internal anchor Pith review arXiv 2022
- [39]
-
[40]
Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, et al
-
[41]
Text and Code Embeddings by Contrastive Pre-Training
Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005 (2022)
work page internal anchor Pith review arXiv 2022
-
[42]
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human-generated machine reading comprehension dataset. (2016)
work page 2016
- [43]
- [44]
-
[45]
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. 2023. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs.arXiv preprint arXiv:2307.16789 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [46]
- [47]
-
[48]
Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. 2021. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[49]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551
work page 2020
-
[50]
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[51]
Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al
-
[52]
Multitask Prompted Training Enables Zero-Shot Task Generalization
Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207 (2021)
work page internal anchor Pith review arXiv 2021
-
[53]
Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2023. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652 (2023)
work page internal anchor Pith review arXiv 2023
-
[54]
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. 2022. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [55]
-
[56]
Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [57]
-
[58]
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[59]
Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[60]
Adina Williams, Nikita Nangia, and Samuel R Bowman. 2017. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426 (2017)
work page Pith review arXiv 2017
-
[61]
Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, et al. 2022. Distill-vq: Learning retrieval oriented vector quantization by distilling knowledge from dense embeddings. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval . 1513–1523
work page 2022
-
[62]
Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Yingxia Shao, Defu Lian, Chaozhuo Li, Hao Sun, Denvy Deng, Liangjie Zhang, et al. 2022. Progressively optimized bi-granular document representation for scalable embedding based retrieval. In Proceedings of the ACM Web Conference 2022 . 286–296
work page 2022
- [63]
- [64]
- [65]
- [66]
- [67]
-
[68]
Sha Yuan, Hanyu Zhao, Zhengxiao Du, Ming Ding, Xiao Liu, Yukuo Cen, Xu Zou, Zhilin Yang, and Jie Tang. 2021. Wudaocorpora: A super large-scale chinese corpora for pre-training language models. AI Open 2 (2021), 65–68
work page 2021
-
[69]
Jianjin Zhang, Zheng Liu, Weihao Han, Shitao Xiao, Ruicheng Zheng, Yingxia Shao, Hao Sun, Hanqing Zhu, Premkumar Srinivasan, Weiwei Deng, et al. 2022. Uni-retriever: Towards learning the unified embedding based retriever in bing sponsored search. In Proceedings of the 28th ACM SIGKDD Conference on Knowl- edge Discovery and Data Mining . 4493–4501
work page 2022
-
[70]
S. Zhang, X. Zhang, H. Wang, L. Guo, and S. Liu. 2018. Multi-Scale Attentive Interaction Networks for Chinese Medical Question Answer Selection. IEEE Access 6 (2018), 74061–74071. https://doi.org/10.1109/ACCESS.2018.2883637
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.