pith. machine review for the scientific record. sign in

arxiv: 2604.17632 · v1 · submitted 2026-04-19 · 💻 cs.IR

Recognition: unknown

Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:13 UTC · model grok-4.3

classification 💻 cs.IR
keywords code-switchinginformation retrievalmultilingual modelsembedding divergenceretrieval benchmarksCSR-LCS-MTEBperformance degradation
0
0 comments X

The pith

Code-switching acts as a performance bottleneck for retrieval systems because mixed-language queries create large divergences in embedding spaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that code-switching in queries harms retrieval effectiveness across statistical, dense, and late-interaction retrievers, even for strong multilingual models. Authors create a human-annotated dataset called CSR-L to test natural mixed-language cases and scale it to CS-MTEB covering eleven tasks. They trace the problem to measurable separation between pure-language and code-switched text in embedding space. Standard multilingual fixes such as vocabulary expansion do not close the gap. If correct, this means real-world search in bilingual settings underperforms what monolingual or clean multilingual tests predict.

Core claim

Code-switching is a fundamental performance bottleneck in information retrieval. Evaluations on the new CSR-L benchmark and the broader CS-MTEB show effectiveness drops of up to 27 percent for current models. The root cause is substantial divergence between the embeddings of pure-language text and code-switched text. Common multilingual techniques such as vocabulary expansion fail to resolve these deficits completely.

What carries the argument

The CSR-L human-annotated benchmark and the measured divergence in embedding space between pure and code-switched queries.

If this is right

  • Retrieval effectiveness on real global queries is lower than monolingual benchmarks indicate.
  • Vocabulary expansion and similar multilingual adaptations leave residual deficits in code-switched settings.
  • New model designs must target alignment of pure and mixed-language representations in embedding space.
  • Future IR systems need dedicated benchmarks like CS-MTEB to measure progress on mixed-language inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Search engines serving bilingual populations would gain from query rewriting or hybrid indexes that detect and handle switches explicitly.
  • The embedding divergence finding suggests similar hidden weaknesses may exist in other multilingual tasks such as question answering or summarization.
  • Synthetic code-switched data generated during pre-training could be tested as a direct mitigation strategy.
  • Performance gaps may widen further when code-switching involves low-resource language pairs not well represented in current training corpora.

Load-bearing premise

The human-annotated CSR-L queries reflect authentic natural code-switching and that embedding divergence is the primary driver of the observed performance drops rather than annotation artifacts or other factors.

What would settle it

A retrieval model trained to eliminate embedding divergence on mixed-language text that shows no effectiveness drop on CSR-L or CS-MTEB compared with pure-language queries would falsify the bottleneck claim.

Figures

Figures reproduced from arXiv: 2604.17632 by Fuheng Zhao, Heli Qi, Hitomi Yanaka, Naoto Yokoya, Puxuan Yu, Qingcheng Zeng, Weihao Xuan, Yuheng Lu, Zeqi Zhou.

Figure 1
Figure 1. Figure 1: Overview of our comprehensive study on Code-Switching IR. Our framework proceeds in three stages: (1) [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The visualization of e5 and Qwen 0.6B embeddings on two IR datasets [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

Code-switching is a pervasive linguistic phenomenon in global communication, yet modern information retrieval systems remain predominantly designed for, and evaluated within, monolingual contexts. To bridge this critical disconnect, we present a holistic study dedicated to code-switching IR. We introduce CSR-L (Code-Switching Retrieval benchmark-Lite), constructing a dataset via human annotation to capture the authentic naturalness of mixed-language queries. Our evaluation across statistical, dense, and late-interaction paradigms reveals that code-switching acts as a fundamental performance bottleneck, degrading the effectiveness of even robust multilingual models. We demonstrate that this failure stems from substantial divergence in the embedding space between pure and code-switched text. Scaling this investigation, we propose CS-MTEB, a comprehensive benchmark covering 11 diverse tasks, where we observe performance declines of up to 27%. Finally, we show that standard multilingual techniques like vocabulary expansion are insufficient to resolve these deficits completely. These findings underscore the fragility of current systems and establish code-switching as a crucial frontier for future IR optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CSR-L, a human-annotated benchmark for code-switching retrieval queries, and evaluates statistical, dense, and late-interaction retrievers to show that code-switching creates a performance bottleneck via embedding-space divergence. It scales the analysis with CS-MTEB (11 tasks) reporting declines up to 27% and finds that vocabulary expansion fails to fully mitigate the deficits, positioning code-switching as a key challenge for multilingual IR.

Significance. If the central findings hold, the work provides valuable new benchmarks (CSR-L and CS-MTEB) and empirical evidence of model fragility in mixed-language settings, which could guide targeted improvements in multilingual embeddings and retrieval. The multi-paradigm evaluation and scale of the new benchmark are clear strengths that enable reproducible follow-up research.

major comments (3)
  1. [Dataset construction] Dataset construction section: The claim that CSR-L captures 'authentic naturalness' of mixed-language queries rests on human annotation, but no inter-annotator agreement statistics, annotation guidelines, or comparison to naturally occurring code-switched queries (e.g., from social media corpora) are provided; without these, annotation artifacts cannot be ruled out as a contributor to the reported performance drops, which is load-bearing for the bottleneck conclusion.
  2. [Embedding analysis] Embedding divergence analysis: The paper links retrieval degradation to 'substantial divergence in the embedding space' between pure and code-switched text, yet presents only correlational evidence (similarity metrics or visualizations) without an ablation that isolates or corrects the divergence (e.g., via fine-tuning on code-switched pairs) to test whether closing the gap restores performance; this leaves open alternative explanations such as tokenization mismatches or training-data scarcity.
  3. [CS-MTEB evaluation] CS-MTEB results: The 'up to 27%' performance decline is reported across 11 tasks, but the manuscript does not specify per-task breakdowns, exact models evaluated, or statistical significance tests (e.g., paired t-tests or confidence intervals); without these details the consistency of the bottleneck claim across paradigms cannot be fully verified.
minor comments (2)
  1. [Introduction and benchmarks] Clarify the exact definition and examples of code-switching types (e.g., intra-sentential vs. inter-sentential) used in both CSR-L and CS-MTEB to aid reproducibility.
  2. [Conclusion] Add a limitations paragraph explicitly discussing potential domain shift between the annotated queries and real user code-switched searches.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments have helped us identify areas where additional clarity and evidence can strengthen the manuscript. We address each major comment below and indicate the revisions made.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction section: The claim that CSR-L captures 'authentic naturalness' of mixed-language queries rests on human annotation, but no inter-annotator agreement statistics, annotation guidelines, or comparison to naturally occurring code-switched queries (e.g., from social media corpora) are provided; without these, annotation artifacts cannot be ruled out as a contributor to the reported performance drops, which is load-bearing for the bottleneck conclusion.

    Authors: We agree that these details strengthen the claims. In the revised manuscript we have added the complete annotation guidelines to Appendix A. We also report inter-annotator agreement statistics computed during dataset creation. Furthermore, we include a qualitative and quantitative comparison of switch-point distributions and language ratios between CSR-L and a sample of naturally occurring code-switched text from social media, demonstrating close alignment. These additions confirm that the observed retrieval drops are not attributable to annotation artifacts. revision: yes

  2. Referee: [Embedding analysis] Embedding divergence analysis: The paper links retrieval degradation to 'substantial divergence in the embedding space' between pure and code-switched text, yet presents only correlational evidence (similarity metrics or visualizations) without an ablation that isolates or corrects the divergence (e.g., via fine-tuning on code-switched pairs) to test whether closing the gap restores performance; this leaves open alternative explanations such as tokenization mismatches or training-data scarcity.

    Authors: The referee is correct that the primary evidence is correlational. In the revision we have expanded the analysis section to explicitly discuss alternative explanations, including tokenization mismatches and training-data scarcity, and provide supporting measurements that control for tokenization effects. A full ablation via fine-tuning on code-switched pairs lies beyond the scope of the current work due to computational cost and is noted as future research; however, the additional controls we present reinforce embedding divergence as a central factor in the performance bottleneck. revision: partial

  3. Referee: [CS-MTEB evaluation] CS-MTEB results: The 'up to 27%' performance decline is reported across 11 tasks, but the manuscript does not specify per-task breakdowns, exact models evaluated, or statistical significance tests (e.g., paired t-tests or confidence intervals); without these details the consistency of the bottleneck claim across paradigms cannot be fully verified.

    Authors: We thank the referee for highlighting this omission. The revised manuscript now contains a dedicated table with per-task results for all 11 CS-MTEB tasks, explicitly listing the models evaluated under each retrieval paradigm. We have also added paired t-tests together with 95% confidence intervals, confirming that the performance declines are statistically significant and consistent across statistical, dense, and late-interaction retrievers. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark construction with no circular derivations or self-referential reductions

full rationale

This is an empirical IR paper that introduces CSR-L through human annotation of mixed-language queries, evaluates statistical/dense/late-interaction retrievers on it, observes performance drops up to 27% on the expanded CS-MTEB benchmark, and notes that vocabulary expansion does not fully resolve issues. No mathematical equations, fitted parameters, or predictive models are presented that reduce by construction to the inputs. Claims about embedding divergence as a bottleneck are observational from the new data rather than derived via self-definition, self-citation chains, or renaming of prior results. The analysis is self-contained against external benchmarks and does not rely on load-bearing self-citations or ansatzes smuggled from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard IR evaluation assumptions and human annotation quality rather than new mathematical derivations or invented entities.

axioms (1)
  • domain assumption Human annotations of query relevance and naturalness are accurate and representative of real code-switched usage.
    Invoked when constructing CSR-L and interpreting performance results.

pith-pipeline@v0.9.0 · 5506 in / 1229 out tokens · 36535 ms · 2026-05-10T05:13:57.938434+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 23 canonical work pages · 2 internal anchors

  1. [1]

    2025 , eprint=

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning , author=. 2025 , eprint=

  2. [2]

    Robertson, Stephen and Zaragoza, Hugo , title =. Found. Trends Inf. Retr. , month = apr, pages =. 2009 , issue_date =. doi:10.1561/1500000019 , abstract =

  3. [3]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  4. [4]

    Cross-Dialect Information Retrieval: Information Access in Low-Resource and High-Variance Languages

    Litschko, Robert and Kraus, Oliver and Blaschke, Verena and Plank, Barbara. Cross-Dialect Information Retrieval: Information Access in Low-Resource and High-Variance Languages. Proceedings of the 31st International Conference on Computational Linguistics. 2025

  5. [5]

    C ontrastive M ix: Overcoming Code-Mixing Dilemma in Cross-Lingual Transfer for Information Retrieval

    Do, Junggeun and Lee, Jaeseong and Hwang, Seung-won. C ontrastive M ix: Overcoming Code-Mixing Dilemma in Cross-Lingual Transfer for Information Retrieval. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers). 2024. doi:10.18653/v1/2024.naacl...

  6. [6]

    Boosting Zero-shot Cross-lingual Retrieval by Training on Artificially Code-Switched Data

    Litschko, Robert and Artemova, Ekaterina and Plank, Barbara. Boosting Zero-shot Cross-lingual Retrieval by Training on Artificially Code-Switched Data. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.193

  7. [7]

    M i LQ : Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries

    Kim, Jonghwi and Kang, Deokhyung and Hwang, Seonjeong and Kim, Yunsu and Ok, Jungseul and Lee, Gary. M i LQ : Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1153

  8. [8]

    SimCSE: Simple Contrastive Learning of Sentence Embeddings , booktitle =

    Gao, Tianyu and Yao, Xingcheng and Chen, Danqi. S im CSE : Simple Contrastive Learning of Sentence Embeddings. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.552

  9. [9]

    ColBERT: Efficient and effective passage search via con- textualized late interaction over bert

    Khattab, Omar and Zaharia, Matei , title =. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =. 2020 , isbn =. doi:10.1145/3397271.3401075 , abstract =

  10. [10]

    2026 , eprint=

    MiMo-V2-Flash Technical Report , author=. 2026 , eprint=

  11. [11]

    Evaluating Large Language Models for Cross-Lingual Retrieval

    Zuo, Longfei and Hong, Pingjun and Kraus, Oliver and Plank, Barbara and Litschko, Robert. Evaluating Large Language Models for Cross-Lingual Retrieval. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.612

  12. [12]

    2024 , eprint=

    Arctic-Embed 2.0: Multilingual Retrieval Without Compromise , author=. 2024 , eprint=

  13. [13]

    2021 , eprint=

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. 2021 , eprint=

  14. [14]

    Proceedings of the 16th Annual Meeting of the Forum for Information Retrieval Evaluation , pages =

    Chanda, Supriya and Pal, Sukomal , title =. Proceedings of the 16th Annual Meeting of the Forum for Information Retrieval Evaluation , pages =. 2025 , isbn =. doi:10.1145/3734947.3735670 , abstract =

  15. [15]

    The bilingualism reader / edited by Li Wei

    Li, Wei , address =. The bilingualism reader / edited by Li Wei. , year =. The bilingualism reader , edition =

  16. [16]

    , title =

    Ahmed, Yusuf M. , title =. Journal of International English Research Studies (JIERS) , volume =. 2024 , month =

  17. [17]

    and Choudhury, Monojit and Rosso, Paolo , title =

    Gupta, Parth and Bali, Kalika and Banchs, Rafael E. and Choudhury, Monojit and Rosso, Paolo , title =. Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval , pages =. 2014 , isbn =. doi:10.1145/2600428.2609622 , abstract =

  18. [18]

    2021 , eprint=

    BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models , author=. 2021 , eprint=

  19. [19]

    2023 , eprint=

    MTEB: Massive Text Embedding Benchmark , author=. 2023 , eprint=

  20. [20]

    2025 , eprint=

    BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval , author=. 2025 , eprint=

  21. [21]

    2025 , eprint=

    MMTEB: Massive Multilingual Text Embedding Benchmark , author=. 2025 , eprint=

  22. [22]

    Advances in Information Retrieval: 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6–10, 2025, Proceedings, Part II , pages =

    Weller, Orion and Chang, Benjamin and Yang, Eugene and Yarmohammadi, Mahsa and Barham, Samuel and MacAvaney, Sean and Cohan, Arman and Soldaini, Luca and Van Durme, Benjamin and Lawrie, Dawn , title =. Advances in Information Retrieval: 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6–10, 2025, Proceedings, Part II , pag...

  23. [23]

    MINERS : Multilingual Language Models as Semantic Retrievers

    Winata, Genta Indra and Zhang, Ruochen and Adelani, David Ifeoluwa. MINERS : Multilingual Language Models as Semantic Retrievers. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.155

  24. [24]

    Sometimes I'll start a sentence in Spanish y termino en espa

    Poplack, Shana , booktitle=. Sometimes I'll start a sentence in Spanish y termino en espa. 2020 , publisher=

  25. [25]

    1997 , publisher=

    Duelling languages: Grammatical structure in codeswitching , author=. 1997 , publisher=

  26. [26]

    Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data

    Pratapa, Adithya and Bhat, Gayatri and Choudhury, Monojit and Sitaram, Sunayana and Dandapat, Sandipan and Bali, Kalika. Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1143

  27. [27]

    Code-Switched Text Synthesis in Unseen Language Pairs

    Hsu, I-Hung and Ray, Avik and Garg, Shubham and Peng, Nanyun and Huang, Jing. Code-Switched Text Synthesis in Unseen Language Pairs. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.318

  28. [28]

    Overview of Touch \'e 2020: Argument Retrieval

    Bondarenko, Alexander and Fr \"o be, Maik and Beloucif, Meriem and Gienapp, Lukas and Ajjour, Yamen and Panchenko, Alexander and Biemann, Chris and Stein, Benno and Wachsmuth, Henning and Potthast, Martin and Hagen, Matthias. Overview of Touch \'e 2020: Argument Retrieval. Experimental IR Meets Multilinguality, Multimodality, and Interaction. 2020

  29. [29]

    2021 , eprint=

    Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

  30. [30]

    2021 , eprint=

    Searching for Scientific Evidence in a Pandemic: An Overview of TREC-COVID , author=. 2021 , eprint=

  31. [31]

    F ollow IR : Evaluating and teaching information retrieval models to follow instructions

    Weller, Orion and Chang, Benjamin and MacAvaney, Sean and Lo, Kyle and Cohan, Arman and Van Durme, Benjamin and Lawrie, Dawn and Soldaini, Luca. F ollow IR : Evaluating and Teaching Information Retrieval Models to Follow Instructions. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics...

  32. [32]

    2024 , eprint=

    Text Embeddings by Weakly-Supervised Contrastive Pre-training , author=. 2024 , eprint=

  33. [33]

    2025 , eprint=

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. 2025 , eprint=

  34. [34]

    2025 , eprint=

    Llama-Embed-Nemotron-8B: A Universal Text Embedding Model for Multilingual and Cross-Lingual Tasks , author=. 2025 , eprint=

  35. [35]

    2024 , url=

    SFR-Embedding-2: Advanced Text Embedding with Multi-stage Training , author=. 2024 , url=

  36. [36]

    2025 , eprint=

    jina-reranker-v3: Last but Not Late Interaction for Listwise Document Reranking , author=. 2025 , eprint=

  37. [37]

    2024 , eprint=

    BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , author=. 2024 , eprint=

  38. [38]

    2022 , eprint=

    ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction , author=. 2022 , eprint=

  39. [39]

    Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence , articleno =

    Zeng, Qingcheng and Garay, Lucas and Zhou, Peilin and Chong, Dading and Hua, Yining and Wu, Jiageng and Pan, Yikang and Zhou, Han and Voigt, Rob and Yang, Jie , title =. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence , articleno =. 2023 , isbn =. doi:10.24963/ijcai.2023/698 , abstract =

  40. [40]

    Retrieval of the Best Counterargument without Prior Topic Knowledge

    Wachsmuth, Henning and Syed, Shahbaz and Stein, Benno. Retrieval of the Best Counterargument without Prior Topic Knowledge. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1023

  41. [41]

    2021 , eprint=

    CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims , author=. 2021 , eprint=

  42. [42]

    Semi-supervised Question Retrieval with Gated Convolutions

    Lei, Tao and Joshi, Hrishikesh and Barzilay, Regina and Jaakkola, Tommi and Tymoshenko, Kateryna and Moschitti, Alessandro and M \`a rquez, Llu \'i s. Semi-supervised Question Retrieval with Gated Convolutions. Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies. 2...

  43. [43]

    S em E val-2015 Task 1: Paraphrase and Semantic Similarity in T witter ( PIT )

    Xu, Wei and Callison-Burch, Chris and Dolan, Bill. S em E val-2015 Task 1: Paraphrase and Semantic Similarity in T witter ( PIT ). Proceedings of the 9th International Workshop on Semantic Evaluation ( S em E val 2015). 2015. doi:10.18653/v1/S15-2001

  44. [44]

    2025 , url=

    MiMo-V2-Flash Technical Report , author=. 2025 , url=

  45. [45]

    Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation

    Wang, Xinyi and Ruder, Sebastian and Neubig, Graham. Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.61

  46. [46]

    Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

    Reimers, Nils and Gurevych, Iryna. Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1410

  47. [47]

    2018 , eprint=

    Word Translation Without Parallel Data , author=. 2018 , eprint=

  48. [48]

    arXiv preprint arXiv:2509.00303 , year=

    Access Paths for Efficient Ordering with Large Language Models , author=. arXiv preprint arXiv:2509.00303 , year=

  49. [49]

    Representation Learning with Contrastive Predictive Coding

    Representation learning with contrastive predictive coding , author=. arXiv preprint arXiv:1807.03748 , year=

  50. [50]

    Multilingual E5 Text Embeddings: A Technical Report

    Multilingual e5 text embeddings: A technical report , author=. arXiv preprint arXiv:2402.05672 , year=

  51. [51]

    V -Measure: A Conditional Entropy-Based External Cluster Evaluation Measure

    Rosenberg, Andrew and Hirschberg, Julia. V -Measure: A Conditional Entropy-Based External Cluster Evaluation Measure. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning ( EMNLP - C o NLL ). 2007