Introduces Indi-RomCoM benchmark for evaluating LLMs on Romanized code-mixed Indic-English instructions across seven tasks, four languages, and three mixing levels.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
BrahmicTokenizer-131K is a 131K-vocab tokenizer constructed via script-prune crop and linear-programming retrofit to o200k_base, achieving 26.7% fewer tokens on Indic text while matching o200k_base on English fertility and outperforming alternatives on code/math benchmarks.
citing papers explorer
-
Indi-RomCoM: Code-Mixed Benchmark for Evaluating LLMs on Romanized Indic-English Instructions
Introduces Indi-RomCoM benchmark for evaluating LLMs on Romanized code-mixed Indic-English instructions across seven tasks, four languages, and three mixing levels.
-
BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base
BrahmicTokenizer-131K is a 131K-vocab tokenizer constructed via script-prune crop and linear-programming retrofit to o200k_base, achieving 26.7% fewer tokens on Indic text while matching o200k_base on English fertility and outperforming alternatives on code/math benchmarks.