MimeLens uses position-agnostic BERT encoders pretrained on random-offset binary windows to output one of 125 libmagic MIME labels, beating Magika on full files and enabling accurate classification on mid-file fragments.
Clark, Dan Garrette, Iulia Turc, and John Wieting
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Tokenizer fertility varies 2.5x across 25 European languages with domain-invariant rankings, morphological fragmentation in high-fertility cases, and a Ukrainian penalty from pre-training underrepresentation.
citing papers explorer
-
MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments
MimeLens uses position-agnostic BERT encoders pretrained on random-offset binary windows to output one of 125 libmagic MIME labels, beating Magika on full files and enabling accurate classification on mid-file fragments.
-
The Tokenizer Tax Across 25 European Languages: Domain Invariance, Cross-Lingual Few-Shot Effects, and the Ukrainian Penalty
Tokenizer fertility varies 2.5x across 25 European languages with domain-invariant rankings, morphological fragmentation in high-fertility cases, and a Ukrainian penalty from pre-training underrepresentation.