Recognition: unknown
FastText.zip: Compressing text classification models
read the original abstract
We consider the problem of producing compact architectures for text classification, such that the full model fits in a limited amount of memory. After considering different solutions inspired by the hashing literature, we propose a method built upon product quantization to store word embeddings. While the original technique leads to a loss in accuracy, we adapt this method to circumvent quantization artefacts. Our experiments carried out on several benchmarks show that our approach typically requires two orders of magnitude less memory than fastText while being only slightly inferior with respect to accuracy. As a result, it outperforms the state of the art by a good margin in terms of the compromise between memory usage and accuracy.
This paper has not been read by Pith yet.
Forward citations
Cited by 13 Pith papers
-
Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration
Prepending stochastic sequences from Lorem Ipsum vocabulary to prompts during GRPO resampling broadens reasoning exploration and outperforms standard resampling on hard tasks for 1.7B-7B models.
-
Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic
SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.
-
Polyglot Teachers: Evaluating Language Models for Multilingual Synthetic Data Generation
Gemma 3 27B and Aya Expanse 32B are effective multilingual teachers; data qualities such as diversity and fluency explain over 93% of variance in student performance rather than model size.
-
DocAtlas: Multilingual Document Understanding Across 80+ Languages
DocAtlas creates multilingual document datasets across 82 languages and shows DPO with rendered ground truth improves model accuracy by 1.7-1.9% without degrading base-language performance, unlike supervised fine-tuning.
-
Synthetic Pre-Pre-Training Improves Language Model Robustness to Noisy Pre-Training Data
Synthetic pre-pre-training on structured data improves LLM robustness to noisy pre-training, matching baseline loss with up to 49% fewer natural tokens for a 1B model.
-
A Comparative Study of Semantic Log Representations for Software Log-based Anomaly Detection
QTyBERT matches or exceeds BERT-based log anomaly detection effectiveness while reducing embedding generation time to near static word embedding levels.
-
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.
-
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.
-
Cross-lingual Comparison of Research Funding Projects with Multilingual Sentence-BERT: Evidence from KAKENHI, NIH, NSF, and UKRI
Multilingual Sentence-BERT places original Japanese and machine-translated English versions of the same KAKENHI projects closer together than to native English projects from other agencies, yet nearest-neighbor overla...
-
Language as a Latent Variable for Reasoning Optimization
Treating language as a latent variable via polyGRPO RL improves Qwen2.5-7B-Instruct by 6.72% on English reasoning benchmarks and 6.89% on multilingual ones, with cross-task gains on commonsense reasoning from math-onl...
-
Emerging Properties in Unified Multimodal Pretraining
BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
-
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
-
Human-Machine Co-Boosted Bug Report Identification with Mutualistic Neural Active Learning
MNAL reduces human effort in bug report labeling by up to 95.8% for readability and 196% for identifiability while improving identification performance and working with various neural models.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.