pith. sign in

arxiv: 1607.01759 · v3 · pith:AUAWS3VHnew · submitted 2016-07-06 · 💻 cs.CL

Bag of Tricks for Efficient Text Classification

classification 💻 cs.CL
keywords textclassificationefficientfasttextlessaccuracybaselinebillion
0
0 comments X
read the original abstract

This paper explores a simple and efficient baseline for text classification. Our experiments show that our fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation. We can train fastText on more than one billion words in less than ten minutes using a standard multicore~CPU, and classify half a million sentences among~312K classes in less than a minute.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment

    cs.CL 2026-05 unverdicted novelty 7.0

    An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.

  2. Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention

    cs.CL 2026-04 unverdicted novelty 7.0

    Kathleen performs byte-level text classification via recurrent oscillator banks, FFT wavetable encoding, and phase harmonics, matching pretrained baselines on standard benchmarks with 36% fewer parameters.

  3. Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention

    cs.CL 2026-04 unverdicted novelty 7.0

    Kathleen uses recurrent oscillator banks, an efficient wavetable encoder, and phase harmonics to classify text at the byte level with high accuracy and low parameter count.

  4. ScrapeGraphAI-100k: Dataset for Schema-Constrained LLM Generation

    cs.IR 2026-02 unverdicted novelty 7.0

    ScrapeGraphAI-100k releases 93,695 real telemetry examples pairing web page content with prompts, schemas, and LLM responses to support training and benchmarking of schema-constrained generation.

  5. Moshi: a speech-text foundation model for real-time dialogue

    eess.AS 2024-09 accept novelty 7.0

    Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.

  6. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

    cs.CL 2018-04 unverdicted novelty 7.0

    GLUE is a multi-task benchmark for general natural language understanding that includes a diagnostic test suite and finds limited gains from current multi-task learning methods over single-task training.

  7. Finding Meaning in Embeddings: Concept Separation Curves

    cs.CL 2026-04 unverdicted novelty 6.0

    Concept Separation Curves provide a classifier-independent method to visualize and quantify how sentence embeddings distinguish conceptual meaning from syntactic variations across languages and domains.

  8. SubFLOT: Submodel Extraction for Efficient and Personalized Federated Learning via Optimal Transport

    cs.LG 2026-04 unverdicted novelty 6.0

    SubFLOT uses optimal transport to generate data-aware personalized submodels via server-side pruning and scaling-based adaptive regularization to mitigate parametric divergence in heterogeneous federated learning.

  9. How Do Data Owners Say No? A Case Study of Data Consent Mechanisms in Web-Scraped Vision-Language AI Training Datasets

    cs.CY 2025-11 unverdicted novelty 6.0

    Analysis of the DataComp dataset finds at least 122 million samples with copyright notices, 60% of samples from top domains on sites whose terms prohibit scraping, and 9-13% of samples containing watermarks that stand...

  10. Muon is Scalable for LLM Training

    cs.LG 2025-02 unverdicted novelty 6.0

    Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.

  11. GigaCheck: Detecting LLM-generated Content via Object-Centric Span Localization

    cs.CL 2024-10 unverdicted novelty 6.0

    GigaCheck detects LLM-generated text at both document and span levels by combining fine-tuned language-model embeddings with a DETR-like architecture that treats generated intervals as detectable objects.

  12. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    cs.CL 2024-06 unverdicted novelty 6.0

    FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.

  13. Aligning AI With Shared Human Values

    cs.CY 2020-08 conditional novelty 6.0

    Introduces ETHICS benchmark showing current language models have promising but incomplete ability to predict basic human ethical judgments on text scenarios.

  14. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    cs.CL 2025-02 unverdicted novelty 5.0

    SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.

  15. Galactica: A Large Language Model for Science

    cs.CL 2022-11 unverdicted novelty 5.0

    Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.

  16. Learning Compressed Sentence Representations for On-Device Text Processing

    cs.CL 2019-06 unverdicted novelty 5.0

    Four binarization strategies turn continuous sentence embeddings into binary form, cutting storage by over 98% with only about 2% performance drop on downstream tasks.

  17. Low-supervision urgency detection and transfer in short crisis messages

    cs.CL 2019-07 unverdicted novelty 4.0

    Presents a low-supervision urgency detection system using ensembles and transfer learning that outperforms baselines on multiple disaster datasets.

  18. Simple Natural Language Processing Tools for Danish

    cs.CL 2019-06 unverdicted novelty 3.0

    The paper presents baseline NLP tools for Danish text using machine learning models trained on previously annotated documents, maintained freely at ITU Copenhagen.

  19. Cross-lingual Data Transformation and Combination for Text Classification

    cs.IR 2019-06 unverdicted novelty 3.0

    Cross-lingual data combined via translation or aligned embeddings can improve performance of CNN and RNN text classifiers.