pith. sign in

arxiv: 1503.05543 · v1 · pith:DKO6Z3STnew · submitted 2015-03-18 · 💻 cs.CL · cs.IR

Text Segmentation based on Semantic Word Embeddings

classification 💻 cs.CL cs.IR
keywords segmentationtextwordalgorithmsembeddingsgreedyknownperformance
0
0 comments X
read the original abstract

We explore the use of semantic word embeddings in text segmentation algorithms, including the C99 segmentation algorithm and new algorithms inspired by the distributed word vector representation. By developing a general framework for discussing a class of segmentation objectives, we study the effectiveness of greedy versus exact optimization approaches and suggest a new iterative refinement technique for improving the performance of greedy strategies. We compare our results to known benchmarks, using known metrics. We demonstrate state-of-the-art performance for an untrained method with our Content Vector Segmentation (CVS) on the Choi test set. Finally, we apply the segmentation procedure to an in-the-wild dataset consisting of text extracted from scholarly articles in the arXiv.org database.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents

    cs.CL 2026-05 unverdicted novelty 3.0

    Recursive character-based chunking at 300 characters outperforms Sentence-Based, Khmer-Aware, and LLM-Based methods on L2 distance, answer relevance, and Khmer IoU in a 5-fold evaluation on 18 Khmer agricultural QA pairs.