pith. sign in

arxiv: 1708.04120 · v2 · pith:2FD27XR4new · submitted 2017-07-28 · 💻 cs.IR · cs.CL

Putting Self-Supervised Token Embedding on the Tables

classification 💻 cs.IR cs.CL
keywords informationmanymessagesself-supervisedstructuretablestokensaddress
0
0 comments X
read the original abstract

Information distribution by electronic messages is a privileged means of transmission for many businesses and individuals, often under the form of plain-text tables. As their number grows, it becomes necessary to use an algorithm to extract text and numbers instead of a human. Usual methods are focused on regular expressions or on a strict structure in the data, but are not efficient when we have many variations, fuzzy structure or implicit labels. In this paper we introduce SC2T, a totally self-supervised model for constructing vector representations of tokens in semi-structured messages by using characters and context levels that address these issues. It can then be used for an unsupervised labeling of tokens, or be the basis for a semi-supervised information extraction system.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.