Putting Self-Supervised Token Embedding on the Tables

Gautier Marti; Marc Szafraniec; Philippe Donnat

arxiv: 1708.04120 · v2 · pith:2FD27XR4new · submitted 2017-07-28 · 💻 cs.IR · cs.CL

Putting Self-Supervised Token Embedding on the Tables

Marc Szafraniec , Gautier Marti , Philippe Donnat This is my paper

classification 💻 cs.IR cs.CL

keywords informationmanymessagesself-supervisedstructuretablestokensaddress

0 comments

read the original abstract

Information distribution by electronic messages is a privileged means of transmission for many businesses and individuals, often under the form of plain-text tables. As their number grows, it becomes necessary to use an algorithm to extract text and numbers instead of a human. Usual methods are focused on regular expressions or on a strict structure in the data, but are not efficient when we have many variations, fuzzy structure or implicit labels. In this paper we introduce SC2T, a totally self-supervised model for constructing vector representations of tokens in semi-structured messages by using characters and context levels that address these issues. It can then be used for an unsupervised labeling of tokens, or be the basis for a semi-supervised information extraction system.

This paper has not been read by Pith yet.

Putting Self-Supervised Token Embedding on the Tables

discussion (0)