META - MultilanguagE Text Analyzer

Document Representation

The internal representation of META is a collection that contains a list of documents. Each document is subdivided into segments, each one corresponding to a specific part of the document. Documents are composed by one segment at least. Each segment contains a list of token, each one associated with one annotation at least. An annotation represents a particular feature extracted during text processing (e.g. token, stemming, lemma, entity, sense, ...). The logical structure of a document in depicted in Fig. 1.

Fig. 1 Document

For example, if the user want to analyze the text 'In this paper we present META (MultilanguagE Text Analyzer), it's a tool for text analysis which implements some NLP functionalities.'. The NLP Engine executes the following operations: tokenization, stemming, pos-tagging, lemmatization and WSD. The output of the system is a list of tokens and corresponding annotations. Fig. 2 shows the logical structure for the token paper. In particular, the token has a sense annotation produced by the WSD annotator, whose value is n12660433, the number which identifies the WordNet synset assigned by JIGSAW.

Fig. 2 Token

The snapshot of the META GUI that represents the output of the above example is showed in fig. 3. The GUI of the system allows the visualization of the output by using a table format: tokens are represented in rows and annotations in columns. Also, from the GUI it is possible to access Export Manager functionalities.

Fig. 3 META GUI