META - MultilanguagE Text Analyzer

Architecture

The architecture of META is depicted in Fig. 1, in which the three main components of the system are showed:

  • Collection Manager - This component provides the tools for the import of documents in different formats (HTML, PDF, DOC, ...), allows the user to organize them in collections, and includes algorithms for the segmentation of documents, that is each document is logically viewed as structured in different sections (e.g., a scientific paper can be structured into: title, abstract, authors, body and references). The Collection Manager allows also the annotation of sections with tags stored in a domain ontology;
  • NLP Engine - This engine is devoted to the management of different NLP annotators. An annotator is a component that performs a specific NLP task (e.g. tokenization, stop word elimination, POS-tagging). The NLP Engine schedules the annotators, loads the lexical resources required for each annotator, and runs the annotator over all the documents into a collection;
  • Export Manager - This component is able to export the results carried out by the NLP Engine into different formats, according to the user's request (XML, RDF, specific DBMS, ...).

Fig. 1 System Architecture

The whole process of document analysis performed by META is described in the following. The Collection Manager imports the documents to be processed from the user's file system (HTML, DOC, RTF, PDF) and groups them in a collection. Each document is assigned with a unique identifier (ID) in the collection, then segmentation is performed and the raw text is extracted from the original document. In this stage, it is also possible to associate both collections and single documents with tags stored in a domain ontology. After these preliminary steps, the documents are ready for the next stage performed by the NLP Engine.
First, the NLP Engine detects the document language; this step is strictly required in order to load the right lexical resources for each language. Then, the NLP engine normalizes (for example, all formatting characters are removed) and tokenizes the text. At this stage, each document is turned in a list of tokens. Each token can be associated with a set of annotations. An annotation is a pair (annotation_name,value), which specifies the kind of annotation and the corresponding value (e.g., the position of the token in the text). Annotations are produced by different components called NLP Annotators, whose scheduling is managed by the NLP Engine.
Currently, the following annotators have been developed and included in META:

  • Stop words elimination: all commonly used words are deleted;
  • Stemming: it is the process of reducing inflected (or sometimes derived) words to their stem. In META, we adopt the Snowball stemmer;
  • POS-tagging: it is the process of assign a part-of-speech to each token. We develop a JAVA version of ACOPOST tagger using Trigram Tagger T3 algorithm. It is based on Hidden Markov Models, in which the states are tag pairs that emit words;
  • Lemmatization: it is the process of determining the lemma for a given word. We use WordNet Default Morphological Processor (included in the WordNet distribution) for English. For the Italian language, we have built a different lemmatizer that exploits the Morph-it! morphological resource;
  • Entity Recognition Driven by Ontologies: it is the process of finding ontology instances into the text;
  • Named Entity Recognition: it is the process of finding named entities into the text, we use Support Vector Machine classifier based on YAMCHA;
  • Word Sense Disambiguation (WSD): it is the problem of selecting a sense for a word from a set of predefined possibilities, by exploiting a sense inventory that usually comes from a electronic dictionary or thesaurus. We have implemented a WSD algorithm, called JIGSAW, able to disambiguate both English and Italian text.
    More details are in: P. Basile, M. de Gemmis, A. Gentile, P. Lops, and G. Semeraro, 'Jigsaw algorithm for word sense disambiguation'. In SemEval-2007: 4th Int. Workshop on Semantic Evaluations.ACL press, 2007, pp. 398-401.

At the end of the pipeline ran by the NLP Engine, the output could be exported in different formats by the Export Manager. This component is devoted to turn the internal output produced by META into different formats such as XML or RDF.