META - MultilanguagE Text Analyzer

Introduction

A vast portion of the Web consists of text documents, thus methods for automatically analyzing text have great importance in the context of the Web.
Several techniques have been developed within the fields of Information Retrieval (IR) and Information Filtering (IF), and include indexing, scoring, and categorization of textual documents. Filtering and retrieval systems deal with the ranking of textual documents in order of relevance. Retrieval refers to the selection of documents from a fixed set, whereas filtering typically refers to selection of relevant documents from a stream of incoming data. Retrieval systems are generally concerned with satisfying a user's one-off information need (query); filtering systems are usually applied to attaining information for a user's long term interests (profiles). Categorization or classification of documents is another useful technique, somewhat related to IR and IF, that consists of assigning a document to one or more predefined categories. A classifier can be used, for example, to distinguish between relevant and irrelevant documents (where the relevance can be personalized for a particular user or group of users), or to help in the semiautomatic construction of large Webbased knowledge bases or hierarchical directories of topics like the Open Directory.
In this scenario, the development of robust tools for both basic and more complex NLP tasks is becoming crucial. This paper describes META (MultilanguagE Text Analyzer), an infrastructure for processing textual documents over different languages. The main features of the proposed tool are:

  • The system is designed to clearly separate low-level tasks (such as data storage, location and loading of language resources) from data structures and algorithms.
  • The tool provides a baseline set of NLP components (Tokenizer, POS-tagger, ...) that can be extended and modified by the user according to the tasks to be accomplished.
  • The architecture was conceived so that language-independent components for both basic and more complex tasks, such as Word Sense Disambiguation, can be easily included.
  • Indexing structures produced by the META can be exported in different formats, thus allowing an easy integration with both IR and IF systems.