META - MultilanguagE Text Analyzer

META @Work

META has been employed for the processing collection of documents in different scenarios, in order to evaluate its performance:

In the following, we describe each one of the scenario in which we system was tested.

WSD on English

JIGSAW, the WSD algorithm included in META, has been tested in the context of SemEval 1-Task 1 competition. This task is an application-driven one, where the application is a fixed Cross-Lingual Information Retrieval (CLIR) system. Participants must disambiguate text by assigning WordNet synsets, then the CLIR system must perform both the expansion to other languages and the indexing of the expanded documents; the final step is the retrieval (in batch) for all the languages. The retrieved results are taken as a measure of the disambiguation accuracy. The dataset consisted of 29,681 documents, including 300 topics (short text). Results are reported in Table 1. Besides the two systems (JIGSAW and PART-B) that participated to SEMEVAL-1 Task 1 competition, a third system (ORGANIZERS), developed by the organizers themselves, was included in the competition. The systems were scored according to standard IR/CLIR measures as implemented in the TREC evaluation package.

SystemIR documentsIR topicsCLIR
no expansion0.3599\0.1446
full expansion0.16100.14100.2676
1st sense0.28620.11720.2637
ORGANIZERS0.28860.15870.2664
JIGSAW0.30300.15210.1373
PART-B0.30360.14820.1734

All systems showed similar results in IR tasks, while their behaviour was extremely different on CLIR task. Probably, the negative results of JIGSAW in CLIR task depends on complex interaction of WSD, expansion and indexing. Contrarily to other tasks, the task organizers do not plan to provide a ranking of systems on SEMEVAL-1 Task 1. As a consequence, the goal of this task - what is the best WSD system in the context of a CLIR system? - is still open.

WSD on Italian

An important applications scenario is EVALITA, that is an initiative devoted to the evaluation of Natural Language Processing tools for Italian. In this context, we have evaluated META for Italian language. Experiments were performed by using the instructions for EVALITA WSD All-Word-Task. The dataset consisted of about 5000 words. Precision and Recall are reported in Table 2.

SystemPrecisionRecallattemted
JIGSAW0.5600.41473.95%
1st sense0.6690.669100%

The results are encouraging as regards precision, considering that our system exploits only ItalWordNet as knowledge base. JIGSAW was compared only with the baseline (for all words, the first sense in ItalWordNet is selected), which achieves very high results. In Table 2 the precision for each POS-tag is showed. It is possible to notice that the precision is quite acceptable for nouns, and very high for proper nouns because generally they have only a sense. The results show that the verb disambiguation is very hard due to high polysemy. High precision is achieved for adjectives and adverbs, but recall is lower due to POS-tagger errors. The process of WSD requires lemmatization and POS-tagging, which introduce errors, thus influencing the recall. We estimated lemmatization and POS tagging precision respectively to 77,66% and 76,23%.

POS-tagPrecisionRecallattemted
NOUN0.5560.44479.96%
VERB0.3750.28375.60%
OTHERS0.6760.32147.55%
PROPER NOUNS0.9130.72479.25%

META in an Information Filtering Scenario

META has been used as Content Analyzer into a content-based recommender system. The recommender automatically infers the user profile, a structured model of the user interests, from documents that were already deemed relevant by the user. The profile is used to filter new documents and to produce personalized suggestions. We used META in the indexing phase for the extraction of both lexical and semantic features from documents. The learning algorithms embedded in the recommender are able to infer user profiles from the feature produced by META. The system produced both a classical Bag-Of-Word (BOW) document representation and a new representation that we call Bag-of-Synset (BOS). In this model, a document is represented by a vector of WordNet synsets recognized by the WSD procedure.
More details are in: G. Semeraro, M. Degemmis, P. Lops, and P. Basile, 'Combining learning and word sense disambiguation for intelligent user profiling'. In Proceedings of the Twentieth International Joint Conference on Artificial Intelligence IJCAI-07, 2007, pp. 2856-2861, m. Kaufmann, San Francisco, California. ISBN: 978-I-57735-298-3.