SWAP - Semantic Web Access and Personalization Research Group

Ph.D. Graduates


Pierpaolo Basile (ciclo XXI)

Annalina Caputo (ciclo XXIII)

Marco de Gemmis (ciclo XVI)

Anna Lisa Gentile (ciclo XXII)

Leo Iaquinta (ciclo XXII)

Oriana Licchelli (ciclo XVII)

Pasquale Lops (ciclo XVI)

Ignazio Palmisano (ciclo XIX)

Domenico Redavid (ciclo XX)

Eufemia Tinelli (ciclo XXI)





Marco de Gemmis

Learning User Profiles from Text for Personalized Information Access

Abstract

Advances in the Internet and the creation of huge stores of digitized text have opened the gateway to a deluge of information that is difficult to navigate. Although the information is widely available, exploring Web sites and finding information relevant to a user's interests is a challenging task. The first obstacle is research, where you must first identify the appropriate information sources and then retrieve the relevant data. Then, you have to sort through this data to filter out the unfocused and unimportant information. Lastly, in order for the information to be truly useful, you must take the time to figure out how to organize and abstract it in a manner that is easy to understand and analyze. To say the least, all of these steps are extremely time consuming. This "relevant information problem" leads to a clear demand for automated methods able to support users in searching large document repositories in order to retrieve relevant information with respect to their preferences. Catching user interests and representing them in a structured form is a problematic activity. Algorithms designed for this purpose base their relevance computations on so-called user profiles in which representations of the users' interests are maintained. The central argument of this dissertation is the use of Supervised Machine Leaning techniques to induce user profiles from text data for Intelligent Information Access. Intelligent Information Access is a user-centric and semantically rich approach to access information: information preferences vary greatly across users, therefore information access must be highly personalized by profiles to serve the individual interests of the user. Moreover, users want to retrieve information on the basis of conceptual content, but individual words provide unreliable evidence about the meaning of documents. Thus, methods for extracting meaning from documents must be considered in order to effectively find relevant information.

First, we describe content-based learning algorithms designed to learn about users' interests. The input, given as a set of text documents marked by the user as relevant or not relevant, is used to find characteristics that distinguish relevant documents from irrelevant ones. The induced target concept is a user profile appropriate for the classification of new documents. Documents are represented as bag of words (BOW): a document is encoded as a feature vector, with each element in the vector indicating the presence or absence of a word in the document. This approach was used as a baseline to determine how well a standard keyword-based learner performs on this task.

Second, current limits in the state of the art in profiles generated from the BOW-represented documents are analyzed. Though many linguistic techniques have been employed, there are problems that still remain unsolved like: polysemy, synonymy, etc. A possible solution for this kind of issues is explored: the shift of the level of abstraction from words up to concepts. Profiles will not contain words anymore. They will contain references to concepts defined in lexicons or, in a further step, ontologies. A first advance in this direction consists of employing WordNet as a reference lexicon in substituting word forms with word meanings into profiles. We show how the described content-based algorithms can be extended using a new, enriched document representation obtained by adding features generated using a new WordNet-based procedure.

The dissertation concludes with the description of the empirical study that evaluates the effectiveness of the proposed approach.


(^)

Pasquale Lops

Hybrid Recommendation Techniques based on User Profiles

Abstract

Nowadays users are overwhelmed by the abundant amount of information, and it is not just a problem to a minority of population; it is a problem for everyone in their daily life. In fact, now, we do not get information just from newspapers, colleagues, family members and friends, but also largely from the Internet.

How can people deal with this information overload problem? Individuals tend to filter and ignore information as the effective ways to cope with information overload.

Recommender systems constitute one of the fastest growing segments of the Internet economy today. They help reduce information overload and provide customized information access for targeted domains. Such systems take input directly or indirectly from users and, based on their needs, preferences and usage patterns, provide personalized advices about products or services and can help people to filter useful information, thus giving users easing the information search and decision processes.

Among different recommendation techniques proposed in the literature, the collaborative filtering approach is the most successful and widely adopted to date. Collaborative filtering by itself cannot always guarantee a good prediction. The effectiveness of predictions relies on the confidence of the computation of the similarity between users. Correlation between users can only be computed if they have rated a sufficient number of common items. Since users can choose among thousands of items to rate, especially in online catalogues, and new items become available continuously, it is likely that overlap of rated items between two users will be minimal in many cases. Therefore, many of the computed correlation coefficients are based on just few observations. As a result, correlation based only on co-rated items cannot be regarded as a reliable similarity measure.

One of the primary contributions of this thesis is the investigation on how the knowledge about users can be exploited to improve recommendations. In particular it is investigated how overlaps between users' interests could be used to define the similarity among users in order to improve recommendations.

The combination of classic collaborative filtering techniques and user profiles inferred using content-based methods for designing a new hybrid recommendation technique is presented.

In the study it is described the process of learning content-based profiles to be used in one of the main steps of the process for producing social recommendations: the neighborhood formation. A clustering technique for grouping user profiles is proposed in order to identify the set of neighbors for those users for which recommendations must be produced.

More specifically, the process of grouping user profiles learns two different profiles of the user: one, from positive examples of interesting items, represents the interests of the user, the other profile, learned from negative examples, represents items the user dislikes. The observation is that two users can be considered similar if they like the same items, but if they dislike the same ones as well.

Finally, advanced semantic user profiles based on concepts instead of keywords have been used for improving the accuracy of collaborative recommendations.

Several experiments have been carried out in order to evaluate the effectiveness of the approaches. Some baseline experiments on classic collaborative filtering have been carried out as benchmark. The final experimental analysis provides evidence of the improvements of the proposed approaches.


(^)

Oriana Licchelli

Personalization in Digital Libraries for Education

Abstract

The rapid evolution of Internet services has led to a constantly increasing number of web sites and to an increase in the available information. Today, the main challenge is to support web users in order to facilitate navigation through web site and to improve searching among the extremely large web repositories, such as Digital Libraries or other generic information sources. Personalization, a possible approach to the problem, involves techniques and mechanisms to reduce this information overload and facilitates the delivery of relevant information that has been personalized for the preferences of individual users. Machine Learning techniques have a significant role to play in the development of personalized services within the Digital Libraries. For example, many Machine Learning techniques are well suited for transforming user-activity data into useful preference rules as part of a user profile. In web systems, the user profiles manipulate information that refers to user knowledge in a domain, to her/his personality, her/his preferences, or to any other information on the user that can be useful in the configuration of an application.

This thesis explores the role of user profiles in web applications such as Bookshop Online, Digital Libraries and e-Learning. In particular, it is analyzed the possibility to enlarge the availability of teaching materials provided by an e-learning system reusing materials existing in external sources, such as digital libraries. Therefore, the major research topic addressed by the thesis is related to improve the search of educational materials on the Web. Looking at it from an educational perspective, related questions include: What type of search tool to provide to the students to assist them in their search for course related materials on the Web? Should it leave the students in control of their search strategy or should it use a meta-search-like automatic modification of their search queries? During a search session in an e-learning system, the learner can obtain a sequence that can help motivate her/him to learn and prevent her/him from being frustrated. This sequence is the result of a search modified on the ground of the information contained in the student model which describes the preferences, needs and interests of the student and her/his learning performance.

This thesis describes the design and implementation of a personalization system (Profile Extractor), that analyzes the data coming out from the interaction between the users and the web application to automatically discover, using Machine Learning techniques, the user preferences, needs and interests. Moreover, it shows the possible uses of the user profiles created by the Profile Extractor system in two domains: bookshop online, and digital libraries, where several different experiments have been carried out in order to measure the efficiency of the user profiles. These two domains have been used as test beds for the implemented techniques and, since the results of the experiments have been encouraging, these techniques have been applied in the areas of the Student Modelling, that is the adoption of user profiles in the e-learning domain. Several experiments have been carried out for checking the efficiency of the user profiles in this context and for comparing the effectiveness of the numeric algorithms implemented by Profile Extractor system with the symbolic ones, from the area of Inductive Logic Pragramming, implemented by another system, along with an evaluation of their efficiency in order to decide how to best exploit them in the induction of student profiles for future works.


(^)

Ignazio Palmisano

A Machine Learning Approach to Ontology Alignment

Abstract

The main problem that this thesis is aimed to address is Ontology Alignment. It can be described as the problem of how to move knowledge between different possible representations or formalizations, not in the sense of different knowledge representation formalisms, but in the sense of different conceptualizations within the same expression language and the same knowledge domain. Conceptualization here is intended as ontology where the key espression is "An ontology is a formalization of a conceptualization". The term ontology is borrowed from phylosopy, where its sense is "discourse about being"; in current Computer Science, its meaning could be described as "formalization of relationships between entities, both physical and abstract ones".

In this work, my aim is to use Machine Learning techniques applied to ontologies expressed in Description Logics (DL) formalisms in order to solve some of the issues that arise in trying to address an Ontology Alignment problem; Decription Logics (a family of knowledge representation formalisms aimed at describing knowledge as concepts and relations between concepts and entities - abstract or real world ones) have been chosen as logic foundation for many languages that the W3 Consortium has endorsed, in particular those involved in the realization of the Semantic Web, the evolution of current web that aims at formally capturing the knowledge expressed by web contents, so that automatic reasoning can be applied to accomplish a wide variety of tasks, to name a few: increase search effectiveness, simplify data migration, automatize knowledge exchange between systems, enhance automatic service discovery and service composition planning.


(^)

Domenico Redavid

Towards the Orchestration of Semantic Web Services

Abstract

The evolution and ubiquity of the Internet has facilitated the proliferation of distributed resources, such as computer systems and software applications. Organizations are increasingly utilizing resources that span traditional organizational boundaries, like shared databases or processor farms, to share expensive computing resources and other equipment (e.g.: networked scientific instruments for e-Science or cyber infrastructure projects) or to pool together Enterprise resources distributed widely across networks and geographies. Software applications are also evolving from monolithic, stove-pipe applications to loosely federated, interacting services that are dependent on networked resources to provide optimal functionality. This evolution, powered by the dot-com bubble at the turn of the century, emerged to automate and outsource business processes to a worldwide audience, both at the Business-to-Business (B2B) level and for Business-to-Customer (B2C) applications one improving the user experience. This new software engineering approach enabled distributed, heterogeneous software components to communicate and interoperate, through declarative, machine-readable descriptions of the services that different organizations offer. The emergence of Web Service technology allowed this migration for both enterprise and Grid-based applications due to its exploitation of the near ubiquitous World-Wide-Web infrastructure, cross-platform interoperability, and the fact that it is built upon de facto Web standards for syntax, addressing, and communication protocols. Several conceptualizations of a service have been proposed, ranging from electronic services that facilitate B2B e-commerce, to business entities grounded within the real world (such as those offered by network or utility providers) that may offer some provision of value in some domain. As the provision of these services has moved from a developer driven mechanism to one involving automatic runtime selection (requiring service discovery support), the descriptions of the APIs and protocols have been increasingly declarative. The Web Service paradigm introduced the concept of homogeneous, XML-based representation of service descriptions using interface and work of definition languages such as WSDL, BPEL4WS, and WS-Choreography. Nevertheless, whilst these approaches facilitated easier access and usage of web services for developers, they have failed to address many of the knowledge-based problems associated with the diversity of service providers, i.e. interface and data heterogeneity. Semantic Web Services (SWS) address this problem by providing a declarative, ontological framework for describing services, messages, and concepts in a machine-readable format that can also facilitate logical reasoning. Thus, service descriptions can be interpreted based on their meanings, rather than simply being a symbolic representation. Semantic Web Services aim to extend the Web Service integration process in order to facilitate automated (or semi-automated) composition, discovery, dynamic binding, and invocation of services within open, scalable environments. Where there is need to use a SWS infrastructure without human intervention, two very important concepts can be used to describe the issues to be solved: Orchestration and Choreography. These definitions can be applied both to Web Services applications and to agent-based systems, and in general to any system for which the notion of collaboration and planning makes sense, i.e. systems including more than one active entity. The following analogy illustrates these concepts and their differences:

Consider a dance with more than one dancer. Each dancer has a set of steps that he will perform; they orchestrate their own steps because they are in complete control of their domain (their body). A choreographer ensures that the steps all of the dancers make are according to some overall scheme; we call this a choreography. The dancers have a single view point of the dance, while the choreography has a multi-party or global view point of the dance. Orchestration is about describing and executing a single view point model, while choreography is about describing and guiding a global model. It is possible to derive the single view point model from the global model by projecting based on participant.

From the Web service perspective, an orchestration is a declarative specification that describes a work to support the execution of a specific business processes, operation or service; i.e., it describes how Web Services can interact with each other at the message level, including the business logic and execution order of their interactions. On the contrary, from a SWS perspective, in order to automatically orchestrate in a set of services, we need to be able to find, select, combine and perform them. Therefore, it is needed to realize the following use cases: Discovery, Selection, Composition and Invocation.

The aim of this thesis is the formalization of some aspects inherent the orchestration of Web Service remaining entirely in the SemanticWeb sphere. In particular, it discusses an in-depth analysis of the Semantic Web Services from the orchestration perspective, including the state of the art and the comparison between the most widely adopted SWS representation languages. A solution for the SWS Composition use case is presented. A prototype based on a backward chaining algorithm has been implemented using SWRL (Semantic Web Rule Language) as representation language for OWL-S services. It is an original solution since it has been realized entirely using Semantic Web technologies. Furthermore, there are two Semantic Web problems that unavoidably impact the development of Semantic Web Services: ontology alignment and monotonic knowledge base management. These issues are also discussed and some possible solutions developed in the Semantic Web context are proposed for the Semantic Web Services. Finally, an empirical evaluation of our prototype and the conclusion is presented.


(^)

Pierpaolo Basile

Word Sense Disambiguation and Intelligent Information Access

Abstract

In the field of computational linguistics, researchers are mainly concerned with the computational processing of natural language. A number of results have already been obtained, ranging from concrete and applicable systems able to understand or produce language to theoretical descriptions of the underlying algorithms. However, a number of important research problems have not been solved. A particular challenge for computational linguistics pertaining to all levels of language is ambiguity. Most people are quite unaware of how vague and ambiguous human languages really are, and they are disappointed when computers are hardly able to understand language and linguistic communication the way humans do. Ambiguity means that a word can be interpreted in more than one way, has more than one meaning. Mostly ambiguity does not pose a problem for humans and is therefore not perceived as such, for a computer, however, ambiguity is one of the main problems encountered in the analysis and generation of natural languages.

Moreover, advances in the Internet and the creation of huge stores of digitized text have opened the gateway to a deluge of information that is difficult to navigate. Although the information is widely available, exploring Web sites and finding information relevant to a user needs is a challenging task. One of the obstacle is represented by the language ambiguity, for example if you want to search all the documents about bat as a small nocturnal creature, most probability the retrieval system gets back also the documents that contains bat as a piece of sport equipment (a club used for hitting a ball in various games). In order to solve this problem, a method able to disambiguate word meanings across the documents is needed.

The central argument of this dissertation is the use of Word Sense Disambiguation for Intelligent Information Access. Word Sense Disambiguation (WSD) refers to the resolution of lexical semantic ambiguity and its goal is to attribute the correct sense to a word used in a given context, while Intelligent Information Access is a user-centric and semantically rich approach to access information.

After a brief introduction to the problem of lexical semantic ambiguity, I propose several methods for word sense disambiguation that attempts to disambiguate words exploiting a semantic knowledge resource like WordNet. WordNet is an lexical database whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets synset, each representing one underlying lexical concept.

The second part of the dissertation takes into account the evaluation of WSD strategies. In particular the proposed algorithms are tested independently of any application, using specially constructed benchmarks and after they are evaluated in terms of their contribution to the overall performance of a system designed for Intelligent Information Access such as Semantic Searching and Intelligent User Profiling.


(^)

Eufemia Tinelli

Efficient Reasoning Techniques for Large Datasets of DLs Instances: Approaches And Applications

Abstract

Nowadays more and more people choose to employ Internet and/or automated procedures as infastructure and means for communication, search and resource repository. In this context, both services and goods are considered resource. The main aim is to provide new business opportunities allowing a more efficient management of information.

In order to offer a powerful and automated retrieval process, the simple keyword-based search is not sufficient. Infact, in the on-line websites and portals the search process can be time-consuming and unsatisfactory. Generally the user can express only her mandatory requirements (there is no possibilities to select features according to wishes or negotiable constraints). Such systems return often irrelevant results without explanations. The efficiency of such retrieval engine is therefore determined by the efficacy of their underlying frameworks able to perform the match among user requests and offers. If requests and offers are simple names or strings, the only possible match would be identity, resulting in an all-or-nothing outcome. On the other hand, pure knowledge-based approaches require heavy computational capabilities, hence response times are often unacceptable.

In business scenarios, other important issue is to deal with very large datasets of resources. Hence, the retrieval efficiency is measured both by data scalability and by parameters such as allowed match classes, obtained relevant results, ranking functions and query language expressivity. Of course, in this work is not possible to cover the full range of reasoning services. Instead, thesis focus is the presentation of efficient resource matchmaking and composition approaches in several business context and, in particular, the contribution that Knowledge Representation (KR), specifically Description Logics (DLs), can provide to improve scenarios where demand (user request) meets offers (good and services). The final goal is to retrieve only the best offers, opportunely ranked, with respect to the user request.

Matchmaking approaches presented in this work are based on KB pre-processing in order to reduce on-line reasoning. A relevant aspect of thesis work is the exploitation of classical relational database systems (RDBMS) and languages i.e., standard SQL, for storing the KB and to perform reasoning tasks respectively. Several approaches have been presented in which databases allow users and applications to access both ontologies and other structured data in a seamless way. An overview and a comparison among these will be also presented. Finally, preference-based models and systems sharing some characteristics with the approaches of this work will be discussed because the problem of preference handling in RDBMS is not new in information retrieval systems.

This thesis aims at advancing the state of the art in research on efficient reasoning techniques for managing very large datasets of DLs instances; in particular, the work intends to show how appropriate modeling of the KB can improve semantic matchmaking and, eventually, match explanation. The discussion will take into account several aspects and dimensions.

The contribution of this research could be summarized as follows:

  • We describe a preliminary matchmaking approach which investigates instance modeling in a relational database. It is not dependent on the domain and it implements several match classes, exploiting SQL standard only. Moreover, two ontologies modeling different domains are used to built several datasets of instances in order to better verify services performance.
  • On the basis of results of the above mentioned approach, we present a complete matchmaking algorithm specially suitable for skill matching. Distinguishing features include: the possibility to express both strict requirements and preferences in the user request, a logic-based ranking of retrieved instances and the explanation of rank results. All services only rely on ad hoc queries translated in standard SQL: no built-in operator and/or new constructor are exploited.
  • As proof-of-concept, we present a tool developed for providing skill matching and team-work composition. The main issue is to design an user-friendly GUI both for browsing easily the domain ontology (in order to compose the query) and for better explain the retrieved results.
  • We describe other efficient techniques for resource retrieval and composition in domains such as Ubiquitous Computing and Business Process. Moreover, we present a possible integration between semantic matchmaking services and user profiling ones and, finally, we investigate the problem of core competence extraction.

(^)

Anna Lisa Gentile

Entities and Identities: Named Entity Processing with cultural Knowledge

Abstract

Natural Language is a mean to express and discuss about concepts, objects, events, i.e. it carries semantic contents. Reading a written text implies the comprehension of the information that words are carrying. Comprehension is an intrinsic capacity for a human, but not for a machine. One of the ultimate roles of Natural Language Processing techniques is identifying the meaning of the text, providing effective ways to make a proper linkage between textual references and real world objects, thus enabling machines to have a bit of the understanding which is proper of a human.

A proper name is a word or a list of words that refers to a real world object. Linguistic Expressions with the same reference may have different senses, so it is necessary to disambiguate between them.

Natural Language Processing (NLP) operations include text normalization, tokenization, stop words elimination, stemming, Part Of Speech tagging, lemmatization. Further steps, such as Word Sense Disambiguation (WSD) or Named Entity Recognition (NER), are aimed at enriching texts with semantic information. Named Entity Disambiguation (NED) is the procedure that solves the correspondence between real-world entities and mentions within text. One of the ultimate goals of NLP techniques is to identify the meaning of the text, providing effective ways to make a proper linkage between textual references and real world objects. The thesis addresses the problem of giving a sense to proper names in a text, that is the problem of automatically associating words representing Named Entities with their identities, that is unique real world objects. Also, the thesis copes with the problem of lack of training and testing data for such a task.

Proposed approaches automatically associate each entity in a text with a unique identifier, a URI from Wikipedia, which is used as an "entity-provider".

The main contribution consists of proposing knowledge based approaches for NED, which do not requires training data. Specifically the thesis proposes two solutions:

  • a completely knowledge-based algorithm for NED, exploiting Wikipedia data
  • a Semantic Relatedness (SR) approach for the NED task: SR scores are obtained by a graph-based model over Wikipedia

The first solution has been tested for italian language: due to lack of italian testing data for such task, the thesis shows a method to automatically build a testbed dataset from Wikipedia. The second solution has been tested over an goldstandard dataset for NED: the proposed algorithm achieves results competitive with the state of the art.

Both suggested solutions are completely knowledge-based, with the advantage that no training data is needed: indeed, manually annotated data for this task is not easily available and acquiring such data can be expensive.


(^)

Leo Iaquinta

Serendipity in Context: Context-aware Recommendations of Serendipitous Items

Abstract

When a person searches for a piece of information about a topic, she finds so much information available that she hardly unearths web pages, books, papers, articles, music, videos, etc. actually relevant to the searched topic. For instance, most search engines on the Internet return thousands of results on every query, while only a few of those results are really relevant for the searcher and they are not always at the top of the returned list. Furthermore, what is relevant and interesting for one searcher may not be relevant and interesting for another searcher, even if they submit the same query.

The extensive options lead the user to feel that she looses control on handling the amount of information and she becomes worried whether something interesting or important is being missed. This problem is often referred to as the information overload.

Recommender systems help to reduce information overload and provide customized information access for targeted domains. Such systems take direct or indirect input from users and, based on their needs, preferences and usage patterns, provide personalized advices about products or services so that users are assisted to filter useful information.

Recommender systems became an important research area since the appearance of the first papers on collaborative filtering since the mid-1990s. There has been much work done both in the industry and academia to develop and to improve new approaches to recommendations over the last decade. The interest in this area still remains high because it constitutes a problem-rich research area and because of the plenty of practical applications that help users to deal with the information overload and that provide them with personalized recommendations, content and services. In addition, despite all the advances, the current generation of recommender systems still requires further improvements to make recommendation methods more effective and applicable to an even broader range of real-life applications. These improvements include better methods for representing the user behavior and the information about the items to be recommended, more advanced recommendation modeling methods, and exploitation of contextual information into the recommendation process.

For some approaches, such as the content-based one, the item representation plays a key role, thus choosing proper facets to represent items is a fundamental task for deploying effective recommender systems. Contextual facets are often marginally relevant to learn and predict user preferences, but in some domains disregarding contextual facets makes recommendations useless. Consequently the thesis deals with the contextual dimension proposing a strategy to improve the effectiveness of a content-based recommender system by the exploitation of contextual facets. The demonstrative scenario concerns with the dynamic suggestion of personalized tours within a museum: the contextual facets deal with the physical layout of items and the interaction of users with the physical environment.

The thesis also deals with the serendipitous dimension. Indeed, recommender systems commonly recommend items that score highly against a userís profile and, consequently, the user is recommended for items similar to those already rated. If this feature becomes a limitation, the recommender system suffers of over-specialization and it damages the common expectations concern with novelty and surprise. Indeed novelty occurs when the system suggests an unknown item that the user might have autonomously discovered. On the other hand, a serendipitous recommendation helps the user to find a surprisingly interesting item that she might not have otherwise discovered (or it would have been really hard to discover). Although the serendipity is a difficult concept to research because it is by definition not particularly susceptible to systematic control and prediction, the thesis deals with the serendipitous dimension, proposing a strategy to mitigate the over-specialization exploiting the learned user profiles.

Finally, the contextual dimension and serendipitous dimension are synergic. Indeed, the contextual dimension is used to refine the selection of supposed serendipitous items and to provide a practical interpretation of serendipity augmented recommendation task. On the other hand, the serendipity dimension allows to introduce an increased dynamicity in the contextual facets handling.


(^)

Annalina Caputo

Semantics and Information Retrieval: Models, Techniques and Applications

Abstract

The dialogue between humans and machines takes place on two different levels. Along the path from the userís mind to a machine representation, concepts, relationships and meanings are translated into a flat unstructured form deprived of its original meaning. This process, which also affects text representation, impacts on Information Access systems, and in particular on the Information Retrieval (IR) ones. The key concept in such systems is the word information, but when text is represented as an unordered sequence of words the retrieval task becomes a mere string matching based process. In this context, userís vagueness and word ambiguity become a big challenge for IR systems.

Over the past decades several attempts have been proposed to deviate from the traditional keyword search paradigm, often by introducing some techniques to capture word meanings. The result is a vast area of approaches that aimed at harnessing the semantics in the Information Retrieval reins, working on two different fronts. The former tries to introduce semantics by modeling word meaning directly into document representation. The latter tries to build an ameliorated query representation by shifting from what the user asks to what the user wants. However, the general feeling is that dealing explicitly with only semantic information does not improve significantly the performance of text retrieval systems.

The work presented in this thesis explores the usage of semantics in Information Retrieval on two separate fronts: documents and queries. Semantics has many facets and several interpretations in Computer Science, but this thesis focuses on lexical semantics.

The first part of this dissertation deals with semantics in documents. Firstly, it is presented SENSE (SEmantic N-levels Search Engine), an IR system that tries to overcome the limitations of the ranked keyword approach, by introducing semantic levels that integrate (and not simply replace) the lexical level represented by keywords.

Two algorithms are proposed for representing word meanings in SENSE: the former is based on Word Sense Disambiguation, while the latter exploits Word Sense Discrimination.

The second part of this work tackles with semantics in queries. One of such approaches is the Query Expansion (QE). Two well known QE algorithms are investigated within the SENSE framework: Rocchio and the Local Context Analysis.

Lastly, this thesis faces the problem of building complex queries able to represent concepts and their relationships. Complex queries are built exploiting the quantum algebra for structured queries within the Quantum IR framework.

All proposed algorithms and approaches are evaluated on standard test collections, and results show that most of them are effective ways for improving the retrieval task. The methods presented in this thesis demonstrate as the point in question is how, rather than whether, add semantics to IR.



(^)