SWAP - Semantic Web Access and Personalization Research Group - PhDGraduates

The dialogue between humans and machines takes place on two different levels. Along the path from the user’s mind to a machine representation, concepts, relationships and meanings are translated into a flat unstructured form deprived of its original meaning. This process, which also affects text representation, impacts on Information Access systems, and in particular on the Information Retrieval (IR) ones. The key concept in such systems is the word information, but when text is represented as an unordered sequence of words the retrieval task becomes a mere string matching based process. In this context, user’s vagueness and word ambiguity become a big challenge for IR systems.

Over the past decades several attempts have been proposed to deviate from the traditional keyword search paradigm, often by introducing some techniques to capture word meanings. The result is a vast area of approaches that aimed at harnessing the semantics in the Information Retrieval reins, working on two different fronts. The former tries to introduce semantics by modeling word meaning directly into document representation. The latter tries to build an ameliorated query representation by shifting from what the user asks to what the user wants. However, the general feeling is that dealing explicitly with only semantic information does not improve significantly the performance of text retrieval systems.

The work presented in this thesis explores the usage of semantics in Information Retrieval on two separate fronts: documents and queries. Semantics has many facets and several interpretations in Computer Science, but this thesis focuses on lexical semantics.

The first part of this dissertation deals with semantics in documents. Firstly, it is presented SENSE (SEmantic N-levels Search Engine), an IR system that tries to overcome the limitations of the ranked keyword approach, by introducing semantic levels that integrate (and not simply replace) the lexical level represented by keywords.

Two algorithms are proposed for representing word meanings in SENSE: the former is based on Word Sense Disambiguation, while the latter exploits Word Sense Discrimination.

The second part of this work tackles with semantics in queries. One of such approaches is the Query Expansion (QE). Two well known QE algorithms are investigated within the SENSE framework: Rocchio and the Local Context Analysis.

Lastly, this thesis faces the problem of building complex queries able to represent concepts and their relationships. Complex queries are built exploiting the quantum algebra for structured queries within the Quantum IR framework.

All proposed algorithms and approaches are evaluated on standard test collections, and results show that most of them are effective ways for improving the retrieval task. The methods presented in this thesis demonstrate as the point in question is how, rather than whether, add semantics to IR.

Added line 526:

\\\

Restore

October 01, 2010, at 04:26 AM EST by 193.204.187.101 -

Changed line 244 from:

This thesis aims to advance the state of the art in research on efficient reasoning

to:

This thesis aims at advancing the state of the art in research on efficient reasoning

Changed lines 250-253 from:

The contributions of this research could be summarized as follows:

We describe a preliminary matchmaking approach which investigates instance modeling in a relational database. It is not dependent on

the domain and it implements several match classes, exploiting SQL standard only. Moreover, two ontologies modeling different domains are used to built several datasets of instances in order to better verify services performance.

On the basis of results of the above mentioned approach, we present a complete matchmaking algorithm specially suitable for skill matching. Distinguishing features include: the possibility to express both strict requirements and preferences in the user request, a logic-based ranking of retrieved instances and the explanation of rank results. All services only rely on ad hoc queries translated in standard SQL: no built-in operator and/or new constructor are exploited. In this approach, both requests and offers must be expressed using the same reference template which is necessary to define their structure and expressiveness.

to:

The contribution of this research could be summarized as follows:

We describe a preliminary matchmaking approach which investigates instance modeling in a relational database. It is not dependent on the domain and it implements several match classes, exploiting SQL standard only. Moreover, two ontologies modeling different domains are used to built several datasets of instances in order to better verify services performance.
On the basis of results of the above mentioned approach, we present a complete matchmaking algorithm specially suitable for skill matching. Distinguishing features include: the possibility to express both strict requirements and preferences in the user request, a logic-based ranking of retrieved instances and the explanation of rank results. All services only rely on ad hoc queries translated in standard SQL: no built-in operator and/or new constructor are exploited.

Restore

October 01, 2010, at 04:06 AM EST by 193.204.187.101 -

Changed line 210 from:

In order to offer a powerful and automated retrieval process, the simple keywordbased

to:

In order to offer a powerful and automated retrieval process, the simple keyword-based

Changed lines 212-213 from:

process can be time-consuming and unsatisfactory. Generally speaking, those systems are keyword-based and then a user can express only her mandatory requirements (there

to:

process can be time-consuming and unsatisfactory. Generally the user can express only her mandatory requirements (there

Changed line 214 from:

systems return not ranked and often irrelevant results without explanations. The efficiency

to:

systems return often irrelevant results without explanations. The efficiency

Changed lines 216-221 from:

frameworks able to perform the match among user requests and offers. From this point of view, it is noteworthy that non-logical approaches to resource retrieval and matchmaking have serious limitations. For example, by exploiting standard relational database techniques to model a resource retrieval framework, there is the need to completely align the attributes of the offered and requested resources, in order to perform a match. If requests and offers are simple names or strings, the only possible

to:

frameworks able to perform the match among user requests and offers. If requests and offers are simple names or strings, the only possible

Deleted lines 220-224:

Moreover, in real contexts, very often there are no offers that are better than the others ones from every user selection criteria. We consider that in these cases, i.e., when exact matches are lacking, instead of receiving an empty set as search result, user could accept worse alternatives gradly or she could negotiate the original requirements for compromises.

Changed line 230 from:

The final goal is to retrieve only the best offers, opportunely ranked, w.r.t. the user

to:

The final goal is to retrieve only the best offers, opportunely ranked, with respect to the user

Changed lines 233-239 from:

The problemof reasoning efficiency is not new in literature. Knowledge Compilation, infact, is a technique exploited for making reasoning computationally easier in a knowledge base (KB) typicallymodelled using a logical formalism. The idea of knowledge compilation is to split query answering into two phases:

in the first one the knowledge base is preprocessed, thus obtaining an appropriate data structure (such a phase is sometimes called off-line reasoning);
in the second phase, the query is actually answered using the output of the first phase (such a phase is sometimes called on-line reasoning).

to:

Changed lines 236-237 from:

classical relational database systems (RDBMS) and languages i.e., SQL, for storing the KB and to perform reasoning tasks. Several approaches have been presented in which

to:

classical relational database systems (RDBMS) and languages i.e., standard SQL, for storing the KB and to perform reasoning tasks respectively. Several approaches have been presented in which

Changed line 241 from:

with this work approaches will be discussed because the problem of preference

to:

with the approaches of this work will be discussed because the problem of preference

Changed line 246 from:

intends to show how appropriatemodeling of the KB can improve semantic matchmaking

to:

intends to show how appropriate modeling of the KB can improve semantic matchmaking

Changed lines 248-255 from:

aspects and dimensions:

Application – For what is resource retrieval used? And resource composition? Application fields are several and different.
Efficient semantic matchmaking – What language is necessary to build both user request and offer semantic description? What data structure enables to perform reasoning tasks? How can we evaluate the reasoning efficiency?
- KB modeling – How is it possible to respect an Open-world Assumption by means of an RDBMS based on the Closed-world Assumption? Other issue is related to stored information useful for retrieval. In other terms, we discuss which data (structured and not, instances and ontological inforation) have to be stored in order to provide services such as matchmaking, ranking and match explanation.
- Match classes – Which match classes are allowed? Which algorithms are implemented? Is the system scalable in the sense that the retrieval time quite linearly increases with the data size?
- Complementary facilities – Which information is used to explain the score obtained for each result? For the end user is important to express both necessary requirements and desiderable ones in her request. Hence, the matchmaker have to be able to deal efficiently with strict and soft constraint, respectively.

to:

aspects and dimensions.

Changed lines 251-254 from:

We describe a preliminary matchmaking approach which investigates instance modeling in a relational database1. It is domain independent and it implements several match classes, exploiting SQL standard only. Limits are the followings: no ranked list of results is returned and the potential match is not complete because generally it retrieves a bigger set of results containing irrelevant results also. Moreover, two ontology modeling different domains are used to built several datasets of instances in order to better verify services performance.
On the basis of results of the above mentioned approach, we present a complete matchmaking algorithmspecially suitable for skill matching. Distinguishing features include: the possibility to express both strict requirements and preferences in the user request, a logic-based ranking of retrieved instances and the explanation of rank results. All services only rely on ad hoc queries translated in standard SQL: no built-in operator and/or new constructor are exploited. In this approach, both requests and offers must be expressed using the same reference template which is necessary to define their structure and expressivity.
As proof-of-concept, we present a tool developed for providing skill matching and team-work composition. A main aim is the design of an user-friendly GUI both for browsing easily the domain ontology (in order to compose the query) and for better explain the retrieved results.
We describe other efficient techniques for resource retrieval and composition in domains as Ubiquitous Computing and Business Process. We present a possible integration between semantic matchmaking services and user profiling ones and, finally, we investigate the problem of core competence extraction.

to:

We describe a preliminary matchmaking approach which investigates instance modeling in a relational database. It is not dependent on

On the basis of results of the above mentioned approach, we present a complete matchmaking algorithm specially suitable for skill matching. Distinguishing features include: the possibility to express both strict requirements and preferences in the user request, a logic-based ranking of retrieved instances and the explanation of rank results. All services only rely on ad hoc queries translated in standard SQL: no built-in operator and/or new constructor are exploited. In this approach, both requests and offers must be expressed using the same reference template which is necessary to define their structure and expressiveness.
As proof-of-concept, we present a tool developed for providing skill matching and team-work composition. The main issue is to design an user-friendly GUI both for browsing easily the domain ontology (in order to compose the query) and for better explain the retrieved results.
We describe other efficient techniques for resource retrieval and composition in domains such as Ubiquitous Computing and Business Process. Moreover, we present a possible integration between semantic matchmaking services and user profiling ones and, finally, we investigate the problem of core competence extraction.

Restore

September 21, 2010, at 04:52 AM EST by 193.206.186.106 -

Added lines 75-86:

Word Sense Disambiguation and Intelligent Information Access

Abstract In the field of computational linguistics, researchers are mainly concerned with the computational processing of natural language. A number of results have already been obtained, ranging from concrete and applicable systems able to understand or produce language to theoretical descriptions of the underlying algorithms. However, a number of important research problems have not been solved. A particular challenge for computational linguistics pertaining to all levels of language is ambiguity. Most people are quite unaware of how vague and ambiguous human languages really are, and they are disappointed when computers are hardly able to understand language and linguistic communication the way humans do. Ambiguity means that a word can be interpreted in more than one way, has more than one meaning. Mostly ambiguity does not pose a problem for humans and is therefore not perceived as such, for a computer, however, ambiguity is one of the main problems encountered in the analysis and generation of natural languages.

Moreover, advances in the Internet and the creation of huge stores of digitized text have opened the gateway to a deluge of information that is difficult to navigate. Although the information is widely available, exploring Web sites and finding information relevant to a user needs is a challenging task. One of the obstacle is represented by the language ambiguity, for example if you want to search all the documents about bat as a small nocturnal creature, most probability the retrieval system gets back also the documents that contains bat as a piece of sport equipment (a club used for hitting a ball in various games). In order to solve this problem, a method able to disambiguate word meanings across the documents is needed.

The central argument of this dissertation is the use of Word Sense Disambiguation for Intelligent Information Access. Word Sense Disambiguation (WSD) refers to the resolution of lexical semantic ambiguity and its goal is to attribute the correct sense to a word used in a given context, while Intelligent Information Access is a user-centric and semantically rich approach to access information.

After a brief introduction to the problem of lexical semantic ambiguity, I propose several methods for word sense disambiguation that attempts to disambiguate words exploiting a semantic knowledge resource like WordNet. WordNet is an lexical database whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets synset, each representing one underlying lexical concept.

The second part of the dissertation takes into account the evaluation of WSD strategies. In particular the proposed algorithms are tested independently of any application, using specially constructed benchmarks and after they are evaluated in terms of their contribution to the overall performance of a system designed for Intelligent Information Access such as Semantic Searching and Intelligent User Profiling.

Restore

September 21, 2010, at 04:48 AM EST by 193.204.187.140 -

Changed lines 11-27 from:

to:

Learning User Profiles from Text for Personalized Information Access

Abstract

Advances in the Internet and the creation of huge stores of digitized text have opened the gateway to a deluge of information that is difficult to navigate. Although the information is widely available, exploring Web sites and finding information relevant to a user's interests is a challenging task. The first obstacle is research, where you must first identify the appropriate information sources and then retrieve the relevant data. Then, you have to sort through this data to filter out the unfocused and unimportant information. Lastly, in order for the information to be truly useful, you must take the time to figure out how to organize and abstract it in a manner that is easy to understand and analyze. To say the least, all of these steps are extremely time consuming. This "relevant information problem" leads to a clear demand for automated methods able to support users in searching large document repositories in order to retrieve relevant information with respect to their preferences. Catching user interests and representing them in a structured form is a problematic activity. Algorithms designed for this purpose base their relevance computations on so-called user profiles in which representations of the users' interests are maintained. The central argument of this dissertation is the use of Supervised Machine Leaning techniques to induce user profiles from text data for Intelligent Information Access. Intelligent Information Access is a user-centric and semantically rich approach to access information: information preferences vary greatly across users, therefore information access must be highly personalized by profiles to serve the individual interests of the user. Moreover, users want to retrieve information on the basis of conceptual content, but individual words provide unreliable evidence about the meaning of documents. Thus, methods for extracting meaning from documents must be considered in order to effectively find relevant information.

First, we describe content-based learning algorithms designed to learn about users' interests. The input, given as a set of text documents marked by the user as relevant or not relevant, is used to find characteristics that distinguish relevant documents from irrelevant ones. The induced target concept is a user profile appropriate for the classification of new documents. Documents are represented as bag of words (BOW): a document is encoded as a feature vector, with each element in the vector indicating the presence or absence of a word in the document. This approach was used as a baseline to determine how well a standard keyword-based learner performs on this task.

Second, current limits in the state of the art in profiles generated from the BOW-represented documents are analyzed. Though many linguistic techniques have been employed, there are problems that still remain unsolved like: polysemy, synonymy, etc. A possible solution for this kind of issues is explored: the shift of the level of abstraction from words up to concepts. Profiles will not contain words anymore. They will contain references to concepts defined in lexicons or, in a further step, ontologies. A first advance in this direction consists of employing WordNet as a reference lexicon in substituting word forms with word meanings into profiles. We show how the described content-based algorithms can be extended using a new, enriched document representation obtained by adding features generated using a new WordNet-based procedure.

The dissertation concludes with the description of the empirical study that evaluates the effectiveness of the proposed approach.

Restore

September 20, 2010, at 04:46 PM EST by 93.43.209.83 -

Changed line 25 from:

Recommender systems constitute one of the fastest growing segments of the Internet economy today. They help reduce information overload and provide customized information access for targeted domains. Such systems take input directly or indirectly from users and, based on their needs, preferences and ''usage patterns', provide personalized advices about products or services and can help people to filter useful information, thus giving users easing the information search and decision processes.

to:

Recommender systems constitute one of the fastest growing segments of the Internet economy today. They help reduce information overload and provide customized information access for targeted domains. Such systems take input directly or indirectly from users and, based on their needs, preferences and usage patterns, provide personalized advices about products or services and can help people to filter useful information, thus giving users easing the information search and decision processes.

Restore

September 20, 2010, at 04:45 PM EST by 93.43.209.83 -

Added lines 17-39:

Hybrid Recommendation Techniques based on User Profiles

Abstract

Nowadays users are overwhelmed by the abundant amount of information, and it is not just a problem to a minority of population; it is a problem for everyone in their daily life. In fact, now, we do not get information just from newspapers, colleagues, family members and friends, but also largely from the Internet.

How can people deal with this information overload problem? Individuals tend to filter and ignore information as the effective ways to cope with information overload.

Among different recommendation techniques proposed in the literature, the collaborative filtering approach is the most successful and widely adopted to date. Collaborative filtering by itself cannot always guarantee a good prediction. The effectiveness of predictions relies on the confidence of the computation of the similarity between users. Correlation between users can only be computed if they have rated a sufficient number of common items. Since users can choose among thousands of items to rate, especially in online catalogues, and new items become available continuously, it is likely that overlap of rated items between two users will be minimal in many cases. Therefore, many of the computed correlation coefficients are based on just few observations. As a result, correlation based only on co-rated items cannot be regarded as a reliable similarity measure.

One of the primary contributions of this thesis is the investigation on how the knowledge about users can be exploited to improve recommendations. In particular it is investigated how overlaps between users' interests could be used to define the similarity among users in order to improve recommendations.

The combination of classic collaborative filtering techniques and user profiles inferred using content-based methods for designing a new hybrid recommendation technique is presented.

In the study it is described the process of learning content-based profiles to be used in one of the main steps of the process for producing social recommendations: the neighborhood formation. A clustering technique for grouping user profiles is proposed in order to identify the set of neighbors for those users for which recommendations must be produced.

More specifically, the process of grouping user profiles learns two different profiles of the user: one, from positive examples of interesting items, represents the interests of the user, the other profile, learned from negative examples, represents items the user dislikes. The observation is that two users can be considered similar if they like the same items, but if they dislike the same ones as well.

Finally, advanced semantic user profiles based on concepts instead of keywords have been used for improving the accuracy of collaborative recommendations.

Several experiments have been carried out in order to evaluate the effectiveness of the approaches. Some baseline experiments on classic collaborative filtering have been carried out as benchmark. The final experimental analysis provides evidence of the improvements of the proposed approaches.

Restore