Overview

WebClass is an adaptive Web intermediary agent performing content-based classification of HTML pages. Information considered by the intermediary is both the textual contents of web pages and the layout structure defined by HTML tags.Four different classification models are implemented for the classification task: Bayesian, decision tree,centroids and k-nearest-neighbor.

The system supports two categories of users: Administrators and final users. The formers are allowed to train the system by means of a graphical user interface. The interface supports browsing functions together with functions typical of a learning system, such as parameter setup, feature extraction, definition of training and test set, classifiers generation and test

On the contrary, the system is almost totally transparent to final users. Indeed, final users are allowed to use their browsers for individual search of relevant information concerning their activity.

 

WebClassIII is the latest version of WebClass and implements a hierarchical text categorization framework.

The hierarchy of categories is involved in all phases of automated document classification, namely feature selection, learning, and classification of a new document. Main innovative aspects concern the feature selection method, the automated threshold determination for classification scores, and new measures for the evaluation of system performances. Methodological novelties concern the three learning methods tested in this work, namely centroid-based, naïve Bayes and SVM. Hierarchies considered in WebClassIII are quite general and include internal nodes with no associated document or only a single child.