WebClassIII is the latest version of WebClass and implements a hierarchical text categorization framework.
The hierarchy of categories is involved in all phases of automated document classification, namely feature selection, learning, and classification of a new document. Main innovative aspects concern the feature selection method, the automated threshold determination for classification scores, and new measures for the evaluation of system performances. Methodological novelties concern the three learning methods tested in this work, namely centroid-based, naïve Bayes and SVM. Hierarchies considered in WebClassIII are quite general and include internal nodes with no associated document or only a single child.
WebClassIII is a prototypical Java workbench for experimenting the application of Statistical and Case-Based Reasoning methods to automatic hierarchical classification of Web pages.
Platform : Java2 (or higher) enabled
platform. Since WebClassIII interfaces a MsAccess database, we recommend to
install it on a Windows machine in order to run the system on one machine.
Installation Procedure & Testing
Download the application (.zip file) here.
the .jar file.
Download an empty database (.zip file) here. (UserID: dmoz_hier; Password: prova)
o Unzip the file bin.zip into the application folder.
o Create an ODBC connection to the
database. The name of the ODBC connection must be: “Web”.
o Run WebClassIII double-clicking on the webclassIII.bat file.
In order to run WebClassIII, you should import web pages in the database.
Alternatively, you can run WebClassIII on non-empty databases:
-- Yahoo Science database (UserID: default; Password: prova)
It is obtained from the documents referenced in the Yahoo! Search Directory. We extracted all 907 actual Web documents referenced at the top three levels of the Web directory http://dir.yahoo.com/Science. Empty documents and documents containing only scripts have been removed.
There are 6 categories at the first level, 27 categories at the second level and 35 categories at the third level
Documents have been downloaded on the 15th of July 2003. The dataset is electronically available at http://lacam.di.uniba.it:8000/phd/micFiles/yahoo_science_docs.zip
-- Dmoz database (UserID: dmoz_hier; Password: prova)
It is obtained from the documents referenced by the Open Directory Project (ODP) (http://www.dmoz.org/) . We extracted all actual Web documents referenced at the top five levels of the Web directory rooted in the branch Health\Conditions_and_Diseases\. Empty documents and documents containing only scripts have been removed.
The dataset contains 5,612 documents in 221 categories organized in a five level hierarchy as follows:
· In the first level there are 21 categories and 340 documents.
· In the second level there are 81 categories and 1,514 documents.
· In the third level there are 85 categories and 2604 documents.
· In the fourth level there are 32 categories and 1099 documents.
· In the fifth level there are 2 categories and 55 documents.
Documents have been downloaded in April 2004. The dataset is electronically available at http://lacam.di.uniba.it:8000/phd/micFiles/dmoz_health_conditions_and_diseases_docs.zip.
We also performed experiments on RCV1 (Reuters Corpus Volume 1) dataset. However, due to copyright restrictions, this dataset cannot be made available on this web site. For further information do not hesitate to contact us.
None yet available. Send all
requests/comments to: Michelangelo Ceci.
Last modified 01/10/2004