|
||
WebClassIII
Short Description
WebClassIII is the latest version of WebClass
and implements a hierarchical text categorization framework. The hierarchy of categories is involved in
all phases of automated document classification, namely feature selection,
learning, and classification of a new document. Main innovative aspects
concern the feature selection method, the automated threshold determination
for classification scores, and new measures for the evaluation of system
performances. Methodological novelties concern the three learning methods
tested in this work, namely centroid-based, naïve Bayes and SVM. Hierarchies
considered in WebClassIII are quite general and include internal nodes with
no associated document or only a single child. WebClassIII is a prototypical Java workbench
for experimenting the application of Statistical and Case-Based Reasoning
methods to automatic hierarchical classification of Web pages. System Requirements
Platform : Java2 (or higher) enabled
platform. Since WebClassIII interfaces a MsAccess database, we recommend to
install it on a Windows machine in order to run the system on one machine. Installation Procedure & Testing
Download the application (.zip file) here. the .jar file. Download an empty database (.zip file) here. (UserID:
dmoz_hier; Password: prova) Installation procedure: o Unzip the file bin.zip into the application folder. o Create an ODBC connection to the
database. The name of the ODBC connection must be: “Web”. o Run WebClassIII double-clicking on
the webclassIII.bat file. In order to run WebClassIII, you
should import web pages in the database. Alternatively, you can run
WebClassIII on non-empty databases: -- Yahoo Science database (UserID: default; Password: prova) It is obtained from the documents
referenced in the Yahoo! Search Directory.
We extracted all 907 actual Web documents referenced at the top three
levels of the Web directory http://dir.yahoo.com/Science. Empty documents and
documents containing only scripts have been removed. There are 6 categories at the
first level, 27 categories at the second level and 35 categories at the third
level Documents have been downloaded on
the 15th of July 2003. The dataset is electronically available at http://lacam.di.uniba.it:8000/phd/micFiles/yahoo_science_docs.zip -- Dmoz database (UserID: dmoz_hier; Password:
prova) It is obtained from the documents
referenced by the Open Directory Project (ODP) (http://www.dmoz.org/)
. We extracted all actual Web documents referenced at the top five levels of
the Web directory rooted in the branch Health\Conditions_and_Diseases\. Empty
documents and documents containing only scripts have been removed. The dataset contains 5,612
documents in 221 categories organized in a five level hierarchy as follows: · In
the first level there are 21 categories and 340 documents. · In
the second level there are 81 categories and 1,514 documents. · In
the third level there are 85 categories and 2604 documents. · In
the fourth level there are 32 categories and 1099 documents. · In
the fifth level there are 2 categories and 55 documents. Documents have been downloaded in
April 2004. The dataset is electronically available at http://lacam.di.uniba.it:8000/phd/micFiles/dmoz_health_conditions_and_diseases_docs.zip. We also performed experiments on
RCV1 (Reuters
Corpus Volume 1) dataset. However, due to copyright restrictions, this
dataset cannot be made available on this web site. For further information do
not hesitate to contact us. ·
FAQs
None yet available. Send all
requests/comments to: Michelangelo Ceci.
Last modified
01/10/2004
|