blockclass

The block classification in WISDOM++

Blocks determined by the RLSA during the segmentation step enclose different types of content. In order to separate text from graphics the type of content has to be recognized. This recognition task is entrusted to the so-called block classifiers.

In WISDOM++ block classifiers are automatically built by means of inductive learning techniques, namely decision tree induction. The choice of a tree-based method instead of the most common generalized linear models is due to its inherent flexibility, since decision trees can handle complicated interactions among features and give results that can be easily interpreted.

WISDOM++ discriminates between:

Each block is represented by a set of numeric attribute-value couples:
 
AttributeDefinition
Height Height of the reduced block image.
Length Length of the reduced block image.
AreaArea of the reduced block image (height*length).
EccenEccentricity of the reduced block image (length/height).
BlackpixTotal number of black pixels in the reduced block image.
Bw_transTotal number of black-white transitions in all rows of the reduced block image.
PblackPercentage of black pixels in the reduced block image (blackpix/area).
Mean_trAverage number of black pixels per black-white transition (blackpix/bw_trans).
F1Short run emphasis.
F2Long run emphasis.
F3Extra long run emphasis.

In a preliminary study, a decision tree was induced using a set of 5473 examples of pre-classified blocks obtained from 53 documents of various kinds. An average accuracy above 97% (Esposito etal., 1997) was obtained with an extension of C4.5 (Quinlan,1993). However, this learning system operates in batch mode:It is not possible to revise the decision tree when some blocks are misclassified,unless a new tree is generated from scratch using an extended trainingset.
In order to give the user the possibility of training WISDOM++ on-line,we considered the incremental decision tree learning ITI (Utgoff,1989) which is the only one able to handle numerical attributes such as those used to describe the blocks.
ITI can operate in three different ways:
 
Operational modeBehaviour of the decision tree learning system
BatchIt builds a new tree from all training examples available
NormalIt first updates the frequency counts associated to each node of thetree as soon as a new instance is input. Then it restructures the decision tree according to the updated frequency counts.
Error-correctionAnalogous to the normal operational mode except that the frequency counts are updated only in case of misclassification of the new instance

The main difference between the two incremental modes is that the former guarantees to build the same decision tree independently of the order inwhich examples are presented, while the latter does not.

In order to test the performance of ITI as block classifier, a set of 112 real, single-page documents have been considered as input data. Three experiments have been organized. ITI has been trained in the batch mode in the first experiment, in the pure error-correction mode in the second one, and in a mixed normal/error-correction mode in the third one.
The main characteristics of the learned trees are reported in the table below. The last column refers to the number of examples stored in the nodes of the induced trees.
 

 Size KbNo. NodesNo. leavesNo. incorporated examples
Batch
24,320
229
114
9,429
Pure Error-correction
982
159
79
277
Mixed
13,150
235
117
4,545+125

The decision tree built in the batch mode takes more than 24Mb, sinceall instances have to be stored, while the decision tree obtained in the error-correction mode requires 982Kb. Nevertheless, this difference in tree size does not correspond to a significant difference in predictive accuracy estimated on the independent test set. This justifies the use of the decision tree built according to the error-correction operation mode within WISDOM++ 2.0.
Experimental results are reported in (Altamura et al., 1999).
 

Bibliography

 

Back to home page


berardi@di.uniba.it