blockclass

The block classification in WISDOM++

Blocks determined by the RLSA during the segmentation step enclose different types of content. In order to separate text from graphics the type of content has to be recognized. This recognition task is entrusted to the so-called block classifiers.

In WISDOM++ block classifiers are automatically built by means of inductive learning techniques, namely decision tree induction. The choice of a tree-based method instead of the most common generalized linear models is due to its inherent flexibility, since decision trees can handle complicated interactions among features and give results that can be easily interpreted.

WISDOM++ discriminates between:

text blocks,
horizontal lines,
vertical lines,
pictures (i.e., halftone images)
graphics (e.g., line drawings).

Each block is represented by a set of numeric attribute-value couples:

Attribute Definition
Height Height of the reduced block image.
Length Length of the reduced block image.
Area Area of the reduced block image (height*length).
Eccen Eccentricity of the reduced block image (length/height).
Blackpix Total number of black pixels in the reduced block image.
Bw_trans Total number of black-white transitions in all rows of the reduced block image.
Pblack Percentage of black pixels in the reduced block image (blackpix/area).
Mean_tr Average number of black pixels per black-white transition (blackpix/bw_trans).
F1 Short run emphasis.
F2 Long run emphasis.
F3 Extra long run emphasis.

In a preliminary study, a decision tree was induced using a set of 5473 examples of pre-classified blocks obtained from 53 documents of various kinds. An average accuracy above 97% (Esposito etal., 1997) was obtained with an extension of C4.5 (Quinlan,1993). However, this learning system operates in batch mode:It is not possible to revise the decision tree when some blocks are misclassified,unless a new tree is generated from scratch using an extended trainingset.
In order to give the user the possibility of training WISDOM++ on-line,we considered the incremental decision tree learning ITI (Utgoff,1989) which is the only one able to handle numerical attributes such as those used to describe the blocks.
ITI can operate in three different ways:

Operational mode Behaviour of the decision tree learning system
Batch It builds a new tree from all training examples available
Normal It first updates the frequency counts associated to each node of thetree as soon as a new instance is input. Then it restructures the decision tree according to the updated frequency counts.
Error-correction Analogous to the normal operational mode except that the frequency counts are updated only in case of misclassification of the new instance

The main difference between the two incremental modes is that the former guarantees to build the same decision tree independently of the order inwhich examples are presented, while the latter does not.

In order to test the performance of ITI as block classifier, a set of 112 real, single-page documents have been considered as input data. Three experiments have been organized. ITI has been trained in the batch mode in the first experiment, in the pure error-correction mode in the second one, and in a mixed normal/error-correction mode in the third one.
The main characteristics of the learned trees are reported in the table below. The last column refers to the number of examples stored in the nodes of the induced trees.

	Size Kb	No. Nodes	No. leaves	No. incorporated examples
Batch	24,320	229	114	9,429
Pure Error-correction	982	159	79	277
Mixed	13,150	235	117	4,545+125

The decision tree built in the batch mode takes more than 24Mb, sinceall instances have to be stored, while the decision tree obtained in the error-correction mode requires 982Kb. Nevertheless, this difference in tree size does not correspond to a significant difference in predictive accuracy estimated on the independent test set. This justifies the use of the decision tree built according to the error-correction operation mode within WISDOM++ 2.0.
Experimental results are reported in (Altamura et al., 1999).

Bibliography

Altamura, O., F. Esposito, F.A. Lisi, & D. Malerba.1999. Symbolic Learning Techniques in Paper Document Processing, in P.Perner and M. Petrou (Eds.), Machine Learning and Data Mining in Pattern Recognition, Lecture Notes in Artificial Intelligence, 1715, 159-173, Springer: Berlin.

Esposito, F., D. Malerba, G. Semeraro. 1997. AComparative Analysis of Methods for Pruning Decision Trees. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(5), 476-491.

Quinlan, J. R., 1993. C4.5: Programs for induction.Morgan Kaufmann, San Mateo (CA).

Utgoff, P. E., 1994. An Improved Algorithm for IncrementalInduction of Decision Trees. Proc. of the 11th Int. Conf. on Machine Learning.Morgan Kaufmann, San Francisco (CA).

Back to home page

berardi@di.uniba.it

Attribute	Definition
Height	Height of the reduced block image.
Length	Length of the reduced block image.
Area	Area of the reduced block image (height*length).
Eccen	Eccentricity of the reduced block image (length/height).
Blackpix	Total number of black pixels in the reduced block image.
Bw_trans	Total number of black-white transitions in all rows of the reduced block image.
Pblack	Percentage of black pixels in the reduced block image (blackpix/area).
Mean_tr	Average number of black pixels per black-white transition (blackpix/bw_trans).
F1	Short run emphasis.
F2	Long run emphasis.
F3	Extra long run emphasis.

Operational mode	Behaviour of the decision tree learning system
Batch	It builds a new tree from all training examples available
Normal	It first updates the frequency counts associated to each node of thetree as soon as a new instance is input. Then it restructures the decision tree according to the updated frequency counts.
Error-correction	Analogous to the normal operational mode except that the frequency counts are updated only in case of misclassification of the new instance