The representation problem
In order to apply machine learning techniques to induce rules for layout-based document classification and understanding, it is important to define a suitable representation of the layout structure of training documents.Describing the layout, at any level of the hierarchy, is not straightforward since the layout components are spatially related and the feature-vector representation adopted for the block classification problem can neither express these relations nor solve the related pattern recognition problems.
We preferred to describe both documents and rules by means of a first-order language.
In this language, unary function symbols, called attributes,are used to describe properties of a single layout component (e.g., height and length), while binary predicate and function symbols, called relations, are used to express spatial relationships among layout components (e.g., part-of and on-top). The semantics of both attributes and relations is reported in (Esposito et al., 1994).
The following table lists the descriptors together with their definition.
|width(block)||Real domain (1..640)|
|height(block)||Real domain (1..875)|
|x_pos_centre(block)||Real domain (1..640)|
|y_pos_centre(block)||Real domain (1..875)|
|type_of(block)||Nominal domain: text, hor_line, image, ver_line, graphic, mixed|
|part_of(block1,block2)||Boolean domain: true if block1 contains block2|
|on_top(block1,block2)||Boolean domain: true if block1 is above block2|
|to_right(block1,block2)||Boolean domain: true if block2 is to the right of block1|
|alignment(block1,block2)||Nominal domain: only_left_col, only_right_col, only_middle_col, only_upper_row,only_lower_row, only_middle_row|
|image_lenght(doc)||Real domain (1,4000)|
|image_width(doc)||Real domain (1,3000)|
|class(doc)||Nominal domain: domain-dependent class name|
Example Rules for the classification of ICML95 documents
class(X) = icml ¬part_of(X,Y), on_top(W,Z), on_top(Y,U),to_right(Y,V), alignment(Z,U)=only_left_col, alignment(V,Y)=only_middle_col
class(X) = icml ¬part_of(X,Y), x_pos_centre(Y)Î[301 .. 557],y_pos_centre(Y)Î[25 ..190],on_top(V,Y), on_top(Z,V), on_top(W,Y)
Example Rules for the understanding of ICML95 documents
page_number(X)=true ¬width(X) Î[8 ..16], height (X) Î[7.. 8]
title(X)=true ¬height(X) Î[13 .. 31], x_pos_centre(X) Î[280 .. 348]
author(X)=true ¬height(X) Î[42..79], y_pos_centre(X) Î[173 .. 279]
abstract(X)=true ¬y_pos_centre(X)Î[256 .. 526], on_top(Y,X), to_right(X,Z)
abstract(X)=true ¬x_pos_centre(X)Î[147.. 218], on_top(X,Y), to_right(Y,Z)
body(X)=true ¬width(X) Î[242..255], type_of(X)=text
body(X)=true ¬x_pos_centre(X) Î[368..477], on_top(Z,X), to_right(Y,Z)
body(X)=true ¬width(X) Î[78 .. 136], type_of(X)=text, on_top(X,Y), to_right(Y,Z)
body(X)=true ¬width(X) Î[237.. 255], alignment(Y,X)=only_right_col
body(X)=true ¬height(X) Î[422 .. 519]
Actually, learning rules for document understanding is more difficult than learning rules for document classification, since logical components refer to a part of the document rather than to the whole document. Indeed, it is noteworthy that often the system is able to discriminate a document belonging to a class rather than to another simply considering the information provided by the descriptors related to the whole page dimensions (image_width, image_lenght). Document understanding complexity is due to the fact that logical components may be related to each other, as in the case of standard scientific papers, where the author's affiliation is above the abstract and under the title. Thus, it would be more appropriate to learn rules that reflect these dependencies among logical components.
For instance, for ICML95 documents the following rule:
captures the fact that a layout component to the right of an abstract is a body. Indeed, ATRE is able to learn recursive theories from examples, where the recursive theory learning task is considered equivalent to the task of learning multiple dependent concept definitions. Experimental results confirm that by taking into account concept dependencies it is possible to improve the predictive accuracy for the document understanding problem (Malerba et al., 2001). Another degree of complexity in the document understanding task is due to the fact that the presence of some logical components may depend on the order page (e.g. author is in the first page). At this aim, the specification of the following domain specific knowledge permits to automatically associate information on page order to layout components: