Document classification and understanding

The document classification and understanding in WISDOM++

In many real applications, documents to be processed can be grouped into classes such that documents in the same class have a similar layout structure. Therefore, it is reasonable to perform a documen tclassification and understanding based on the within-class similarities and between-class differences as regards the layout structure.

The representation problem
In order to apply machine learning techniques to induce rules for layout-based document classification and understanding, it is important to define a suitable representation of the layout structure of training documents.Describing the layout, at any level of the hierarchy, is not straightforward since the layout components are spatially related and the feature-vector representation adopted for the block classification problem can neither express these relations nor solve the related pattern recognition problems.
We preferred to describe both documents and rules by means of a first-order language.
In this language, unary function symbols, called attributes,are used to describe properties of a single layout component (e.g., height and length), while binary predicate and function symbols, called relations, are used to express spatial relationships among layout components (e.g., part-of and on-top). The semantics of both attributes and relations is reported in (Esposito et al., 1994).
The following table lists the descriptors together with their definition.

Descriptor name Definition

width(block) Real domain (1..640)

height(block) Real domain (1..875)

x_pos_centre(block) Real domain (1..640)

y_pos_centre(block) Real domain (1..875)

type_of(block) Nominal domain: text, hor_line, image, ver_line, graphic, mixed

part_of(block1,block2) Boolean domain: true if block1 contains block2

on_top(block1,block2) Boolean domain: true if block1 is above block2

to_right(block1,block2) Boolean domain: true if block2 is to the right of block1

alignment(block1,block2) Nominal domain: only_left_col, only_right_col, only_middle_col, only_upper_row,only_lower_row, only_middle_row
image_lenght(doc) Real domain (1,4000)

image_width(doc) Real domain (1,3000)

class(doc) Nominal domain: domain-dependent class name

Given the description of the layout components of a set of documents by means of such a first-order language, the learning system embedded in WISDOM++, namely ATRE, generates rules for both document classification and understanding. ATRE is an inductive logic programming system that can learn logic theories from examples and which is able to handle symbolic as well as numerical descriptors. A full description of ATRE is reported in Malerba, 2003.
As concern the output of the system, it consists of a set of rules formulated as follows:

H ¬L₁, L₂, .., L_mwhere the conclusion part H(called head) represents either a document class or a logical component,while the premise part L₁,L₂, .., L_m (called body) is a conjunction of conditions involving either attributes or relations.
It is noteworthy that such rules express some spatial relationships between layout components, thus confirming the importance of adopting a first-order representation for document classification and understanding. Examples of learned rules are reported below.

Example Rules for the classification of ICML95 documents

class(X) = icml ¬part_of(X,Y), on_top(W,Z), on_top(Y,U),to_right(Y,V), alignment(Z,U)=only_left_col, alignment(V,Y)=only_middle_col
class(X) = icml ¬part_of(X,Y), x_pos_centre(Y)Î[301 .. 557],y_pos_centre(Y)Î[25 ..190],on_top(V,Y), on_top(Z,V), on_top(W,Y)

Example Rules for the understanding of ICML95 documents

page_number(X)=true ¬width(X) Î[8 ..16], height (X) Î[7.. 8]

title(X)=true ¬height(X) Î[13 .. 31], x_pos_centre(X) Î[280 .. 348]

author(X)=true ¬height(X) Î[42..79], y_pos_centre(X) Î[173 .. 279]

abstract(X)=true ¬y_pos_centre(X)Î[256 .. 526], on_top(Y,X), to_right(X,Z)
abstract(X)=true ¬x_pos_centre(X)Î[147.. 218], on_top(X,Y), to_right(Y,Z)

body(X)=true ¬width(X) Î[242..255], type_of(X)=text
body(X)=true ¬x_pos_centre(X) Î[368..477], on_top(Z,X), to_right(Y,Z)
body(X)=true ¬width(X) Î[78 .. 136], type_of(X)=text, on_top(X,Y), to_right(Y,Z)
body(X)=true ¬width(X) Î[237.. 255], alignment(Y,X)=only_right_col
body(X)=true ¬height(X) Î[422 .. 519]

Actually, learning rules for document understanding is more difficult than learning rules for document classification, since logical components refer to a part of the document rather than to the whole document. Indeed, it is noteworthy that often the system is able to discriminate a document belonging to a class rather than to another simply considering the information provided by the descriptors related to the whole page dimensions (image_width, image_lenght). Document understanding complexity is due to the fact that logical components may be related to each other, as in the case of standard scientific papers, where the author's affiliation is above the abstract and under the title. Thus, it would be more appropriate to learn rules that reflect these dependencies among logical components.
For instance, for ICML95 documents the following rule:

body(X)=true ¬to_right(Y,X), abstract(Y)=true

captures the fact that a layout component to the right of an abstract is a body. Indeed, ATRE is able to learn recursive theories from examples, where the recursive theory learning task is considered equivalent to the task of learning multiple dependent concept definitions. Experimental results confirm that by taking into account concept dependencies it is possible to improve the predictive accuracy for the document understanding problem (Malerba et al., 2001). Another degree of complexity in the document understanding task is due to the fact that the presence of some logical components may depend on the order page (e.g. author is in the first page). At this aim, the specification of the following domain specific knowledge permits to automatically associate information on page order to layout components:

at_page(X,first) ¬part_of(Y,X), page(Y,first)
at_page(X,intermediate) ¬part_of(Y,X),page(Y,intermediate)
at_page(X,last_but_one) ¬part_of(Y,X), page(Y,last_but_one)
at_page(X,last) ¬part_of(Y,X), page(Y,last).

Bibliography

D. Malerba (2003): Learning Recursive Theories in the Normal ILP Setting, Fundamenta Informaticae, 57(1) 39-77.

D. Malerba, F. Esposito, F.A. Lisi & O. Altamura (2001). Automated Discovery of Dependencies Between Logical Components in Document Image Understanding, Proceedings of the Sixth International Conference on Document Analysis and Recognition, 174-178, IEEE Computer Society Press, Los Vaqueros, CA.

F. Esposito, D. Malerba & G. Semeraro (1994). Multistrategy Learning for Document Recognition. Applied Artificial Intelligence 8(1),33-84.

Back to home page

berardi@di.uniba.it

Descriptor name	Definition
width(block)	Real domain (1..640)
height(block)	Real domain (1..875)
x_pos_centre(block)	Real domain (1..640)
y_pos_centre(block)	Real domain (1..875)
type_of(block)	Nominal domain: text, hor_line, image, ver_line, graphic, mixed
part_of(block1,block2)	Boolean domain: true if block1 contains block2
on_top(block1,block2)	Boolean domain: true if block1 is above block2
to_right(block1,block2)	Boolean domain: true if block2 is to the right of block1
alignment(block1,block2)	Nominal domain: only_left_col, only_right_col, only_middle_col, only_upper_row,only_lower_row, only_middle_row
image_lenght(doc)	Real domain (1,4000)
image_width(doc)	Real domain (1,3000)
class(doc)	Nominal domain: domain-dependent class name