Tutorial: Learning in Document Analysis and Understanding
Intelligent document processing techniques require a large amount of knowledge concerning both the layout structure and the logical structure of target classes of publications. Many successful document systems have been developed by hand-coding the necessary knowledge in the form of syntactic grammars, rules, geometric trees, or sentences of a form definition language. Nevertheless, each document has predetermined layout conventions and the change of the target publications often requires a great deal of manual labor. Machine learning and statistics offer a number of methods, techniques and tools that can be profitably applied to the problem of automating the acquisition of document knowledge. Some interesting applications are already known for some specific document processing tasks, such as text/graphics separation, layout structure extraction, logical structure understanding, document classification and text categorization.
The goal of this tutorial is to provide attendees with an introductory overview of machine learning and statistical methods appropriate for applications to document analysis and document understanding. The tutorial articulates in two parts of roughly equal length. The first part is devoted to methods appropriate for attribute-based representations, namely linear classifiers, nonparametric methods, and decision trees. The second part concentrates on machine learning methods for structural representations, namely learning sets of first-order rules.
For use in practical applications, pointers to the available public domain systems will be provided and discussed. Moreover, most significant attempts to apply learning techniques in document analysis and understanding will be presented.
The tutorial is aimed at researchers and graduate students in the field of pattern recognition, in particular those specializing in document image processing, document image analysis, map, text and document understanding.
A general knowledge on pattern recognition and document
processing is assumed; introductory knowledge of machine learning and statistics
is beneficial but not required. Some familiarity with logic is considered
helpful for the second part of the tutorial.
Program
outline
Duration: 4 hours
Fees (buffet lunch and coffee break included):
Registration | Before July 15 | After July 15 |
IAPR member | US $50 | US $60 |
Non IAPR member | US $60 | US $70 |
Student | US $40 | US $45 |
Student (from India) | Rs. 1,100 | Rs. 1,100 |
Presenters:
email: esposito@di.uniba.it |
email: malerba@di.uniba.it |
Dipartimento di Informatica, University of Bari, Via Orabona 4, I-70125 Bari, Italy |