A Document Reconstruction System for Transferring Bengali Paper Documents into Rich Text Format
Loading...
Date
2005-02-02
Journal Title
Journal ISSN
Volume Title
Publisher
INFLIBNET Centre
Abstract
The transformation of a scanned paper document into an editable form suitable for
further processing such as desktop publishing or archiving in a digital library is a
complex process. It requires solutions to several problems – document analysis by
acquiring knowledge of document layout by a Page Layout Analyzer (PLA), followed
by document recognition, which mainly comprises text recognition by Optical Character
Recognition (OCR). Besides these two, another important problem is document
reconstruction by transforming content into an electronically editable format by keeping
the original layout intact. Core OCR modules exist on different Indian scripts, but no
such document reconstruction system is available for Indian scripts. The document
reconstruction system reported in this paper is the first of its kind on Indian scripts
and it addresses document reconstruction for Bengali document images. The system
makes use of the knowledge of both document layout extracted by a PLA in a graphical
user interface (GUI) and the results of text recognition steps performed by OCR for
transformation of paper documents into Rich Text Format.
Description
Keywords
Indian Scripts, Desktop Publishing, Page Layout Analysis, Optical Character Recognition, Document Reconstruction, Encoding Standard, Indian Language