In previous work we showed that shape descriptor features can be used in Look Up Table (LUT) classifiers to
learn patterns of degradation and correction in historical document images. The algorithm encodes the pixel
neighborhood information effectively using a variant of shape descriptor. However, the generation of the shape
descriptor features was approached in a heuristic manner. In this work, we propose a system of learning the
shape features from the training data set by using neural networks: Multilayer Perceptrons (MLP) for feature
extraction. Given that the MLP maybe restricted by a limited dataset, we apply a feature selection algorithm to
generalize, and thus improve, the feature set obtained from the MLP. We validate the effectiveness and efficiency
of the proposed approach via experimental results.
Large degradations in document images impede their readability as well as substantially deteriorating the performance
of automated document processing systems. Image quality metrics have been defined to correlate with
OCR accuracy. However, this does not always correlate with human perception of image quality. When enhancing
document images with the goal of improving readability, it is important to understand human perception
of quality. The goal of this work is to evaluate human perception of degradation and correlate it to known
degradation parameters and existing image quality metrics. The information captured enables the learning and
estimation of human perception of document image quality.
In previous work we showed that Look Up Table (LUT) classifiers can be trained to learn patterns of degradation
and correction in historical document images. The effectiveness of the classifiers is directly proportional to the
size of the pixel neighborhood it considers. However, the computational cost increases almost exponentially with
the neighborhood size. In this paper, we propose a novel algorithm that encodes the neighborhood information
efficiently using a shape descriptor. Using shape descriptor features, we are able to characterize the pixel
neighborhood of document images with much fewer bits and so obtain an efficient system with significantly
reduced computational cost. Experimental results demonstrate the effectiveness and efficiency of the proposed
approach.
The fast evolution of scanning and computing technologies have led to the creation of large collections of scanned paper documents. Examples of such collections include historical collections, legal depositories, medical archives, and business archives. Moreover, in many situations such as legal litigation and security investigations scanned collections are being used to facilitate systematic exploration of the data. It is almost always the case that
scanned documents suffer from some form of degradation. Large degradations make documents hard to read and substantially deteriorate the performance of automated document processing systems. Enhancement of degraded document images is normally performed assuming global degradation models. When the degradation is large,
global degradation models do not perform well. In contrast, we propose to estimate local degradation models and
use them in enhancing degraded document images. Using a semi-automated enhancement system we have labeled
a subset of the Frieder diaries collection.1 This labeled subset was then used to train an ensemble classifier. The
component classifiers are based on lookup tables (LUT) in conjunction with the approximated nearest neighbor algorithm. The resulting algorithm is highly effcient. Experimental evaluation results are provided using the Frieder diaries collection.1
Degraded documents are frequently obtained in various situations. Examples of degraded document collections include historical document depositories, document obtained in legal and security investigations, and legal and medical archives. Degraded document images are hard to to read and are hard to analyze using computerized techniques. There is hence a need for systems that are capable of enhancing such images. We describe a language-independent semi-automated system for enhancing degraded document images that is capable of exploiting inter- and intra-document coherence. The system is capable of processing document images with high levels of degradations and can be used for ground truthing of degraded document images. Ground truthing of degraded document images is extremely important in several aspects: it enables quantitative performance measurements
of enhancement systems and facilitates model estimation that can be used to improve performance. Performance evaluation is provided using the historical Frieder diaries collection.1
We address the problem of content-based image retrieval in the context of complex document images. Complex
documents typically start out on paper and are then electronically scanned. These documents have rich internal
structure and might only be available in image form. Additionally, they may have been produced by a combination
of printing technologies (or by handwriting); and include diagrams, graphics, tables and other non-textual
elements. Large collections of such complex documents are commonly found in legal and security investigations.
The indexing and analysis of large document collections is currently limited to textual features based OCR data
and ignore the structural context of the document as well as important non-textual elements such as signatures,
logos, stamps, tables, diagrams, and images. Handwritten comments are also normally ignored due to the
inherent complexity of offline handwriting recognition. We address important research issues concerning content-based
document image retrieval and describe a prototype for integrated retrieval and aggregation of diverse
information contained in scanned paper documents we are developing. Such complex document information
processing combines several forms of image processing together with textual/linguistic processing to enable
effective analysis of complex document collections, a necessity for a wide range of applications. Our prototype
automatically generates rich metadata about a complex document and then applies query tools to integrate
the metadata with text search. To ensure a thorough evaluation of the effectiveness of our prototype, we are
developing a test collection containing millions of document images.
Poor quality documents are obtained in various situations such as historical document collections, legal archives,
security investigations, and documents found in clandestine locations. Such documents are often scanned for
automated analysis, further processing, and archiving. Due to the nature of such documents, degraded document
images are often hard to read, have low contrast, and are corrupted by various artifacts. We describe
a novel approach for the enhancement of such documents based on probabilistic models which increases the
contrast, and thus, readability of such documents under various degradations. The enhancement produced by
the proposed approach can be viewed under different viewing conditions if desired. The proposed approach was
evaluated qualitatively and compared to standard enhancement techniques on a subset of historical documents
obtained from the Yad Vashem Holocaust museum. In addition, quantitative performance was evaluated based
on synthetically generated data corrupted under various degradation models. Preliminary results demonstrate
the effectiveness of the proposed approach.
Analysis of large collections of complex documents is an increasingly important need for numerous applications. Complex documents are documents that typically start out on paper and are then electronically scanned. These documents have rich internal structure and might only be available in image form. Additionally, they may have been produced by a combination of printing technologies (or by handwriting); and include diagrams, graphics, tables and other non-textual elements. The state of the art today for a large document collection is essentially text search of OCR'd documents with no meaningful use of data found in images, signatures, logos, etc. Our prototype automatically generates rich metadata about a complex document and then applies query tools to integrate the metadata with text search. To ensure a thorough evaluation of the effectiveness of our prototype, we are also developing a roughly 42,000,000 page complex document test collection. The collection will include relevance judgments for queries at a variety of levels of detail and depending on a variety of content and structural characteristics of documents, as well as "known item" queries looking for particular documents.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.