Proceedings Article | 29 January 2007
KEYWORDS: Photography, Image segmentation, Image classification, Feature extraction, Image quality, Image processing, RGB color model, Image retrieval, Image resolution, Detection and tracking algorithms
We report an investigation into strategies, algorithms, and software tools for document image content extraction
and inventory, that is, the location and measurement of regions containing handwriting, machine-printed text,
photographs, blank space, etc. We have developed automatically trainable methods, adaptable to many kinds
of documents represented as bilevel, greylevel, or color images, that offer a wide range of useful tradeoffs of
speed versus accuracy using methods for exact and approximate k-Nearest Neighbor classification. We have
adopted a policy of classifying each pixel (rather than regions) by content type: we discuss the motivation
and engineering implications of this choice. We describe experiments on a wide variety of document-image and
content types, and discuss performance in detail in terms of classification speed, per-pixel classification accuracy,
per-page inventory accuracy, and subjective quality of page segmentation. These show that even modest per-pixel
classification accuracies (of, e.g., 60-70%) support usefully high recall and precision rates (of, e.g., 80-90%)
for retrieval queries of document collections seeking pages that contain a given minimum fraction of a certain
type of content.