Copyright © 1998 Academic Press. All rights reserved.
Regular Article
Summarization of Imaged Documents without OCR*1
Received 11 February 1997;
References and further reading may be available for this article. To view references and further reading you must purchase this article.
Abstract
A system is presented for creating a summary indicating the contents of an imaged document. The summary is composed from selected regions extracted from the imaged document. The regions may include sentences, key phrases, headings, and figures. The extracts are identified without the use of optical character recognition. The imaged document is first processed to identify the word-bounding boxes, the reading order of words, and the location of sentence and paragraph boundaries in the text. The word-bounding boxes are grouped into equivalence classes to mimic the terms in a text document. Equivalence classes representing content words are identified, and key phrases are identified from the set of content words. Summary sentences are selected using a statistically based classifier applied to a set of discrete sentence features. Evaluation of sentence selection against a set of abstracts created by a professional abstracting company is given.







E-mail Article
Add to my Quick Links

Cited By in Scopus (9)




