Workshop on Electronic Texts: Proceedings, 9-10 June 1992 - Library of Congress

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ DISCUSSION * Re retrieval software * "Digital file copyright" * Scanning rate during production * Autosegmentation * Criteria employed in selecting books for scanning * Compression and decompression of images * OCR not precluded * +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

During the question-and-answer period that followed her presentation,
PERSONIUS made these additional points:

* Re retrieval software, Cornell is developing a Unix-based server as well as clients for the server that support multiple platforms (Macintosh, IBM and Sun workstations), in the hope that people from any of those platforms will retrieve books; a further operating assumption is that standard interfaces will be used as much as possible, where standards can be put in place, because CLASS considers this retrieval software a library application and would like to be able to look at material not only at Cornell but at other institutions.

* The phrase "digital file copyright by Cornell University" was added at the advice of Cornell's legal staff with the caveat that it probably would not hold up in court. Cornell does not want people to copy its books and sell them but would like to keep them available for use in a library environment for library purposes.

* In production the scanner can scan about 300 pages per hour, capturing 600 dots per inch.

* The Xerox software has filters to scan halftone material and avoid the moire patterns that occur when halftone material is scanned. Xerox has been working on hardware and software that would enable the scanner itself to recognize this situation and deal with it appropriately—a kind of autosegmentation that would enable the scanner to handle halftone material as well as text on a single page.

* The books subjected to the elaborate process described above were selected because CLASS is a preservation project, with the first 500 books selected coming from Cornell's mathematics collection, because they were still being heavily used and because, although they were in need of preservation, the mathematics library and the mathematics faculty were uncomfortable having them microfilmed. (They wanted a printed copy.) Thus, these books became a logical choice for this project. Other books were chosen by the project's selection committees for experiments with the technology, as well as to meet a demand or need.

* Images will be decompressed before they are sent over the line; at this time they are compressed and sent to the image filing system and then sent to the printer as compressed images; they are returned to the workstation as compressed 600-dpi images and the workstation decompresses and scales them for display—an inefficient way to access the material though it works quite well for printing and other purposes.

* CLASS is also decompressing on Macintosh and IBM, a slow process right now. Eventually, compression and decompression will take place on an image conversion server. Trade-offs will be made, based on future performance testing, concerning where the file is compressed and what resolution image is sent.

* OCR has not been precluded; images are being stored that have been scanned at a high resolution, which presumably would suit them well to an OCR process. Because the material being scanned is about 100 years old and was printed with less-than-ideal technologies, very early and preliminary tests have not produced good results. But the project is capturing an image that is of sufficient resolution to be subjected to OCR in the future. Moreover, the system architecture and the system plan have a logical place to store an OCR image if it has been captured. But that is not being done now.