Workshop on Electronic Texts: Proceedings, 9-10 June 1992 - Library of Congress

******

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ WEIBEL * OCLC's approach to preparing electronic text: retroconversion, keying of texts, more automated ways of developing data * Project ADAPT and the CORE Project * Intelligent character recognition does not exist * Advantages of SGML * Data should be free of procedural markup; descriptive markup strongly advocated * OCLC's interface illustrated * Storage requirements and costs for putting a lot of information on line * +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Stuart WEIBEL, senior research scientist, Online Computer Library Center, Inc. (OCLC), described OCLC's approach to preparing electronic text. He argued that the electronic world into which we are moving must accommodate not only the future but the past as well, and to some degree even the present. Thus, starting out at one end with retroconversion and keying of texts, one would like to move toward much more automated ways of developing data.

For example, Project ADAPT had to do with automatically converting document images into a structured document database with OCR text as indexing and also a little bit of automatic formatting and tagging of that text. The CORE project hosted by Cornell University, Bellcore, OCLC, the American Chemical Society, and Chemical Abstracts, constitutes WEIBEL's principal concern at the moment. This project is an example of converting text for which one already has a machine-readable version into a format more suitable for electronic delivery and database searching. (Since Michael LESK had previously described CORE, WEIBEL would say little concerning it.) Borrowing a chemical phrase, de novo synthesis, WEIBEL cited the Online Journal of Current Clinical Trials as an example of de novo electronic publishing, that is, a form in which the primary form of the information is electronic.

Project ADAPT, then, which OCLC completed a couple of years ago and in fact is about to resume, is a model in which one takes page images either in paper or microfilm and converts them automatically to a searchable electronic database, either on-line or local. The operating assumption is that accepting some blemishes in the data, especially for retroconversion of materials, will make it possible to accomplish more. Not enough money is available to support perfect conversion.

WEIBEL related several steps taken to perform image preprocessing (processing on the image before performing optical character recognition), as well as image postprocessing. He denied the existence of intelligent character recognition and asserted that what is wanted is page recognition, which is a long way off. OCLC has experimented with merging of multiple optical character recognition systems that will reduce errors from an unacceptable rate of 5 characters out of every l,000 to an unacceptable rate of 2 characters out of every l,000, but it is not good enough. It will never be perfect.

Concerning the CORE Project, WEIBEL observed that Bellcore is taking the topography files, extracting the page images, and converting those topography files to SGML markup. LESK hands that data off to OCLC, which builds that data into a Newton database, the same system that underlies the on-line system in virtually all of the reference products at OCLC. The long-term goal is to make the systems interoperable so that not just Bellcore's system and OCLC's system can access this data, but other systems can as well, and the key to that is the Z39.50 common command language and the full-text extension. Z39.50 is fine for MARC records, but is not enough to do it for full text (that is, make full texts interoperable).

WEIBEL next outlined the critical role of SGML for a variety of purposes, for example, as noted by HOCKEY, in the world of extremely large databases, using highly structured data to perform field searches. WEIBEL argued that by building the structure of the data in (i.e., the structure of the data originally on a printed page), it becomes easy to look at a journal article even if one cannot read the characters and know where the title or author is, or what the sections of that document would be. OCLC wants to make that structure explicit in the database, because it will be important for retrieval purposes.

The second big advantage of SGML is that it gives one the ability to build structure into the database that can be used for display purposes without contaminating the data with instructions about how to format things. The distinction lies between procedural markup, which tells one where to put dots on the page, and descriptive markup, which describes the elements of a document.

WEIBEL believes that there should be no procedural markup in the data at all, that the data should be completely unsullied by information about italics or boldness. That should be left up to the display device, whether that display device is a page printer or a screen display device. By keeping one's database free of that kind of contamination, one can make decisions down the road, for example, reorganize the data in ways that are not cramped by built-in notions of what should be italic and what should be bold. WEIBEL strongly advocated descriptive markup. As an example, he illustrated the index structure in the CORE data. With subsequent illustrated examples of markup, WEIBEL acknowledged the common complaint that SGML is hard to read in its native form, although markup decreases considerably once one gets into the body. Without the markup, however, one would not have the structure in the data. One can pass markup through a LaTeX processor and convert it relatively easily to a printed version of the document.