Workshop on Electronic Texts: Proceedings, 9-10 June 1992 - Library of Congress

WEIBEL next illustrated an extremely cluttered screen dump of OCLC's system, in order to show as much as possible the inherent capability on the screen. (He noted parenthetically that he had become a supporter of X-Windows as a result of the progress of the CORE Project.) WEIBEL also illustrated the two major parts of the interface: l) a control box that allows one to generate lists of items, which resembles a small table of contents based on key words one wishes to search, and 2) a document viewer, which is a separate process in and of itself. He demonstrated how to follow links through the electronic database simply by selecting the appropriate button and bringing them up. He also noted problems that remain to be accommodated in the interface (e.g., as pointed out by LESK, what happens when users do not click on the icon for the figure).

Given the constraints of time, WEIBEL omitted a large number of ancillary items in order to say a few words concerning storage requirements and what will be required to put a lot of things on line. Since it is extremely expensive to reconvert all of this data, especially if it is just in paper form (and even if it is in electronic form in typesetting tapes), he advocated building journals electronically from the start. In that case, if one only has text graphics and indexing (which is all that one needs with de novo electronic publishing, because there is no need to go back and look at bit-maps of pages), one can get 10,000 journals of full text, or almost 6 million pages per year. These pages can be put in approximately 135 gigabytes of storage, which is not all that much, WEIBEL said. For twenty years, something less than three terabytes would be required. WEIBEL calculated the costs of storing this information as follows: If a gigabyte costs approximately $1,000, then a terabyte costs approximately $1 million to buy in terms of hardware. One also needs a building to put it in and a staff like OCLC to handle that information. So, to support a terabyte, multiply by five, which gives $5 million per year for a supported terabyte of data.

******

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ DISCUSSION * Tapes saved by ACS are the typography files originally supporting publication of the journal * Cost of building tagged text into the database * +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

During the question-and-answer period that followed WEIBEL's presentation, these clarifications emerged. The tapes saved by the American Chemical Society are the typography files that originally supported the publication of the journal. Although they are not tagged in SGML, they are tagged in very fine detail. Every single sentence is marked, all the registry numbers, all the publications issues, dates, and volumes. No cost figures on tagging material on a per-megabyte basis were available. Because ACS's typesetting system runs from tagged text, there is no extra cost per article. It was unknown what it costs ACS to keyboard the tagged text rather than just keyboard the text in the cheapest process. In other words, since one intends to publish things and will need to build tagged text into a typography system in any case, if one does that in such a way that it can drive not only typography but an electronic system (which is what ACS intends to do—move to SGML publishing), the marginal cost is zero. The marginal cost represents the cost of building tagged text into the database, which is small.

******

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ SPERBERG-McQUEEN * Distinction between texts and computers * Implications of recognizing that all representation is encoding * Dealing with complicated representations of text entails the need for a grammar of documents * Variety of forms of formal grammars * Text as a bit-mapped image does not represent a serious attempt to represent text in electronic form * SGML, the TEI, document-type declarations, and the reusability and longevity of data * TEI conformance explicitly allows extension or modification of the TEI tag set * Administrative background of the TEI * Several design goals for the TEI tag set * An absolutely fixed requirement of the TEI Guidelines * Challenges the TEI has attempted to face * Good texts not beyond economic feasibility * The issue of reproducibility or processability * The issue of mages as simulacra for the text redux * One's model of text determines what one's software can do with a text and has economic consequences * +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Prior to speaking about SGML and markup, Michael SPERBERG-McQUEEN, editor, Text Encoding Initiative (TEI), University of Illinois-Chicago, first drew a distinction between texts and computers: Texts are abstract cultural and linguistic objects while computers are complicated physical devices, he said. Abstract objects cannot be placed inside physical devices; with computers one can only represent text and act upon those representations.

The recognition that all representation is encoding, SPERBERG-McQUEEN argued, leads to the recognition of two things: 1) The topic description for this session is slightly misleading, because there can be no discussion of pros and cons of text-coding unless what one means is pros and cons of working with text with computers. 2) No text can be represented in a computer without some sort of encoding; images are one way of encoding text, ASCII is another, SGML yet another. There is no encoding without some information loss, that is, there is no perfect reproduction of a text that allows one to do away with the original. Thus, the question becomes, What is the most useful representation of text for a serious work? This depends on what kind of serious work one is talking about.

The projects demonstrated the previous day all involved highly complex information and fairly complex manipulation of the textual material. In order to use that complicated information, one has to calculate it slowly or manually and store the result. It needs to be stored, therefore, as part of one's representation of the text. Thus, one needs to store the structure in the text. To deal with complicated representations of text, one needs somehow to control the complexity of the representation of a text; that means one needs a way of finding out whether a document and an electronic representation of a document is legal or not; and that means one needs a grammar of documents.