Workshop on Electronic Texts: Proceedings, 9-10 June 1992 - Library of Congress

HOCKEY said that the more that is tagged accurately, the more one can refine the tagging process and thus the bigger body of text one can build up with linguistic tagging incorporated into it. Hence, the more tagging or annotation there is in the text, the more one may begin to learn about language and the more it will help accomplish more intelligent OCR. She recommended the development of software tools that will help one begin to understand more about a text, which can then be applied to scanning images of that text in that format and to using more intelligence to help one interpret or understand the text.

HOCKEY posited the need to think about common methods of text-encoding for a long time to come, because building these large bodies of text is extremely expensive and will only be done once.

In the more general discussion on approaches to encoding that followed, these points were made:

BESSER identified the underlying problem with standards that all have to struggle with in adopting a standard, namely, the tension between a very highly defined standard that is very interchangeable but does not work for everyone because something is lacking, and a standard that is less defined, more open, more adaptable, but less interchangeable. Contending that the way in which people use SGML is not sufficiently defined, BESSER wondered 1) if people resist the TEI because they think it is too defined in certain things they do not fit into, and 2) how progress with interchangeability can be made without frightening people away.

SPERBERG-McQUEEN replied that the published drafts of the TEI had met with surprisingly little objection on the grounds that they do not allow one to handle X or Y or Z. Particular concerns of the affiliated projects have led, in practice, to discussions of how extensions are to be made; the primary concern of any project has to be how it can be represented locally, thus making interchange secondary. The TEI has received much criticism based on the notion that everything in it is required or even recommended, which, as it happens, is a misconception from the beginning, because none of it is required and very little is actually actively recommended for all cases, except that one document one's source.

SPERBERG-McQUEEN agreed with BESSER about this trade-off: all the projects in a set of twenty TEI-conformant projects will not necessarily tag the material in the same way. One result of the TEI will be that the easiest problems will be solved—those dealing with the external form of the information; but the problem that is hardest in interchange is that one is not encoding what another wants, and vice versa. Thus, after the adoption of a common notation, the differences in the underlying conceptions of what is interesting about texts become more visible. The success of a standard like the TEI will lie in the ability of the recipient of interchanged texts to use some of what it contains and to add the information that was not encoded that one wants, in a layered way, so that texts can be gradually enriched and one does not have to put in everything all at once. Hence, having a well-behaved markup scheme is important.

STEVENS followed up on the paradoxical analogy that BESSER alluded to in the example of the MARC records, namely, the formats that are the same except that they are different. STEVENS drew a parallel between document-type definitions and MARC records for books and serials and maps, where one has a tagging structure and there is a text-interchange. STEVENS opined that the producers of the information will set the terms for the standard (i.e., develop document-type definitions for the users of their products), creating a situation that will be problematical for an institution like the Library of Congress, which will have to deal with the DTDs in the event that a multiplicity of them develops. Thus, numerous people are seeking a standard but cannot find the tag set that will be acceptable to them and their clients. SPERBERG-McQUEEN agreed with this view, and said that the situation was in a way worse: attempting to unify arbitrary DTDs resembled attempting to unify a MARC record with a bibliographic record done according to the Prussian instructions. According to STEVENS, this situation occurred very early in the process.

WATERS recalled from early discussions on Project Open Book the concern of many people that merely by producing images, POB was not really enhancing intellectual access to the material. Nevertheless, not wishing to overemphasize the opposition between imaging and full text, WATERS stated that POB views getting the images as a first step toward possibly converting to full text through character recognition, if the technology is appropriate. WATERS also emphasized that encoding is involved even with a set of images.

SPERBERG-McQUEEN agreed with WATERS that one can create an SGML document consisting wholly of images. At first sight, organizing graphic images with an SGML document may not seem to offer great advantages, but the advantages of the scheme WATERS described would be precisely that ability to move into something that is more of a multimedia document: a combination of transcribed text and page images. WEIBEL concurred in this judgment, offering evidence from Project ADAPT, where a page is divided into text elements and graphic elements, and in fact the text elements are organized by columns and lines. These lines may be used as the basis for distributing documents in a network environment. As one develops software intelligent enough to recognize what those elements are, it makes sense to apply SGML to an image initially, that may, in fact, ultimately become more and more text, either through OCR or edited OCR or even just through keying. For WATERS, the labor of composing the document and saying this set of documents or this set of images belongs to this document constitutes a significant investment.

WEIBEL also made the point that the AAP tag sets, while not excessively prescriptive, offer a common starting point; they do not define the structure of the documents, though. They have some recommendations about DTDs one could use as examples, but they do just suggest tag sets. For example, the CORE project attempts to use the AAP markup as much as possible, but there are clearly areas where structure must be added. That in no way contradicts the use of AAP tag sets.