Workshop on Electronic Texts: Proceedings, 9-10 June 1992 - Library of Congress

The final speaker in this session, Eric CALALUCA, vice president, Chadwyck-Healey, Inc., spoke from the perspective of a publisher re text-encoding, rather than as one qualified to discuss methods of encoding data, and observed that the presenters sitting in the room, whether they had chosen to or not, were acting as publishers: making choices, gathering data, gathering information, and making assessments. CALALUCA offered the hard-won conviction that in publishing very large text files (such as PLD), one cannot avoid making personal judgments of appropriateness and structure.

In CALALUCA's view, encoding decisions stem from prior judgments. Two notions have become axioms for him in the consideration of future sources for electronic publication: 1) electronic text publishing is as personal as any other kind of publishing, and questions of if and how to encode the data are simply a consequence of that prior decision; 2) all personal decisions are open to criticism, which is unavoidable.

CALALUCA rehearsed his role as a publisher or, better, as an intermediary between what is viewed as a sound idea and the people who would make use of it. Finding the specialist to advise in this process is the core of that function. The publisher must monitor and hug the fine line between giving users what they want and suggesting what they might need. One responsibility of a publisher is to represent the desires of scholars and research librarians as opposed to bullheadedly forcing them into areas they would not choose to enter.

CALALUCA likened the questions being raised today about data structure and standards to the decisions faced by the Abbe Migne himself during production of the Patrologia series in the mid-nineteenth century. Chadwyck-Healey's decision to reproduce Migne's Latin series whole and complete with SGML tags was also based upon a perceived need and an expected use. In the same way that Migne's work came to be far more than a simple handbook for clerics, PLD is already far more than a database for theologians. It is a bedrock source for the study of Western civilization, CALALUCA asserted.

In regard to the decision to produce and publish PLD, the editorial board offered direct judgments on the question of appropriateness of these texts for conversion, their encoding and their distribution, and concluded that the best possible project was one that avoided overt intrusions or exclusions in so important a resource. Thus, the general decision to transmit the original collection as clearly as possible with the widest possible avenues for use led to other decisions: 1) To encode the data or not, SGML or not, TEI or not. Again, the expected user community asserted the need for normative tagging structures of important humanities texts, and the TEI seemed the most appropriate structure for that purpose. Research librarians, who are trained to view the larger impact of electronic text sources on 80 or 90 or 100 doctoral disciplines, loudly approved the decision to include tagging. They see what is coming better than the specialist who is completely focused on one edition of Ambrose's De Anima, and they also understand that the potential uses exceed present expectations. 2) What will be tagged and what will not. Once again, the board realized that one must tag the obvious. But in no way should one attempt to identify through encoding schemes every single discrete area of a text that might someday be searched. That was another decision. Searching by a column number, an author, a word, a volume, permitting combination searches, and tagging notations seemed logical choices as core elements. 3) How does one make the data available? Tieing it to a CD-ROM edition creates limitations, but a magnetic tape file that is very large, is accompanied by the encoding specifications, and that allows one to make local modifications also allows one to incorporate any changes one may desire within the bounds of private research, though exporting tag files from a CD-ROM could serve just as well. Since no one on the board could possibly anticipate each and every way in which a scholar might choose to mine this data bank, it was decided to satisfy the basics and make some provisions for what might come. 4) Not to encode the database would rob it of the interchangeability and portability these important texts should accommodate. For CALALUCA, the extensive options presented by full-text searching require care in text selection and strongly support encoding of data to facilitate the widest possible search strategies. Better software can always be created, but summoning the resources, the people, and the energy to reconvert the text is another matter.

PLD is being encoded, captured, and distributed, because to Chadwyck-Healey and the board it offers the widest possible array of future research applications that can be seen today. CALALUCA concluded by urging the encoding of all important text sources in whatever way seems most appropriate and durable at the time, without blanching at the thought that one's work may require emendation in the future. (Thus, Chadwyck-Healey produced a very large humanities text database before the final release of the TEI Guidelines.)

******

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ DISCUSSION * Creating texts with markup advocated * Trends in encoding * The TEI and the issue of interchangeability of standards * A misconception concerning the TEI * Implications for an institution like LC in the event that a multiplicity of DTDs develops * Producing images as a first step towards possible conversion to full text through character recognition * The AAP tag sets as a common starting point and the need for caution * +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

HOCKEY prefaced the discussion that followed with several comments in favor of creating texts with markup and on trends in encoding. In the future, when many more texts are available for on-line searching, real problems in finding what is wanted will develop, if one is faced with millions of words of data. It therefore becomes important to consider putting markup in texts to help searchers home in on the actual things they wish to retrieve. Various approaches to refining retrieval methods toward this end include building on a computer version of a dictionary and letting the computer look up words in it to obtain more information about the semantic structure or semantic field of a word, its grammatical structure, and syntactic structure.

HOCKEY commented on the present keen interest in the encoding world in creating: 1) machine-readable versions of dictionaries that can be initially tagged in SGML, which gives a structure to the dictionary entry; these entries can then be converted into a more rigid or otherwise different database structure inside the computer, which can be treated as a dynamic tool for searching mechanisms; 2) large bodies of text to study the language. In order to incorporate more sophisticated mechanisms, more about how words behave needs to be known, which can be learned in part from information in dictionaries. However, the last ten years have seen much interest in studying the structure of printed dictionaries converted into computer-readable form. The information one derives about many words from those is only partial, one or two definitions of the common or the usual meaning of a word, and then numerous definitions of unusual usages. If the computer is using a dictionary to help retrieve words in a text, it needs much more information about the common usages, because those are the ones that occur over and over again. Hence the current interest in developing large bodies of text in computer-readable form in order to study the language. Several projects are engaged in compiling, for example, 100 million words. HOCKEY described one with which she was associated briefly at Oxford University involving compilation of 100 million words of British English: about 10 percent of that will contain detailed linguistic tagging encoded in SGML; it will have word class taggings, with words identified as nouns, verbs, adjectives, or other parts of speech. This tagging can then be used by programs which will begin to learn a bit more about the structure of the language, and then, can go to tag more text.