To Ricky ERWAY, associate coordinator, American Memory, Library of Congress, the constant variety of conversion projects taking place simultaneously represented perhaps the most challenging aspect of working on AM. Thus, the challenge was not to find a solution for text conversion but a tool kit of solutions to apply to LC's varied collections that need to be converted. ERWAY limited her remarks to the process of converting text to machine-readable form, and the variety of LC's text collections, for example, bound volumes, microfilm, and handwritten manuscripts.

Two assumptions have guided AM's approach, ERWAY said: 1) A desire not to perform the conversion inhouse. Because of the variety of formats and types of texts, to capitalize the equipment and have the talents and skills to operate them at LC would be extremely expensive. Further, the natural inclination to upgrade to newer and better equipment each year made it reasonable for AM to focus on what it did best and seek external conversion services. Using service bureaus also allowed AM to have several types of operations take place at the same time. 2) AM was not a technology project, but an effort to improve access to library collections. Hence, whether text was converted using OCR or rekeying mattered little to AM. What mattered were cost and accuracy of results.

AM considered different types of service bureaus and selected three to perform several small tests in order to acquire a sense of the field. The sample collections with which they worked included handwritten correspondence, typewritten manuscripts from the 1940s, and eighteenth-century printed broadsides on microfilm. On none of these samples was OCR performed; they were all rekeyed. AM had several special requirements for the three service bureaus it had engaged. For instance, any errors in the original text were to be retained. Working from bound volumes or anything that could not be sheet-fed also constituted a factor eliminating companies that would have performed OCR.

AM requires 99.95 percent accuracy, which, though it sounds high, often means one or two errors per page. The initial batch of test samples contained several handwritten materials for which AM did not require text-coding. The results, ERWAY reported, were in all cases fairly comparable: for the most part, all three service bureaus achieved 99.95 percent accuracy. AM was satisfied with the work but surprised at the cost.

As AM began converting whole collections, it retained the requirement for 99.95 percent accuracy and added requirements for text-coding. AM needed to begin performing work more than three years ago before LC requirements for SGML applications had been established. Since AM's goal was simply to retain any of the intellectual content represented by the formatting of the document (which would be lost if one performed a straight ASCII conversion), AM used "SGML-like" codes. These codes resembled SGML tags but were used without the benefit of document-type definitions. AM found that many service bureaus were not yet SGML-proficient.

Additional factors influencing the approach AM took with respect to coding included: 1) the inability of any known microcomputer-based user-retrieval software to take advantage of SGML coding; and 2) the multiple inconsistencies in format of the older documents, which confirmed AM in its desire not to attempt to force the different formats to conform to a single document-type definition (DTD) and thus create the need for a separate DTD for each document.

The five text collections that AM has converted or is in the process of converting include a collection of eighteenth-century broadsides, a collection of pamphlets, two typescript document collections, and a collection of 150 books.

ERWAY next reviewed the results of AM's experience with rekeying, noting again that because the bulk of AM's materials are historical, the quality of the text often does not lend itself to OCR. While non-English speakers are less likely to guess or elaborate or correct typos in the original text, they are also less able to infer what we would; they also are nearly incapable of converting handwritten text. Another disadvantage of working with overseas keyers is that they are much less likely to telephone with questions, especially on the coding, with the result that they develop their own rules as they encounter new situations.

Government contracting procedures and time frames posed a major challenge to performing the conversion. Many service bureaus are not accustomed to retaining the image, even if they perform OCR. Thus, questions of image format and storage media were somewhat novel to many of them. ERWAY also remarked other problems in dealing with service bureaus, for example, their inability to perform text conversion from the kind of microfilm that LC uses for preservation purposes.

But quality control, in ERWAY's experience, was the most time-consuming aspect of contracting out conversion. AM has been attempting to perform a 10-percent quality review, looking at either every tenth document or every tenth page to make certain that the service bureaus are maintaining 99.95 percent accuracy. But even if they are complying with the requirement for accuracy, finding errors produces a desire to correct them and, in turn, to clean up the whole collection, which defeats the purpose to some extent. Even a double entry requires a character-by-character comparison to the original to meet the accuracy requirement. LC is not accustomed to publish imperfect texts, which makes attempting to deal with the industry standard an emotionally fraught issue for AM. As was mentioned in the previous day's discussion, going from 99.95 to 99.99 percent accuracy usually doubles costs and means a third keying or another complete run-through of the text.