2006: TOWARDS A WORLD PUBLIC DIGITAL LIBRARY
= [Overview]
Conceived by the Internet Archive to offer a universal public digital library, the Open Content Alliance (OCA) was launched in October 2005 as a group of cultural, technology, non profit and governmental organizations willing to build a permanent archive of multilingual digitized text and multimedia content. The project took off in 2006, with the digitization of public domain books around the world. Unlike Google Books, the Open Content Alliance (OCA) has made them searchable through any web search engine, and has not scanned copyrighted books, except when the copyright holder has expressly given permission. The first contributors to OCA were the University of California, the University of Toronto, the European Archive, the National Archives in United Kingdom, O’Reilly Media and the Prelinger Archives. The digitized collections are freely available in the Text Archive section of the Internet Archive. In December 2008, one million ebooks were posted under OCA principles by the Internet Archive.
= [In Depth]
The Internet Archive and Yahoo! conceived the Open Content Alliance (OCA) in early 2005 to offer broad public access to the world culture. The OCA also wanted to address the issues of the Google Book project, with its copyright issues and its availability from one search engine only. The OCA was launched with the goal of digitizing only public domain books and making them searchable and downloadable through any search engine.
What exactly is the Internet Archive? Founded in April 1996 by Brewster Kahle, the Internet Archive is a non-profit organization that has built an "internet library" to offer permanent access to historical collections in digital format for researchers, historians and scholars. An archive of the web is stored every two months or so. In late 1999, the Internet Archive started to include more collections of archived webpages on specific topics. It also became an online digital library of text, audio, software, image and video content. In October 2001, with 30 billion stored webpages, the Internet Archive launched the Wayback Machine, for users to be able to surf the archive of the web by date. In 2004, there were 300 terabytes of data, with a growth of 12 terabytes per month. There were 65 billion pages (from 50 million websites) in 2006 and 85 million pages in 2008. The Internet Archive now defines itself as "a nonprofit digital library dedicated to providing universal access to human knowledge."
In October 2005, the Internet Archive launched the Open Content Alliance (OCA) with other contributors as a collective effort for "building a digital archive of global content for universal access" (subtitle of the OCA home page) that would be a permanent repository of multilingual text and multimedia content.
As explained on its website in 2007, the OCA "is a collaborative effort of a group of cultural, technology, nonprofit, and governmental organizations from around the world that helps build a permanent archive of multilingual digitized text and multimedia material. An archive of contributed material is available on the Internet Archive website and through Yahoo! and other search engines and sites. The OCA encourages access to and reuse of collections in the archive, while respecting the content owners and contributors."
The project aims at digitizing public domain books around the world and make them searchable through any web search engine and downloadable for free. Unlike Google Books, the OCA scans and digitizes only public domain books, except when the copyright holder has expressly given permission. The first contributors to the OCA were the University of California, the University of Toronto, the European Archive, the National Archives in United Kingdom, O’Reilly Media and Prelinger Archives. The digitized collections are freely available in the Text Archive section of the Internet Archive. 100,000 ebooks were publicly available in December 2006 (with 12,000 new ebooks added per month), 200,000 ebooks in May 2007, and one million ebooks in December 2008.
Microsoft has been one of the partners of the OCA, while also developing its own project. The beta version of Live Search Books was released in December 2006, with a search possible by keyword for non copyrighted books digitized by Microsoft in partner libraries. The British Library and the libraries of the universities of California and Toronto were the first ones to join in, followed in January 2007 by the New York Public Library and Cornell University. Books offered full text views and could be downloaded in PDF files. In May 2007, Microsoft announced agreements with several publishers, including Cambridge University Press and McGraw Hill, for their books to be available in Live Search Books. After digitizing 750,000 books and indexing 80 million journal articles, Microsoft ended the Live Search Books program in May 2008, to focus on other activities, and closed the website. These books are available in the OCA collections of the Internet Archive.
A main issue for digital libraries is the lack of proofreading of digitized books, that ensures a better accuracy of the text without any loss from the print version. The only digital library proofreading its books has been Project Gutenberg, with 28,000 high-quality ebooks available in January 2009. Good OCR (Optical Character Recognition) software run on image files - obtained from scanning print pages - is said to ensure 99% accuracy. If the step of the proofreading seems essential to Project Gutenberg, whose goal is to reach a 99.99% accuracy for its ebooks - above the 99.95% accuracy set up as a standard for Library of Congress -, this step is skipped by the Internet Archive, the OCA, Google and many others. Some R&D teams work on better quality OCR technology, which means that they would have to go back to the original image files to provide a higher quality book in the future, if they do want to provide digital versions without any loss from the print version.