Human review of Web pages has the advantage of allowing more nuanced, if not more accurate, interpretations than automated classification systems are capable of making, but suffers from its own sources of error. The filtering software companies involved here have limited staff, of between eight and a few dozen people, available for hand reviewing Web pages. The reviewers that are employed by these companies base their categorization decisions on both the text and the visual depictions that appear on the sites or pages they are assigned to review. Human reviewers generally focus on English language Web sites, and are generally not required to be multi-lingual. Given the speed at which human reviewers must work to keep up with even a fraction of the approximately 1.5 million pages added to the publicly indexable Web each day, human error is inevitable. Errors are likely to result from boredom or lack of attentiveness, overzealousness, or a desire to "err on the side of caution" by screening out material that might be offensive to some customers, even if it does not fit within any of the company's category definitions. None of the filtering companies trains its reviewers in the legal definitions concerning what is obscene, child pornography, or harmful to minors, and none instructs reviewers to take community standards into account when making categorization decisions.
Perhaps because of limitations on the number of human reviewers and because of the large number of new pages that are added to the Web every day, filtering companies also widely engage in the practice of categorizing entire Web sites at the "root URL," rather than engaging in a more fine-grained analysis of the individual pages within a Web site. For example, the filtering software companies deposed in this case all categorize the entire Playboy Web site as Adult, Sexually Explicit, or Pornography. They do not differentiate between pages within the site containing sexually explicit images or text, and for example, pages containing no sexually explicit content, such as the text of interviews of celebrities or politicians. If the "root" or "top-level" URL of a Web site is given a category tag, then access to all content on that Web site will be blocked if the assigned category is enabled by a customer. In some cases, whole Web sites are blocked because the filtering companies focus only on the content of the home page that is accessed by entering the root URL. Entire Web sites containing multiple Web pages are commonly categorized without human review of each individual page on that site. Web sites that may contain multiple Web pages and that require authentication or payment for access are commonly categorized based solely on a human reviewer's evaluation of the pages that may be viewed prior to reaching the authentication or payment page.
Because there may be hundreds or thousands of pages under a root URL, filtering companies make it their primary mission to categorize the root URL, and categorize subsidiary pages if the need arises or if there is time. This form of overblocking is called "inheritance," because lower-level pages inherit the categorization of the root URL without regard to their specific content. In some cases, "reverse inheritance" also occurs, i.e., parent sites inherit the classification of pages in a lower level of the site. This might happen when pages with sexual content appear in a Web site that is devoted primarily to non-sexual content. For example, N2H2's Bess filtering product classifies every page in the Salon.com Web site, which contains a wide range of news and cultural commentary, as "Sex, Profanity," based on the fact that the site includes a regular column that deals with sexual issues. Blocking by both domain name and IP address is another practice in which filtering companies engage that is a function both of the architecture of the Web and of the exigencies of dealing with the rapidly expanding number of Web pages. The category lists maintained by filtering software companies can include URLs in either their human-readable domain name address form, their numeric IP address form, or both. Through "virtual hosting" services, hundreds of thousands of Web sites with distinct domain names may share a single numeric IP address. To the extent that filtering companies block the IP addresses of virtual hosting services, they will necessarily block a substantial amount of content without reviewing it, and will likely overblock a substantial amount of content.
Another technique that filtering companies use in order to deal with a structural feature of the Internet is blocking the root level URLs of so-called "loophole" Web sites. These are Web sites that provide access to a particular Web page, but display in the user's browser a URL that is different from the URL with which the particular page is usually associated. Because of this feature, they provide a "loophole" that can be used to get around filtering software, i.e., they display a URL that is different from the one that appears on the filtering company's control list. "Loophole" Web sites include caches of Web pages that have been removed from their original location, "anonymizer" sites, and translation sites. Caches are archived copies that some search engines, such as Google, keep of the Web pages they index. The cached copy stored by Google will have a URL that is different from the original URL. Because Web sites often change rapidly, caches are the only way to access pages that have been taken down, revised, or have changed their URLs for some reason. For example, a magazine might place its current stories under a given URL, and replace them monthly with new stories. If a user wanted to find an article published six months ago, he or she would be unable to access it if not for Google's cached version.
Some sites on the Web serve as a proxy or intermediary between a user and another Web page. When using a proxy server, a user does not access the page from its original URL, but rather from the URL of the proxy server. One type of proxy service is an "anonymizer." Users may access Web sites indirectly via an anonymizer when they do not want the Web site they are visiting to be able to determine the IP address from which they are accessing the site, or to leave "cookies" on their browser. Some proxy servers can be used to attempt to translate Web page content from one language to another. Rather than directly accessing the original Web page in its original language, users can instead indirectly access the page via a proxy server offering translation features. As noted above, filtering companies often block loophole sites, such as caches, anonymizers, and translation sites. The practice of blocking loophole sites necessarily results in a significant amount of overblocking, because the vast majority of the pages that are cached, for example, do not contain content that would match a filtering company's category definitions. Filters that do not block these loophole sites, however, may enable users to access any URL on the Web via the loophole site, thus resulting in substantial underblocking. 3. The Process for "Re-Reviewing" Web Pages After Their Initial Categorization Most filtering software companies do not engage in subsequent reviews of categorized sites or pages on a scheduled basis. Priority is placed on reviewing and categorizing new sites and pages, rather than on re-reviewing already categorized sites and pages. Typically, a filtering software vendor's previous categorization of a Web site is not re-reviewed for accuracy when new pages are added to the Web site. To the extent the Web site was previously categorized as a whole, the new pages added to the site usually share the categorization assigned by the blocking product vendor. This necessarily results in both over- and underblocking, because, as noted above, the content of Web pages and Web sites changes relatively rapidly.
In addition to the content on Web sites or pages changing rapidly, Web sites themselves may disappear and be replaced by sites with entirely different content. If an IP address associated with a particular Web site is blocked under a particular category and the Web site goes out of existence, then the IP address likely would be reassigned to a different Web site, either by an Internet service provider or by a registration organization, such as the American Registry for Internet Numbers, see http://www.arin.net. In that case, the site that received the reassigned IP address would likely be miscategorized. Because filtering companies do not engage in systematic re-review of their category lists, such a site would likely remain miscategorized unless someone submitted it to the filtering company for re-review, increasing the incidence of over- and underblocking. This failure to re-review Web pages primarily increases a filtering company's rate of overblocking. However, if a filtering company does not re-review Web pages after it determines that they do not fall into any of its blocking categories, then that would result in underblocking (because, for example, a page might add sexually explicit content). 3. The Inherent Tradeoff Between Overblocking and Underblocking
There is an inherent tradeoff between any filter's rate of overblocking (which information scientists also call "precision") and its rate of underblocking (which is also referred to as "recall"). The rate of overblocking or precision is measured by the proportion of the things a classification system assigns to a certain category that are appropriately classified. The plaintiffs' expert, Dr. Nunberg, provided the hypothetical example of a classification system that is asked to pick out pictures of dogs from a database consisting of 1000 pictures of animals, of which 80 were actually dogs. If it returned 100 hits, of which 80 were in fact pictures of dogs, and the remaining 20 were pictures of cats, horses, and deer, we would say that the system identified dog pictures with a precision of 80%. This would be analogous to a filter that overblocked at a rate of 20%. The recall measure involves determining what proportion of the actual members of a category the classification system has been able to identify. For example, if the hypothetical animal- picture database contained a total of 200 pictures of dogs, and the system identified 80 of them and failed to identify 120, it would have performed with a recall of 40%. This would be analogous to a filter that underblocked 60% of the material in a category. In automated classification systems, there is always a tradeoff between precision and recall. In the animal-picture example, the recall could be improved by using a looser set of criteria to identify the dog pictures in the set, such as any animal with four legs, and all the dogs would be identified, but cats and other animals would also be included, with a resulting loss of precision. The same tradeoff exists between rates of overblocking and underblocking in filtering systems that use automated classification systems. For example, an automated system that classifies any Web page that contains the word "sex" as sexually explicit will underblock much less, but overblock much more, than a system that classifies any Web page containing the phrase "free pictures of people having sex" as sexually explicit.
This tradeoff between overblocking and underblocking also applies not just to automated classification systems, but also to filters that use only human review. Given the approximately two billion pages that exist on the Web, the 1.5 million new pages that are added daily, and the rate at which content on existing pages changes, if a filtering company blocks only those Web pages that have been reviewed by humans, it will be impossible, as a practical matter, to avoid vast amounts of underblocking. Techniques used by human reviewers such as blocking at the IP address level, domain name level, or directory level reduce the rates of underblocking, but necessarily increase the rates of overblocking, as discussed above. To use a simple example, it would be easy to design a filter intended to block sexually explicit speech that completely avoids overblocking. Such a filter would have only a single sexually explicit Web site on its control list, which could be re-reviewed daily to ensure that its content does not change. While there would be no overblocking problem with such a filter, such a filter would have a severe underblocking problem, as it would fail to block all the sexually explicit speech on the Web other than the one site on its control list. Similarly, it would also be easy to design a filter intended to block sexually explicit speech that completely avoids underblocking. Such a filter would operate by permitting users to view only a single Web site, e.g., the Sesame Street Web site. While there would be no underblocking problem with such a filter, it would have a severe overblocking problem, as it would block access to millions of non-sexually explicit sites on the Web other than the Sesame Street site.
While it is thus quite simple to design a filter that does not overblock, and equally simple to design a filter that does not underblock, it is currently impossible, given the Internet's size, rate of growth, rate of change, and architecture, and given the state of the art of automated classification systems, to develop a filter that neither underblocks nor overblocks a substantial amount of speech. The more effective a filter is at blocking Web sites in a given category, the more the filter will necessarily overblock. Any filter that is reasonably effective in preventing users from accessing sexually explicit content on the Web will necessarily block substantial amounts of non- sexually explicit speech. 4. Attempts to Quantify Filtering Programs' Rates of Over- and Underblocking The government presented three studies, two from expert witnesses, and one from a librarian fact witness who conducted a study using Internet use logs from his own library, that attempt to quantify the over- and underblocking rates of five different filtering programs. The plaintiffs presented one expert witness who attempted to quantify the rates of over- and underblocking for various programs. Each of these attempts to quantify rates of over- and underblocking suffers from various methodological flaws.
The fundamental problem with calculating over- and underblocking rates is selecting a universe of Web sites or Web pages to serve as the set to be tested. The studies that the parties submitted in this case took two different approaches to this problem. Two of the studies, one prepared by the plaintiffs' expert witness Chris Hunter, a graduate student at the University of Pennsylvania, and the other prepared by the defendants' expert, Chris Lemmons of eTesting Laboratories, in Research Triangle Park, North Carolina, approached this problem by compiling two separate lists of Web sites, one of URLs that they deemed should be blocked according to the filters' criteria, and another of URLs that they deemed should not be blocked according to the filters' criteria. They compiled these lists by choosing Web sites from the results of certain key word searches. The problem with this selection method is that it is neither random, nor does it necessarily approximate the universe of Web pages that library patrons visit.