N2H2 offers the following categories: Adults Only; Alcohol; Auction; Chat; Drugs; Electronic Commerce; Employment Search; Free Mail; Free Pages; Gambling; Games; Hate/Discrimination; Illegal; Jokes; Lingerie; Message/Bulletin Boards; Murder/Suicide; News; Nudity; Personal Information; Personals; Pornography; Profanity; Recreation/Entertainment; School Cheating Information; Search Engines; Search Terms; Sex; Sports; Stocks; Swimsuits; Tasteless/Gross; Tobacco; Violence; and Weapons. The "Nudity" category purports to block only "non-pornographic" images. The "Sex" category is intended to block only those depictions of sexual activity that are not intended to arouse. The "Tasteless/Gross" category includes contents such as "tasteless humor" and "graphic medical or accident scene photos." Additionally, N2H2 offers seven "exception categories." These exception categories include Education, Filtered Search Engine, For Kids, History, Medical, Moderated, and Text/Spoken Only. When an exception category is enabled, access to any Web site or page via a URL associated with both a category and an exception, for example, both "Sex" and "Education," will be allowed, even if the customer has enabled the product to otherwise block the category "Sex." As of November 15, 2001, of those Web sites categorized by N2H2 as "Sex," 3.6% were also categorized as "Education," 2.9% as "Medical," and 1.6% as "History."
Websense offers the following categories: Abortion Advocacy; Advocacy Groups; Adult Material; Business & Economy; Drugs; Education; Entertainment; Gambling; Games; Government; Health; Illegal/Questionable; Information Technology; Internet Communication; Job Search; Militancy/Extremist; News & Media; Productivity Management; Bandwidth Management; Racism/Hate; Religion; Shopping; Society & Lifestyle; Special Events; Sports; Tasteless; Travel; Vehicles; Violence; and Weapons. The "Adult" category includes "full or partial nudity of individuals," as well as sites offering "light adult humor and literature" and "exually explicit language." The "Sexuality/Pornography" category includes, inter alia, "hard-core adult humor and literature" and "exually explicit language." The "Tasteless" category includes "hard-to-stomach sites, including offensive, worthless or useless sites, grotesque or lurid depictions of bodily harm." The "Hacking" category blocks "sites providing information on or promoting illegal or questionable access to or use of communications equipment and/or software." SmartFilter offers the following categories: Anonymizers/Translators; Art & Culture; Chat; Criminal Skills; Cults/Occult; Dating; Drugs; Entertainment; Extreme/Obscene/Violence; Gambling; Games; General News; Hate Speech; Humor; Investing; Job Search; Lifestyle; Mature; MP3 Sites; Nudity; On-line Sales; Personal Pages; Politics, Opinion & Religion; Portal Sites; Self-Help/Health; Sex; Sports; Travel; Usenet News; and Webmail. Most importantly, no category definition used by filtering software companies is identical to CIPA's definitions of visual depictions that are obscene, child pornography, or harmful to minors. And category definitions and categorization decisions are made without reference to local community standards. Moreover, there is no judicial involvement in the creation of filtering software companies' category definitions and no judicial determination is made before these companies categorize a Web page or site.
Each filtering software company associates each URL in its control list with a "tag" or other identifier that indicates the company's evaluation of whether the content or features of the Web site or page accessed via that URL meets one or more of its category definitions. If a user attempts to access a Web site or page that is blocked by the filter, the user is immediately presented with a screen that indicates that a block has occurred as a result of the operation of the filtering software. These "denial screens" appear only at the point that a user attempts to access a site or page in an enabled category. All four of the filtering programs on which evidence was presented allow users to customize the category lists that exist on their own PCs or servers by adding or removing specific URLs. For example, if a public librarian charged with administering a library's Internet terminals comes across a Web site that he or she finds objectionable that is not blocked by the filtering program that his or her library is using, then the librarian may add that URL to a category list that exists only on the library's network, and it would thereafter be blocked under that category. Similarly, a customer may remove individual URLs from category lists. Importantly, however, no one but the filtering companies has access to the complete list of URLs in any category. The actual URLs or IP addresses of the Web sites or pages contained in filtering software vendors' category lists are considered to be proprietary information, and are unavailable for review by customers or the general public, including the proprietors of Web sites that are blocked by filtering software.
Filtering software companies do not generally notify the proprietors of Web sites when they block their sites. The only way to discover which URLs are blocked and which are not blocked by any particular filtering company is by testing individual URLs with filtering software, or by entering URLs one by one into the "URL checker" that most filtering software companies provide on their Web sites. Filtering software companies will entertain requests for recategorization from proprietors of Web sites that discover their sites are blocked. Because new pages are constantly being added to the Web, filtering companies provide their customers with periodic updates of category lists. Once a particular Web page or site is categorized, however, filtering companies generally do not re-review the contents of that page or site unless they receive a request to do so, even though the content on individual Web pages and sites changes frequently. 2. The Methods that Filtering Companies Use to Compile Category Lists
While the way in which filtering programs operate is conceptually straightforward by comparing a requested URL to a previously compiled list of URLs and blocking access to the content at that URL if it appears on the list accurately compiling and categorizing URLs to form the category lists is a more complex process that is impossible to conduct with any high degree of accuracy. The specific methods that filtering software companies use to compile and categorize control lists are, like the lists themselves, proprietary information. We will therefore set forth only general information on the various types of methods that all filtering companies deposed in this case use, and the sources of error that are at once inherent in those methods and unavoidable given the current architecture of the Internet and the current state of the art in automated classification systems. We base our understanding of these methods largely on the detailed testimony and expert report of Dr. Geoffrey Nunberg, which we credit. The plaintiffs offered, and the Court qualified, Nunberg as an expert witness on automated classification systems. When compiling and categorizing URLs for their category lists, filtering software companies go through two distinct phases. First, they must collect or "harvest" the relevant URLs from the vast number of sites that exist on the Web. Second, they must sort through the URLs they have collected to determine under which of the company's self-defined categories (if any), they should be classified. These tasks necessarily result in a tradeoff between overblocking (i.e., the blocking of content that does not meet the category definitions established by CIPA or by the filtering software companies), and underblocking (i.e., leaving off of a control list a URL that contains content that would meet the category definitions defined by CIPA or the filtering software companies). 1. The "Harvesting" Phase
Filtering software companies, given their limited resources, do not attempt to index or classify all of the billions of pages that exist on the Web. Instead, the set of pages that they attempt to examine and classify is restricted to a small portion of the Web. The companies use a variety of automated and manual methods to identify a universe of Web sites and pages to "harvest" for classification. These methods include: entering certain key words into search engines; following links from a variety of online directories (e.g., generalized directories like Yahoo or various specialized directories, such as those that provide links to sexually explicit content); reviewing lists of newly-registered domain names; buying or licensing lists of URLs from third parties; "mining" access logs maintained by their customers; and reviewing other submissions from customers and the public. The goal of each of these methods is to identify as many URLs as possible that are likely to contain content that falls within the filtering companies' category definitions.
The first method, entering certain keywords into commercial search engines, suffers from several limitations. First, the Web pages that may be "harvested" through this method are limited to those pages that search engines have already identified. However, as noted above, a substantial portion of the Web is not even theoretically indexable (because it is not linked to by any previously known page), and only approximately 50% of the pages that are theoretically indexable have actually been indexed by search engines. We are satisfied that the remainder of the indexable Web, and the vast "Deep Web," which cannot currently be indexed, includes materials that meet CIPA's categories of visual depictions that are obscene, child pornography, and harmful to minors. These portions of the Web cannot presently be harvested through the methods that filtering software companies use (except through reporting by customers or by observing users' log files), because they are not linked to other known pages. A user can, however, gain access to a Web site in the unindexed Web or the Deep Web if the Web site's proprietor or some other third party informs the user of the site's URL. Some Web sites, for example, send out mass email advertisements containing the site's URL, the spamming process we have described above. Second, the search engines that software companies use for harvesting are able to search text only, not images. This is of critical importance, because CIPA, by its own terms, covers only "visual depictions." 20 U.S.C. Sec. 9134(f)(1)(A)(i); 47 U.S.C. Sec. 254(h)(5)(B)(i). Image recognition technology is immature, ineffective, and unlikely to improve substantially in the near future. None of the filtering software companies deposed in this case employs image recognition technology when harvesting or categorizing URLs. Due to the reliance on automated text analysis and the absence of image recognition technology, a Web page with sexually explicit images and no text cannot be harvested using a search engine. This problem is complicated by the fact that Web site publishers may use image files rather than text to represent words, i.e., they may use a file that computers understand to be a picture, like a photograph of a printed word, rather than regular text, making automated review of their textual content impossible. For example, if the Playboy Web site displays its name using a logo rather than regular text, a search engine would not see or recognize the Playboy name in that logo.
In addition to collecting URLs through search engines and Web directories (particularly those specializing in sexually explicit sites or other categories relevant to one of the filtering companies' category definitions), and by mining user logs and collecting URLs submitted by users, the filtering companies expand their list of harvested URLs by using "spidering" software that can "crawl" the lists of pages produced by the previous four methods, following their links downward to bring back the pages to which they link (and the pages to which those pages link, and so on, but usually down only a few levels). This spidering software uses the same type of technology that commercial Web search engines use. While useful in expanding the number of relevant URLs, the ability to retrieve additional pages through this approach is limited by the architectural feature of the Web that page-to-page links tend to converge rather than diverge. That means that the more pages from which one spiders downward through links, the smaller the proportion of new sites one will uncover; if spidering the links of 1000 sites retrieved through a search engine or Web directory turns up 500 additional distinct adult sites, spidering an additional 1000 sites may turn up, for example, only 250 additional distinct sites, and the proportion of new sites uncovered will continue to diminish as more pages are spidered. These limitations on the technology used to harvest a set of URLs for review will necessarily lead to substantial underblocking of material with respect to both the category definitions employed by filtering software companies and CIPA's definitions of visual depictions that are obscene, child pornography, or harmful to minors. 2. The "Winnowing" or Categorization Phase
Once the URLs have been harvested, some filtering software companies use automated key word analysis tools to evaluate the content and/or features of Web sites or pages accessed via a particular URL and to tentatively prioritize or categorize them. This process may be characterized as "winnowing" the harvested URLs. Automated systems currently used by filtering software vendors to prioritize, and to categorize or tentatively categorize the content and/or features of a Web site or page accessed via a particular URL operate by means of (1) simple key word searching, and (2) the use of statistical algorithms that rely on the frequency and structure of various linguistic features in a Web page's text. The automated systems used to categorize pages do not include image recognition technology. All of the filtering companies deposed in the case also employ human review of some or all collected Web pages at some point during the process of categorizing Web pages. As with the harvesting process, each technique employed in the winnowing process is subject to limitations that can result in both overblocking and underblocking.
First, simple key-word-based filters are subject to the obvious limitation that no string of words can identify all sites that contain sexually explicit content, and most strings of words are likely to appear in Web sites that are not properly classified as containing sexually explicit content. As noted above, filtering software companies also use more sophisticated automated classification systems for the statistical classification of texts. These systems assign weights to words or other textual features and use algorithms to determine whether a text belongs to a certain category. These algorithms sometimes make reference to the position of a word within a text or its relative proximity to other words. The weights are usually determined by machine learning methods (often described as "artificial intelligence"). In this procedure, which resembles an automated form of trial and error, a system is given a "training set" consisting of documents preclassified into two or more groups, along with a set of features that might be potentially useful in classifying the sets. The system then "learns" rules that assign weights to those features according to how well they work in classification, and assigns each new document to a category with a certain probability. Notwithstanding their "artificial intelligence" description, automated text classification systems are unable to grasp many distinctions between types of content that would be obvious to a human. And of critical importance, no presently conceivable technology can make the judgments necessary to determine whether a visual depiction fits the legal definitions of obscenity, child pornography, or harmful to minors. Finally, all the filtering software companies deposed in this case use some form of human review in their process of winnowing and categorizing Web pages, although one company admitted to categorizing some Web pages without any human review. SmartFilter states that "the final categorization of every Web site is done by a human reviewer." Another filtering company asserts that of the 10,000 to 30,000 Web pages that enter the "work queue" to be categorized each day, two to three percent of those are automatically categorized by their PornByRef system (which only applies to materials classified in the pornography category), and the remainder are categorized by human review. SurfControl also states that no URL is ever added to its database without human review.