Over the past 12 months I have been contacted by a surprising number of new information technology (IT) companies and startups. Most of them plan to offer some variant of electronic commerce (online shopping, bartering, information gathering, etc.). Given the rather poor performance of current non-research level natural language processing technology (when is the last time you actually easily and accurately found a correct answer to a question to the Web, without having to spend too much time sifting through irrelevant information?), this is a bit surprising. But I think everyone feels that the new developments in automated text summarization, question analysis, and so on, are going to make a significant difference. I hope so!—but the level of performance is not available yet.
It seems to me that we will not get a big breakthrough, but we will get a somewhat acceptable level of performance, and then see slow but sure incremental improvement. The reason is that it is very hard to make your computer really "understand" what you mean—this requires us to build into the computer a network of "concepts" and their interrelationships that (at some level) mirror those in your own mind, at least in the subjects areas of interest. The surface (word) level is not adequate — when you type in "capital of Switzerland", current systems have no way of knowing whether you mean "capital city" or "financial capital". Yet the vast majority of people would choose the former reading, based on phrasing and on knowledge about what kinds of things one is likely to ask the Web, and in what way.
Several projects are now building, or proposing to build, such large "concept" networks. This is not something one can do in two years, and not something that has a correct result. We have to develop both the network and the techniques for building it semi-automatically and self-adaptively. This is a big challenge.
= What do you think about the debate concerning copyright on the Web? What practical solutions would you suggest?
As an academic, I am of course one of the parasites of society, and hence all in favor of free access to all information. But as a part-owner of a small startup company, I am aware of how much it costs to assemble and format information, and the need to charge somehow.
To balance these two wishes, I like the model by which raw information (and some "raw" resources, such as programming languages and basic access capabilities like the Web search engines) are made available for free. This creates a market and allows people to do at least something. But processed information, and the systems that help you get and structure just exactly what you need, I think should be paid for. That allows developers of new and better technology to be rewarded for their effort.
Take an example: a dictionary, today, is not free. Dictionary companies refuse to make them available to research groups and others for free, arguing that they have centuries of work invested. (I have had several discussions with dictionary companies on this.) But dictionaries today are stupid products — you have to know the word before you can find the word! I would love to have something that allows me to give an approximate meaning, or perhaps a sentence or two with a gap where I want the word I am looking for, or even the equivalent in another language, and returns the word(s) I am looking for. This is not hard to build, but you need the core dictionary to start with. I think we should have the core dictionary freely available, and pay for the engine (or the service) that allows you to enter partial or only somewhat accurate information and helps you find the best result.
A second example: you should have free access to all the Web, and to basic search engines like those available today. No copyrights, no license fees. But if you want an engine that provides a good targeted answer, pinpointed and evaluated for trustworthiness, then I think it is not unreasonable to pay for that.
Naturally, an encyclopedia builder will not like my proposal. But to him or her I say: package your encyclopedia inside a useful access system, because without it the raw information you provide is just more data, and can easily get lost in the sea of data available and growing every hour.
*Interview of September 2, 2000