Multilingualism on the Web - Marie Lebert

"UNL (Universal Networking Language) is a language that - with its companion "enconverter" and "deconverter" software - enables communication among peoples of differing native languages. It will reside, as a plug-in for popular World Wide Web browsers, on the Internet, and will be compatible with standard network servers. The technology will be shared among the member states of the United Nations. Any person with access to the Internet will be able to "enconvert" text from any native language of a member state into UNL. Just as easily, any UNL text can be "deconverted" from UNL into native languages. United Nations University's UNL Center will work with its partners to create and promote the UNL software, which will be compatible with popular network servers and computing platforms."

The Natural Language Group (NLG) at the Information Sciences Institute (ISI) of the University of Southern California (USC) is currently involved in various aspects of computational/natural language processing. The group's projects are: machine translation; automated text summarization; multilingual verb access and text management; development of large concept taxonomies (ontologies); discourse and text generation; construction of large lexicons for various languages; and multimedia communication.

Eduard Hovy, Head of the Natural Language Group, expained in his e-mail of
August 27, 1998:

"Your presentation outline looks very interesting to me. I do wonder, however, where you discuss the language-related applications/functionalities that are not translation, such as information retrieval (IR) and automated text summarization (SUM). You would not be able to find anything on the Web without IR! — all the search engines (AltaVista, Yahoo!, etc.) are built upon IR technology. Similarly, though much newer, it is likely that many people will soon be using automated summarizers to condense (or at least, to extract the major contents of) single (long) documents or lots of (any length) ones together. […]

In this context, multilingualism on the Web is another complexifying factor. People will write their own language for several reasons — convenience, secrecy, and local applicability — but that does not mean that other people are not interested in reading what they have to say! This is especially true for companies involved in technology watch (say, a computer company that wants to know, daily, all the Japanese newspaper and other articles that pertain to what they make) or some Government Intelligence agencies (the people who provide the most up-to-date information for use by your government officials in making policy, etc.). One of the main problems faced by these kinds of people is the flood of information, so they tend to hire 'weak' bilinguals who can rapidly scan incoming text and throw out what is not relevant, giving the relevant stuff to professional translators. Obviously, a combination of SUM and MT (machine translation) will help here; since MT is slow, it helps if you can do SUM in the foreign language, and then just do a quick and dirty MT on the result, allowing either a human or an automated IR-based text classifier to decide whether to keep or reject the article.

For these kinds of reasons, the US Government has over the past five years been funding research in MT, SUM, and IR, and is interested in starting a new program of research in Multilingual IR. This way you will be able to one day open Netscape or Explorer or the like, type in your query in (say) English, and have the engine return texts in *all* the languages of the world. You will have them clustered by subarea, summarized by cluster, and the foreign summaries translated, all the kinds of things that you would like to have.

You can see a demo of our version of this capability, using English as the user
language and a collection of approx. 5,000 texts of English, Japanese, Arabic,
Spanish, and Indonesian, by visiting MuST Multilingual Information Retrieval,
Summarization, and Translation System.

Type your query word (say, 'baby', or whatever you wish) in and press 'Enter/Return'. In the middle window you will see the headlines (or just keywords, translated) of the retrieved documents. On the left you will see what language they are in: 'Sp' for Spanish, 'Id' for Indonesian, etc. Click on the number at left of each line to see the document in the bottom window. Click on 'Summarize' to get a summary. Click on 'Translate' for a translation (but beware: Arabic and Japanese are extremely slow! Try Indonesian for a quick word-by-word 'translation' instead).

This is not a product (yet); we have lots of research to do in order to improve the quality of each step. But it shows you the kind of direction we are heading in."

"How do you see the future of Internet-related activities as regards languages?"