Have a problem with information overload now? Just wait a while

The Web, which two years ago had an estimated 21T worth of static HTML pages, has been doubling in size each year.

'Martin Herman

Henrik G. DeGyor

Many knowledge workers feel as if they are drowning in information.

It's not hard to understand why. The Web, which two years ago had an estimated 21T worth of static HTML pages, is doubling in size each year. We are inundated with e-mails, online documents, photographs, images and videos.

According to a study by the University of California at Berkeley, at www.sims.berkeley.edu/how-much-info, the world 'produces between 1 exabyte and 2 exabytes of information each year'about 250M for every man, woman and child on earth.' An exabyte is a billion gigabytes.

The study also said photographs are accumulating at an annual rate of 410 petabytes (410 million gigabytes), while video files add up to 300P annually.

It's impossible to browse so much information, but can't it be summarized better? Web search engines return too many irrelevant hits and put too much of a burden on the user, who typically has to know key words, browse returned documents and then enter new words to refine the search.

Cross-language document searches'for example, using English key words to find relevant documents in Chinese'are very difficult. Finding a certain piece of nontext material, such as a speech or video file, can turn into a major research project.

The Information Access Division of the National Institute of Standards and Technology's IT Laboratory is looking for better ways to access unstructured multimedia and other complex information: text, Web pages, images, video, voice, audio, and 2-D and 3-D graphics. NIST doesn't create such technologies, but we do contribute to them by developing performance metrics, evaluation methods, test suites and standards. We also work with industry to speed up commercial transition.

Our human language technology program focuses on the huge growth of non-English content, focused information as distinct from lists of documents, and human-computer interfaces.

Evaluation program

Since 1992, NIST has held annual text-retrieval conferences to evaluate systems and algorithms developed by participating organizations. They have three goals: to improve the state of the art in Web retrieval, to locate answers rather than merely documents and to bridge language barriers.

NIST is working with the Defense Advanced Research Projects Agency on several human language projects. One involves summarization, or reducing the number of words that must be read to understand the main points. Our goal is a reliable, comprehensive evaluation program for summarization tools.

Another project involves tracking broadcast news. We are trying to develop ways to evaluate technologies that claim to search and organize multilingual text from broadcast news media.
Since 1988, the division has been benchmarking speech-to-text transcription systems. We started with speech read aloud at various vocabulary sizes; then we moved into spontaneous speech, radio and TV speech, telephone conversations and, finally, interactive meetings with unlimited vocabularies. NIST is working with DARPA to drive down the transcription error rates while producing useful metadata.

Multimedia sources

A multimedia project is evaluating MPEG-7, a metadata standard for content-based, audio-visual information access. We are also developing metrics to test compliance of MPEG-7 encoders.

A program in user interaction was motivated by the enormous growth in Web information and applications, as well as by the government's mandate to make digital information more accessible.

One area deals with commercial software usability. Although poor usability drives up the total cost of owning productivity software, it is not generally considered as a cost itself.

Industry approached NIST to help buyers gauge usability in their procurement decisions, which would cut the cost of training and maintenance. The result is the Common Industry Format for reporting usability results.

A related project deals with evaluating the usability of Web sites in a consistent way. NIST's Common Industry Format Testing, Evaluation and Reporting project will set up benchmarks for Web usability evaluation and eventually create a 'usability ground truth' for each Web site. This ground truth could serve as a benchmark for comparing different accessibility evaluation methods.

Other programs are developing protocols and software to evaluate the accuracy of biometric identifiers such as fingerprints and facial recognition. As more individual biometrics are collected, much more data will have to be stored and retrieved, both to identify persons and to control their access to facilities. High accuracy here is critical; low-accuracy biometric systems require extra information to confirm identity and should be avoided.

Eventually, as computing becomes pervasive throughout our environment, there will be multimodal interfaces for speech and visual input, multiple kinds of sensors, and new mobile devices and collaboration technologies. We will need ways to handle this information overload, too.

Martin Herman is chief of NIST's Information Access Division in Gaithersburg, Md.

Reader Comments

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above