Another View: Is Web scrubbing an exercise in futility?

Gabe Goldberg

When cuneiform stone tablets represented state-of-the-art printing, the best records retention policy was "Don't drop them." Each successive form of IT--printing press, filing cabinet, copier, e-mail and Internet--has made it harder to control information flow and access, and especially to reclaim freely disseminated material.

Sept. 11 changed the rules, forcing government agencies to re-evaluate data they'd put online, such as infrastructure layouts.

But anything available on the Internet--Web pages, Usenet newsgroup postings and even some e-mail--is difficult to remove from public view. You may not realize how widely replicated and long-lived content can be.

Today's leading search engines, such as
google.com, fetch links to current Web pages. But most links returned include a cached option, allowing viewing pages even after they've been removed from host sites.

The Wayback Machine at
www.archive.org/index.html archives many years of data. For example, it displays the
www.whitehouse.gov site site from December 1996.

Surprise! Scrubbing pages off Web servers doesn't make them unavailable. At best, it delays finding them.

Search engines locate information by crawling the Web using software robots. You can use robot protocols to prevent searches of all or parts of your sites. But that only works with constant vigilance in maintaining proper access controls.

Another risk is mirror sites, which can retain content removed from the main site. Locating and dealing with unwanted mirrored content can resemble playing Whack-a-Mole, with two sites popping up for each one you knock down. Foreign sites that have copped your content might comply only slowly, if at all.

A Google search often returns index pages of retrievable material posted years ago. Groups.google.com, formerly DejaNews and Deja, offers a 20-year backlog of material. Because industrial suppliers and their customers often use newsgroups as communities, this site may host persistent material.

It's worth conducting periodic vanity searches to find out where your content is indexed, linked, cached and mirrored. To be truly prepared, agency webmasters must understand countless procedures for removing content and suppressing indexing by robots.

But recognize that the robot protocol and information removal procedures rely on adherence to technically agreed-upon but legally unenforceable procedures.

You can gain additional security and control by not presenting content in easily captured Web pages but instead requiring local host processing, that is, retrieval from a database. Content stored this way has been called the invisible or dark Web.

Gary Price, a Web content and searching consultant, distinguishes information that's on the Net--typical unrestricted server content, little bibliographic control, usually free or low-cost--from that retrieved via the Net--well-indexed databases and resources, often proprietary, sometimes billable, maybe in complex formats.

Such data is substantially easier for a webmaster to control and redact.

Be careful that after-the-fact content modifications don't generate accusations of rewriting history. Scrubbing may be necessary and justifiable, but doing it clandestinely, heavy-handedly or too broadly will reduce agency credibility and cause needless controversy.

Gabe Goldberg is an Alexandria, Va., Internet and enterprise computing writer and consultant. E-mail him at gabe@gabegold.com.

Reader Comments

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above