Railroaded by Web bots? Take control and make 'em work for you

 If you want nothing on the site to be indexed, the text in the file should
read: # go away User-agent: * Disallow: /

 Most visiting bots look for this file first and go away when told to. But bots run
by hackers may ignore this file, so password-protect each directory on a development
server and watch traffic reports to see who is trying to hit these directories.

 If you want bots to index specific directories on the live server, then your
robots.txt file must be more specific. Visit http://info.webcrawler.com/mak/projects/robots/exclusion.html
for details on how to invite but limit bot visits.

 Alex Cunningham, lead technical administrator at Excite Inc. of Mountain View,
Calif., told me there are instances in which a Web bot misses a robots.txt file at a
server's root level.

 If multiple domain names resolve into different directories on one server, the bot
may enter via one IP address and think it's at the root level. It looks for the robot
file, finds nothing and proceeds to index.

 To stay on the safe side, post a robots.txt file within the individual top-level
directories at each IP address. You might want to place a meta tag at the top of each
individual Hypertext Markup Language file.

 Bots that look for meta tags learn whether an HTML document may be indexed or use
the tags to gather more links. A tag that reads <META
NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> tells bots not to
index the document or analyze it for links. 

You can build the meta tag into the template you use for development files. When you move
these over to the live site, take out the tags via batch edit. For more info, see http://info.webcrawler.com/mak/projects/robots/meta-user.html.


 If you discover that bots are finding pages you wanted to keep private, there's a
quick but risky fix. Delete or move the accidentally indexed files, then submit a request
for the bot to return to your site. In theory, it will look for the directories you
specify, find nothing there and update its database, effectively deleting the old
references.

 Inviting a bot into the very machine you want to keep private is the risk. You
cannot be certain it won't find something else private when it gets there.

 Now let's talk about how to get noticed.

 Cunningham suggested that one way to show up high on a Web search list is to limit
the number of concepts on each page. This seems to go against the idea of cramming many
keywords onto a page to increase the number of hits.

 But such search programs as Excite assign a weight to specific words and phrases. If
you have too many concepts on one page, the search engine must balance the weight of each.
As a result, your page won't turn up as high on a list of matches.

 Because these suggestions are general and every bot service is slightly different,
what works for one site may not work for all. Experiment. Visit the sites of each bot
service to see where you can submit a uniform resource locator, and read the documentation
for ideas and suggestions.

 Start at http://home.netscape.com/home/internet-search.html
and work your way out from there. The list of bots is growing every month. 

Shawn P. McCarthy is a computer journalist, webmaster and Internet programmer for GCN's
parent, Cahners Publishing Co.


inside gcn

  • cloud environment

    Microsoft brings Azure Stack to Government Cloud

Reader Comments

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above