Teleport Pro webcrawler is the spider, man

Most people have heard of the mysterious World Wide Web spiders that crawl from site to
site cataloging information.


Did you know you can set your own personal spider to do the same thing?


Teleport Pro from Tennyson Maxwell Information Systems is a fully automated,
multithreaded, link-following, file-retrieving Web spider. It's simple to use and
configure. You can have it retrieve all the files you want, and only the ones you want,
from any Web site accessible to your browser.


It automates your work in several ways. Say you download files via File Transfer
Protocol and Hypertext Transfer Protocol clients. What if you need to grab multiple files
from multiple sites? Suppose you know the name of a file but not where to find it. Or you
know only the type of file, not its exact name or location. Teleport Pro can help.


If you're going to travel, use Teleport Pro to download an entire Web site, selected
directories or individual files, so you can browse off line on your portable. A webmaster
can use it to create an exact duplicate, or mirror, of a Web site complete with
subdirectory structure and HTML, image and script files.


The program saves each search session as a project. You create a project profile with
one or more starting addresses and a set of rules for how far your spider goes, how long
it's gone, the types of links it follows and what it brings home.


For example, you might direct Teleport Pro to retrieve Joint Photographic Experts Group
or Graphics Interchange Format files, the image standards for the Web. You might restrict
the spider's range by telling it to follow only links within the same domain as the
starting address. You also can control the search depth or number of levels the spider
descends.


Once you activate the spider, it reads your starting address and retrieves any files it
finds there that match the profile. Then it reads all the links there and follows them. It
gets any matching files from those pages, reads and follows their links, and so on ad
infinitum until it runs out of places to go.


The spider has 10 legs called retrieval threads. When Teleport Pro encounters a profile
match, it launches one of these independent subprograms to retrieve the file, and then
terminates itself. The 10 threads can run concurrently.


To monitor and control individual threads, there's a row of button indicators called a
thread bar. The buttons light up in different colors or change into pie charts to reflect
each thread's changing status.


When you place the cursor over a button, it displays the thread's current activity
status. If you click the button, Teleport Pro will display a menu that you can use to
abort the thread.


I tested Teleport Pro by mirroring the GCN Web site on a local server and experienced
only minor problems. The program downloaded and rebuilt more than 730 HTML and image files
in about 20 minutes over a 56-kilobit/sec connection. One server-side image map didn't
convert properly, but otherwise the site worked the same locally as online.


I then used Teleport Pro to search the copied site for image files and made a file
archive. I searched for .gif image files larger than 50K. In minutes, I had a list of
files exceeding that size limit.


Teleport Pro does more than snatch files, though. If you've done Web research, you know
how time-consuming it can be. Just arm your pet spider with a list of keywords, give it a
starting point and go home. Teleport Pro will crawl around the Internet, looking for text
that matches your keyword searches.


In the morning, your spider will have a report waiting for you with working hyperlinks
to the appropriate sites, or you can program it to fetch the complete documents.


The spider remembers where it has been and never visits the same site twice. It also
remembers everything it retrieves and won't bring back files twice unless you so order.


Be sure to study Teleport Pro's netiquette properties settings before launching your
spider. There are firm standards for Internet robotic agents, or bots.


That's because agents such as Teleport Pro can impose overwhelming transaction burdens
on Web sites. The default netiquette settings keep Teleport Pro from overtaxing sites and
going into areas that are off-limits to bots.


Teleport Pro is the best and most economical utility I've found for site mirroring. It
can make an exact copy of a Web site, with original file names and subdirectory structure.
Most programs copy retrieved files to a flat directory and rename them, reducing their
usefulness. But if you want to retrieve a whole site or partial site for local browsing,
Teleport Pro can build flat-file copies.


The package will pass your user name and password information to secure sites, but you
should have a talk with your security officer before accessing secure servers this way.


Some minor shortcomings: Teleport Pro follows only Web servers and can't yet search FTP
sites.


You can choose to download Java applets with Teleport Pro but not execute them if
you're concerned about security. But remember that when an applet calls another applet as
part of its routine, the retrieved applet won't work because Teleport Pro didn't know to
retrieve the second one. I'm not holding this limitation against the program.


Serious webmasters and Internet users will want to add this package to their toolboxes.


inside gcn

  • Congressman sees broader role for DHS in state and local cyber efforts

    Automating the ATO

Reader Comments

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above