Have you ever wondered how sites like Google and Dogpile and other search engines find all their sites? Yes, some of them are submitted by the authors, but most are found through automated processes. These processes are called robots or 'bots. The robot will come to your site, parse through all the pages it can find, store the data in a database and move on.
Why Control the Robots
Sometimes you don't want the robots just roaming anywhere they like on your site.
- You may have semi-private areas that are not password protected, but you still don't want scanned
- Some areas of your Web site may contain programs or other non-content (like the cgi-bin directory) don't need to be scanned.
- Or perhaps, you just don't have a lot of bandwidth and don't want what you have wasted on a robot.
It's Easy to Communicate with Web 'Bots
The first thing a robot does when it comes to a new Web site is it looks for a file on the root of the Web server called "robots.txt". If there is no file, it assumes that robots are allowed anywhere they can find on the site.
This file consists of two or more lines:
- The name of the robot or user-agent that is not allowed on the site. Usually this is
left to "*", meaning all robots:
User-agent: * - The area of the site that that agent is not allowed into. All files and sub-directories
under that directory will not be scanned by the robot. If there is more than
one directory you want to disallow, then duplicate this line as many times as you
need it.
Disallow: /private/
So, if you wanted to prevent all robots from going to any area of your site, your robots.txt
file would read:
User-agent: *
Disallow: /
If you want to prevent only a specific Web crawler from crawling your site, you need to
list it by name in the User-agent line. For example, to prevent Google from spidering
your site, you would write
User-agent: Googlebot
Disallow: /
Some Caveats
- The robots.txt file is case-sensitive. If you create a file called Robots.txt or robots.TXT the spiders will ignore whatever it says.
- The robots.txt file has to be in the root of your Web server. This means that if you have a Web page like http://www.webhostingcompany.com/~jenniferkyrnin/ you will need to ask your administrator to add your disallows to their root level directory.
- There is no way to "allow" a spider, you can only disallow. So if you have one page in a group of 160 others that you want spidered, you should move it out of the disallowed directory. Or, you can explicitly name every file you want disallowed.

