1. Home
  2. Computing & Technology
  3. Web Design / HTML

Controlling Web Robots
Using the robots.txt File

By , About.com Guide

Have you ever wondered how sites like Google and Dogpile and other search engines find all their sites? Yes, some of them are submitted by the authors, but most are found through automated processes. These processes are called robots or 'bots. The robot will come to your site, parse through all the pages it can find, store the data in a database and move on.

Why Control the Robots

Sometimes you don't want the robots just roaming anywhere they like on your site.

  • You may have semi-private areas that are not password protected, but you still don't want scanned
  • Some areas of your Web site may contain programs or other non-content (like the cgi-bin directory) don't need to be scanned.
  • Or perhaps, you just don't have a lot of bandwidth and don't want what you have wasted on a robot.

It's Easy to Communicate with Web 'Bots

The first thing a robot does when it comes to a new Web site is it looks for a file on the root of the Web server called "robots.txt". If there is no file, it assumes that robots are allowed anywhere they can find on the site.

This file consists of two or more lines:

  1. The name of the robot or user-agent that is not allowed on the site. Usually this is left to "*", meaning all robots:
    User-agent: *
  2. The area of the site that that agent is not allowed into. All files and sub-directories under that directory will not be scanned by the robot. If there is more than one directory you want to disallow, then duplicate this line as many times as you need it.
    Disallow: /private/

So, if you wanted to prevent all robots from going to any area of your site, your robots.txt file would read:

  User-agent: *
  Disallow: /

If you want to prevent only a specific Web crawler from crawling your site, you need to list it by name in the User-agent line. For example, to prevent Google from spidering your site, you would write

  User-agent: Googlebot
  Disallow: /

Some Caveats

  • The robots.txt file is case-sensitive. If you create a file called Robots.txt or robots.TXT the spiders will ignore whatever it says.
  • The robots.txt file has to be in the root of your Web server. This means that if you have a Web page like http://www.webhostingcompany.com/~jenniferkyrnin/ you will need to ask your administrator to add your disallows to their root level directory.
  • There is no way to "allow" a spider, you can only disallow. So if you have one page in a group of 160 others that you want spidered, you should move it out of the disallowed directory. Or, you can explicitly name every file you want disallowed.
Explore Web Design / HTML
About.com Special Features

Holiday Central

What to eat, where to go, fun things to do and how to save money on the perfect gifts. More >

Family Tech Center

Stay connected and entertained with reviews on tips on the latest HDTVs, cellphones and more. More >

  1. Home
  2. Computing & Technology
  3. Web Design / HTML
  4. About.com Web Design A to Z
  5. Books
  6. About Web Design Book
  7. AWD: Chapter 10
  8. Controlling Web Robots - Using the robots.txt File>

©2009 About.com, a part of The New York Times Company.

All rights reserved.