1. Technology

Sample robots.txt Files

Learn How to Write a robots.txt File for Your Site

By

A robots.txt file stored in the root of your website will tell web robots like search engine spiders what directories and files they are allowed to crawl. It’s easy to use a robots.txt file, but there are some things you should remember:

  1. Black hat web robots will ignore your robots.txt file. The most common types are malware bots and robots looking for email addresses to harvest.
  2. Some new programmers will write robots that ignore the robots.txt file. This is usually done by mistake.
  3. Anyone can see your robots.txt file. They are always called robots.txt and are always stored at the root of the website. For example, About.com’s robots.txt file is here http://www.about.com/robots.txt.
  4. Finally, if someone links to a file or directory that is excluded by your robots.txt file from a page that is not excluded by their robots.txt file, the search engines may find it anyway.

Don’t use robots.txt files to hide anything important. Instead you should put important information behind secure passwords or leave it off the web entirely.

How to Use These Sample Files

Copy the text from the sample that is closest to what you want to do, and paste it into your robots.txt file. Change the robot, directory, and file names to match your prefered configuration.

Two Basic Robots.txt Files

User-agent: *
Disallow: /

This file says that any robot (User-agent: *) that accesses it should ignore every page on the site (Disallow: /).

User-agent: *
Disallow:

This file says that any robot (User-agent: *) that accesses it is allowed to view every page on the site (Disallow:).

You can also do this by leaving your robots.txt file blank or not having one on your site at all.

Protect Specific Directories from Robots

User-agent: *
Disallow: /cgi-bin/
Disallow: /temp/

This file says that any robot (User-agent: *) that accesses it should ignore the directories /cgi-bin/ and /temp/ (Disallow: /cgi-bin/ Disallow: /temp/).

Protect Specific Pages from Robots

User-agent: *
Disallow: /jenns-stuff.htm
Disallow: /private.php

This file says that any robot (User-agent: *) that accesses it should ignore the files /jenns-stuff.htm and /private.php (Disallow: /jenns-stuff.htm Disallow: /private.php).

Prevent a Specific Robot from Accessing Your Site

User-agent: Lycos/x.x
Disallow: /

This file says that the Lycos bot (User-agent: Lycos/x.x) is not allowed access anywhere on the site (Disallow: /).

Allow Only One Specific Robot Access

User-agent: *
Disallow: /
User-agent: Googlebot
Disallow:

This file first disallows all robots like we did above, and then explicitly lets the Googlebot (User-agent: Googlebot) have access to everything (Disallow:).

Combine Multiple Lines to Get Exactly the Exclusions You Want

While it’s better to use a very inclusive User-agent line, like User-agent: *, you can be as specific as you like. Remember that robots read the file in order. So if the first lines say that all robots are blocked from everything, and then later on in the file it says that all robots are allowed access to everything, the robots will have access to everything.

If you’re not sure whether you’ve written your robots.txt file correctly, you can use Google’s Webmaster Tools to check your robots.txt file or write a new one.

©2014 About.com. All rights reserved.