1. Technology

What is a Web Robot?

How to Handle Robots on Your Site and Server

By

A Web robot is a program that automatically and recursively traverses a Web site retrieving document content and information. The most common types of Web robots are the search engine spiders. These robots visit Web sites and follow the links to add more information to the search engine database.

Web robots often go by different names. You may hear them called:

  • spiders
  • bots
  • crawlers

All these terms mean the same thing, but robot is the clearest, because it does not imply that the program is moving through the Web site on its own, but rather is programmed to move systematically through a site.

Web Robots Follow Rules

While it is possible to write a robot that ignores the rules, most Web robots are written to obey certain rules set down in a specific text file on your site. This file is the robots.txt file. It is usually found in the root of your Web server and acts as the gateway for the robots. It tells them which areas of the site they can and cannot traverse.

Keep in mind that while most Web robots follow the rules that you lay out in your robots.txt file, some do not. If you have sensitive information, you should control access to it with a password or on an intranet rather than relying on robots not to spider it.

How are Web Robots Used

The most common use for Web robots is to index a site for a search engine. But robots can be used for other purposes as well. Some of the more common uses are:

  • Link validation - Robots can follow all the links on a site or a page, testing them to make sure they return a valid page code. The advantage to doing this programmatically is inherently obvious, the robot can visit all the links on a page in a minute or two and provide a report of the results much quicker than a human could do manually.
  • HTML validation - Similar to link validation, robots can be sent to various pages on your site to evaluate the HTML coding.
  • Change monitoring - There are services available on the Web that will tell you when a Web page has changed. These services are done by sending a robot to the page periodically to evaluate if the content has changed. When it is different, the robot would file a report.
  • Web site mirroring - Similar to the change monitoring robots, these robots evaluate a site, and when there is a change, the robot will transfer the changed information to the mirror site location.

©2014 About.com. All rights reserved.