Difference between revisions of "Glossary/Robots.txt"

Revision as of 23:56, 16 December 2010

Robots.txt is a file that website owners put on their web servers to give instructions to robots -- that is, computer programs that travel around the web, looking for specific kinds of web pages and content.

People usually use robots.txt to prevent search engines from indexing - that is, including in their databases - web contact forms, print versions of web pages and other pages that have content duplicated elsewhere on the site. That's because you want search engines to index only the pages that are important to your business - for example, pages that showcase your products or services.

Robots.txt can also be used to request that specific robots refrain from indexing a web page. A website owner might want to allow search engines to crawl his site, but not robots sent out by companies gathering information that could help his competitors.

Crawlers, robots, agents, bots and spiders

These five terms all describe basically the same thing: an automated software program used to locate, collect, or analyze data from web pages. Search engines like Google use a spider to collect data on web pages for inclusion in their databases. The spider also follows links on web pages to find new pages.

AboutUs.org uses a robot to analyze web pages for its Online Visibility Audit.

How robots.txt works

When a legitimate robot wants to visit a web page like www.example.com/good-content, it first checks for www.example.com/robots.txt to make sure the site owner is willing to let the robot examine the page.

The robot looks for several things:

Is there a robots.txt file?
Is the robot explicitly excluded in the robots.txt file?
Is the robot excluded from this web page, the directory that contains this page, or both?

If the robot isn't explicitly excluded from the page, the robot will collect data, analyze it, or do whatever else it was designed to do.

@@ Line 6: / Line 6: @@
 ===Crawlers, robots, agents, bots and spiders===
-These five terms all describe basically the same thing: an automated software program used to locate, collect, or analyze data from web pages. Search engines like [[Google.com|Google]] use a spider to ```````````````````collect data on web pages for inclusion in their databases. The spider also follows links on web pages to find new pages.
+These five terms all describe basically the same thing: an automated software program used to locate, collect, or analyze data from web pages. Search engines like [[Google.com|Google]] use a spider to collect data on web pages for inclusion in their databases. The spider also follows links on web pages to find new pages.
 [[AboutUs.org]] uses a robot to analyze web pages for its [[Online Visibility Audit]].