How To Use Robots.txt

By Martin Laetsch on December 2, 2010

Contents

What is the robots.txt file?


Robots.txt is a file that a website owner can put on his or her web server to give instructions to computer programs that travel around the web, looking for specific kinds of web pages and content. These computer programs are often called "robots," which is why this file is called robots.txt.
People usually use robots.txt to prevent search engines from including in their indexes any web pages that really shouldn't be indexed - for example, web contact forms, print versions of web pages and other content that's duplicated elsewhere on the site. Robots.txt can also be used to request that specific robots not index a site.

Crawlers, robots, agents, bots and spiders


These five terms all describe basically the same thing: an automated software program used to locate, collect, or analyze data from web pages. Search engines like Google use a spider to collect data on web pages for inclusion in their databases. The spider also follows links on web pages to find new pages.

AboutUs uses a robot to analyze web pages for its Website Analysis. Our bot's name is "AboutUsBot".

How robots.txt works


When a legitimate robot wants to visit a web page like www.example.com/good-content, it first checks for www.example.com/robots.txt to make sure the site owner is willing to let the robot examine the page.

The robot looks for several things:

  • Does a robots.txt file exist?
  • Is the robot explicitly excluded in the robots.txt file?
  • Is the robot excluded from the web page, the directory the page is in, or both?

If the robot isn't explicitly excluded from the page, the robot will do what it's designed to do: collect data, analyze it, or whatever.

How to create a robots.txt file


Robots.txt is a standard text file that you can create in Notepad or any other text editor. You can use Word or another word processor, but be sure to save the file as raw text (.txt) when you are done.

The file name should be in lower case: "robots.txt," not "Robots.Txt."

When the file is ready, upload it to the top-level directory of your web server. Robots and spiders should be able to find it at www.YourDomain.com/robots.txt. Don't forget to go there - or have your web designer go there - and verify that the file works.

What belongs in a robots.txt file?


Robots.txt files usually contain a single record. They look something like this:

User-agent: *
Disallow: /print/
Disallow: /temp/

In this example, all robots have been excluded from the /print and /temp directories.

How do I prevent robots from scanning my site?


There is no easy way to prevent all robots from crawling your site. However, you can request that well-behaved robots not visit your site by adding these two lines to your robots.txt file:

User-agent: *
Disallow: /

This asks all robots to refrain from crawling any pages on the site. Keep in mind that this will block search engine robots from indexing your site and including it in their search results too.

Things to keep in mind


  • The robots.txt file is a polite request to robots, not a mandate they must follow. Malicious bots will pay no attention to this request -- for example, robots that scan the web for security vulnerabilities or email address harvesters used by spammers.
  • The robots.txt file is public. Anyone can see which sections of your server you don't want robots to examine. If you want to hide information, protect it with a password.
  • You need a separate "Disallow" line for every page, file, or directory you want to exclude.
  • Everything not explicitly disallowed is considered fair game for a robot to retrieve.
  • You can't have blank lines within a record. Blank lines are used to separate multiple records.
  • Regular expressions cannot be used in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif" won't work.

Uses for robots.txt


To exclude all robots from the entire website
User-agent: *
Disallow: /
User-agent: * means this section applies to all robots.
Disallow: / asks the robot not to visit any pages on the site.
To allow all robots complete access
User-agent: *
Disallow:

This is exactly the same as having an empty robots.txt file or not having one at all.

To exclude all robots from part of the website
User-agent: *
Disallow: /directory 1/
Disallow: /directory 2/
Disallow: /directory 3/

This asks all robots to avoid content from directory 1, directory 2, and directory 3. Robots are welcome to look at content in any other directory on the site.

To exclude a single robot
User-agent: Robot-Name
Disallow: /

This asks the robot named Robot-Name to refrain from crawling any part of the website.

To allow a single robot


User-agent: Robot-Name
Disallow:
User-agent: *
Disallow: /

This tells the robot named Robot-Name that it's welcome to examine the entire website, but asks all other robots to refrain from crawling any part of the site.

To exclude all pages except one

There isn't a way to explicitly "allow" a page. Therefore, you need to "disallow" all the pages except the one you want robots to find. The easiest way to do this is to put all files you want disallowed into a separate directory, and leave the one page you want crawled in the level above the "disallow" directory:

User-agent: *
Disallow: /good/bad/

This tells all robots they can examine everything in the /good directory but they shouldn't look in the /bad directory that lives under the /good directory.

You can also explicitly disallow all pages you don't want the robots to examine. For example:

User-agent: *
Disallow: /good/page 1.html
Disallow: /good/page 2.html
Disallow: /good/page 3.html

This tells all robots that you don't want them to look at page 1, page 2 or page 3 in the /good directory.

To specify the location of your sitemap

You can tell search engine bots where to find your XML sitemap like this:

Sitemap: http://www.yourwebsite.com/sitemap.xml