Learn/404-Errors-Drive-Visitors-Away

Revision as of 23:14, 6 October 2010 by Suzi Ziegler (talk | contribs) (When and why do these messages occur?)

What does a "page not found" message mean?

Have you ever typed a URL in the navigation bar only to receive a "404 error message" or learn that the page no longer exists? It can be frustrating and feel like a waste of time, especially when repeated attempts to find a page produce the same results.

When and why do these messages occur?

"Page not found" or "404 error message" is the standard response to a server request for a site that is dead, broken, or no longer exists. In other words, when a server responds to a URL query and can't find the requested page, the server will send a "404 error message" or "Page not found" notice indicating the page cannot be loaded or opened. Link rot is a slang term for the same occurrence.

While there are many reasons a link might be broken, it is frequently due to some form of blocking such as content filters or firewalls.

Another type of dead link occurs when the server that hosts the target page stops working, or relocates to a new domain name. In this case, the browser may return a Domain Name Server error, or it may display a site unrelated to the content sought. The latter can occur when a domain name is allowed to lapse, and is subsequently reregistered by another party. Domain names acquired in this manner are attractive to those who wish to take advantage of the stream of unsuspecting surfers that will inflate hit counters and PageRanking.

What is a 404 error message?

When a legitimate robot wants to vist a web page like www.example.com/good-content it firsts checks for www.example.com/robots.txt. The robot wants to make sure the site owner is OK with the robot examining the content.

The robot looks for several things:

  • Does a robots.txt file exist?
  • Is the robot explicitly excluded in the robots.txt file?
  • Is the page or the directory the page is in explicitly excluded?

If the robot isn't explicitly excluded and the site owner doesn't explicitly exclude the page the robot is interested in, the robot will continue on to the page and do what ever it was set up to do.

How to create a Robots.txt file

Robots.txt is a standard text file that you can create in Notepad or any other text editor. You can use Word or another word processor, but be sure to save the file as raw text (.txt) when you are done.

The file name should be all lower case: "robots.txt", not "Robots.Txt.

When the file is ready, you need to upload it to the top-level directory of your web server. It should be accessible from www./robots.txt.

What belongs in a robots.txt file?

Robots.txt files usually contain a single record. They look something like this:

User-agent: *
Disallow: /print/
Disallow: /temp/

In this example, the /print and /temp directories are excluded from all robots.

How do I prevent robots scanning my site?

There is no easy way to prevent all robots from visiting your site. However, you can request that well-behaved robots not visit your site by adding these two lines into your robots.txt file:

User-agent: *
Disallow: /

This asks all robots to please not visit any pages on the site.

Things to be aware of

  • You need a separate "Disallow" line for every page, file, or directory that you want to exclude.
  • You can't have blank lines in a record. Blank lines are used to separate multiple records.
  • Regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif" won't work.
  • Everything not explicitly disallowed is considered fair game to retrieve.
  • The robots.txt file is a polite request to robots and not a mandate they have to follow. Robots that scan the web for security vulnerabilities, email address harvesters used by spammers, and other malicious bots will pay no attention to the request.
  • The robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to examine. If you want to hide information, password protect the section instead of trying to rely on robots.txt to hide information.

Examples

Retrieved from "http://aboutus.com/index.php?title=Learn/404-Errors-Drive-Visitors-Away&oldid=20672999"