Difference between revisions of "Glossary/Robots.txt"

Revision as of 23:56, 16 December 2010 (view source)

Aliza Earnshaw (talk | contribs)

← Older edit

Latest revision as of 18:39, 6 November 2013 (view source)

WikiSysop (talk | contribs)

(migration import)

(5 intermediate revisions by one other user not shown)

Line 1:

−

'''Robots.txt''' is a file that website owners put on their web servers to give instructions to robots -- that is, computer programs that travel around the web, looking for specific kinds of web pages and content.

+

< [[Glossary]]

−

~~People usually use robots~~.txt ~~to prevent search engines from indexing~~ - ~~that is, including in their databases~~ - web contact forms, print versions of web pages and other pages that have content duplicated elsewhere on the site. That's because you want search engines to index only the pages that are important to your business - ~~for example, pages that showcase your products or services.~~

+

==Robots.txt==

+

----

−

Robots.txt ~~can also be used to request~~ that ~~specific robots refrain from indexing a~~ web ~~page. A website owner might want~~ to ~~allow search engines~~ to ~~crawl his site~~, ~~but not robots sent out by companies gathering information~~ that ~~could help his competitors~~.

+

Robots.txt is a file that website owners put on their web servers to give instructions to robots, crawlers or spiders: computer programs that travel around the web, looking for specific kinds of web pages and content.

−

~~===Crawlers, robots, agents, bots and spiders===~~

+

It resides after the [[Glossary/Domain-name|domain name]] - /robots.txt - i.e. http://www.aboutus.org/robots.txt

−

~~These five terms all describe basically~~ the ~~same thing: an automated software program used to locate, collect, or analyze data from web pages. Search engines like~~ [[~~Google.com~~|~~Google~~]] ~~use a spider to collect data on web pages for inclusion in their databases~~. ~~The spider also follows links on web pages to find new pages~~.

−

~~[[AboutUs.org]] uses a robot to analyze web pages for its [[Online Visibility Audit]]~~.

+

For more information on robots.txt, read:

−

~~===~~How ~~robots.txt works===~~

+

* [[Learn/How-To-Use-Robots.txt|How To Use Robots.txt]]

−

~~When a legitimate robot wants to visit a web page like '''www.example.com/good~~-~~content,''' it first checks for '''www.example.com/robots~~.txt~~''' to make sure the site owner is willing to let the robot examine the page.~~

+

* [[Learn/Don't-Block-Search-Engine-Crawlers|Don't Block Search Engine Crawlers]]

−

~~The robot looks for several things:~~

−

* Is there a robots.txt ~~file?~~

−

* ~~Is the robot explicitly excluded in the robots.txt file?~~

−

* Is the robot excluded from this web page, the directory that contains this page, or both?

−

~~If the robot isn~~'t ~~explicitly excluded from the page, the robot will collect data, analyze it, or do whatever else it was designed to do.~~

@@ Line 1: / Line 1: @@
-'''Robots.txt''' is a file that website owners put on their web servers to give instructions to robots -- that is, computer programs that travel around the web, looking for specific kinds of web pages and content.
+< [[Glossary]]
-People usually use robots.txt to prevent search engines from indexing - that is, including in their databases - web contact forms, print versions of web pages and other pages that have content duplicated elsewhere on the site. That's because you want search engines to index only the pages that are important to your business - for example, pages that showcase your products or services.
+==Robots.txt==
+----
-Robots.txt can also be used to request that specific robots refrain from indexing a web page. A website owner might want to allow search engines to crawl his site, but not robots sent out by companies gathering information that could help his competitors.
+Robots.txt is a file that website owners put on their web servers to give instructions to robots, crawlers or spiders: computer programs that travel around the web, looking for specific kinds of web pages and content.
-===Crawlers, robots, agents, bots and spiders===
+It resides after the [[Glossary/Domain-name|domain name]] - /robots.txt - i.e. http://www.aboutus.org/robots.txt
-These five terms all describe basically the same thing: an automated software program used to locate, collect, or analyze data from web pages. Search engines like [[Google.com|Google]] use a spider to collect data on web pages for inclusion in their databases. The spider also follows links on web pages to find new pages.
-[[AboutUs.org]] uses a robot to analyze web pages for its [[Online Visibility Audit]].
+For more information on robots.txt, read:
-===How robots.txt works===
+* [[Learn/How-To-Use-Robots.txt|How To Use Robots.txt]]
-When a legitimate robot wants to visit a web page like '''www.example.com/good-content,''' it first checks for '''www.example.com/robots.txt''' to make sure the site owner is willing to let the robot examine the page.
+* [[Learn/Don't-Block-Search-Engine-Crawlers|Don't Block Search Engine Crawlers]]
-The robot looks for several things:
-* Is there a robots.txt file?
-* Is the robot explicitly excluded in the robots.txt file?
-* Is the robot excluded from this web page, the directory that contains this page, or both?
-If the robot isn't explicitly excluded from the page, the robot will collect data, analyze it, or do whatever else it was designed to do.