There are a number of reasons you might want to block bots from all, or part, of your site. For example, if your site is not complete, if you have broken links, or if you haven’t prepared your site for a search engine visit, you probably don’t want to be indexed yet. You may also want to protect parts of your site from being indexed if those parts contain sensitive information or pages that you know cannot be accurately traversed or parsed.
Note
Google requests that you block URLs that will give the bot
hiccups—for example, dynamic URLs that include calendar information that
have the potential for infinite expansion. You can block individual URLs
using a nofollow
attribute value in the anchor tag of
the URL itself. For example:
<a rel="nofollow" href="botcantgohere" />No follow me</a>
If you need to, you can make sure that part of your site does not get indexed by any search engine.
Note
Following the no-robots protocol is voluntary and based on the honor system. So all you can really be sure of is that a legitimate search engine that follows the protocol will not index the prohibited parts of your site from the root of your site (if there are external links to excluded pages, these may still be traversed regardless of your policy file). Don’t rely on search engine exclusion for security. Information that needs to be protected should be in password-protected locations, and protected by software hardened for security purposes.
Figure 4-6. Compared with the identical page in a text-only view (Figure 4-5), it’s hard to focus on just the text and links
To block bots from traversing your site, place a text file named robots.txt in your site’s web root directory (where the HTML files for your site are placed). The following syntax in the robots.txt file blocks all compliant bots from traversing your entire site:
User-agent: * Disallow: /
You can exercise more granular control over which bots you ban and which parts of your site are off-limits as follows:
The
User-agent
line specifies the bot that is to be banished.The
Disallow
line specifies a path relative to your root directory that is banned territory.
Note
A single robots.txt file can include
multiple User-agent
bot bannings,
each disallowing different paths.
For example, you would tell the Google search bot not to look in your cgi-bin directory (assuming the cgi-bin directory is right beneath your web root directory) by placing the following two lines in your robots.txt file:
User-agent: googlebot Disallow: /images
Warning
As I’ve mentioned, the robots.txt mechanism relies on the honor system. By definition, it is a text file that can be read by anyone with a browser. Don’t rely on every bot honoring the request within a robots.txt file, and don’t use robots.txt in an attempt to protect sensitive information from being uncovered on your site by humans (this is a different issue from using it to avoid publishing sensitive information in honest search engine indexes like Google). In fact, someone trying to hack your site might specifically read your robots.txt file in an attempt to uncover site areas that you deem sensitive.
For more information about working with the
robots.txt file, see the Web Robots FAQ. You
can also find tools for managing and generating custom
robots.txt files and robot meta
tags (explained later) at http://www.rietta.com/robogen/ (an evaluation version is
available for free download).
The Googlebot and many other web robots can be instructed not to
index specific pages (rather than entire directories), not to follow
links on a specific page, and to index but not cache a specific page,
all via the HTML meta
tag placed
inside of the head
tag.
Note
Google maintains a cache of documents it has indexed. The Google search results provide a link to the cached version in addition to the version on the Web. The cached version can be useful when the Web version has changed and also because the cached version highlights the search terms (so you can easily find them).
The meta
tag used to block a
robot has two attributes: name
and
content
. The name
attribute is the name of the bot you are
excluding. To exclude all robots, you’d include the attribute name="robots"
in the meta
tag.
To exclude a specific robot, the robot’s identifier is used. The
Googlebot’s identifier is googlebot
,
and it is excluded by using the attribute name="googlebot"
. You can find the entire
database of registered and excludable robots and their identifiers
(currently about 300) at http://www.robotstxt.org/db.html.
Note
The more than 300 robots in the official database are the tip of the iceberg. There are at least 200,000 robots and crawlers “in the wild.” Some of these software programs have malicious intent; all of them eat up valuable web bandwidth. For more information about wild (and rogue) robots, visit Bots vs. Browsers.
The possible values of the content
attribute are shown in Table 4-1. You can use
multiple attribute values,
separated by commas, but you should not use contradictory attribute values together (such as
content="follow,
nofollow"
).
Table 4-1. Content attribute values and their meanings
Attribute value | Meaning |
---|---|
| Bot can follow links on the page |
| Bot can index the page |
| Only works with the Googlebot; tells the Googlebot not to cache the page |
| Bot should not follow links on the page |
| Bot should not index the page |
For example, you can block Google from indexing a page, following
links on a page, and caching the page using this meta
tag:
<meta name="googlebot" content="noindex, nofollow, noarchive">
More generally, the following tag tells legitimate bots (including the Googlebot) not to index a page or follow any of the links on the page:
<meta name="robots" content="noindex, nofollow">
For more information about Google’s page-specific tags that exclude bots, and about the Googlebot in general, see http://www.google.com/bot.html.
Get Google Advertising Tools, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.