Sometimes your web pages can get lost in the mix because there aren’t many links to them, because they’re new pages, or because of the content contained within the pages (i.e., dynamic content, Flash content, AJAX, or other media). Search engines can find out about these pages through a sitemap, a file containing information about pages on your site that you want search engines to index. Sitemap files can be in XML or plain-text format, and they require a special syntax to work properly.
XML, or eXtensible Markup Language, is
somewhat similar to HTML, which is used to represent data in a
simple, organized, universally accessible way. XML files are simply
text files that contain XML code, saved with the
Plain-text files refer to files created in a text editor that
contain only text, and have a
Along with using a sitemap to tell search engines about your pages, you can optionally add extra information about each page. The information can include when the page was last modified, how often the page is updated, and how the page’s importance ranks in relation to the other pages on your site.
Many search engines, including Google, Yahoo!, and MSN, follow the sitemap standards at http://www.sitemaps.org. You can find more information on sitemaps and sitemap standards on that site.
Sitemaps are great for pages that are more difficult for search engines to index. This includes pages with Flash, AJAX, and dynamic content. If you have pages that utilize any of these technologies, it’s a good idea to create a sitemap. Although it’s not guaranteed that search engines will index every page in your sitemap, creating one won’t hurt your rankings, so you may find it makes sense to create one.
All you need to create a sitemap is a plain-text editor, such as TextEdit on the Mac or Notepad on the PC. As I mentioned earlier, there are two types of sitemaps: XML and plain text. The creation process is slightly different depending on which type of sitemap you decide to create.
Some websites, such as http://www.xml-sitemaps.com/, create sitemaps for you by giving you cut-and-paste text to put into an XML file and upload to your site. This is a great option, especially if you want to create a sitemap quickly.
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.yoursite.com/page1.html</loc> <lastmod>2008-01-01</lastmod> <changefreq>hourly</changefreq> <priority>1</priority> </url> <url> <loc>http://www.yoursite.com/page2.html</loc> <lastmod>2008-02-01</lastmod> <changefreq>weekly</changefreq> <priority>0.5</priority> </url> <url> <loc>http://www.yoursite.com/page3.html</loc> <lastmod>2008-03-01</lastmod> <changefreq>monthly</changefreq> <priority>0.2</priority> </url> </urlset>
Here’s a walkthrough of the preceding code.
<?xml version="1.0" encoding="UTF-8"?>
This line is standard as the first line in an XML file, and it includes information about the XML version used in the file and how the text data is encoded.
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> ...(code not shown here) </urlset>
This code defines a set of URLs for your sitemap, in the
urlset element. The
xmlns attribute defines an XML namespace, a syntax standard unique
to the URL value. This namespace is the
<url> ...(code not shown here) </url>
lastmod, contains the date that the URL from
loc element, was last modified.
The format for the date is YYYY-MM-DD, or a four-digit year, a
two-digit month, and a two-digit date, separated by hyphens. This
date represents January 1, 2008.
It’s important to note that declaring your page’s
change frequency doesn’t necessarily mean search engine
spiders will index your page as often as they’re updated.
This is more of a guideline for them. In fact, spiders will
occasionally crawl pages that have a value of
never, just in case any changes have been
dictates a page’s importance relative to other pages in your
sitemap. This element has a default value of 0.5, and it ranges
from 0 to 1. Giving your pages higher-priority ratings
doesn’t mean the pages will rank higher than other sites in
search engine results. Rather, this is for the sitemap to choose
which pages on your site are more important than their pages on
your site. This is a means of controlling which of your pages get
priority over your other pages in the results pages. Pages with
higher priority will show up higher than pages with lower
In summary, remember to declare the XML version and the
url elements in a
urlset element, make sure to declare
the namespace, and include at least the
loc element inside each
You can declare a maximum of 50,000 URLs in a sitemap.
If you have a lot of URLs in your sitemap, you may want to consider creating multiple sitemaps and linking them together. You can find instructions for doing that at http://www.sitemaps.org.
When creating an XML sitemap, or any XML
file, for that matter, certain characters aren’t allowed
because they’re reserved XML characters. To use these
characters in an XML file, characters such as
<, for example, you must escape them by using special syntax to
represent them. Table 2.1,
“Characters that must be escaped” shows which
characters must be escaped and how to escape them.
Table 2.1. Characters that must be escaped
' (single quote)
" (double quote)
Most likely, the only character escaping you’ll need to do when creating a sitemap is for a dynamic URL. For example, you may want to include a page that keeps track of a person’s username and ID, so your page URL may look like this:
To separate URL parameters,
id in this case, you need to use
an ampersand (
&). To escape an
ampersand, use the escape code
&. Your sitemap code would then need to look like
Following is an example of a plain-text sitemap.
http://www.yoursite.com/page1.html http://www.yoursite.com/page2.html http://www.yoursite.com/page3.html
Your plain-text sitemap should contain only the URLs for the pages on your site. Don’t include header, footer, or any other text in your plain-text sitemap.
Your sitemap should be in the highest
directory level that you want to be indexed. For example, if you
want your entire site to be indexed, you’d put your sitemap
in your root directory. If your domain was Yoursite.com and your
sitemap file was called
the URL to your sitemap would be:
Search engines will automatically look for a sitemap called
sitemap.xml at the root directory of
your server, so if you put it here you can skip the process of
submitting your sitemap.
Sometimes you may only want part of your site to be indexed. For
example, you may have several folders on your web server that are
password-protected and one folder for the public to view. In this
case, you’d put your sitemap in the public folder.
Here’s an example of what the URL to your sitemap would be if
you were to call the sitemap file
sitemap.xml and the public folder
For Google, you can log in to your Webmaster Tools account (assuming you’ve created an account) at http://www.google.com/webmasters and give Google the URL to your sitemap. The process is the same for MSN. Log in to Webmaster Tools at http://webmaster.live.com, and submit your website and sitemap URL. To submit your sitemap to Yahoo!, go to https://siteexplorer.search.yahoo.com/submit, submit your site, and submit your sitemap in the Submit Site Feed area.
Once you’ve submitted your sitemap, the search engines will do the rest of the work for you. Although it’s not guaranteed that submitting a sitemap will force spiders to crawl all of your pages, it still gives you more control over what pages are indexed and how they rank in relation to other pages on your site.
Copyright © 2009 O'Reilly Media, Inc.