|
I am sure that a lot of you have heard of the file named robots.txt (also called
a "robot exclusion file") before. But what does this file really pertain
to? Basically you can think of a robots.txt file as a list of rules that search
engines follow when they spider your site. A robots.txt file gives you the Webmaster
a say in what does and does not get indexed when spiders come to your little corner
of the web.
Okay I can hear a few people asking why anyone would want to keep some things
from being indexed. I thought the goal was to get indexed, right? Well yes and
no, there are quite a few instances when blocking spider access to certain areas
or pages is almost a must. Here are several examples of what a person might
want to restrict access to: temporary files or directories, presentations, information
with a specific sequential order, testing directories or cgi-bin. As you can
see just from these few examples there are definitely files that you would most
certainly want to keep from being indexed. While there is a Meta tag (<meta
name="Robots" content="attributes">) available that does
in essence the same thing as a robots.txt file it is not currently 100% supported
by search engines. Another drawback is that the tag needs to go on every page
you do not want indexed, as opposed to one central point of control.
Writing 101
All right I have given you a few vague examples as to what might be included
in such a file, essentially there is never going to be a set list of things
that should and should not be indexed, a robots.txt file needs to be tailored
to your site and your content. There is however a very specific format that
needs to be followed when creating a robots.txt file.
Step 1: First a robots.txt file needs to be created in Unix format, or Unix
line ender mode. The reason for this is to ensure that there are no carriage
returns inserted into your file. I would suggest looking at Notepad++, my personal
favorite text editor due to the amount of languages and formatting it supports.
Notepad++ is able to create a document directly in Unix format by selecting
the "Convert to Unix Format" from the "Format" option. Other
plain text editors should be able to achieve the same results however stay away
from editors like WordPad or Microsoft Word when creating your robots.txt file.
Also I do not recommend using HTML editors for this task.
Step 2: Now lets begin adding some content to our file. A robots.txt file is
made up of two fields. The first line is the User-agent line. This line specifies
the spider/robot that we are intending to limit or allow. An example of this
would be:
User-agent: googlebot
In addition to allowing or restricting specific spiders you can use a wildcard
and target all spiders coming to your site. To do this you simply need to place
an asterisk (*) in for your User-agent. Example:
User-agent: *
Step 3: Now we will begin to disallow our desired content; either a file or
a whole directory can be kept from being index with a robots.txt file. We will
do this with the second line of our file the Disallow: directive line. Here
is an example:
Disallow: /cgi-bin/
Or for a file:
Disallow: /temp/temp.html
Moreover you are not limited to just one Disallow per User-agent and in fact
you can get pretty granular as to what you give spiders access to. Just make
sure that you give each Disallow its own line. If you leave the Disallow field
empty (i.e. Disallow: ) you are giving permission for all files and directories
to be indexed.
One word of caution when writing your robots exclusion file; if you are not
careful you can shut one or all spider's access to your site off completely.
This would be done by prohibiting access at the root level by using a slash
(/). Example:
Disallow: /
If you were to use the asterisk wildcard to specify your User-agent with the
above example you would block all search engines from every part of your site.
Step 4: That is all there is to creating a robots.txt file. The final step
is to upload it to the root directory of your site: www.yoursite.com/. Make
sure that you upload it as ASCII just like all other text files and you are
done.
Step 5: Writing a robots.txt file is pretty straightforward after you get comfortable
with the files configuration. Once your file is complete and uploaded it is
good practice to have it validated; you can do this through www.searchengineworld.com.
Notes: Aside from search engine specific information you are also able to comment
your robots.txt file. This is achieved by using the pound sign (#). Though you
can place a comment after the Disallow field it is not recommended. Instead
make sure that you begin your comments on a new line starting with the pound
sign. Example:
# Just making a comment
User-agent: googlebot
Disallow: /cgi-bin/
If you are hesitant about the different steps involved in creating a robots.txt
file there are applications available that will help you through the creation
process. One application that does this is RoboGen from Rietta Solutions. RoboGen
provides you with an Explorer like view that lets you browse the files and directories
that you want to restrict access to and creates the robot exclusion file as
you go.
In Closing
As with all things there are going to be some drawbacks you will need to contend
with. With the robots.txt file it is the road map effect that it causes; for
those with the desire to attempt to see what you do not want made publicly available
the file provides them with a prime place to begin looking. Since all robot
exclusion files are named the same and are always in the same place probing
people will know where to find it.
Still the pros out weight the cons. And by having a robots.txt file present
on your site you keep important or private information from ending up in a search
engine's cache making it publicly available to a mass audience. This is what
the file is there for. If on the other hand you have something that not only
needs to be kept private but also needs to be protected you should make sure
that access is restricted through much more secure and appropriate means. Robot
exclusion files were designed as a method for Webmasters to delimit the access
robots have to their sites, providing robots with one central place to look
when they begin the task of indexing. To this end the file serves it purpose
extremely well and when used properly it makes the job of a Webmaster much easier.
About the Author:
Matt Benya is a co-owner of Primate Studios (www.primatestudios.com) an independent
development house focusing on CGI illustration, Web design and multimedia. With
20+ years of art experience and a degree in Network administration Matt is well
suited to translate your needs to the Web. |