|So you want to know about robots.txt files, eh? Possibly because you're seeing 404 errors for that file piling up...don't worry, you're not alone and there's hope for you yet :o)
Simply put, a robots.txt file is the first thing (almost all) robots (aka 'bots', 'spiders') look for when visiting your site. The "robots.txt exclusion protocol" is a fancy name for a very simple concept, a universal method of telling bots which files they're not allowed to look at on your
site, as well as what directories and/or pages they're allowed to add to their indexes to be displayed in Search Engine Results Pages (SERPs). The file should be a basic text file (.txt), and should reside in your root "public_html/" or "www/" directory on your server (the
same place you put the index.html file displayed at "www.yourdomain.com")
Why would you want to keep Search Engine (SE) bots from seeingor indexing something? I'm glad you asked!
There are a lot of factors:
--Most people (including me) believe that you should make any site as robot friendly as possible. The robot wants to see a robots.txt file, it's the first thing it looks for, and if it isn't there then the first thing the robot sees is a 404 message.
--If you keep a folder on the server for works in progress, etc. then you don't want those coming up in search results, so you don't want that folder indexed.
--Some robots crawl the web gobbling up images as thumbnails for image based searching (like Google's image search). I'd just as soon the robots not index all my individual image files, so I disallow them. Sure, bandwidth is cheap but it's MINE, dammit, MINE!!
--Some bots roam around gobbling up email addresses for sale to spammers, so if you have a cgi based email list that uses text data files, you DEFINITELY don't want robots seeing that. (you would exclude your cgi-bin)
There could be tons of reasons you might not want bots indexing something, and the robots.txt exclusion protocol is simple enough and versatile enough to give you all the options you want.
Plus, you can exclude certain bots and not others if you want. Instead of
(this format allows all bots to look at everything on the site)
You could allow every spider except googlebot to look at your cgi-bin with
A lot of this stuff may not matter to you or make any difference to you, it's just by way of explanation. If you don't care what the bots index, just use the basic:
Which, again, allows all bots to look at and index everything on the site. But, In my opinion, even if you're letting them look at everything you still want the file there for them to read.
It's important to remember that robots.txt is NOT a security measure (you'll want .htaccess for that). Bots aren't "required" to read and abide by the contents of your robots.txt file, but as a rule SE spiders (especially from the major SEs) always will.
You can find a full explanation of the proper syntax and use of robots.txt at this wonderful page http://www.robotstxt.org/wc/exclusion-admin.html.