Once folks start administering their own web server, they immediately understand the enormity of the task. There are so many new things to learn. The task can seem daunting. You start encountering things that you didn’t know about. One such thing is robots.txt. What is this file and what does it do?
To understand what the robots.txt file does, it’s important to take a quick step backward. The internet contains many robots. These are programs that scour the internet. They are also called spiders, crawlers, etc. Basically, they go out and try to find what’s out there. In the case of someone like Google, they are trying to see what content is on the web so that it can index it. It can then present that information to people searching the internet in a organized fashion. I hope that makes sense.
Well, the robots.txt file tells participating robots what to and what not to index. So, you can tell a robot that you don’t want it to see anything. The problem with this in terms of completely hiding parts of your site is that the robots.txt file is easy to find. Anyone can read it. But it’s a good way to keep certain parts of your site out of search engines. And that is part of the battle to securing site. Now, let’s look at a sample robots.txt file:
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/
This robots.txt file tells robots to stay out of the cgi-bin, images, tmp, and private directories. There are many thing you can do with a robots.txt file, including specifying where a sitemap is located. Hmmm, maybe we should cover that topic sometime in the near future. So, the robots.txt file is a simple text file that tells internet spiders where they can and cannot go. It’s that simple. Well, unless you get more involved.
private directories





February 23rd, 2008 at 3:58 am
How do we know that spiders actually follow these rules? Is it posssible to design a spider that is focussed on searching those parts that are forbidden?