# Copyright www.website-ownersclub.com May 2009 # The file robots.txt must be in lowercase and in the root of the home directory # All names should be in lowercase # There is no way to "allow" a spider, you can only disallow. *** ALWAYS REMEMBER THIS *** # You can explicitly name every file you want disallowed. # # Any line starting with '#' is a comment line - ignored by robots. # Use a space AFTER any # and have comment as the only item on a line - by itself # # Some comments come from - http://www.robotstxt.org/faq/robotstxt.html # and http://www.robotstxt.org/orig.html # See BELOW the LIVE ENTRY section immediately below - to see correct style samples # # *** LIVE SECTION BELOW *** # User-Agent: * Disallow: /sen Disallow: /images Disallow: /stats # # *** LIVE SECTION ABOVE *** # # ************************************************************************* # Sample style to help you do it correctly - Below here # ************************************************************************* # Two common errors: # Wildcards * are NOT SUPPORTED for directories/folders or file names # instead of 'Disallow: /tmp/*' just say Disallow: /tmp/ # You shouldn't put more than one path on a Disallow line # Wildcards are ONLY USED for SEARCH ENGINES - not for directories/folders and file names # # There is no way to "allow" a spider, you can only disallow. *** ALWAYS REMEMBER THIS *** # # 1 # User-agent: webcrawler # Disallow: # This first paragraph specifies that the robot called 'webcrawler' has nothing disallowed: it may go anywhere. # 2 # User-agent: lycra # Disallow: / # This second paragraph indicates that the robot called 'lycra' has all relative URLs starting with '/' disallowed. # Because all relative URL's on a server start with '/', this means the entire site is closed off. # 3 # User-agent: * # Disallow: /tmp # Disallow: /logs # # The third paragraph indicates that all other robots should not visit URLs starting with /tmp or /log. # Note the '*' is a special token, that can ONLY be used for user-agents that is robots/spiders etc # You CANNOT USE wildcard "*" patterns or regular expressions in Disallow lines. # # If you have excludes - do the EXCLUDE search engine/robots FIRST # - and what they are permitted and allowed # - them do the other engine robots below # - so the restricted robot finds it's instructions FIRST (before the unrestricted robots) # - so ALWAYS to the * (for ALL robots) LAST # - Put your most specific directives first, and your more inclusive ones (with wildcards) last # # === Examples with comments === Start # # This next part is WRONG ============ WRONG # User-Agent: * # Disallow: /express* # Disallow: /*.zip # Disallow: /images # Disallow: /stats # The above part is WRONG ============ WRONG # # Below is CORRECT ========= CORRECT # User-Agent: * every robot can come in # Disallow: /exp same as Disallow: /express # Disallow: /*.zip WRONG - WILDCARDS ARE NOT SUPPORTED # Disallow: /express.zip rename and make a unique file or use /exp (if you need to hide the file name # Disallow: /images cannot go into this folder or it's sub-folders # Disallow: /stats cannot go into this folder or it's sub-folders # Disallow: /help disallows both /help.html and /help/index.html, whereas # Disallow: /help/ would disallow /help/index.html but allow /help.html # Disallow: /food.html would disallow /food.html (in the root folder) # Disallow: Any empty value, indicates that all URLs can be accessed/retrieved. # Disallow: At least one Disallow field needs to be present in a record. # Disallow: / All relative URL's on a server start with '/', so this means the entire site is closed off # # The presence of an empty "robots.txt" file has NO PURPOSE and all robots will consider themselves WELCOME. # Disallow: At least one Disallow field needs to be present in a record. # Disallow: Any empty value, indicates that all URLs can be accessed/retrieved. # # SUMMARY # ======= # There is no way to "allow" a spider, you can only disallow. # # To prevent all robots from going to any area of your site, your robots.txt file would read: # Tells EVERY robot go away - access denied - do NOT look here or report/index anything # User-agent: * every robot # Disallow: / All relative URL's on a server start with '/', so this means the entire site is closed off # # To prevent only a specific Web crawler from crawling your site, list it by name in the User-agent line. # For example, to prevent Google from spidering your site, you would write # User-agent: Googlebot Of course youmust use the EXACT robot's name # Disallow: / # # IMPORTANT CAVEATS # # The robots.txt file is case-sensitive. # If you create a file called Robots.txt or robots.TXT the spiders will ignore whatever it says. # The robots.txt file has to be named robots.txt and located in the root of your Web server. # # This means that if you have a Web page like http://www.webhostingcompany.com/private/ that you wish to keep private # You will need to have your administrator add your disallows to their root level directory. # # There is no way to "allow" a spider, you can only disallow. # If you have one page in a group of 100 others that you want spidered, move it out of the disallowed directory # Or, you can explicitly name every file you want disallowed. # Below is CORRECT ========= CORRECT # # === Examples with comments === Finish # ************************************************************************* # Sample style to help you do it correctly - Above here # *************************************************************************