Tuesday, 29 September 2015

What is Robots.txt and How to create ?

 What is Robots.txt

Robots.txt is a standard text file for communicating with web Crawler/Spider/ bots and instruct them to which area of website or webpage is crawled or not. Robots.txt is a Publicly available file and anyone can see easily with part or url’s of website crawler scan or not. By default Search Engine Crawler crawl everything they possible.

If you want to see the robots.txt of any website like http://abc.com  then write /robots.txt after the domain name http://abc.com/robots.txt . In this way you can see it.





 How to create robots.txt ?

There are many online tools for creating robots.txt file and also you will create this file manually.Before creating robots.txt file you will understand some rules and regulation. I have explained here how to create a robots.txt file
In the robots.txt file USERAGENT line identifies the web crawler and DISALLOW: line defines which part of the site is disallowed
[1] Here is the Basic robots.txt file
      User-agent: *
      Disallow : /
In the Above declaration “*” indicate the all crawler/ spider /bots and “/” define all the pages are disallowed. We can say it in other words, that the whole sites are disallowed for all crawlers.

[2] if you want to disallow any specific web crawler or spider not crawl your site, then the robots.txt    file will be
     User-agent: Yahoobot
      Disallow: /
In above example  I used Yahoobot and you will use that crawler which you not want to crawl you site.
[3]  If you want to diallow all crawler for specific folder or web pages then
       User-agent: *
       Disallow: cgi-bin
       Disallow: abc.html

[4] If you want to disallow all crawler for whole site except any specific crawler for allowed crawling then
      User-agent: *
      Disallow: /
      User-agent: Googlebot
      Disallow: /cgi-bin
      Disallow:/ abc.php
In above example all the crawlers are disallow expect Google crawler which is allowed to crawl entire part apart from abc.php page and cgi-bin folder

[5] in next case you can use disallow to turn into allow rule by not entering any value or / after semicolon (:) .
     User-agent: *
     Disallow :/
     User-agent: Googlebot
      Disallow :
Above example define that all the crawler are disallow for entire site EXCEPT google bot can crawl entire sites.
[6] Some crawler now support an additional filed known as “allow”
      User-agent: *
      Disallow :/
      User-agent: Googlebot
       allow :
All the crawler are disallow for entire site EXCEPT Google.

[7] The better solutions of disallow a particular page completely by using  robots noindex meta tag. And if  you want that  nofollow  outbound links by crawler then adds nofollow attribute in head of page.
The meta tags would be :
 <meta name="robots" content="noindex"> <-- the page is not indexed, but links may be followed
<meta name="robots" content="noindex,nofollow"> <-- the page is not indexed & the links are not followed

Where to put Robots.txt

After creating robots.txt file a question comes in mind where to put a robots.txt file.
Always upload the Robots.txt file on a Top Level directory or in another way we can say put in the root directory of the website.

No comments:

Post a Comment