What is Robots.txt
Robots.txt is a standard text file for communicating with web Crawler/Spider/ bots and instruct them to which area of website or webpage is crawled or not. Robots.txt is a Publicly available file and anyone can see easily with part or url’s of website crawler scan or not. By default Search Engine Crawler crawl everything they possible.If you want to see the robots.txt of any website like http://abc.com then write /robots.txt after the domain name http://abc.com/robots.txt . In this way you can see it.
How to create robots.txt ?
There are many online tools for creating robots.txt file and also you will create this file manually.Before creating robots.txt file you will understand some rules and regulation. I have explained here how to create a robots.txt fileIn the robots.txt file USERAGENT line identifies the web crawler and DISALLOW: line defines which part of the site is disallowed
[1] Here is the Basic robots.txt file
User-agent: *
Disallow : /
In the Above declaration “*” indicate the all crawler/ spider /bots and “/” define all the pages are disallowed. We can say it in other words, that the whole sites are disallowed for all crawlers.
[2] if you want to disallow any specific web crawler or spider not crawl your site, then the robots.txt file will be
User-agent: Yahoobot
Disallow: /
In above example I used Yahoobot and you will use that crawler which you not want to crawl you site.
[3] If you want to diallow all crawler for specific folder or web pages then
User-agent: *
Disallow: cgi-bin
Disallow: abc.html
[4] If you want to disallow all crawler for whole site except any specific crawler for allowed crawling then
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow: /cgi-bin
Disallow:/ abc.php
In above example all the crawlers are disallow expect Google crawler which is allowed to crawl entire part apart from abc.php page and cgi-bin folder
[5] in next case you can use disallow to turn into allow rule by not entering any value or / after semicolon (:) .
User-agent: *
Disallow :/
User-agent: Googlebot
Disallow :
Above example define that all the crawler are disallow for entire site EXCEPT google bot can crawl entire sites.
[6] Some crawler now support an additional filed known as “allow”
User-agent: *
Disallow :/
User-agent: Googlebot
allow :
All the crawler are disallow for entire site EXCEPT Google.
[7] The better solutions of disallow a particular page completely by using robots noindex meta tag. And if you want that nofollow outbound links by crawler then adds nofollow attribute in head of page.
The meta tags would be :
<meta name="robots" content="noindex"> <-- the page is not indexed, but links may be followed
<meta name="robots" content="noindex,nofollow"> <-- the page is not indexed & the links are not followed
Where to put Robots.txt
After creating robots.txt file a question comes in mind where to put a robots.txt file.Always upload the Robots.txt file on a Top Level directory or in another way we can say put in the root directory of the website.
No comments:
Post a Comment