What is robots.txt file?


What is robots.txt file?

A robots.txt is a simple text file which can be created by notepad editer. This file is used for search engines, which provide instructions to search engines on how search engines crawl your website. It is supported by all major search engines.

A robots.txt file contains allow and disallow statements that define which sections of website search engines should or shouldn’t crawl. Additionally, an XML Sitemap declaration can be added as well to provide an additional signal about your XML Sitemaps or Sitemap Index file to search engines. 

Make sure that the "robots.txt" file should be placed in the root directory of your website and should access like this:


User agent

User agent is a way by which a browsers and search engines bots identify themselves to webservers. Such as googlebot which is google's web crawler user agent. By using user-agents statement, we can provide specific allow and disallow statements to a particular search engines.It’s important to note that if you use user agent sections, such as a googlebot section, googlebot will ignore all other sections in the robots.txt file.

 User-agent: *
Disallow: /user_login
Disallow: /cart/
Allow: /index.html

//googlebot section

User-agent: googlebot
Disallow: /catalogs
Disallow: /cart/
Allow: /index.html

There are two user agent statement is defined in the above example. 1st statement applies to all search engine which defined by regular expression ("regex") wildcard. However, the second statement applies to just googlebot. In this case googlebot ignore all statements except googlebot section.

Disallow Statements

disallow statements is use for telling search engines  not to crawl certain part of you website.

 User-agent: *
Disallow: /cart/
Disallow: /user_login/

Allow Statements

Allow statements is use for telling search engines to crawl certain part of your website.

 User-agent: *
Allow: /index.html
Allow: /category/

if a folder is covered by disallow statement and you write a allow statement for specific file from that folder. With allow and disallow statements, the more specific statement wins, so search bots will respect this allow statement despite it being covered by the disallow. 

Disallow: /folder1/
Allow: /folder1/demo.html

XML Sitemap Declarations

This is an additional featcher of "robots.txt" file.  Since search engine bots start crawling a site by checking the robots.txt file, it provides you an opportunity to notify them for your XML Sitemap(s).

 Sitemap: http://www.example.com/sitemap-categories.xml
Sitemap: http://www.example.com/sitemap-blogposts.xml

Here is a sample code of robots.txt 

 User-agent: *
Disallow: /search/
Disallow: /private/

Sitemap: http://www.example.com/sitemap.xml
Sitemap: http://www.example.com/sitemap1.xml