Robots.txt - General information
Robots.txt and SEO
Hotfixes and workarounds
Robots.txt for WordPress
Robots.txt - General information
Robots.txt is a text file located in a website’s root directory that specifies what website pages and files you want (or don’t want) search engine crawlers and spiders to visit. Usually, website owners want to be noticed by search engines; however, there are cases when it’s not needed. For instance, if you store sensitive data or you want to save bandwidth by not indexing (excluding heavy pages with images).
The search engines index the websites using the keywords and metadata in order to provide the most relevant results to the Internet users looking for something online. Reaching the top of the search result list is especially important for e-commerce shop owners. Customers rarely browse further than the first few pages of the suggested matches in the search engine.
For indexing purposes, so-called spiders or crawlers are used. These are bots that the search engine companies use to fetch and index the content of all the websites that are open to them.
When a crawler accesses a website, it first requests the file named /robots.txt. If such a file is found, the crawler then checks it for the website indexation instructions. The bot that does not find any directives has its own algorithm of actions, which basically indexes everything. Not only does this overload the website with needless requests but also indexing itself becomes a lot less effective.
NOTE: There can be only one robots.txt file for the website. The robots.txt file for an addon domain name should be located in the corresponding document root. For example, if your domain name is www.domain.com, it should be found at https://www.domain.com/robots.txt.
It’s also very important that your robots.txt file is actually called robots.txt. The name is case sensitive, so make sure to get that right, or it won’t work.
Google's official stance on the robots.txt file
The robots.txt file consists of lines which contain two fields:
- User-agent name (search engine crawlers). Find the list with all user-agents’ names here
- Line(s) starting with the Disallow: directive to block indexing.
Robots.txt should be created in the UNIX text format. It’s possible to create such a .txt file directly in the File Manager in cPanel. More detailed instructions can be found here.
Basics of robots.txt syntax
Usually, the robots.txt file contains a code like this:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~different/
In this example, three directories: /cgi-bin/, /tmp/ and /~different/ are excluded from indexation.
PLEASE NOTE:
- Every directory is written on a separate line. You should not write all the directories on one line, nor should you break up one directive into several lines. Instead, use a new line to separate directives from each other.
- Star (*) in the User-agent field means “any web crawler.” Consequently, directives such as Disallow: *.gif or User-agent: Mozilla* are not supported. Pay attention to these logical mistakes as they are the most common ones.
- Another common mistake is an accidental typo - misspelled directories, user-agents, missing colons after User-agent and Disallow, etc. When your robots.txt files get more and more complicated, it’s easier for an error to slip in so there are
some validation tools that come in handy.
Examples of usage
Here are some useful examples of robots.txt usage:
Example 1
Prevent the whole site from indexation by all web crawlers:
User-agent: *
Disallow: /
Such a measure as full blocking the crawling might be necessary when the website is under a heavy load of requests or if the content is being updated and should not come up in the search results. Sometimes the settings for the SEO-campaign are too aggressive, so the bots basically overload the website with requests to its pages.
Example 2
Allow all web crawlers to index the whole site:
User-agent: *
Disallow:
There is actually no need to crawl the whole website. It’s unlikely that the visitors will be looking up terms of use or login pages via Google Search, for example. Excluding some pages or types of content from indexing would be beneficial for security, speed, and relevance in the rankings of the given website.
Below are the examples of how to control what content is indexed on your website.
Example 1
Prevent only several directories from indexation:
User-agent: *
Disallow: /cgi-bin/
Example 2
Prevent the site indexation by a specific web crawler:
User-agent: *
Disallow: /page_url
The page usually goes without a full URL, only with its name that follows http://www.yourdomain.com/. When such a rule is used, any page with a matching name is blocked from indexing. For example, both /page_url and /page_url_new will be excluded. In order to avoid this, the following code should be used:
User-agent: *
Disallow: /page_url$
Example 3
Prevent the website indexation by a specific web crawler:
User-agent: Bot1
Disallow: /
Despite the list, some identities might change over time. When the load is extremely high on the website, and it’s not possible to find out the exact bot overusing the resources, it’s better to block all of them temporarily.
Example 4
Allow indexation to a specific web crawler and prevent indexation by others:
User-agent: Opera 9
Disallow:
User-agent: *
Disallow: /
Example 5
Prevent all the files from indexation except a single one.
There is also the Allow: directive. However, it is not recognized by all the crawlers and might be ignored by a number of them. Currently, it’s supported by Bing and Google. The following example of the rule to allow only one file from a specific folder can be used at your own risk:
User-agent: *
Allow: /docs/file.jpeg
Disallow: /docs/
Instead, you can move all the files to a certain subdirectory and prevent its indexation except for one file that you allow to be indexed:
User-agent: *
Disallow: /docs/
This setup requires a specific website structure. It’s also possible to create a separate landing page that would redirect to actual user`s home page. This way you can block the actual directory with the website and allow the landing index page only. It’s better when such changes are performed by a website developer to avoid any issues.
You can also use an online robots.txt file generator
here.
Keep in mind that it performs the default setup that does not take into account the sophisticated structures of the custom-coded websites.
The default robots.txt file in some CMS versions is set up to exclude your images folder. This issue doesn’t occur in the latest CMS versions, but the older versions should be checked.
This exclusion means your images will not be indexed and included in Google’s Image Search. Images appearing in search results is something you would want as it increases your SEO rankings. However, you need to look out for an issue called “hotlinking.” When someone reposts an image uploaded to your website elsewhere, your server gets loaded with the requests. To prevent hotlinking, read more in our
corresponding Knowledgebase article.
If you would like to change this, open your robots.txt file and remove the line that says:
Disallow: /images/
If your website has a lot of private content or the media files are not stored permanently, but uploaded and deleted daily, it’s better to exclude the images from the search results. In the first case, it’s a matter of personal privacy. The latter regards the possible overload of crawlers activity when they are checking each new image again and again.
sitemap:http://www.domain.com/sitemap.xml
Do not forget to replace the
http://www.domain.com/sitemap.xml path with your actual information.
For guidelines on how to create sitemap.xml for your website, please refer
here.
Miscellaneous remarks-
Don't block the CSS, Javascript and other resource files by default. This prevents Googlebot from properly rendering the page and understanding that your site is mobile-optimized.
- You can also use the file to prevent the specific pages from being indexed, like login- or 404-pages, but it is better to do using the robots meta tag.
- Adding the disallow statements to the robots.txt file does not remove content. It simply blocks access to spiders. If there is content that you want to remove, it’s better to use a meta noindex.
- As a rule, the robots.txt file should never be used to handle duplicate content. There are better ways like a Rel=canonical tag which is a part of the HTML head of a webpage.
- Always keep in mind that robots.txt should be accurate in order your website to be indexed correctly by search engines.
Hotfixes and workaroundsIncluding URL indexing to 'noindex'
A
noindex meta tag prevents the whole page from being indexed by a search engine. This might not be a desirable situation since you would want the URLs on that page being indexed and followed by bots for better results. To ensure this happening, you can edit your page header with the following line:
<meta name="robots" content="noindex, follow">
This line will prevent the page itself from being indexed by a search engine but due to the follow part in the code, the links posted on this page will still be retrieved. This will allow the spider to move around the website and its linked content. The benefit from this type of integration is called Link Juice - it’s the connection between different pages and the relevance of their content to each other.
If nofollow is added, the crawler will stop when it reaches this page and will not move further to the interlinked content:
<meta name="robots" content="noindex, nofollow">
From an SEO perspective, this is not recommended but it’s up to you to decide.
Some pages might be removed from the website permanently, therefore, no longer having any real value. Any outdated content should be removed from the robots.txt and .htaccess files. The latter might contain the redirects for the pages that are no longer relevant.
Simply blocking expired content is not effective. Instead, the 301 redirects should be applied either in the .htaccess file or via plugins. If there is no adequate replacement for the removed page, it can be redirected to the homepage.
It’s better to prohibit indexed pages with
sensitive data on them. The most common examples are:
- Login pages
- Administration area
- Personal accounts information
To improve website security, please keep in mind the following:
- The fact that this URL appears in the search results does not mean that anyone without the credentials can access it. Still, you may want to have a custom administrative dashboard and login URLs that are only known to you.
- It’s recommended that you not only exclude certain folders but also protect them using passwords.
- If certain content on your website should be available to registered users only, make sure to apply these settings to the pages. The password-only access can be set up as described here. The examples are the websites with premium membership where certain pages and articles are available upon being logged in only.
- The robots.txt file and its content can be checked online. This is why it’s advisable to avoid inputting any names or data that might give unwanted information about your business.
For example, if you have pages for your colleagues each residing in separate folders and you want to exclude them from the search results, they should not be named "johndoe" or "janedoe", etc. Disallowing these aforementioned folder names will basically openly publicize your colleagues’ names. Instead, you can create folder “profiles” and place all the personal accounts there. The URL in the browser would be
https://yourdomain.com/profiles/johndoe and the robots.txt rule will look like this:
User-agent: *
Disallow: /profiles/
Not only as a security measure, but also in order to save your hosting space resources, you might want to exclude the irrelevant content for your website visitors from the search results. For example, these might be the theme and background images, buttons, seasonal banners, etc. Using the Disallow directive for a whole /theme directory is not advised.
This is why it’s advised that you fully implement the theme and layout throughout the CSS instead of inserting backgrounds via the
HTML tag, for example. Hiding the specific style folder might cause an issue with fetching the content by crawlers and properly presenting it to the users in the respective search results.
Some search engines are too eager to look for content with the slightest update. They do it too often and create a
heavy load on the website. Nobody wants to see their pages loading slowly because of hungry crawlers, but blocking them completely every time might be too extreme. Instead, it’s possible to slow them down by using the following directive:
crawl-delay: 10
In this case, there’s a 10-second delay for search bots.
Robots.txt for WordPressWordPress creates the virtual robots.txt file once you publish your first post with WordPress. Though if you already have the real robots.txt file created on your server, WordPress won’t add the virtual one.
Virtual robots.txt doesn’t exist on the server, and you can only access it via the following link:
http://www.yoursite.com/robots.txtBy default, it will have Google’s Mediabot allowed, a bunch of spambots disallowed and some standard WordPress folders and files disallowed.
So in case you haven't created the real robots.txt yet, create one with any text editor and upload it to the root directory of your server via FTP.
As best practice, you can also use one of the many offered SEO plugins. For the most updated and trustworthy plugins, check out
WordPress’ official SEO guide.
Blocking main WordPress directoriesThere are 3 standard directories in every WordPress installation –
wp-content, wp-admin, wp-includes -- that don’t require indexing.
Don’t choose to disallow the whole wp-content folder though, as it contains an 'uploads' subfolder with your site’s media files that you don’t want to be blocked. That’s why you need to proceed as follows:
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
Blocking on the basis of your site structureEvery blog can be structured in various ways:
a) On the basis of categories
b) On the basis of tags
c) On the basis of both
- none of those
d) On the basis of date-based archives
a) If your site is category-structured, you don’t need to have the Tag archives indexed. Find your tag base in the
Permalinks options page under the
Settings menu. If the field is left blank, the tag base is simply 'tag':
Disallow: /tag/
b) If your site is tag-structured, you need to block the category archives. Find your category base and use the following directive:
Disallow: /category/
c) If you use both categories and tags, you don’t need to use any directives. In case you use none of them, you need to block both of them:
Disallow: /tags/
Disallow: /category/
d) If your site is structured on the basis of date-based archives, you can block those in the following ways:
Disallow: /2010/
Disallow: /2011/
Disallow: /2012/
Disallow: /2013/
PLEASE NOTE: You can’t use Disallow: /20*/ here as such a directive will block every single blog post or page that starts with the number '20'.
Duplicate content issues in WordPressBy default, WordPress has duplicate pages which do no good to your SEO rankings. To repair it, we would advise you not to use robots.txt, but instead go with a subtler way: the
rel = canonical tag that you use to place the only correct canonical URL in the section of your site. This way, web crawlers will only crawl the canonical version of a page.
A more detailed description from Google about what a canonical tag is and why you should be using it
can be found here.
That's it!
Need any help? Contact our
Helpdesk