4 Dec 2015
robots.txt
Reading time: 14 mins.
|
Difficulty:

Stop setting up the robots.txt file badly

Google recommends and advises on the use of tools such as the robots.txt file or sitemap (for example in xml format) to obtain better tracking of the information that makes up our websites. Theoretically this simplifies their life considerably because it “guides” their spiders or crawlers to relevant content in less time than if they only had to follow links or determine autonomously what content should not be indexed. So far so good.robots.txt

In practice the truth is that it is not unusual to find Google riding roughshod over the content of a robots.txt file, indexing pages and files in the content, in the same way that they ignore in a seemingly random way the robot meta tags that we include in the code of our
pages. Not always, but much more often than they should (which should be never). That said, we will have to settle for the tools that we have and learn to use them better.

And on occasion the robots file is set up incorrectly because we take some things for granted, we hope that this article will clear up certain important concepts which often affect the treatment Google gives to our sites. We will start with the simplest…

Basic set up of robots.txt

The famous “robots” is a file in .TXT format that we can create and edit with a simple notepad or basic text editor (wordpad, notepad). Once created we have to give it this specific name (robots.txt) and upload it to the root directory of the web site, as this is the only place search engines can find it without problems. The basic set up of the robots file is fairly simple, just include two parameters: the robot we are addressing and an instruction giving or removing permission to index something (a file, folder, everything, nothing).

In other words, this is about telling each Bot (a kind of crawler –spider– that Google has teeming around the millions of web pages there are in order to be up to date in its content and show it correctly in the search results) what information is expressly can (or CANNOT) access to read and index. It goes without saying that, by default, if you don’t say anything to Google it will devour all the information in its path and include it in its mammoth index of web content.

A simple and basic example of the content of a robots.txt file would be as follows:

User-agent: *
Disallow: /

As can be seen, first the “user-agent” or Bot being addressed with the command is defined (be it Google web search, Google images, or Bing…) and then the pages to be ignored (with the Disallow command) or explicitly indexed (Allow). In this case it is stated that all the robots (with an asterisk) should ignore –not index– all the pages (with the forward slash from “the seven”, with nothing else behind, we let it be seen that it is a command for the full root directory).

User-agent: Bing
Disallow: /

With this example we are saying to the Bing robot not to index anything on our site. By emission, the rest of the robots including Google’s can crawl and index all the content on our site.

User-agent: Bing
Disallow: /documents/

One step further: we have said to Bing not to index the “documents” folder. In this case Google will index by default all the content on the site, while Bing will index everything except the content of the “documents folder”. In this way it is clear that to define a folder in the robots, its name and an additional forward slash must be placed at the end (after the initial forward slash that is always there).

User-agent: *
Disallow: /page1.html
Disallow: /page2.php
Disallow: /documents/page3.html

With this robots.txt we are telling all the robots not to index 3 specific pages: “page1.html”, “page2.php” and “page3.html” that is located in the documents folder.

User-agent: Googlebot
Allow:

We have introduced the “Allow” command whose function is the opposite of “Disallow” but we have removed the subsequent forward slash “/”, therefore we are saying that NOTHING CAN be indexed. At first sight it seems that this should work in exactly the same way as “Disallow: /” impeding the indexing of the website completely, but in reality it is not the case.

We find ourselves before one of the easiest errors to make when setting up the robots file, because there are certain important considerations to be taken into account relating to the “Allow” command, which is quite tricky.

  • It only makes sense when accompanied by a “Disallow”: it is a non restrictive command, so when on it is “alone” it is not applicable. That is to say, by defect Google can index everything, therefore only saying that it CAN index is of no concern, it will continue to index everything. Even in the case of the previous example, where it is said that it CAN index NOTHING, it will ignore the order and continue to index everything. Careful with this.
  • In theory the rules are applied in order, beginning with the first, therefore the “Allow”, being exceptions to the “Disallow”, should go first. Albeit in practice the main search engines will interpret it correctly even if you don’t do it this way.
  • The “Allow” command is not an official part of the standard, even though Google and the other “big ones” support it perfectly, for some robots it could be problematic.

Frequently Asked Questions on the robots file: advanced set up

It seems easy up until now but over time with continuous tinkering you are going to need new configurations and the doubts start to arise, what can you do with the robots file? Here you have a few doubts that have fallen on me to try out, either in first person or through the frequently asked questions posed by my online marketing colleagues:

What happens when the page ends in a forward slash “/”?

It often happens, especially in web pages built with platforms such as WordPress, that there could be a page on our site with the URL structure: “mydomain.com/services/”. In this case the URL is a page where all the services of the company are shown, while it is possible that underneath there are pages of the type “mydomain.com/services/service-name1” or similar. But how do we tell the robots to exclude only the top page, without blocking all those below by doing so ? The temptation would be:

User-agent: *
Disallow: /services/

But as we have said, in this case the agent will understand that the order affects the whole of this folder or directory. But that’s not what we want! To say to the robots that we only refer to this specific page, we have to use the dollar operators (“$”) that is used to specify the end of the URL. So:

User-agent: *
Disallow: /services/$

In this way we tell the robot to not index the URLs that end in exactly this way, being the only URL of this type that we want to deindex. And that takes us to regular expressions in robots.txt…

Uses of operators in robots.txt (dollar and asterisk)

Although the previous example serves to explain the use of the “$” in the robots, the truth is that it should be used along with the asterisk “*” in order to get the most out of it. This acts as a wildcard, it can be used to say “substitutes whatever can be used in my place”. Best to look at an example.

User-agent: *
Disallow: /*.htm$

We have already explained that the dollar is used to say that the URL ends there, that nothing else can go behind that which we wish to apply the “allow” or the “disallow”.

In the case of the asterisk, we are saying that it can be replaced with whatever, as long as it os followed by “.htm”. That is to say, there can be various levels of folder in the way (for example “/folder/subfolder/page.htm” would also be excluded).

In this way in the example we are saying to all the robots not to index any .HTM while we let them, using the dollar, while they index all files with the .HTML extension. This takes us to another recurring question…

How to avoid URL indexation with parameters?

Our CMS often generates routes with parameters such as “mydomain.com/index.php?user=1” which we do not want to be indexed to not incur duplication of content. Following the previous pattern and knowing that the parameters are preceded by a question, something like this would have to be applied:

User-agent: *
Disallow: /*?

So we tell them not to index everything that starts with “whatever” but that after has a question mark, followed by whatever. I am sure that someone will have been lost in that last stage, that would have the temptation to have put “Disallow: /*?” in order to be sure that something else goes after the question mark, the parameters. Well no, this kind of regular expressions assume by default that, after what we tell them, anything can go. That´s why when we say “Disallow: /services/” the robot understands that everything that goes after (e.g. /services/audit) will not be indexed either because it responds to a defined pattern. But be careful this is very dangerous! An example as follows:

What happens when the page URL doesn’t have an extension (e.g.: it doesn’t end in “.html”)?

Let´s say there’s a page that we don’t want to index whose URL is exactly: “www.mydomain.com/service”. We might fall into what is possibly the biggest error committed in the use of the robots.txt file in the world! :O

User-agent: *
Disallow: /service

Some smart guy will say: “Not like that, that serves to restrict the whole folder ‘services'”.

Well, not really either exactly. In reality, as we have explained previously, the robot is going to understand that anything can go behind this, that is to say, it is going to exclude pages like:

  • /service
  • /services
  • /service-audit
  • /servicio-consultancy/
  • /servicio-consultancy/digital.html
  • /webservices/seo/yandex.php
  • etc.

So, how do I exclude this page that doesn’t have an extension? Like this:

User-agent: *
Disallow: /service$

In this way we define where the URL ends and avoid a major problem that is often overlooked when making robot files.

Do you have to put a forward slash “/” after the name of the folder? What happens if I don’t put it?

This has been explained in the previous point: if you don’t put a forward slash, the robots will be excluding everything that starts in this way, specifically corresponding to or not the subdirectory.

Can Disallow and Allow commands be included in the same robots?

They can. In fact its combination can be a way to better define the things that should be indexed and those not for a certain folder (or for the whole site).

An example…

User-agent: *
Allow: /services/$
Disallow: /services/

In this way we are saying that the general service page IS to be indexed (“mydomain.com/services/”) but to NOT index the following pages with the specific services (“mydomain.com/services/audit”+”mydomain.com/services/consultancy”+etc.).

Ideally “Allow” should be used first which is not restrictive (by default it is understood as allowed to index everything), to later include the “Disallow”. In this way the more “clumsy” robots’ work is made easier.

How should upper and lower case be treated?

It should be taken into consideration that the upper and lower case letters are distinguished, in this case you can’t just use lower case. That is to say, a command such as “Disallow: page.html” would allow the page “mydomain.com/Page.html” to be indexed”.

How to set up robots.txt for WordPress?

Although WordPress is a widely used platform, and Google increasingly understands better what it has to index and not, in practice things slip through that tarnish the quality of the information indexed on our site. Given that the structure of WordPress is common across all installations, it is possible to define a robots type for WordPress with the folders where the search engine should not shove its nose in. Take into account that these are minimums, as soon as we make use of templates, plugins and personalisations, there will be new folders that we will have to “restrict”.

User-agent: *
Disallow: /wp-content/ 
Disallow: /wp-includes/ 
Disallow: /trackback/ 
Disallow: /wp-admin/ 
Disallow: /archives/ 
Disallow: /category/ 
Disallow: /tag/ 
Disallow: /wp-* 
Disallow: /login/ 
Disallow: /*.js$ 
Disallow: /*.inc$ 
Disallow: /*.css$ 
Disallow: /*.php$

So the possible indexation of system folders and files with extensions that are not of interest are excluded. Handle with care.

Other considerations: meta robots and sitemap

Remember that in addition to the robots.txt it is possible to specify the convenience or not of indexing a page through the meta tag “robots” which can be included on an individual level for each of the pages of the site. This would simply be a case of including something similar to this in the <head> for each case:

<meta name=”robots” content=”noindex”>

By default everything is indexable, the tag makes more sense when we use the command “noindex”, even though “index” can also be specified.

Regarding the sitemap, two considerations::

  • It is possible to include in the robots.txt file the route where the sitemap(s) of the site, can be found, it would be a question of adding a line such as: “Sitemap: http://www.mydomain.com/sitemap.xml” (or wherever you want your sitemap to be found).
  • Including a sitemap for our site is not restrictive, that is to say, Google is going to index everything it can, regardless of it being on your sitemap or not. With it we are only helping it to find the pages, the means for not indexing them are those previously identified.

 

Google bots

As has been hinted before, there exist different Bots / Robots / Crawlers / Spiders that spend their spare time circling around the web of webs gulping down information like crazy. To be practical and given that Google takes about 97% of state searches, we are going to detail the different Google bots and what they are for:

  • Googlebot: Is Google’s “general” bot, therefore it is used to restrict all the others. That is to say, if we limit ourselves to restricting Googlebot we will be restricting Googlebot-News, Googlebot-Image, Googlebot-Video y Googlebot-Mobile.
  • Googlebot-News: Helps to restrict the access to the pages or posts for their indexation in Google News. As with the previous, restricting Googlebot means that you will not appear in Google searches or in Google News. If we only want to appear in Google News we would have to define something like
    User-agent: Googlebot
    Disallow: /
    User-agent: Googlebot-News
    Disallow:
  • Googlebot-Image: Used to block the access to folders that contain images that we do not want to be indexed. Example
    User-agent: Googlebot-Image
    Disallow: /summer-photos/
  • Googlebot-Video: The same as the previous but applied to the restriction to index videos.
  • Googlebot-Mobile: Although there is some controversy and mysticism, it is assumed that it manages the content indexable to show on searches from mobile devices. It is assumed.
  • Mediapartners-Google: Specifies the pages that should be taken into account by Google when displaying advertising on their net, but without affecting their indexability. For example with this robots.txt adverts from Adsense could be displayed on our website, although it would not be indexed by Google:
    User-agent: Googlebot
    Disallow: /
    User-agent: Mediapartners-Google
    Disallow:
  • Adsbot-Google: Manages access for the Adwords robot charged with assessing the quality of the landing page.

 

As important as achieving a high indexing rate, with Google’s robots frequently visiting our most recently updated sites, is attempting that private information or content that could be considered duplicate, among other things, are not indexed. Therefore you have to take the trouble of setting up the robots.txt file, the “meta robots” if necessary, communicate with Google through the Webmaster Tools, and everything that is in our hands. After that, all that´s left to do is pray! XD