Robots.txt - The Ultimate Guide
Search Engine Optimization

Robots.txt - The Ultimate Guide

12/14/2023 12:00 AM by SEO Admin in


What is Robots.txt?

Robots.txt is a file in text form that instructs bot crawlers to index or not index certain pages. Robots.txt files control crawler access to certain areas of your site. It is used to manage web crawler activities so they don’t overwork your website or index pages not meant for public view. Bot crawlers’ first objective is to find and read the robots.txt file, before accessing your sitemap or any pages or folders.  They will always look for your robots. txt file in the root of your website.

With robots.txt, you can more specifically:                

  •  Direct how search engine bots crawl your site

  • Allow certain access

  • Help search engine spiders index the content of the page

  • Show how to serve content to users

Robots.txt is a part of the Robots Exclusion Protocol (R.E.P), consisting of the site/page/URL level directives.While search engine bots can still crawl your entire site, it’s up to you to help them decide whether certain pages are worth the time and effort. 

Why do you Need Robots.txt?

Your site does not need a robots.txt file in order for it to work properly. The main reason you need a robots.txt file is so that when bots crawl your page, they ask for permission to crawl so they can attempt to retrieve information about the page to index. Additionally, a website without a robots.txt file is basically asking bot crawlers to index the site as it sees fit. It’s important to understand that bots will still crawl your site without the robots.txt file.

The location of your robots.txt file is also important because all bots will look for www.yoursite.com/robots.txt. If they don’t find anything there, they will assume that the site does not have a robots.txt file and index everything. The file must be an ASCII or UTF-8 text file. It is also important to note that rules are case-sensitive.

robots.txt must-knows

  • The file is able to control access of crawlers to certain areas of your website. You need to be very careful when setting up robots.txt as it is possible to block the entire website from being indexed.

  • It  prevents duplicate content and non-public pages from being indexed and appearing in search engine results.

  • Specifying a crawl delay in order to prevent your servers from being overloaded when crawlers load multiple pieces of content at once

  • The file specifies the crawl delay in order to prevent servers from overloading when the crawlers are loading multiple pieces of content at the same time.

  • If you want to exclude certain files such as PDFs, videos, and images from search results. The file helps to keep them private or have Google focus on more important content.

  • The files help in the specification of the location of the sitemaps.

After arriving at a website (for example: www.123.com), the search crawler will look for a www.yoursite.com/robots.txt  file. If it finds one, the crawler will read that file first before continuing through the page.

If the bot finds:

User-agent: *

Disallow: /

Above example instructs all (User-agents*) search engine bots to not index (Disallow: /) the website.

If you removed the forward slash from Disallow, like in the example below,

User-agent: *

Disallow:

The  bots would be able to crawl and index everything on the website. This is why it is important to understand the syntax of robots.txt.

robots.txt Syntax

Robots.txt syntax can be thought of as the “language” of robots.txt files. There are 5 common terms you’re likely to come across in a robots.txt file. They are:

  • User-agent: The specific web crawler to which you’re giving crawl instructions (usually a search engine). A list of most user agents can be found here.

  • Disallow: The command used to tell a user agent not to crawl a particular URL. Only one "Disallow:" line is allowed for each URL.

  • Allow (Only applicable for Googlebot): The command tells Googlebot that it can access a page or subfolder even though its parent page or subfolder may be disallowed.

  • Crawl-delay: The number of milliseconds a crawler should wait before loading and crawling page content. Note that Googlebot does not acknowledge this command, but crawl rate can be set in Google Search Console.

  • Sitemap: Used to call out the location of any XML sitemap(s) associated with a URL. Note this command is only supported by Google, Ask, Bing, and Yahoo.

Robots.txt instruction outcomes

You expect three outcomes when you issue robots.txt instructions:

  • Full allow

  • Full disallow

  • Conditional allow

Let’s investigate each below.

Full allow

This outcome means that all content on your website may be crawled. Robots.txt files are meant to block crawling by search engine bots, so this command can be very important. This outcome could mean that you don't have a robots.txt file, the search engine robots will have full access to your site and index anything they find on the website.

 You can put this into your robots.txt file to allow all:

User-agent:*

Disallow:

 Disallow All

This outcome could mean that you do not have a robots.txt file on your website at all. Even if you do not have it, search engine bots will still look for it on your site. If they do not get it, then they will crawl all parts of your website.

Here, no content will be crawled and indexed. 

This is the code you should put in your robots.txt to disallow all:

User-agent:*

Disallow:/

When we talk about no content, we mean that nothing from the website (content, pages, etc.) can be crawled. This is never a good idea.

Conditional Allow

This means that only certain content on the website can be crawled.

A conditional allow has this format:

User-agent:*

Disallow:/

User-agent: Mediapartner-Google

Allow:/

 Find the full robots.txt syntax here.

Note that blocked pages can still be indexed even if you disallowed the URL.

 If your disallowed URL is linked from other sites, such as anchor text in links, it will get indexed. The solution to this is to 1) password-protect your files on your server, 2) use the noindex meta tag, or 3) remove the page entirely.

Some of the best SEO practices when using robots.txt

  • Place the robots.txt file in your websites root directory for it to be found

  • Robots.txt file is case sensitive,  the file must be named “robots.txt” (no other variations)

  • Make sure all important pages are crawlable, and content that won’t provide any real value if found in search are blocked.

  • Google Search Console has warned site owners not to block CSS and JS files. If you block CSS and JavaScript files in yourrobots.txt file, Google can’t render your website as intended. Now, Google can’t understand your website, which might result in lower rankings.

  • If you have changed the robots.txt file and you want Google to update it more quickly, submit it directly to Google. For instructions on how to do that, click here. It is important to note that search engines cache robots.txt content and update the cached content at least once a day.

  • Robots.txt should never be used to block pages from being crawled by search engine bots. Only use it to block the sections of your website not accessible to the public, for instance, login pages like wp-admin.

  • Ensure that you are not blocking any content or sections of your site that you want crawled.

  • Don’t use robots.txt to prevent sensitive data such as private user information from appearing in search engine results. Doing so could allow other pages to link to pages which contain private user information which may cause the page to be indexed. In this case, robots.txt has been bypassed. If you want to block your page from search results, use another method such as password protection or noindex .

  • If you have changed the robots.txt file and you want Google to update it more quickly, submit it directly to Google. For instructions on how to do that, click here. It is important to note that search engines cache robots.txt content and update the cached content at least once a day.

  • Google Search Console has recommended site owners not to block CSS and JS files. Because If you block CSS and JavaScript files in your robots.txt file, Google can’t render your website as intended. Now, Google can’t understand your website, which might result in lower rankings.

  • You must name the file that you create “robots.txt” because the file is case sensitive. No uppercase characters are used.

  • You can only have one robots.txt file on the entire site.

  • Not only can you use wildcards (*) to apply directives to all user-agents, but also to match URL patterns when declaring directives. For example, if you wanted to prevent search engines from accessing parameterized product category URLs on your site, you could list them out like this:

Robots.txt: Basic Guidelines

Format and Location

Use any text editor. The text editor that you choose to use to create a robots.txt file needs to be able to create standard ASCII or UTF-8 text files. Avoid Using a word processor, since some characters which may affect crawling may be added.

While almost any text editor can be used to create your robots.txt file, this tool is highly recommended as it allows for testing against your site.

  • You must name the file that you create “robots.txt” because the file is case sensitive. No uppercase characters are used.

  • You can only have one robots.txt file on the entire site.

  • The robots.txt file is hosted on your server, just like any other file on your website. If you see something like this, then you have a robots.txt file:   https://sanseotools.com/robots.txt

How to Optimize robots.txt for SEO?

To optimize your robots. txt file for SEO, use it sparingly and strategically to block pages that are low-quality or duplicate content. Or use it to block the sections of your website not accessible to the public, for instance, login pages like wp-admin. Robots.txt should never be used to block pages from being crawled by search engine bots.

What are some of the pages that you may want to exclude from being indexed?

  • Duplicate content that is unintentinal, it's generally a good idea to avoid indexing all of them. This could include printer-friendly versions, or slight variations of the same content. . You may use robots.txt to block the indexing of the printer-friendly version of the identical content.

  • Thank You pages: These are typically displayed after a user submits a form or makes a purchase, and they are not meant for indexing. The command to block such a page is:

Specify the thank you page that you do not want to be indexed after the slash and close with another slash. For instance:

User-agent:*

Disallow:/page/thank-you/

  • Login and checkout pages. These pages are typically not relevant to search users, and they can contain sensitive information that should not be indexed.

NoIndex and NoFollow

using robots.txt is not a 100% guarantee that your page will not get indexed. Let’s look at two ways to ensure that your blocked page is indeed not indexed.

The NoIndex Directive

This directive offers benefits in eliminating snippetless title-less listings from the search results, but it’s limited to Google. Its syntax exactly mirrors Disallow. In the words of Matt Cutts:

“Google allows a NOINDEX directive in robots.txt and it will completely remove all matching site URLs from Google. (That behavior could change based on this policy discussion, of course, which is why we haven’t talked about it much.)”

This works in conjunction with the disallow command. Use both in your directive, as in:

Disallow:/thank-you/

Noindex:/thank-you/

The Nofollow directive

So any page with the "noindex" directive on it will not go into the search engine's search index, and can therefore not be shown in search results.

To use the nofollow command to block pages from being crawled and indexed, you need to find the source code of the specific page that you don’t want indexed.

Paste this in between the opening and closing head tags:

<meta name = “robots” content=”nofollow”>

You can use both “nofollow” and “noindex” simultaneously. Use this line of code:

<meta name = “robots” content=”noindex,nofollow”>

Generating robots.txt

If you find it difficult to write robots.txt using all the necessary formats and syntax that you need to understand and follow, you can use tools that simplify the process. A good example is our free robots.txt generator.

If you see your robots.txt file with the content you added, you’re ready to test the robots.txt markup.

Testing your robots.txt file

You need to test your robots.txt file to ensure that it is working as expected.

Use Google’s robots.txt tester.

To do this, sign in to your Webmaster’s account.

  • Next, select your property. In this case, it is your website.

  • Click on “crawl” on the left-hand sidebar.

  • Click on “robots.txt tester.”

  • Replace any existing code with your new robots.txt file.

  • Click “test.”

You should be able to see a text box “allowed” if the file is valid. For more information, check out this in-depth guide to Google robots.txt tester.

If your file is valid, it is now time to upload it to your root directory or save it if there as another robots.txt file.

FAQs

How many robots.txt can a website have?

Your site can have only one robots.txt file.

What is the maximum size of a robots.txt file?

Short Answer: 500KB

How do I create a custom robots.txt for Blogger?

First, get a robots.txt file for your Blogger site

Here is how to Implement the Custom Robots. txt File to Blogger

  1. Go to Blogger Dashboard and click on the settings option,
  2. Scroll down to the crawlers and indexing section,
  3. Enable custom robots. txt by the switch button.
  4. Click on custom robots. txt; a window will open, paste the robots. txt file, and update.

Where is robots.txt in WordPress?

Same place: yoursite.com/robots.txt.

How do I edit robots.txt in WordPress?

Either manually, or using one of the many WordPress SEO plugins like Yoast that let you edit robots.txt from the WordPress backend.