Preventing the crawling of internal search results pages.
Keeping sections of a website private (e.g., your staging site).
Preventing the crawling of duplicate content.
It gives you more control over where search engines can and can’t go on your website, and that can help with things like: That said, there’s no good reason not to have one. Having a robots.txt file isn’t crucial for a lot of websites, especially small ones. If you want to tell Google not to follow specific links on a page, use the rel=“nofollow” link attribute. If you want to nofollow all links on a page now, you should use the robots meta tag or x-robots header. Google announced that this directive is officially unsupported on September 1st, 2019. For example, if you wanted to stop Google from following all links on your blog, you could use the following directive: User-agent: Googlebot This is another directive that Google never officially supported, and was used to instruct search engines not to follow links on pages and files under a specific path. If you want to exclude a page or file from search engines, use the meta robots tag or x-robots HTTP header instead. However, on September 1st, 2019, Google made it clear that this directive is not supported. However, until recently, it’s thought that Google had some “code that handles unsupported and unpublished rules (such as noindex).” So if you wanted to prevent Google from indexing all posts on your blog, you could use the following directive: User-agent: Googlebot This directive was never officially supported by Google. That’s not very helpful if you have millions of pages, but it could save bandwidth if you have a small website. If you set a crawl-delay of 5 seconds, then you’re limiting bots to crawl a maximum of 17,280 URLs a day. That said, be careful when setting this directive, especially if you have a big site. Google no longer supports this directive, but Bing and Yandex do. For example, if you wanted Googlebot to wait 5 seconds after each crawl action, you’d set the crawl-delay to 5 like so: User-agent: Googlebot Previously, you could use this directive to specify a crawl delay in seconds. Here are the directives that are no longer supported by Google-some of which technically never were. Sidenote. You can include as many sitemaps as you like in your robots.txt file. Google supports the sitemap directive, as do Ask, Bing, and Yahoo. So you’re best to include sitemap directives at the beginning or end of your robots.txt file. Note that you don’t need to repeat the sitemap directive multiple times for each user-agent. However, it does tell other search engines like Bing where to find your sitemap, so it’s still good practice. How important is including your sitemap(s) in your robots.txt file? If you’ve already submitted through Search Console, then it’s somewhat redundant for Google. Here’s an example of a robots.txt file using the sitemap directive: Sitemap: If you’re unfamiliar with sitemaps, they generally include the pages that you want search engines to crawl and index. Use this directive to specify the location of your sitemap(s) to search engines. Other search engines listen to the first matching directive. Sidenote. Here, /blog (without the trailing slash) is still accessible and crawlable.Ĭrucially, this is only the case for Google and Bing. There are hundreds of user-agents, but here are some useful ones for SEO: You can set custom instructions for each of these in your robots.txt file. User-agentsĮach search engine identifies itself with a different user-agent. Let’s explore these two components in more detail. In short, you assign rules to bots by stating their user-agent followed by directives. If you’ve never seen one of these files before, that might seem daunting. Here’s the basic format of a robots.txt file: Sitemap: Just know that some search engines ignore it completely. They obey the instructions in a robots.txt file. Google isn’t one of those search engines.
That said, some aren’t shy about picking a few metaphorical locks. They aren’t in the habit of breaking an entry.