URL sent blocked by robots.txt Cause of error and remedy

When using the Google Search Console, you may receive an email stating "A new'index coverage'problem has been detected."

This time, I will explain the cause of the error "The sent URL was blocked by robots.txt" and how to improve it.

See also here for index coverage

The URL sent was blocked by robots.txt

This error is often due to the URL being included in robots.txt.

To summarize briefly

Even though you asked the search engine to patrol the site (send the site map), the search engine is set not to patrol (the description of robots.txt), so you can patrol normally. Did not

That is the situation.

What is robots.txt?Files that control content that you don't want to be collected from being crawledis. This allows search engine crawlers to crawl around content that is important to your site, or block crawls where you don't need to crawl.

See also this article about crawling.

How to improve the sent URL being blocked by robots.txt

If you are using WordPress, first open [Settings]> [Display Settings] on the left side menu of the WordPress administration screen.
Then look for the "prevent search engines from indexing your site" part of the image.
Robots.txt change screen

When a website is created in a short time or there is little content, it is often checked with the intention of "preventing searches because the content and articles are not yet sufficient".

So, if this item is checked, uncheck it.

Next, the points I would like people who understand the technical part to check are The description of "Disallow" in robots.txtis.
The description Disallow is used to deny access.

▼ Disallow entry example

Disallow: / → Block all pages in the site (/ represents all under TOP)
Disallow: → No block
Disallow: /directory1/page1.html → Block only page (/directory1/page1.html)
Disallow: / directory2 / → Block all pages under the directory (/ directory2 /)
Allow: /directory2/page1.html → Only page (/directory2/page1.html) is allowed to crawl

For pages that do not need to be crawled like this, write Disallow and control crawling. A common example is a login URL. Many sites refuse to crawl the Wordpress admin screen URL (eg 123.com/wp-admin).

Make sure the contents of robots.txt are what you intended, and modify the sitemap or robots.txt to resolve the conflict.

What is robots.txt in the first place?

Robots.txt is a file that controls content that you don't want to collect from being crawled by search engines such as Google.

In general, it is considered good to be crawled by a crawler, so there is a story that "all the contents on the WEB page should be crawled?", And there is no correct answer or incorrect answer.
However, crawling content that is limited to registered users, shopping carts, or duplicate pages that are automatically generated systematically may affect the SEO of the entire site. It is.

Description of robots.txt

Mainly, robots.txt consists of the following elements.

Contents of robots.txt

If there is a specific page location or "/" (representing the entire site) after the "Disallow" I mentioned earlier, that page (entire site) is set not to be patrolled by search engines. ,be careful.

Points to note when creating robots.txt

It's convenient and I somehow understand that it's important, but what should I be careful about?

Crawler denial should not be used for noindex purposes

Crawling denial should not be used for no index purposes. This is because crawl denial is the role of crawl denial, and the index cannot prevent it.

If you specify disallow in robots.txt, you can control crawl access, so basically it will not be indexed.

However, if another page has a link to the page to be disallowed, it may be indexed.

Even pages that block crawlers can be indexed if they are linked from other sites.
Google doesn't crawl or index content that is blocked in robots.txt, but it will detect if the blocked URL is linked from elsewhere on the web. May be indexed. As a result, the URL address in question and, in some cases, other public information (such as the anchor text of the link to the page) may appear in Google search results. To ensure that certain URLs appear in Google search results, you need to password protect the files on your server or use the noindex meta tag or response headers (or permanently remove the page). I have.

Google Search Central Reference

So, for pages that you don't want to index, put them in the head of the page.

robots meta tag that encourages noindex
meta name="robots" content="noindex"
Code language: JavaScript (javascript)

Let's write like this or set noindex with a plug-in etc.

Do not use for normalization of duplicate content

It's a similar story, but if you're experiencing duplicate content, some people think that blocking one content page in robots.txt can normalize the duplicate content, but that's not the right way to go.
Duplicate content should be normalized in the correct way, such as canonical, 301 redirects, and text rewrite.

Processing priority is not top to bottom

It's a bit technical, but keep in mind the following two priorities for processing robots.txt.

・ Processed from the deepest layer
・ Allow takes precedence over Disallow

Let me give you an example.

User-agent: *
Disallow: / shop / tokyo
Allow: / shop /

With this process, it looks like the following process, but it is different.

・ Prohibit access to / shop / tokyo!
・ After all, you can enter everything under / shop /.

In this case, "/ shop / tokyo" is in a deeper hierarchy.
Therefore, even if you write the processing of "Allow" for the upper hierarchy (/ shop /) later, the priority will be low because the hierarchy is shallow, and it will not be processed.

summary

I've told you a lot, but let's handle robots.txt carefully.
If the crawler loses access to all pages, it's possible that the ranking will plummet ...

Robots.txt itself can be easily created, so it can be said that it is a scary file.
However, if you are in charge of the Web, it will be a file that you can touch quite a lot, so let's understand it to some extent.

Please feel free to contact us if you have any questions.

I'm sure some of you may be worried when you suddenly receive a warning email.

I hope this article will be of some help to you.

If you have any questions after reading the article, feel free to Twitter (@kaznak_com) Etc., please ask.

Please excuse me.

Kazuhiro Nakamura
Kazuhiro Nakamura
Representative of Cocorograph Inc. 13 years of SEO history, more than 970 sites with countermeasures. We provide SUO, an upward compatible service of SEO that optimizes not only search engines but also search users. SEO / SUO's original report tool, Sachiko Report Developer. Book "The latest common sense of SEO taught by professionals in the field"
Kazuhiro Nakamura
Kazuhiro Nakamura
Representative of Cocorograph Inc. 13 years of SEO history, more than 970 sites with countermeasures. We provide SUO, an upward compatible service of SEO that optimizes not only search engines but also search users. SEO / SUO's original report tool, Sachiko Report Developer. Book "The latest common sense of SEO taught by professionals in the field"