When using the Google Search Console, you may receive an email stating "A new'index coverage'problem has been detected."
This time, I will explain the cause of the error "The sent URL was blocked by robots.txt" and how to improve it.
See also here for index coverage
What is written on this page
The URL sent was blocked by robots.txt
This error is often due to the URL being included in robots.txt.
To summarize briefly
Even though you asked the search engine to patrol the site (send the site map), the search engine is set not to patrol (the description of robots.txt), so you can patrol normally. Did not
That is the situation.
What is robots.txt?Files that control content that you don't want to be collected from being crawledis. This allows search engine crawlers to crawl around content that is important to your site, or block crawls where you don't need to crawl.
See also this article about crawling.
How to improve the sent URL being blocked by robots.txt
If you are using WordPress, first open [Settings]> [Display Settings] on the left side menu of the WordPress administration screen.
Then look for the "prevent search engines from indexing your site" part of the image.
When a website is created in a short time or there is little content, it is often checked with the intention of "preventing searches because the content and articles are not yet sufficient".
So, if this item is checked, uncheck it.
Next, the points I would like people who understand the technical part to check are The description of "Disallow" in robots.txtis.
The description Disallow is used to deny access.
▼ Disallow entry example
Disallow: / → Block all pages in the site (/ represents all under TOP)
Disallow: → No block
Disallow: /directory1/page1.html → Block only page (/directory1/page1.html)
Disallow: / directory2 / → Block all pages under the directory (/ directory2 /)
Allow: /directory2/page1.html → Only page (/directory2/page1.html) is allowed to crawl
For pages that do not need to be crawled like this, write Disallow and control crawling. A common example is a login URL. Many sites refuse to crawl the Wordpress admin screen URL (eg 123.com/wp-admin).
Make sure the contents of robots.txt are what you intended, and modify the sitemap or robots.txt to resolve the conflict.
What is robots.txt in the first place?
Robots.txt is a file that controls content that you don't want to collect from being crawled by search engines such as Google.
In general, it is considered good to be crawled by a crawler, so there is a story that "all the contents on the WEB page should be crawled?", And there is no correct answer or incorrect answer.
However, crawling content that is limited to registered users, shopping carts, or duplicate pages that are automatically generated systematically may affect the SEO of the entire site. It is.
Description of robots.txt
Mainly, robots.txt consists of the following elements.
If there is a specific page location or "/" (representing the entire site) after the "Disallow" I mentioned earlier, that page (entire site) is set not to be patrolled by search engines. ,be careful.
Points to note when creating robots.txt
It's convenient and I somehow understand that it's important, but what should I be careful about?
Crawler denial should not be used for noindex purposes
Crawling denial should not be used for no index purposes. This is because crawl denial is the role of crawl denial, and the index cannot prevent it.
If you specify disallow in robots.txt, you can control crawl access, so basically it will not be indexed.
However, if another page has a link to the page to be disallowed, it may be indexed.
Even pages that block crawlers can be indexed if they are linked from other sites.
Google doesn't crawl or index content that is blocked in robots.txt, but it will detect if the blocked URL is linked from elsewhere on the web. May be indexed. As a result, the URL address in question and, in some cases, other public information (such as the anchor text of the link to the page) may appear in Google search results. To ensure that certain URLs appear in Google search results, you need to password protect the files on your server or use the noindex meta tag or response headers (or permanently remove the page). I have.
So, for pages that you don't want to index, put them in the head of the page.
robots meta tag that encourages noindex
meta name="robots" content="noindex"
Let's write like this or set noindex with a plug-in etc.
Do not use for normalization of duplicate content
It's a similar story, but if you're experiencing duplicate content, some people think that blocking one content page in robots.txt can normalize the duplicate content, but that's not the right way to go.
Duplicate content should be normalized in the correct way, such as canonical, 301 redirects, and text rewrite.
Processing priority is not top to bottom
It's a bit technical, but keep in mind the following two priorities for processing robots.txt.
・ Processed from the deepest layer
・ Allow takes precedence over Disallow
Let me give you an example.
Disallow: / shop / tokyo
Allow: / shop /
With this process, it looks like the following process, but it is different.
・ Prohibit access to / shop / tokyo!
・ After all, you can enter everything under / shop /.
In this case, "/ shop / tokyo" is in a deeper hierarchy.
Therefore, even if you write the processing of "Allow" for the upper hierarchy (/ shop /) later, the priority will be low because the hierarchy is shallow, and it will not be processed.
I've told you a lot, but let's handle robots.txt carefully.
If the crawler loses access to all pages, it's possible that the ranking will plummet ...
Robots.txt itself can be easily created, so it can be said that it is a scary file.
However, if you are in charge of the Web, it will be a file that you can touch quite a lot, so let's understand it to some extent.
Please feel free to contact us if you have any questions.
I'm sure some of you may be worried when you suddenly receive a warning email.
I hope this article will be of some help to you.
If you have any questions after reading the article, feel free to Twitter (@kaznak_com) Etc., please ask.
Please excuse me.