Scraping method using Screaming frog spider

What is scraping?

Scraping is the process of crawling a website and then extracting / processing the data.

The advantage of scraping is that it is possible to collect a large amount of information that is very troublesome manually / visually.
How to visually check and summarize data such as "Extract names by author from a huge number of blogs with thousands of articles" or "Extract all rents in a specific area from multiple rent sites" It's too much work and a waste of time.

As a caveat,

  • Since it is not officially provided like API, if the site structure changes, it may suddenly become impossible to collect information.
  • If you scrape too often, it may be rejected from the log information.
  • Be careful with the terms of use of the site as you may be exposed to the law

Scraping with Screaming frog spider

Even with the Screaming frog spider (also known as a frog among SEO personnel), by default you can extract many important elements such as titles, heading tags (h1, h2, etc.), descriptions, etc. It has convenient functions.

However, in my daily data analysis work, I often come across scenes such as "I want more detailed information" and "It would be convenient if you could extract the information here as well".

"It would be easier if I could crawl the specific information I was looking for and drop it in CSV ..." might be a little solution.
Therefore, I would like to introduce scraping using Screaming frog spider.

This is also a caveat,

  • Not available in the free version of Screaming frog spider.
  • Assuming a page that can be crawled, select a page that returns a status code of 200.
  • This time, I have not touched on scraping using "Xpath" and "regular expression".

Copy the data for which you want to extract text

First, find the text you are interested in and copy the selector.
Open the developer tools within the site where the text is located. Press Command + Option + I on a mac or F12 on Windows.

Scraping tool

Look for page elements that contain only text here.
When you find the code in the developer tools, right click> Copy> Copy selector and you're done copying.
This time, I want to list the authors of the article page, so I will copy that selector.

Scraping tool

If it's difficult to understand, press the part of the image and move the cursor while it is blue to display the code corresponding to the element in the site.

Open Screaming frog spider

Next, let's open the tool.

Scraping tool

There is a toolbar in the upper left, so go to Configuration> Custom> Extraction.
Then, a screen like the following image will be displayed.

Scraping tool

I will explain each number.

  1. Give it a name for the data you want to extract. This time, I want to extract the author of the article, so I chose "author name".
    Enter the name of the data you are looking for, depending on your situation.
  2. Select "CSS Path" here.
  3. Paste the selector you copied earlier here.
    It is important to note here, but if you copy it from a mass-produced article page like this time,
    Like "# post-12345> header> div.header-meta -...", it will stick to the path of the article at the beginning, so
    Let's delete that.
    If it looks like "header> div.header-meta -...", it's OK.
  4. Select "Extract Text" here.
  5. When you're done with the above, press OK.

* If there are multiple data you want to extract, you can increase the number of items by pressing "+ Add".
If so, do the same thing again.

Crawl the site

When you're done, crawl the site that contains the page as you normally would.
After crawl, check if the extraction is done properly. (* Because it is a personal name, I am lying down)

Scraping tool

You can check it in the tool like the image, and it is displayed properly even if you drop it in CSV.

That's all for this time.
In the future, I will explain about "regular expressions" and "Xpath" if I can make an article.

summary

This time, I explained the function that makes up for the missing part of the default yesterday.

I feel that it is not a very frequently used SEO work, but I think that it is a function that is worth remembering, so
If you usually use Screaming frog spider, please use it once.

I hope this article will be of some help to you.
If you have any questions after reading the article, feel free to Twitter (@kaznak_com) Etc., please ask.

see you.

Kazuhiro Nakamura
Kazuhiro Nakamura
Representative of Cocorograph Inc. 13 years of SEO history, more than 970 sites with countermeasures. We provide SUO, an upward compatible service of SEO that optimizes not only search engines but also search users. SEO / SUO's original report tool, Sachiko Report Developer. Book "The latest common sense of SEO taught by professionals in the field"
Kazuhiro Nakamura
Kazuhiro Nakamura
Representative of Cocorograph Inc. 13 years of SEO history, more than 970 sites with countermeasures. We provide SUO, an upward compatible service of SEO that optimizes not only search engines but also search users. SEO / SUO's original report tool, Sachiko Report Developer. Book "The latest common sense of SEO taught by professionals in the field"

What is written on this page

When it comes to web marketing
Please feel free to contact us

cocorograph inc.
〒150-0002
Miyamasuzaka Building 203, 2-19-15 Shibuya, Shibuya-ku, Tokyo
mail: [email protected]
tel: +81-50-1748-9550
url: https://cocorograph.co/

© 2021 cocorograph Inc.