What is written on this page
What is scraping?
Scraping is the process of crawling a website and then extracting / processing the data.
The advantage of scraping is that it is possible to collect a large amount of information that is very troublesome manually / visually.
How to visually check and summarize data such as "Extract names by author from a huge number of blogs with thousands of articles" or "Extract all rents in a specific area from multiple rent sites" It's too much work and a waste of time.
As a caveat,
- Since it is not officially provided like API, if the site structure changes, it may suddenly become impossible to collect information.
- If you scrape too often, it may be rejected from the log information.
Scraping with Screaming frog spider
Even with the Screaming frog spider (also known as a frog among SEO personnel), by default you can extract many important elements such as titles, heading tags (h1, h2, etc.), descriptions, etc. It has convenient functions.
However, in my daily data analysis work, I often come across scenes such as "I want more detailed information" and "It would be convenient if you could extract the information here as well".
"It would be easier if I could crawl the specific information I was looking for and drop it in CSV ..." might be a little solution.
Therefore, I would like to introduce scraping using Screaming frog spider.
This is also a caveat,
- Not available in the free version of Screaming frog spider.
- Assuming a page that can be crawled, select a page that returns a status code of 200.
- This time, I have not touched on scraping using "Xpath" and "regular expression".
Copy the data for which you want to extract text
First, find the text you are interested in and copy the selector.
Open the developer tools within the site where the text is located. Press Command + Option + I on a mac or F12 on Windows.
Look for page elements that contain only text here.
When you find the code in the developer tools, right click> Copy> Copy selector and you're done copying.
This time, I want to list the authors of the article page, so I will copy that selector.
If it's difficult to understand, press the part of the image and move the cursor while it is blue to display the code corresponding to the element in the site.
Open Screaming frog spider
Next, let's open the tool.
There is a toolbar in the upper left, so go to Configuration> Custom> Extraction.
Then, a screen like the following image will be displayed.
I will explain each number.
- Give it a name for the data you want to extract. This time, I want to extract the author of the article, so I chose "author name".
Enter the name of the data you are looking for, depending on your situation.
- Select "CSS Path" here.
- Paste the selector you copied earlier here.
It is important to note here, but if you copy it from a mass-produced article page like this time,
Like "# post-12345> header> div.header-meta -...", it will stick to the path of the article at the beginning, so
Let's delete that.
If it looks like "header> div.header-meta -...", it's OK.
- Select "Extract Text" here.
- When you're done with the above, press OK.
* If there are multiple data you want to extract, you can increase the number of items by pressing "+ Add".
If so, do the same thing again.
Crawl the site
When you're done, crawl the site that contains the page as you normally would.
After crawl, check if the extraction is done properly. (* Because it is a personal name, I am lying down)
You can check it in the tool like the image, and it is displayed properly even if you drop it in CSV.
That's all for this time.
In the future, I will explain about "regular expressions" and "Xpath" if I can make an article.
This time, I explained the function that makes up for the missing part of the default yesterday.
I feel that it is not a very frequently used SEO work, but I think that it is a function that is worth remembering, so
If you usually use Screaming frog spider, please use it once.
I hope this article will be of some help to you.
If you have any questions after reading the article, feel free to Twitter (@kaznak_com) Etc., please ask.