【Scraping(BeautifulSoup)】Automatically extract information from websites in Python

English pages

2022.12.29

Hello, I’m Yuki (@engineerblog_Yu), a student engineer.

People who hates troublesome things.

People who want to collect data quickly by programming.

People who want to take on projects in Python.

People who want to learn Python automation.

This article is for people who fit even one of these categories.

1 Advantages of Web Scraping
2 Disadvantages of Web Scraping
3 Web scraping with BeautifulSoup
4 How to avoid burdening the server
5 At the end

Advantages of Web Scraping
Disadvantages of Web Scraping
Web scraping with BeautifulSoup
How to avoid burdening the server
At the end

Advantages of Web Scraping

・Automatically collect information from a large number of websites without manual work

Information can be automatically collected from the Web without manual copying and pasting.

・Information can be collected from websites that do not provide APIs.

Since APIs are not necessarily applied to all websites, information can be collected from websites that do not provide APIs.

Disadvantages of Web Scraping

・Scraping may be rejected by the website where the information is collected or may be in violation of the law.

However, there may be cases in which scraping is prohibited in the terms of use or is not allowed.

When scraping, be sure to check the terms of use carefully before doing so.

Also, accessing the same page multiple times and burdening the server is prohibited, so take the methods described below to avoid burdening the server.

Web scraping with BeautifulSoup

pip install bs4

Now let’s write the code.

This time, let’s scrape the Python website (https://www.python.org).

from bs4 import BeautifulSoup
import requests

html = requests.get('https://www.python.org')
print(html.text)

If you type in this code, the html that makes up the Python website will be displayed on the terminal.

Next, let’s extract only the information we want.

This time, we will extract only the information in the title tag of the official Python website.

from bs4 import BeautifulSoup
import requests

html = requests.get('https://www.python.org')
soup = BeautifulSoup(html.text, 'lxml')

titles = soup.find_all('title')
print(titles)

Output

[<title>Welcome to Python.org</title>]

This means that this is the only title tag used on the official Python website.

If there are many of them, all the title tags will be displayed in list form.

Similarly, if you want to extract all the information in a class

variable_name = soup.find_all('tag_name', {'class': 'class_name'})

How to avoid burdening the server

To avoid burdening the server, use the sleep function.

from time import sleep

sleep(3)

When accessing a site repeatedly, it is said that it is better to stop the operation for at least 1~3 seconds, so when using for statements, etc., use the sleep method as appropriate.