【Scraping(BeautifulSoup)】Automatically extract information from websites in Python

English pages
スポンサーリンク

Hello, I’m Yuki (@engineerblog_Yu), a student engineer.

People who hates troublesome things.

People who want to collect data quickly by programming.

People who want to take on projects in Python.

People who want to learn Python automation.

This article is for people who fit even one of these categories.

スポンサーリンク

Advantages of Web Scraping

Automatically collect information from a large number of websites without manual work

Information can be automatically collected from the Web without manual copying and pasting.

Information can be collected from websites that do not provide APIs.

Since APIs are not necessarily applied to all websites, information can be collected from websites that do not provide APIs.

Disadvantages of Web Scraping

Scraping may be rejected by the website where the information is collected or may be in violation of the law.

However, there may be cases in which scraping is prohibited in the terms of use or is not allowed.

When scraping, be sure to check the terms of use carefully before doing so.

Also, accessing the same page multiple times and burdening the server is prohibited, so take the methods described below to avoid burdening the server.

Web scraping with BeautifulSoup

pip install bs4

Now let’s write the code.

This time, let’s scrape the Python website (https://www.python.org).

from bs4 import BeautifulSoup
import requests

html = requests.get('https://www.python.org')
print(html.text)

If you type in this code, the html that makes up the Python website will be displayed on the terminal.

Next, let’s extract only the information we want.

This time, we will extract only the information in the title tag of the official Python website.

from bs4 import BeautifulSoup
import requests

html = requests.get('https://www.python.org')
soup = BeautifulSoup(html.text, 'lxml')

titles = soup.find_all('title')
print(titles)

Output

[<title>Welcome to Python.org</title>]

This means that this is the only title tag used on the official Python website.

If there are many of them, all the title tags will be displayed in list form.

Similarly, if you want to extract all the information in a class

variable_name = soup.find_all('tag_name', {'class': 'class_name'})

How to avoid burdening the server

To avoid burdening the server, use the sleep function.

from time import sleep

sleep(3)

When accessing a site repeatedly, it is said that it is better to stop the operation for at least 1~3 seconds, so when using for statements, etc., use the sleep method as appropriate.

At the end

In this article, we introduced web scraping for those who find it tedious to collect data from the web.

Automation is one of the strengths of Python, and I think it is something you should learn if you want to study Python.

If you are interested, you can take a course on Udemy.

多彩な講座から自分に合った講座を探そう!

コメント

タイトルとURLをコピーしました