Hello, I’m Yuki (@engineerblog_Yu), a student engineer.
People who hates troublesome things.
People who want to collect data quickly by programming.
People who want to take on projects in Python.
People who want to learn Python automation.
This article is for people who fit even one of these categories.
目次
Advantages of Web Scraping
・Automatically collect information from a large number of websites without manual work
Information can be automatically collected from the Web without manual copying and pasting.
・Information can be collected from websites that do not provide APIs.
Since APIs are not necessarily applied to all websites, information can be collected from websites that do not provide APIs.
Disadvantages of Web Scraping
・Scraping may be rejected by the website where the information is collected or may be in violation of the law.
However, there may be cases in which scraping is prohibited in the terms of use or is not allowed.
When scraping, be sure to check the terms of use carefully before doing so.
Also, accessing the same page multiple times and burdening the server is prohibited, so take the methods described below to avoid burdening the server.
Web scraping with BeautifulSoup
pip install bs4
Now let’s write the code.
This time, let’s scrape the Python website (https://www.python.org).
from bs4 import BeautifulSoup
import requests
html = requests.get('https://www.python.org')
print(html.text)
If you type in this code, the html that makes up the Python website will be displayed on the terminal.
Next, let’s extract only the information we want.
This time, we will extract only the information in the title tag of the official Python website.
from bs4 import BeautifulSoup
import requests
html = requests.get('https://www.python.org')
soup = BeautifulSoup(html.text, 'lxml')
titles = soup.find_all('title')
print(titles)
Output
[<title>Welcome to Python.org</title>]
This means that this is the only title tag used on the official Python website.
If there are many of them, all the title tags will be displayed in list form.
Similarly, if you want to extract all the information in a class
variable_name = soup.find_all('tag_name', {'class': 'class_name'})
How to avoid burdening the server
To avoid burdening the server, use the sleep function.
from time import sleep
sleep(3)
When accessing a site repeatedly, it is said that it is better to stop the operation for at least 1~3 seconds, so when using for statements, etc., use the sleep method as appropriate.
At the end
In this article, we introduced web scraping for those who find it tedious to collect data from the web.
Automation is one of the strengths of Python, and I think it is something you should learn if you want to study Python.
If you are interested, you can take a course on Udemy.
コメント