Hi, I’m Yuki (@engineerblog_Yu), a student engineer.
Are you interested in getting a scraping project in Python?
In this article, I would like to explain Selenium for those who want to do scraping in Python.
With Selenium, you can automatically open a browser and collect information on a site.
If you keep the code for operations that are used many times, the next time you want to perform the same operation, you can simply run the program and perform the operation in an instant, which is very convenient.
If you want to work on Python projects or shorten your workload with Python, this is a good book to take a look at.
Scraping in particular is rich in projects, so if you want to get a project in Python, this is a good place to start.
目次
Basic Scraping Flow
The basic flow of scraping is
1, Check the HTML information of the website from the verification.
2, Store the desired information as a list using find_elements_by~.
3, Make a table using Pandas
4, Output as a CSV file or Excel file
The following is a list of the steps.
import
from selenium import webdriver
How to open browser
This code will automatically open Google Chrome.
browser = webdriver.Chrome()

Open the web site
Put the URL of the website you want to open in the brackets of browser.get() and execute it.
browser.get('URL')
Some websites may prohibit scraping, so please confirm this on your own.
Obtaining the id of a website
elem = browser.find_element_by_id('id')
An id is an HTML tag on a website.
This may be difficult because it requires knowledge of HTML as well as knowledge of Python.
If you are using Google Chrome, you can view the HTML information of a website by right-clicking and selecting “Verify”.
Please type in the id tag that contains the information you want to extract from the HTML information as a code.
(If there are multiple id tags with the same name, the information of the first id tag will be stored in the elem.)
If you want to store the information of multiple id tags with the same name as an array, type
elems = browser.find.elements_by_id('id')
Let’s give it as follows.
I think we can make the array store as many id tags as elms[0],elms[1],,,,and so on.
Get the class of the website
If you want to get the class in the same way, give it this way.
elem = browser.find.element_by_class_name('class')
elems = browser.find.elements_by_class_name('class')
Output element as text
After storing the information in the elem as shown above, you can use the text method to output the text.
elem.text
If you execute this code, the text will be stored in the array named values in order.
You can use the append method to put the text into the array named values in order.
values = []
for elem in elems:
value = elem.text
values.append(value)
How to make a list into a table
import pandas as pd
df = pd.DataFrame()
df['value']=values
Output as CSV or Excel file
df.to_csv('output.csv',index=False)
df.to_excel('output.xsl',index=False)
At the end
In this article, we introduced Selenium, which is used for web scraping.
If you are interested, you can take a course on Udemy.
Thank you for reading!
コメント