Modern Web Scraping With BeautifulSoup and Selenium

HTML is almost intuitive. CSS is a great advancement that cleanly separates the structure of a page from its look and feel. JavaScript adds some pizazz. That’s the theory. The real world is a little different.

In this tutorial, you’ll learn how the content you see in the browser actually gets rendered and how to go about scraping it when necessary. In particular, you’ll learn how to scrape dynamic comments. Our tools will be Python and awesome packages like requests, BeautifulSoup, and Selenium.

When Should You Use Web Scraping?

Web scraping is the practice of automatically fetching the content of web pages designed for interaction with human users, parsing them, and extracting some information (possibly navigating links to other pages). It is sometimes necessary if there is no other way to extract the necessary information. Ideally, the application provides a dedicated API for accessing its data programmatically. There are several reasons web scraping should be your last resort:

  1. It is fragile (the web pages you’re scraping might change frequently).
  2. It might be forbidden (some web apps have policies against scraping).
  3. It might be slow and expansive (if you need to fetch and wade through a lot of noise).

Understanding Real-World Web Pages

Let’s understand what we are up against, by looking at the output of some common web application code. In the article Introduction to Vagrant

To scrape any content on this page, we need to find the HTML elements containing the content on the page first.

View Page Source

Every browser since the dawn of time (the 1990s) has supported the ability to view the HTML of the current page. Here is a snippet from the view source of Introduction to Vagrant that starts with a huge chunk of minified and uglified JavaScript unrelated to the article itself. Here is a small portion of it:

Page Source

Here is some actual HTML from the page:

HTML From the Page

Static Scraping vs. Dynamic Scraping

Static scraping ignores JavaScript. It fetches web pages from the server without the help of a browser. You get exactly what you see in “view page source”, and then you slice and dice it. If the content you’re looking for is available, you need to go no further. However, if the content is in a src URL, you need dynamic scraping.

Dynamic scraping uses an actual browser (or a headless browser) and lets JavaScript do its thing. Then, it queries the DOM to extract the content it’s looking for. Sometimes you need to automate the browser by simulating a user to get the content you need.

Static Scraping With Requests and BeautifulSoup

Let’s see how static scraping works using two awesome Python packages: requests for fetching web pages and BeautifulSoup for parsing HTML pages.

Installing Requests and BeautifulSoup

Install pipenv first, and then: pipenv install requests beautifulsoup4

This will create a virtual environment for you too. If you’re using the code from gitlab, you can just pipenv install.

Fetching Pages

Fetching a page with requests is a one liner: r = requests.get(url)

The response object has a lot of attributes. The most important ones are ok and content. If the request fails then r.ok will be False and r.content will contain the error. The content is a stream of bytes. It is usually better to decode it to utf-8 when dealing with text:

1
>>> r = requests.get('https://www.c2.com/no-such-page')
2
>>> r.ok
3
False
4
>>> print(r.content.decode('utf-8'))
5
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
6
<html><head>
7
<title>404 Not Found</title>
8
</head><body>
9
<h1>Not Found</h1>
10
<p>The requested URL /ggg was not found on this server.</p>
11
<hr>
12
<address>
13
Apache/2.0.52 (CentOS) Server at www.c2.com Port 80
14
</address>
15
</body></html>
 

If everything is OK then r.content will contain the requested web page (same as view page source).

Finding Elements With BeautifulSoup

The get_page() function below fetches a web page by URL, decodes it to UTF-8, and parses it into a BeautifulSoup object using the HTML parser.

1
def get_page(url):
2
    r = requests.get(url)
3
    content = r.content.decode('utf-8')
4
    return BeautifulSoup(content, 'html.parser')
 

Once we have a BeautifulSoup object, we can start extracting information from the page. BeautifulSoup provides many find functions to locate elements inside the page and drill down deep nested elements.

Envato Tuts+ author pages contain multiple tutorials. Here is my author page. On each page, there are up to 12 tutorials. If you have more than 12 tutorials then you can navigate to the next page. The HTML for each article is enclosed in an <article> tag. The following function finds all the article elements on the page, drills down to their links, and extracts the href attribute to get the URL of the tutorial:

1
def get_page_articles(page):
2
    elements = page.findAll('article')
3
    articles = [e.a.attrs['href'] for e in elements]
4
    return articles
 

The following code gets all the articles from my page and prints them (without the common prefix):

1
page = get_page('https://tutsplus.com/authors/gigi-sayfan')
2
articles = get_page_articles(page)
3
prefix = 'https://code.tutsplus.com/tutorials'
4
for a in articles:
5
    print(a[len(prefix):])
6
    
7
Output:
8

9
docker-from-the-ground-up-building-images--cms-28166
10
how-to-implement-your-own-data-structure-in-python--cms-28723
11
professional-error-handling-with-python--cms-25950
12
write-professional-unit-tests-in-python--cms-25835
13
how-to-write-package-and-distribute-a-library-in-python--cms-28693
14
serialization-and-deserialization-of-python-objects-part-1--cms-26183
15
understand-how-much-memory-your-python-objects-use--cms-25609
16
fetching-data-in-your-react-application--cms-30670
17
rest-vs-grpc-battle-of-the-apis--cms-30711
18
regular-expressions-with-go-part-2--cms-30406
19
regular-expressions-with-go-part-1--cms-30403
20
8-things-that-make-jest-the-best-react-testing-framework--cms-30534
 

Dynamic Scraping With Selenium

Static scraping was good enough to get the list of articles, but if we need to automate the browser and interact with the DOM interactively, one of the best tools for the job is Selenium.

Selenium is primarily geared towards automated testing of web applications, but it is great as a general-purpose browser automation tool.

Installing Selenium

Type this command to install Selenium: pipenv install selenium

Choose Your Web Driver

Selenium needs a web driver (the browser it automates). For web scraping, it usually doesn’t matter which driver you choose. I prefer the Chrome driver. Follow the instructions in this Selenium guide.

Chrome vs. PhantomJS

In some cases you may prefer to use a headless browser, which means no UI is displayed. Theoretically, PhantomJS is just another web driver. But, in practice, people reported incompatibility issues where Selenium works properly with Chrome or Firefox and sometimes fails with PhantomJS. I prefer to remove this variable from the equation and use an actual browser web driver.

Scraping Dynamic Comments

Let’s do some dynamic scraping and use Selenium to scrap the comments from a codecanyon Javascript Plugin located here. Here are the necessary imports.

1
from selenium import webdriver

The next step is to create a webdriver instance and get the comments URL.

1
driver = webdriver.Chrome()
2
driver.get('https://codecanyon.net/item/whatshelp-whatsapp-help-and-support-plugin-for-javascript/42202303/comments')

Once we get the URL, we have to find the element where comments are located. To do that, right-click on the web page and click on inspect, as shown below.

The comments are enclosed in the commentList id. Right-click on the element and copy the XPath as shown below.

Next, use selenium to extract the contents of the element containing the comments.

1
driver = webdriver.Chrome()
2
driver.get('https://codecanyon.net/item/whatshelp-whatsapp-help-and-support-plugin-for-javascript/42202303/comments')
3
modules = driver.find_elements('xpath','//*[@id="content"]/div/div[1]/div[4]')
4
for comment in comments:
5
    print(comment.text)

Here is the output:

1
sunsuk
2
20 days ago
3
How to integrate to CI scripts?
4
3 other replies
5
ThemeAtelier AUTHOR
6
20 days ago
7
Yes. You can add 5 staffs. We have proper documentation how can you add it in your website. Please follow the instruction. You have to call usual js and CSS files then just need to put actual markup in the footer of your site.
8
sunsuk
9
20 days ago
10
How to add wa number? Any admin dashboard ?
11
ThemeAtelier AUTHOR
12
20 days ago
13
No admin. It just a script. So everything is code. You just need to change demo values to your own value.
14
HTML knowledge is require for installing it.
15
Deezia12
16
29 days ago
17
Hi Author,
18
Does it work for a php website?
19
ThemeAtelier AUTHOR
20
24 days ago
21
Yes. As It will work in php websites.
22
gigantium
23
about 1 month ago
24
This is support for multiuser or for SAAS function?
25
ThemeAtelier AUTHOR
26
about 1 month ago
27
It’s not a SAAS product to use for multiple users. You can use this in your single website with for giving chat support to your customers with single and multi user agents.
28
gigantium
29
about 1 month ago
30
Thanks for your kind response. Can you provide a demo to access the admin dashboard?
31
ThemeAtelier AUTHOR
32
about 1 month ago

There are some scenarios where the browser has to load some dynamic content; this can make the scraping process a bit more complicated. Luckily the selenium driver provides wait() objects that wait for a certain amount of time before finding the element programmatically.

To use the wait functionality, you first need the following imports:

1
from selenium import webdriver
2
from selenium.webdriver.common.by import By
3
from selenium.webdriver.support.wait import WebDriverWait
4
from selenium.webdriver.support import expected_conditions as EC

Then get the url and apply the wait() object to the element.

1
driver = webdriver.Chrome()
2
url = 'http://www.c2.com/loading-page'
3
driver.get(url)
4

5
element = WebDriverWait(driver, 5).until(
6
    EC.presence_of_element_located((By.ID, "loaded_element"))
7
)
 

Conclusion

Web scraping is a useful practice when the information you need is accessible through a web application that doesn’t provide an appropriate API. It takes some non-trivial work to extract data from modern web applications, but mature and well-designed tools like requests, BeautifulSoup, and Selenium make it worthwhile.

This post has been updated with contributions from Esther Vaati. Esther is a software developer and writer for Envato Tuts+.


This content originally appeared on Envato Tuts+ Tutorials and was authored by Gigi Sayfan

HTML is almost intuitive. CSS is a great advancement that cleanly separates the structure of a page from its look and feel. JavaScript adds some pizazz. That's the theory. The real world is a little different.

In this tutorial, you'll learn how the content you see in the browser actually gets rendered and how to go about scraping it when necessary. In particular, you'll learn how to scrape dynamic comments. Our tools will be Python and awesome packages like requests, BeautifulSoup, and Selenium.

When Should You Use Web Scraping?

Web scraping is the practice of automatically fetching the content of web pages designed for interaction with human users, parsing them, and extracting some information (possibly navigating links to other pages). It is sometimes necessary if there is no other way to extract the necessary information. Ideally, the application provides a dedicated API for accessing its data programmatically. There are several reasons web scraping should be your last resort:

  1. It is fragile (the web pages you're scraping might change frequently).
  2. It might be forbidden (some web apps have policies against scraping).
  3. It might be slow and expansive (if you need to fetch and wade through a lot of noise).

Understanding Real-World Web Pages

Let's understand what we are up against, by looking at the output of some common web application code. In the article Introduction to Vagrant

To scrape any content on this page, we need to find the HTML elements containing the content on the page first.

View Page Source

Every browser since the dawn of time (the 1990s) has supported the ability to view the HTML of the current page. Here is a snippet from the view source of Introduction to Vagrant that starts with a huge chunk of minified and uglified JavaScript unrelated to the article itself. Here is a small portion of it:

Page Source

Here is some actual HTML from the page:

HTML From the Page

Static Scraping vs. Dynamic Scraping

Static scraping ignores JavaScript. It fetches web pages from the server without the help of a browser. You get exactly what you see in "view page source", and then you slice and dice it. If the content you're looking for is available, you need to go no further. However, if the content is in a src URL, you need dynamic scraping.

Dynamic scraping uses an actual browser (or a headless browser) and lets JavaScript do its thing. Then, it queries the DOM to extract the content it's looking for. Sometimes you need to automate the browser by simulating a user to get the content you need.

Static Scraping With Requests and BeautifulSoup

Let's see how static scraping works using two awesome Python packages: requests for fetching web pages and BeautifulSoup for parsing HTML pages.

Installing Requests and BeautifulSoup

Install pipenv first, and then: pipenv install requests beautifulsoup4

This will create a virtual environment for you too. If you're using the code from gitlab, you can just pipenv install.

Fetching Pages

Fetching a page with requests is a one liner: r = requests.get(url)

The response object has a lot of attributes. The most important ones are ok and content. If the request fails then r.ok will be False and r.content will contain the error. The content is a stream of bytes. It is usually better to decode it to utf-8 when dealing with text:

1
>>> r = requests.get('https://www.c2.com/no-such-page')
2
>>> r.ok
3
False
4
>>> print(r.content.decode('utf-8'))
5
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
6
<html><head>
7
<title>404 Not Found</title>
8
</head><body>
9
<h1>Not Found</h1>
10
<p>The requested URL /ggg was not found on this server.</p>
11
<hr>
12
<address>
13
Apache/2.0.52 (CentOS) Server at www.c2.com Port 80
14
</address>
15
</body></html>
 

If everything is OK then r.content will contain the requested web page (same as view page source).

Finding Elements With BeautifulSoup

The get_page() function below fetches a web page by URL, decodes it to UTF-8, and parses it into a BeautifulSoup object using the HTML parser.

1
def get_page(url):
2
    r = requests.get(url)
3
    content = r.content.decode('utf-8')
4
    return BeautifulSoup(content, 'html.parser')
 

Once we have a BeautifulSoup object, we can start extracting information from the page. BeautifulSoup provides many find functions to locate elements inside the page and drill down deep nested elements.

Envato Tuts+ author pages contain multiple tutorials. Here is my author page. On each page, there are up to 12 tutorials. If you have more than 12 tutorials then you can navigate to the next page. The HTML for each article is enclosed in an <article> tag. The following function finds all the article elements on the page, drills down to their links, and extracts the href attribute to get the URL of the tutorial:

1
def get_page_articles(page):
2
    elements = page.findAll('article')
3
    articles = [e.a.attrs['href'] for e in elements]
4
    return articles
 

The following code gets all the articles from my page and prints them (without the common prefix):

1
page = get_page('https://tutsplus.com/authors/gigi-sayfan')
2
articles = get_page_articles(page)
3
prefix = 'https://code.tutsplus.com/tutorials'
4
for a in articles:
5
    print(a[len(prefix):])
6
    
7
Output:
8
9
docker-from-the-ground-up-building-images--cms-28166
10
how-to-implement-your-own-data-structure-in-python--cms-28723
11
professional-error-handling-with-python--cms-25950
12
write-professional-unit-tests-in-python--cms-25835
13
how-to-write-package-and-distribute-a-library-in-python--cms-28693
14
serialization-and-deserialization-of-python-objects-part-1--cms-26183
15
understand-how-much-memory-your-python-objects-use--cms-25609
16
fetching-data-in-your-react-application--cms-30670
17
rest-vs-grpc-battle-of-the-apis--cms-30711
18
regular-expressions-with-go-part-2--cms-30406
19
regular-expressions-with-go-part-1--cms-30403
20
8-things-that-make-jest-the-best-react-testing-framework--cms-30534
 

Dynamic Scraping With Selenium

Static scraping was good enough to get the list of articles, but if we need to automate the browser and interact with the DOM interactively, one of the best tools for the job is Selenium.

Selenium is primarily geared towards automated testing of web applications, but it is great as a general-purpose browser automation tool.

Installing Selenium

Type this command to install Selenium: pipenv install selenium

Choose Your Web Driver

Selenium needs a web driver (the browser it automates). For web scraping, it usually doesn't matter which driver you choose. I prefer the Chrome driver. Follow the instructions in this Selenium guide.

Chrome vs. PhantomJS

In some cases you may prefer to use a headless browser, which means no UI is displayed. Theoretically, PhantomJS is just another web driver. But, in practice, people reported incompatibility issues where Selenium works properly with Chrome or Firefox and sometimes fails with PhantomJS. I prefer to remove this variable from the equation and use an actual browser web driver.

Scraping Dynamic Comments

Let's do some dynamic scraping and use Selenium to scrap the comments from a codecanyon Javascript Plugin located here. Here are the necessary imports.

1
from selenium import webdriver

The next step is to create a webdriver instance and get the comments URL.

1
driver = webdriver.Chrome()
2
driver.get('https://codecanyon.net/item/whatshelp-whatsapp-help-and-support-plugin-for-javascript/42202303/comments')

Once we get the URL, we have to find the element where comments are located. To do that, right-click on the web page and click on inspect, as shown below.

The comments are enclosed in the commentList id. Right-click on the element and copy the XPath as shown below.

Next, use selenium to extract the contents of the element containing the comments.

1
driver = webdriver.Chrome()
2
driver.get('https://codecanyon.net/item/whatshelp-whatsapp-help-and-support-plugin-for-javascript/42202303/comments')
3
modules = driver.find_elements('xpath','//*[@id="content"]/div/div[1]/div[4]')
4
for comment in comments:
5
    print(comment.text)

Here is the output:

1
sunsuk
2
20 days ago
3
How to integrate to CI scripts?
4
3 other replies
5
ThemeAtelier AUTHOR
6
20 days ago
7
Yes. You can add 5 staffs. We have proper documentation how can you add it in your website. Please follow the instruction. You have to call usual js and CSS files then just need to put actual markup in the footer of your site.
8
sunsuk
9
20 days ago
10
How to add wa number? Any admin dashboard ?
11
ThemeAtelier AUTHOR
12
20 days ago
13
No admin. It just a script. So everything is code. You just need to change demo values to your own value.
14
HTML knowledge is require for installing it.
15
Deezia12
16
29 days ago
17
Hi Author,
18
Does it work for a php website?
19
ThemeAtelier AUTHOR
20
24 days ago
21
Yes. As It will work in php websites.
22
gigantium
23
about 1 month ago
24
This is support for multiuser or for SAAS function?
25
ThemeAtelier AUTHOR
26
about 1 month ago
27
It’s not a SAAS product to use for multiple users. You can use this in your single website with for giving chat support to your customers with single and multi user agents.
28
gigantium
29
about 1 month ago
30
Thanks for your kind response. Can you provide a demo to access the admin dashboard?
31
ThemeAtelier AUTHOR
32
about 1 month ago

There are some scenarios where the browser has to load some dynamic content; this can make the scraping process a bit more complicated. Luckily the selenium driver provides wait() objects that wait for a certain amount of time before finding the element programmatically.

To use the wait functionality, you first need the following imports:

1
from selenium import webdriver
2
from selenium.webdriver.common.by import By
3
from selenium.webdriver.support.wait import WebDriverWait
4
from selenium.webdriver.support import expected_conditions as EC

Then get the url and apply the wait() object to the element.

1
driver = webdriver.Chrome()
2
url = 'http://www.c2.com/loading-page'
3
driver.get(url)
4
5
element = WebDriverWait(driver, 5).until(
6
    EC.presence_of_element_located((By.ID, "loaded_element"))
7
)
 

Conclusion

Web scraping is a useful practice when the information you need is accessible through a web application that doesn't provide an appropriate API. It takes some non-trivial work to extract data from modern web applications, but mature and well-designed tools like requests, BeautifulSoup, and Selenium make it worthwhile.

This post has been updated with contributions from Esther Vaati. Esther is a software developer and writer for Envato Tuts+.


This content originally appeared on Envato Tuts+ Tutorials and was authored by Gigi Sayfan


Print Share Comment Cite Upload Translate Updates
APA

Gigi Sayfan | Sciencx (2018-02-04T20:44:12+00:00) Modern Web Scraping With BeautifulSoup and Selenium. Retrieved from https://www.scien.cx/2018/02/04/modern-web-scraping-with-beautifulsoup-and-selenium/

MLA
" » Modern Web Scraping With BeautifulSoup and Selenium." Gigi Sayfan | Sciencx - Sunday February 4, 2018, https://www.scien.cx/2018/02/04/modern-web-scraping-with-beautifulsoup-and-selenium/
HARVARD
Gigi Sayfan | Sciencx Sunday February 4, 2018 » Modern Web Scraping With BeautifulSoup and Selenium., viewed ,<https://www.scien.cx/2018/02/04/modern-web-scraping-with-beautifulsoup-and-selenium/>
VANCOUVER
Gigi Sayfan | Sciencx - » Modern Web Scraping With BeautifulSoup and Selenium. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2018/02/04/modern-web-scraping-with-beautifulsoup-and-selenium/
CHICAGO
" » Modern Web Scraping With BeautifulSoup and Selenium." Gigi Sayfan | Sciencx - Accessed . https://www.scien.cx/2018/02/04/modern-web-scraping-with-beautifulsoup-and-selenium/
IEEE
" » Modern Web Scraping With BeautifulSoup and Selenium." Gigi Sayfan | Sciencx [Online]. Available: https://www.scien.cx/2018/02/04/modern-web-scraping-with-beautifulsoup-and-selenium/. [Accessed: ]
rf:citation
» Modern Web Scraping With BeautifulSoup and Selenium | Gigi Sayfan | Sciencx | https://www.scien.cx/2018/02/04/modern-web-scraping-with-beautifulsoup-and-selenium/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.