Simple Web scraping project using python and Beautiful soup

Introduction

Web scraping is an automated method used to extract large amounts of data from websites. The data on the websites are unstructured. Web scraping helps collect these unstructured data and store it in a structured form.

In this…


This content originally appeared on DEV Community and was authored by Betty Kamanthe

Web scraping a shopping site

Introduction

Web scraping is an automated method used to extract large amounts of data from websites. The data on the websites are unstructured. Web scraping helps collect these unstructured data and store it in a structured form.

In this project I will show you how to scrape data from a Kenyan website called Jumia https://www.jumia.co.ke/. The data we gather can be used for price comparison.

Website Inspection

The aim of this project is to scrape all products, their prices and rating. So first, we need to inspect the website, this is done by:

1.Visiting this site https://www.jumia.co.ke/all-products/

2.Right clicking and selecting inspect or clicking ctrl+shift+i to inspect the website.
Inspect
3.Move the cursor around till a product is selected.Then search for the div tag that has the name, price and rating of the product.
Web scraping

Write the code
We start by importing the necessary libraries

from bs4 import BeautifulSoup
import requests

The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us.

jumia = requests.get('https://www.jumia.co.ke/all-products/')

Parsing a page using BeautifulSoup

soup = BeautifulSoup(jumia.content , 'html.parser')
products = jsoup.find_all('div' , class_ = 'info')

Use the find_all method, which will find all the instances of the div tag that has a class called 'info' on the page.

We now extract the name, price and rating.If you want to find the first instance of a tag, you can use the find method, which will return a single BeautifulSoup object:

Name = product.find('h3' , class_="name").text.replace('\n', '')
Price = product.find('div' , class_= "prc").text.replace('\n', '')
Rating = product.find('div', class_='stars _s').text.replace('\n', '')

replace() is an inbuilt function in the Python programming language that returns a copy of the string where all occurrences of a substring are replaced with another substring.

We can now loop over all products on the page to extract the name, price and rating.

for product in products:
      Name = product.find('h3' , class_="name").text.replace('\n', '')
      Price = product.find('div' , class_= "prc").text.replace('\n', '')
      Rating = product.find('div', class_='stars _s').text.replace('\n', '')

      info = [ Name, Price,Rating]
      print(info)

Note that we are storing all these in a list called info.

Loop over all pages
We have only scraped data from the first page. The site has 50 pages and when you click on the second page you notice that the url changes. So to get the new url we do this:

url = "https://www.jumia.co.ke/all-products/" + "?page=" +str(page)+"#catalog-listing"

That is a simple string concatination. The code to loop through all the pages is:

for page in range(1,51):
  url = "https://www.jumia.co.ke/all-products/" + "?page=" +str(page)+"#catalog-listing"
  furl = requests.get(url)
  jsoup = BeautifulSoup(furl.content , 'html.parser')
  products = jsoup.find_all('div' , class_ = 'info')

  for product in products:
      Name = product.find('h3' , class_="name").text.replace('\n', '')
      Price = product.find('div' , class_= "prc").text.replace('\n', '')
      try:
        Rating = product.find('div', class_='stars _s').text.replace('\n', '')
      except:
        Rating = 'None'

      info = [ Name, Price,Rating]
      print(info)

range() function goes up to but doesn't include the last number. The website has 50 pages this range is up to 51.
Since some of the products have no ratings, we put it between try catch clause and print None in that instance.

Saving to csv

df = pd.DataFrame({'Product Name':Name,'Price':Price,'Rating':Ratings}) 
df.to_csv('products.csv', index=False, encoding='utf-8')

The whole code

from bs4 import BeautifulSoup
import requests

for page in range(1,51):
  url = "https://www.jumia.co.ke/all-products/" + "?page=" +str(page)+"#catalog-listing"
  furl = requests.get(url)
  jsoup = BeautifulSoup(furl.content , 'html.parser')
  products = jsoup.find_all('div' , class_ = 'info')

  for product in products:
      Name = product.find('h3' , class_="name").text.replace('\n', '')
      Price = product.find('div' , class_= "prc").text.replace('\n', '')
      try:
        Rating = product.find('div', class_='stars _s').text.replace('\n', '')
      except:
        Rating = 'None'

      info = [ Name, Price,Rating]
      print(info)

Conclusion

This is a simple web scraping beginners project into data analytics. All the best in your journey.


This content originally appeared on DEV Community and was authored by Betty Kamanthe


Print Share Comment Cite Upload Translate Updates
APA

Betty Kamanthe | Sciencx (2022-01-24T06:53:04+00:00) Simple Web scraping project using python and Beautiful soup. Retrieved from https://www.scien.cx/2022/01/24/simple-web-scraping-project-using-python-and-beautiful-soup/

MLA
" » Simple Web scraping project using python and Beautiful soup." Betty Kamanthe | Sciencx - Monday January 24, 2022, https://www.scien.cx/2022/01/24/simple-web-scraping-project-using-python-and-beautiful-soup/
HARVARD
Betty Kamanthe | Sciencx Monday January 24, 2022 » Simple Web scraping project using python and Beautiful soup., viewed ,<https://www.scien.cx/2022/01/24/simple-web-scraping-project-using-python-and-beautiful-soup/>
VANCOUVER
Betty Kamanthe | Sciencx - » Simple Web scraping project using python and Beautiful soup. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2022/01/24/simple-web-scraping-project-using-python-and-beautiful-soup/
CHICAGO
" » Simple Web scraping project using python and Beautiful soup." Betty Kamanthe | Sciencx - Accessed . https://www.scien.cx/2022/01/24/simple-web-scraping-project-using-python-and-beautiful-soup/
IEEE
" » Simple Web scraping project using python and Beautiful soup." Betty Kamanthe | Sciencx [Online]. Available: https://www.scien.cx/2022/01/24/simple-web-scraping-project-using-python-and-beautiful-soup/. [Accessed: ]
rf:citation
» Simple Web scraping project using python and Beautiful soup | Betty Kamanthe | Sciencx | https://www.scien.cx/2022/01/24/simple-web-scraping-project-using-python-and-beautiful-soup/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.