This content originally appeared on Level Up Coding - Medium and was authored by Ebuka (Gaus Octavio)
Web Scraping is a powerful technique for data collection from websites. Scrapy is a powerful python library used for web scraping. Other alternatives are beautifulsoup4
In this article, I will show you how to use Scrapy to extract available information about data science books from Amazon.com
This article is a detailed process on how to implement Scrapy
The Process:
- Installing Scrapy
- Setting up a Scrapy project
- Creating a Spider
- Extracting and saving the information needed (i.e Data Science Books)
The First Process:
Before you can use scrapy, you need to have Python installed on your computer.
Scrapy is installed by executing this command
pip install scrapy
The Second Process:
Create a new scrapy project by executing this command
scrapy startproject data_books
A new directory called “data_books” would be created with the requied files for this Scrapy project.
The Third Process:
Create a Scrapy Spider. A scrapy spider is a class that aids Scrapy on how to navigate a website and the information needed to extract.
In the data_books directory, you will have to create a python file called book_bot.py (you can name the python file based on your choice).
You need to define a python class
import scrapy
class BookBotSpider(scrapy.Spider):
name = "data_books"
start_urls = ['https://www.amazon.com/s?k=data+science+books',]
The Fourth Process:
Create a parse function. A parse function is called for each URL in the start_urls list. Parse function aids in navigating through a website and extracting the information you need.
Here’s an example on how to extract the title, price and rating of each data science book
import scrapy
class BookBotSpider(scrapy.Spider):
name = "data_books"
start_urls = ['https://www.amazon.com/s?k=data+science+books',]
def parse(self,response):
for book in response.css('div.sg-col-4-of-12'):
yield{
'title': book.css('span.a-size-medium::text').get(),
'price': book.css('span.a-price-whole::text').get(),
'rating': book.css('span.a-icon-alt::text').get(),
}
Final Process:
The extracted data can be saved and exported in either JSON, CSV and XML file format.
Scrapy has built-in export formats such as JSON, XML and CSV.
To save the extracted data, use this syntax
scrapy crawl data_books -o databooks.csv
If you intend to save the extracted data to a data base or a cloud storage, you can use this command
scrapy.exporters
Web scraping is a powerful tool for extracting data.
Scrapy is efficient in extracting information from websites and have it saved in various formats.
By applying the steps I outlined in this article, you can:
- Create a scrapy spider.
- Navigate a website.
- Extract the information you need using Scrapy in-built functions.
Web Scraping is useful for collecting required data for data science projects such as:
- Analyzing book prices.
- Analyzing customer reviews.
In conclusion, Scrapy is a great tool to have in your data science toolkit.
Connect With Me On Twitter , LinkedIn, Github
How To Build Your Data Science Books Reading List Using Scrapy was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.
This content originally appeared on Level Up Coding - Medium and was authored by Ebuka (Gaus Octavio)
Ebuka (Gaus Octavio) | Sciencx (2023-02-24T20:23:49+00:00) How To Build Your Data Science Books Reading List Using Scrapy. Retrieved from https://www.scien.cx/2023/02/24/how-to-build-your-data-science-books-reading-list-using-scrapy/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.