How I scrape lots of sites with one python script

The power of configurable execution of code.Have you ever wanted to scrape a website but didn’t want to pay for a scraping tool like Octoparse? Or maybe you only needed to scrape a few pages from the website and didn’t want to go through the hassle of …


This content originally appeared on Level Up Coding - Medium and was authored by Mykhailo Kushnir

The power of configurable execution of code.

Have you ever wanted to scrape a website but didn’t want to pay for a scraping tool like Octoparse? Or maybe you only needed to scrape a few pages from the website and didn’t want to go through the hassle of setting up a scraping script. In this blog post, I will show you how I created a tool capable of scraping 90% of websites for free using only python and a bit of docker.

Types of data that can be scraped

Most of the scraping bots are created to scrape tabular data or lists. In terms of markup, both tables and lists are essentially the same. In a container, they hold rows with cells filled with values. Hence the algorithm of the script:

Flowchart of application

The process of scraping a website

To extend the potential scraping target list, I’ve decided to use an old-fashioned combination of python with Selenium. While I do enjoy working with Scrapy and was highly influenced by its configurable design when creating my own parsing script, it has certain limits in parsing sites with pagination, so I had to opt for the already mentioned solution.

For the sake of stability, I’ve also decided to use a dockerized version of chromedriver. It saves me some pain during updates of local Chrome and is always there, ready for me, unlike a version you’re installing on your OS, which can be messed up with system updates or installation of new software.

Assuming you have docker service already running on your machine, starting a new container with chromedriver would be as easy as running two commands:

$ docker pull selenium/standalone-chrome
$ docker run -d -p 4444:4444 -p 7900:7900 — shm-size=”2g” selenium/standalone-chrome
My python script for scraping websites

The core of this post — code sharing paragraph. First, I’ll introduce you to helpers methods:

These two allow me to switch between a dockerized version of Selenium and the local one when I need to debug something during development.

There’s also a straightforward method to extract text out of HTML elements that I’m using. In the nearest future, I have plans to add helpers to extract links and images automatically. If there is interest in the subject, I can share an updated version of the script.

The essence of this selenium-based spider is in the gist below. Please, read through the comments, and if there would any questions about how it works — let me know in the comments.

How to use the script to scrape websites

In this part, I’ll demonstrate how this script can be used. First, you need to create a YAML configuration file and then run your spider. For example, let us scrape good-old quotes.toscrape.com. An example of config for it would look like this:

First of all, notice that $p$ is a placeholder for the future page number. This is because most of the sites serve page content with a noticeable change in URLs. Your task would be to identify how it is changed from page to page and configure it for your spider with this mask.

Be aware that in data_selectors and data_column_titles, order matters. For example, the text of quotes would be parsed from selector “.text” (duh).

After you have your config prepared, you can execute it with:

python -m spider -c “./configs/quotes.yaml” -o “./outputs/quotes/$(date +%Y-%m-%d).csv”

Bash line above takes config from “./configs/quotes.yaml” file and stores result in a CSV file to “./outputs/quotes/current_date.csv”

Tips on how to improve your scraping process

  • Use proxies

Selenium allows you to pass proxy IP addresses as simple as adding a parameter to its constructor. There is a perfect answer at StackOverflow, so I’ll not try to invent the wheel.

  • Be gentle with sites you’re parsing

Check out robots.txt and comply. Run your request with a specific timeout to smooth the load. Use scheduling to run scripts during evenings or when you think that site would have low incoming traffic.

Voice of the crowd

One of the best things about agile scraping bots is that you don’t have to write a new bot for every site you want to parse. You just need one good script that can be tweaked for each site or domain. Think back on all your scraping projects from this year so far — what would you like me to add to my script?


How I scrape lots of sites with one python script was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.


This content originally appeared on Level Up Coding - Medium and was authored by Mykhailo Kushnir


Print Share Comment Cite Upload Translate Updates
APA

Mykhailo Kushnir | Sciencx (2022-01-28T20:07:55+00:00) How I scrape lots of sites with one python script. Retrieved from https://www.scien.cx/2022/01/28/how-i-scrape-lots-of-sites-with-one-python-script/

MLA
" » How I scrape lots of sites with one python script." Mykhailo Kushnir | Sciencx - Friday January 28, 2022, https://www.scien.cx/2022/01/28/how-i-scrape-lots-of-sites-with-one-python-script/
HARVARD
Mykhailo Kushnir | Sciencx Friday January 28, 2022 » How I scrape lots of sites with one python script., viewed ,<https://www.scien.cx/2022/01/28/how-i-scrape-lots-of-sites-with-one-python-script/>
VANCOUVER
Mykhailo Kushnir | Sciencx - » How I scrape lots of sites with one python script. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2022/01/28/how-i-scrape-lots-of-sites-with-one-python-script/
CHICAGO
" » How I scrape lots of sites with one python script." Mykhailo Kushnir | Sciencx - Accessed . https://www.scien.cx/2022/01/28/how-i-scrape-lots-of-sites-with-one-python-script/
IEEE
" » How I scrape lots of sites with one python script." Mykhailo Kushnir | Sciencx [Online]. Available: https://www.scien.cx/2022/01/28/how-i-scrape-lots-of-sites-with-one-python-script/. [Accessed: ]
rf:citation
» How I scrape lots of sites with one python script | Mykhailo Kushnir | Sciencx | https://www.scien.cx/2022/01/28/how-i-scrape-lots-of-sites-with-one-python-script/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.