This content originally appeared on DEV Community and was authored by jwc20
In 2019, the International Weightlifting Federation (IWF) changed the bodyweight categories in which athletes competed.
At the time, I was afraid that their official website(iwf.net) would delete results that used the old bodyweight categories, so I decided to scrape the website and create an archive of my own.
The result of this project was +50,000 lines of data in a csv file.
I scraped the website using Python, Scrapy, Regular Expression, and Selenium(webdriver).
Since it was my first ever programming project that I did on my own, I made small and big mistakes.
This project had many mistakes, like writing multiple scripts to do one specific task and not connecting them to have a parent and children branch.
So every time I wanted to test out the project, I had to run each Python scripts, which was very time-consuming.
One small mistake that became a significant problem was when I created the scraper. I was using the Selenium webdrivers to click on the "Export" button provided by the website instead of using things like css selector to scrape the website (so this project is honestly a downloader rather than a scraper).
This became a significant issue because every time the Selenium webdriver would click on the button, it would close the browser and open another using the hundreds of links I fed to the script.
Furthermore, I did not provide any timeout or set time between opening and closing, so I guess I caused a sizeable traffic spike and caused the website to crash.
I checked the website on my phone and my other laptop to verify that the website was down for 5 minutes.
Lessons learned
It's pretty obvious what I did wrong in the project. When documentation or tutorials warn how to avoid breaking other people's websites, you should listen to it. But it was a learning experience for me and a learning experience for iwf.net (not iwf.sport)since they have gotten rid of the Export button.
This content originally appeared on DEV Community and was authored by jwc20
jwc20 | Sciencx (2022-04-08T18:00:03+00:00) Revisiting my old scraping project. Retrieved from https://www.scien.cx/2022/04/08/revisiting-my-old-scraping-project/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.