This content originally appeared on Level Up Coding - Medium and was authored by Mykhailo Kushnir
Disclaimer: “TrickyCases” is a series of posts with rather short code snippets, useful in day-to-day ML practice. Here you can find something that you would search for in StackOverflow in days from now.
Recently Medium added a new feature that allows you to distributed your bookmarked posts to several reading lists, while previously, you’ve had only two: main list and archive. I’m organized, and I like to keep things at their place, but I’m also a good example of Fear-of-missing-Out (FOMO). Right now, I have 612 bookmarked posts in my main list and pretty much no idea how to segment them manually, so I’ve decided to incorporate my coding skills and a bit of machine learning.
I'll show you how to do it in this and the next few posts of the TrickyCases series.
The high-level algorithm would look like this:
- Manually get a sign-in link on the main page of the site. For that, you’ll have to open medium.com in incognito mode and sign with your email. Then, store the link you’ll get in a message received from Medium and don’t use it so far.
- Setup Selenium hub. I’m not going to walk through this in detail because there are many online tutorials about it. I prefer to do it through docker because I only have to pull a docker image like standalone-chrome, launch it and connect to it in my code.
- The coding part of the algorithm starts here. First, we’ll use the login link to obtain a session. Then, we’ll store the session into a cookies file so it can be reused later. Bear in mind that cookies have an expiration date, so you won't be able to use them endlessly.
- The next step would be to move to the reading list page and scroll down to get the desired number of posts. Medium shows your list by chunks of 20 posts, so you can calculate yourself how many posts you’d like to scrape and then divide this number by 20 to get the number of needed scrolls.
- The final part is to scrape post titles. At this point, I believe that it should be enough to segment posts into thematic buckets, but I’ll learn with you along the way.
Potential caveats:
- It seems that titles are stored in the “.kb” class, but I’m not sure if that’s persistent over time. Perhaps, during next update of medium this script would have to be changed.
- I’m not sure how friendly Medium is with a scraping of this part of the site. I’ll talk to site owners and update you on that part.
Code:
Listing of “helpers” files:
My results:
Title
0 35 Actionable Tips to Grow Your Medium Blog
1 Most hyped stocks on Reddit
2 How to Build an EDA App in Python
3 How to Scrape Tweets Without Twitter’s API Usi...
4 A Gentle Intro to Time Series Forecasting for ...
.. ...
95 How to deploy your Neural Network Model using ...
96 BERT: Multilabel Text Classification
97 Fine-tuning a BERT model with transformers
98
99 How to implement EWMA plots using Python?
I hope you’ll be able to reproduce this code on your profile. Feel free to ask any questions regarding it and reach for help if it is needed.
In the next TrickyCases part, I’ll show you how to separate this list of titles into segments, and then we’ll try to assign each post to its correspondent reading list automatically.
TrickyCases #5. Scrape Medium reading list was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.
This content originally appeared on Level Up Coding - Medium and was authored by Mykhailo Kushnir
Mykhailo Kushnir | Sciencx (2021-09-06T17:06:39+00:00) TrickyCases #5. Scrape Medium reading list. Retrieved from https://www.scien.cx/2021/09/06/trickycases-5-scrape-medium-reading-list/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.