This content originally appeared on DEV Community and was authored by Jonathan Law
Let's start by agreeing that Whatsapp does not have great stickers, but LINE does. For the people who have no idea what is LINE, or wonder what is so great about their stickers, check one out here. Apart from that, this project would be useful for scrapping information if you are in that field.
Here is the code at my Github repo
Scrapping web
Most of us know that there are great libraries in Python such as BeautifulSoup/bs4 and Selenium, Chrome extensions that can download all the media on a specific page, and I would encourage the use of these popular approaches.
However today, the challenge of this project is to:
- Do it without a library
- Attempt to achieve concurrency/async (Downloading multiple stickers at once without waiting)
- (Optional) Package it so our non technical friends can use it
What does it means to download it concurrently/async? It means we are not going to wait for one sticker to download and convert finish, then only proceed to the next one.
Just like in school, the teacher (Madam Choo) will just give everyone available in the classroom one task to complete, while Madam Choo sits and wait for everyone to complete their own task and hand it back to her. On a normal program, Madam Choo would give one student a task, wait for that student to complete then only she will hand out the second task. In most cases this works fine, since that one student is good at what he is doing, however Madam Choo realizes she has a class of 12 students doing nothing all the while. If she could hand out the task for all 12 students (threads), her task would be completed quicker.
Why Golang
Truth is it can be completed just with any other programming languages [NodeJS, C++, Python...]. So really there should not be any debate about this, but purely just for entertainment and learning purposes :)
However what made Golang stand out from the rest would be the ability to achieve the 2nd and 3rd point easily. More importantly, Golang is built for Madam Choo to easily assign all her students a task at the same time.
Speedrun tutorial
Here is the code at my Github repo
Understanding our target
Before we start, we will need to understand a few things.
- Our sticker shop is fortunately not client rendered with Javascript, so a basic curl will get all the required information without waiting for AJAX calls.
- We can parse through HTML tags and get the stickers we need, however we noticed that each sticker raw information is in a custom HTML attribute called
data-preview='{}'
. This allows us to parse that information as a JSON format.
data-preview="{ "type" : "popup_sound", "id" : "312149456", "staticUrl" : "https://stickershop.line-scdn.net/stickershop/v1/sticker/312149456/iPhone/sticker@2x.png;compress=true", "fallbackStaticUrl" : "https://stickershop.line-scdn.net/stickershop/v1/sticker/312149456/iPhone/sticker@2x.png;compress=true", "animationUrl" : "", "popupUrl" : "https://stickershop.line-scdn.net/stickershop/v1/sticker/312149456/android/sticker_popup.png;compress=true", "soundUrl" : "https://stickershop.line-scdn.net/stickershop/v1/sticker/312149456/android/sticker_sound.m4a" }"
- Images are in APNG (Animated PNG) format. We will need a GIF version of that.
With that set, lets start
Start
1/ First we create an entry point that takes in a URL
// Entrypoint
func main(){
consoleReader := bufio.NewReader(os.Stdin)
for {
fmt.Println("Enter Line Stickershop URL")
inputUrl, err := consoleReader.ReadString('\n'); if err != nil {
log.Fatal(err)
}
// Check if input has at least a line store format
if strings.Contains(inputUrl, "https://store.line.me") {
inputUrl = strings.Replace(inputUrl, "\r\n", "", -1)
err := scrap(inputUrl); if err != nil {
log.Fatal(err)
}
} else {
fmt.Println("Invalid format")
}
}
}
2/ Then we create a scrap function, which uses built in http library to download the webpage.
resp, err := http.Get(scrapUrl); if err != nil {
return err
}
3/ Once the download is completed, we will parse the body of the downloaded webpage. Remember we said that the raw information can be found in a custom attribute called data-preview
, without going too complicated, a regex call will be able to extract each occurrence if that attribute.
var rgx = regexp.MustCompile(`(data-preview='.*?')`)
tmpExtracted := rgx.FindAllStringSubmatch(inputHtml, -1)
for i := 0; i < len(tmpExtracted); i++ {
// Parse the JSON here
4/ Before parsing the JSON, let us create a struct based on the needed information we have from data preview attribute
type DataPreview struct {
Id string `json:"id"`
StickerType string `json:"type"`
PopupUrl string `json:"popupUrl"`
StaticUrl string `json:"staticUrl"`
AnimationUrl string `json:"animationUrl"`
SoundUrl string `json:"soundUrl"`
}
5/ Great, now we just unmarshall the JSON into the DataPreview struct.
6/ Next is to create a function to download the stickers concurrently. With Golang, this could be as easy as a few lines
var wg sync.WaitGroup
for i := 0; i < len(result); i++ {
wg.Add(1)
go downloadImage(result[i], &wg)
}
WaitGroup basically just tells Madam Choo how many students has she assigned the task to.
7/ We know that There are 3 Url where we can use, PopupUrl for big animated stickers, StaticUrl for stickers that do not move and AnimationUrl for normal sized stickers that are animated. Building a simple switch rule will help us identify which URL we should grab, and then again using HTTP library to download the GIF.
8/ After downloading the GIF, I am using APNG2GIF to convert the APNG to GIF. This is not the most ideal solution, but definitely the easiest.
9/ Before we proceed asking for another URL from the user, Madam Choo wants to wait for all the students to complete their work.
We need to add this into the async function to inform Madam Choo that its work is done
defer wg.Done()
And Madam Choo will have to wait for everyone's work to be completed, before continuing
wg.Wait()
10/ And basically we are done! We can quickly package it for Windows and send the compiled binary to our friends just by running
go build main.go
Final
There you go, a quick program to scrap GIF for your pleasure. Although it may seem like a trivial thing to do, it makes a big difference when you are into gathering data for ML trainings or mass scrapping. The difference with and without async does affect the time it takes to achieve our goal at a scale.
Thanks for reading up to here, hope you enjoyed this post!
This content originally appeared on DEV Community and was authored by Jonathan Law
Jonathan Law | Sciencx (2021-05-25T01:01:37+00:00) Scrapping LINE stickers with Golang. Retrieved from https://www.scien.cx/2021/05/25/scrapping-line-stickers-with-golang/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.