This content originally appeared on DEV Community 👩💻👨💻 and was authored by Mikhail Zub
What will be scraped
Full code
If you don't need an explanation, have a look at the full code example in the online IDE
const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
puppeteer.use(StealthPlugin());
const searchParams = {
hl: "en", // Parameter defines the language to use for the Google search
gl: "us", // parameter defines the country to use for the Google search
device: "phone", // parameter defines the search device. Options: phone, tablet, tv, chromebook
category: "MOVIE", // you can see the full list of supported categories on https://serpapi.com/google-play-movies-categories
};
const URL = `https://play.google.com/store/movies/category/${searchParams.category}?hl=${searchParams.hl}&gl=${searchParams.gl}&device=${searchParams.device}`;
async function scrollPage(page, scrollContainer) {
let lastHeight = await page.evaluate(`document.querySelector("${scrollContainer}").scrollHeight`);
while (true) {
await page.evaluate(`window.scrollTo(0, document.querySelector("${scrollContainer}").scrollHeight)`);
await page.waitForTimeout(4000);
let newHeight = await page.evaluate(`document.querySelector("${scrollContainer}").scrollHeight`);
if (newHeight === lastHeight) {
break;
}
lastHeight = newHeight;
}
}
async function getMoviesFromPage(page) {
const movies = await page.evaluate(() => {
const mainPageInfo = Array.from(document.querySelectorAll("section .oVnAB")).reduce((result, block) => {
const categoryTitle = block.querySelector(".kcen6d").textContent.trim();
const categorySubTitle = block.querySelector(".kMqehf")?.textContent.trim();
const movies = Array.from(block.parentElement.querySelectorAll(".ULeU3b")).map((movie) => {
const link = `https://play.google.com${movie.querySelector(".Si6A0c")?.getAttribute("href")}`;
const movieId = link.slice(link.indexOf("?id=") + 4);
return {
title: movie.querySelector(".Epkrse")?.textContent.trim(),
link,
rating: parseFloat(movie.querySelector(".LrNMN[aria-label]")?.getAttribute("aria-label").slice(6, 9)) || "No rating",
originalPrice: movie.querySelector(".LrNMN .SUZt4c")?.textContent.trim(),
price: movie.querySelector(".LrNMN .VfPpfd")?.textContent.trim(),
thumbnail: movie.querySelector(".TjRVLb img")?.getAttribute("src"),
video: movie.querySelector(".TjRVLb button")?.getAttribute("data-trailer-url") || "No video preview",
movieId,
};
});
return {
...result,
[categoryTitle]: { subtitle: categorySubTitle, movies },
};
}, {});
return mainPageInfo;
});
return movies;
}
async function getMainPageInfo() {
const browser = await puppeteer.launch({
headless: true, // if you want to see what the browser is doing, you need to change this option to "false"
args: ["--no-sandbox", "--disable-setuid-sandbox"],
});
const page = await browser.newPage();
await page.setDefaultNavigationTimeout(60000);
await page.goto(URL);
await page.waitForSelector(".oVnAB");
await scrollPage(page, ".T4LgNb");
const movies = await getMoviesFromPage(page);
await browser.close();
return movies;
}
getMainPageInfo().then((result) => console.dir(result, { depth: null }));
Preparation
First, we need to create a Node.js* project and add npm
packages puppeteer
, puppeteer-extra
and puppeteer-extra-plugin-stealth
to control Chromium (or Chrome, or Firefox, but now we work only with Chromium which is used by default) over the DevTools Protocol in headless or non-headless mode.
To do this, in the directory with our project, open the command line and enter npm init -y
, and then npm i puppeteer puppeteer-extra puppeteer-extra-plugin-stealth
.
*If you don't have Node.js installed, you can download it from nodejs.org and follow the installation documentation.
📌Note: also, you can use puppeteer
without any extensions, but I strongly recommended use it with puppeteer-extra
with puppeteer-extra-plugin-stealth
to prevent website detection that you are using headless Chromium or that you are using web driver. You can check it on Chrome headless tests website. The screenshot below shows you a difference.
Process
First of all, we need to scroll through all movies listings until there are no more listings loading which is the difficult part described below.
The next step is to extract data from HTML elements after scrolling is finished. The process of getting the right CSS selectors is fairly easy via SelectorGadget Chrome extension which able us to grab CSS selectors by clicking on the desired element in the browser. However, it is not always working perfectly, especially when the website is heavily used by JavaScript.
We have a dedicated Web Scraping with CSS Selectors blog post at SerpApi if you want to know a little bit more about them.
The Gif below illustrates the approach of selecting different parts of the results using SelectorGadget.
Code explanation
Declare puppeteer
to control Chromium browser from puppeteer-extra
library and StealthPlugin
to prevent website detection that you are using web driver from puppeteer-extra-plugin-stealth
library:
const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
Next, we "say" to puppeteer
use StealthPlugin
, write the necessary request parameters and search URL:
puppeteer.use(StealthPlugin());
const searchParams = {
hl: "en", // Parameter defines the language to use for the Google search
gl: "us", // parameter defines the country to use for the Google search
device: "phone", // parameter defines the search device. Options: phone, tablet, tv, chromebook
};
const URL = `https://play.google.com/store/movies?hl=${searchParams.hl}&gl=${searchParams.gl}&device=${searchParams.device}`;
Next, we write a function to scroll the page to load all the articles:
async function scrollPage(page, scrollContainer) {
...
}
In this function, first, we need to get scrollContainer
height (using evaluate()
method). Then we use while
loop in which we scroll down scrollContainer
, wait 2 seconds (using waitForTimeout
method), and get a new scrollContainer
height.
Next, we check if newHeight
is equal to lastHeight
we stop the loop. Otherwise, we define newHeight
value to lastHeight
variable and repeat again until the page was not scrolled down to the end:
let lastHeight = await page.evaluate(`document.querySelector("${scrollContainer}").scrollHeight`);
while (true) {
await page.evaluate(`window.scrollTo(0, document.querySelector("${scrollContainer}").scrollHeight)`);
await page.waitForTimeout(4000);
let newHeight = await page.evaluate(`document.querySelector("${scrollContainer}").scrollHeight`);
if (newHeight === lastHeight) {
break;
}
lastHeight = newHeight;
}
Next, we write a function to get movies data from the page:
async function getMoviesFromPage(page) {
...
}
In this function, we get information from the page context and save it in the returned object. Next, we need to get all HTML elements with "section .oVnAB"
selector (querySelectorAll()
method). Then we use reduce()
method (it's allow to make the object with results) to iterate an array that built with Array.from()
method:
const movies = await page.evaluate(() => {
const mainPageInfo = Array.from(document.querySelectorAll("section .oVnAB")).reduce((result, block) => {
...
}, {});
return mainPageInfo;
});
return movies;
And finally, we need to get categoryTitle
, categorySubTitle
, and title
, link
, rating
, originalPrice
, price
, thumbnail
, video
and movieId
(we can cut it fromlink
using slice()
and indexOf()
methods) of each app from the selected category (querySelectorAll()
, querySelector()
, getAttribute()
, textContent
and trim()
methods.
On each itaration step we return previous step result (using spread syntax
) and add the new category with name from categoryTitle
constant:
const categoryTitle = block.querySelector(".kcen6d").textContent.trim();
const categorySubTitle = block.querySelector(".kMqehf")?.textContent.trim();
const movies = Array.from(block.parentElement.querySelectorAll(".ULeU3b")).map((movie) => {
const link = `https://play.google.com${movie.querySelector(".Si6A0c")?.getAttribute("href")}`;
const movieId = link.slice(link.indexOf("?id=") + 4);
return {
title: movie.querySelector(".Epkrse")?.textContent.trim(),
link,
rating: parseFloat(movie.querySelector(".LrNMN[aria-label]")?.getAttribute("aria-label").slice(6, 9)) || "No rating",
originalPrice: movie.querySelector(".LrNMN .SUZt4c")?.textContent.trim(),
price: movie.querySelector(".LrNMN .VfPpfd")?.textContent.trim(),
thumbnail: movie.querySelector(".TjRVLb img")?.getAttribute("src"),
video: movie.querySelector(".TjRVLb button")?.getAttribute("data-trailer-url") || "No video preview",
movieId,
};
});
return {
...result,
[categoryTitle]: { subtitle: categorySubTitle, movies },
};
Next, write a function to control the browser, and get information:
async function getMainPageInfo() {
...
}
In this function first we need to define browser
using puppeteer.launch({options})
method with current options
, such as headless: true
and args: ["--no-sandbox", "--disable-setuid-sandbox"]
.
These options mean that we use headless mode and array with arguments which we use to allow the launch of the browser process in the online IDE. And then we open a new page
:
const browser = await puppeteer.launch({
headless: true, // if you want to see what the browser is doing, you need to change this option to "false"
args: ["--no-sandbox", "--disable-setuid-sandbox"],
});
const page = await browser.newPage();
Next, we change default (30 sec) time for waiting for selectors to 60000 ms (1 min) for slow internet connection with .setDefaultNavigationTimeout()
method, go to URL
with .goto()
method and use .waitForSelector()
method to wait until the selector is load:
await page.setDefaultNavigationTimeout(60000);
await page.goto(URL);
await page.waitForSelector(".oVnAB");
And finally, we wait until the page was scrolled, save movies data from the page in the movies
constant, close the browser, and return the received data:
await scrollPage(page, ".T4LgNb");
const movies = await getMoviesFromPage(page);
await browser.close();
return movies;
Now we can launch our parser:
$ node YOUR_FILE_NAME # YOUR_FILE_NAME is the name of your .js file
Output
{
"Popular family films":{
"subtitle":"Perfect for movie night",
"movies":[
{
"title":"Sing 2",
"link":"https://play.google.com/store/movies/details/Sing_2?id=74GR3HZ5fI0.P",
"rating":4.3,
"originalPrice":"$5.99",
"price":"$3.99",
"thumbnail":"https://play-lh.googleusercontent.com/Z94mZzSVqG975oT1dQ7h1Adiql0wAywGbfatetwyv1Bw08KG_CGAzOFAzZ73roku4WGbGWN4SuplfOjNJXc=s256-rw",
"video":"https://play.google.com/video/lava/web/player/yt:movie:j7MgT6LWNEE.P?autoplay=1&embed=play",
"movieId":"74GR3HZ5fI0.P"
},
... and other results
]
},
"Pre-orders":{
"movies":[
{
"title":"Woman King, The",
"link":"https://play.google.com/store/movies/details/Woman_King_The?id=dYKWSdXf6rw.P",
"rating":3.2,
"price":"$19.99",
"thumbnail":"https://play-lh.googleusercontent.com/JGjCj3XixIQg2zXnAbUbSpBFvdp36YyG1couJnUNB9R_YC3I54Dp_iulAID3J_BDUULDbLHZzb8I954yNg=s256-rw",
"video":"https://play.google.com/video/lava/web/player/yt:movie:OfC5HTg2P4E.P?autoplay=1&embed=play",
"movieId":"dYKWSdXf6rw.P"
},
... and other results
]
},
... and other categories
}
Using Google Play Movies Store API from SerpApi
This section is to show the comparison between the DIY solution and our solution.
The biggest difference is that you don't need to create the parser from scratch and maintain it.
There's also a chance that the request might be blocked at some point from Google, we handle it on our backend so there's no need to figure out how to do it yourself or figure out which CAPTCHA, proxy provider to use.
First, we need to install google-search-results-nodejs
:
npm i google-search-results-nodejs
Here's the full code example, if you don't need an explanation:
const SerpApi = require("google-search-results-nodejs");
const search = new SerpApi.GoogleSearch(process.env.API_KEY); //your API key from serpapi.com
const params = {
engine: "google_play", // search engine
gl: "us", // parameter defines the country to use for the Google search
hl: "en", // parameter defines the language to use for the Google search
store: "movies", // parameter defines the type of Google Play store
store_device: "phone", // parameter defines the search device. Options: phone, tablet, tv, chromebook, watch, car
movies_category: "MOVIE", // you can see the full list of supported categories on https://serpapi.com/google-play-movies-categories
};
const getJson = () => {
return new Promise((resolve) => {
search.json(params, resolve);
});
};
const getResults = async () => {
const json = await getJson();
const moviesResults = json.organic_results.reduce((result, category) => {
const { title: categoryTitle, subtitle, items } = category;
const movies = items.map((movie) => {
const { title, link, rating = "No rating", original_price, price, video = "No video preview", thumbnail, product_id } = movie;
const returnedMovie = {
title,
link,
rating,
price,
thumbnail,
video,
movieId: product_id,
};
if (original_price) returnedMovie.originalPrice = original_price;
return returnedMovie;
});
return {
...result,
[categoryTitle]: { subtitle, movies },
};
}, {});
return moviesResults;
};
getResults().then((result) => console.dir(result, { depth: null }));
Code explanation
First, we need to declare SerpApi
from google-search-results-nodejs
library and define new search
instance with your API key from SerpApi:
const SerpApi = require("google-search-results-nodejs");
const search = new SerpApi.GoogleSearch(API_KEY);
Next, we write the necessary parameters for making a request:
const params = {
engine: "google_play", // search engine
gl: "us", // parameter defines the country to use for the Google search
hl: "en", // parameter defines the language to use for the Google search
store: "movies", // parameter defines the type of Google Play store
store_device: "phone", // parameter defines the search device. Options: phone, tablet, tv, chromebook, watch, car
movies_category: "MOVIE", // you can see the full list of supported categories on https://serpapi.com/google-play-movies-categories
};
Next, we wrap the search method from the SerpApi library in a promise to further work with the search results:
const getJson = () => {
return new Promise((resolve) => {
search.json(params, resolve);
});
};
And finally, we declare the function getResult
that gets data from the page and return it:
const getResults = async () => {
...
};
In this function first, we get json
with results, then we need to iterate organic_results
array in the received json
. To do this we use reduce()
method (it's allow to make the object with results). On each itaration step we return previous step result (using spread syntax
) and add the new category with name from categoryTitle
constant:
const json = await getJson();
const appsResults = json.organic_results.reduce((result, category) => {
...
return {
...result,
[categoryTitle]: { subtitle, movies },,
};
}, {});
return appsResults;
Next, we destructure category
element, redefine title
to categoryTitle
constant, and itarate the items
array to get all games from this category. To do this we need to destructure the movie
element, set default value "No video preview" for video
(because not all games have a video preview) and "No rating" for rating
and return this constants:
const { title: categoryTitle, subtitle, items } = category;
const movies = items.map((movie) => {
const { title, link, rating = "No rating", original_price, price, video = "No video preview", thumbnail, product_id } = movie;
const returnedMovie = {
title,
link,
rating,
price,
thumbnail,
video,
movieId: product_id,
};
if (original_price) returnedMovie.originalPrice = original_price;
return returnedMovie;
});
After, we run the getResults
function and print all the received information in the console with the console.dir
method, which allows you to use an object with the necessary parameters to change default output options:
getResults().then((result) => console.dir(result, { depth: null }));
Output
{
"New to Rent":{
"subtitle":"Watch within 30 days of rental",
"movies":[
{
"title":"Top Gun: Maverick",
"link":"https://play.google.com/store/movies/details/Top_Gun_Maverick?id=PnS5p3AmpRE.P",
"rating":4.8,
"price":"$4.99",
"thumbnail":"https://play-lh.googleusercontent.com/UJHa0DJftoFAt7rj1M8w7OmVoPxcFoRJAAqV2hbbz8QI-p5xHTxbjidNKM7gE-jxKzDfCuCfIJ7VBxQIcQ=s256-rw",
"video":"https://play.google.com/video/lava/web/player/yt:movie:q8CxTfNkwyA.P?autoplay=1&embed=play",
"movieId":"PnS5p3AmpRE.P"
},
... and other results
]
},
"Deals on movie purchases":{
"subtitle":"undefined",
"movies":[
{
"title":"Alita: Battle Angel",
"link":"https://play.google.com/store/movies/details/Alita_Battle_Angel?id=jwlu7jkYI1A",
"rating":4.6,
"price":"$7.99",
"thumbnail":"https://play-lh.googleusercontent.com/cpvUwWnYh5wcz2MVQE2tTJFW8j3nBTzmPvt8QOiE7E8PIe8JEgRs4OymeJbUMg5yPUU=s256-rw",
"video":"https://play.google.com/video/lava/web/player/yt:movie:Yy-NE9hRt20?autoplay=1&embed=play",
"movieId":"jwlu7jkYI1A",
"originalPrice":"$19.99"
},
... and other results
]
},
... and other categories
}
Links
If you want other functionality added to this blog post (e.g. extracting additional categories) or if you want to see some projects made with SerpApi, write me a message.
Add a Feature Request💫 or a Bug🐞
This content originally appeared on DEV Community 👩💻👨💻 and was authored by Mikhail Zub
Mikhail Zub | Sciencx (2022-10-07T15:10:45+00:00) Web scraping Google Play Movies & TV with Nodejs. Retrieved from https://www.scien.cx/2022/10/07/web-scraping-google-play-movies-tv-with-nodejs/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.