This content originally appeared on DEV Community and was authored by Dimitry Zub
Contents: intro, imports, what will be scraped, process, code, code with pagination, links, outro.
Intro
This blog post is a continuation of Google's web scraping series. Here you'll see how to scrape Google News Results using Python with beautifulsoup
, requests
, lxml
libraries. An alternative API solution will be shown.
Imports
import requests, lxml
from bs4 import BeautifulSoup
from serpapi import GoogleSearch
What will be scraped
Process
Selecting container, title, link, source, snippet, published time.
Code
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "gta san andreas",
"hl": "en",
"tbm": "nws",
}
response = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(response.text, 'lxml')
for result in soup.select('.dbsr'):
title = result.select_one('.nDgy9d').text
link = result.a['href']
source = result.select_one('.WF4CUc').text
snippet = result.select_one('.Y3v8qd').text
date_published = result.select_one('.WG9SHc span').text
print(f'{title}\n{link}\n{snippet}\n{date_published}\n{source}\n')
-------------
'''
San Andreas: Cesar & Kendl Is Grand Theft Auto's Best Relationship
https://screenrant.com/gta-san-andreas-cesar-kendl-best-relationship-why/
Many Grand Theft Auto relationships are negative or purely transactional.
That makes Cesar and Kendl's genuine love for each other stand out.
4 hours ago
Screen Rant
'''
Code with pagination
from bs4 import BeautifulSoup
import requests, urllib.parse, lxml
def paginate(url, previous_url=None):
# Break from infinite recursion
if url == previous_url: return
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"
"Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get(url, headers=headers).text
soup = BeautifulSoup(response, 'lxml')
# First page
yield soup
next_page_node = soup.select_one('a#pnnext')
# Stop when there is no next page
if next_page_node is None: return
next_page_url = urllib.parse.urljoin('https://www.google.com/', next_page_node['href'])
# Pages after the first one
yield from paginate(next_page_url, url)
def scrape():
pages = paginate("https://www.google.com/search?hl=en-US&q=gta san andreas&tbm=nws")
for soup in pages:
print(f'Current page: {int(soup.select_one(".YyVfkd").text)}\n')
for result in soup.select('.dbsr'):
title = result.select_one('.nDgy9d').text
link = result.a['href']
source = result.select_one('.WF4CUc').text
snippet = result.select_one('.Y3v8qd').text
date_published = result.select_one('.WG9SHc span').text
print(f'{title}\n{link}\n{snippet}\n{date_published}\n{source}\n')
scrape()
-------------------
'''
Current page: 1
San Andreas: Cesar & Kendl Is Grand Theft Auto's Best Relationship
https://screenrant.com/gta-san-andreas-cesar-kendl-best-relationship-why/
Many Grand Theft Auto relationships are negative or purely transactional.
That makes Cesar and Kendl's genuine love for each other stand out.
4 hours ago
Screen Rant
...
Current page: 8
Il recréé des covers d'album sur "GTA : San Andreas" et c'est ...
https://intrld.com/gtpmagazine-le-magazine-parodique-du-jeu-gta/
On a trouvé LE compte parodique à suivre : Grand Theft Parody, le magazine
qui revisite des pochettes mythiques sur GTA : San Andreas.
2 weeks ago
Interlude
'''
Using Google News Result API
SerpApi is a paid API with a free trial that soon will be replaced with a free plan.
The main differences as I usually write in this blog post is that it's much faster and straightforward process rather than tinkering CSS
, XPath
selectors or dealing with Javascript-driven websites e.g. Google Maps which SerpApi scrapes like a charm.
If you want to get things quickly and write code faster, and don't want to maintain the parser then, I believe that API solution is a way to go.
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "gta san andreas",
"gl": "us",
"tbm": "nws"
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['news_results']:
print(result)
-----------------------
'''
{'position': 1, 'link': 'https://www.sportskeeda.com/gta/5-strange-gta-san-andreas-glitches', 'title': '5 strange GTA San Andreas glitches', 'source': 'Sportskeeda', 'date': '9 hours ago', 'snippet': 'GTA San Andreas has a wide assortment of interesting and strange glitches.', 'thumbnail': 'https://serpapi.com/searches/60e71e1f8b7ed2dfbde7629b/images/1394ee64917c752bdbe711e1e56e90b20906b4761045c01a2cefb327f91d40bb.jpeg'}
'''
Google News Results API with Pagination
# https://github.com/serpapi/google-search-results-python
from serpapi import GoogleSearch
import os
def scrape():
params = {
"engine": "google",
"q": "coca cola",
"tbm": "nws",
"api_key": "YOUR_API_KEY",
}
search = GoogleSearch(params)
pages = search.pagination()
for result in pages:
print(f"Current page: {result['serpapi_pagination']['current']}")
for news_result in result["news_results"]:
print(f"Title: {news_result['title']}\nLink: {news_result['link']}\n")
scrape()
------------------------
'''
Current page: 1
Title: 5 strange GTA San Andreas glitches
Link: https://www.sportskeeda.com/gta/5-strange-gta-san-andreas-glitches
...
Current page: 14
Title: Ambitious Grand Theft Auto: San Andreas Mod Turns It Into A Spider-Man Game
Link: https://gamerant.com/grand-theft-auto-san-andreas-spider-man-game-mod/
...
'''
Links
Code in the online IDE • Google News Result API
Outro
If you want to see how to scrape something using Python/Ruby that I didn't write about yet or you want to see some project made with SerpApi, please write me a message.
Yours, D
This content originally appeared on DEV Community and was authored by Dimitry Zub
Dimitry Zub | Sciencx (2021-07-09T13:15:26+00:00) Scrape Google News with Python. Retrieved from https://www.scien.cx/2021/07/09/scrape-google-news-with-python/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.