How to Scrape News Content from Popular News Sites?
Getting enough knowledge in a specific business domain is a main mantra for any business to stay aligned with its competitors. News is the best means to learn what’s happening worldwide. Especially for data engineers, news articles are a great way to collect enough data. More data means more insights. But, collecting news, reading, and gaining enough news knowledge is challenging and takes a lot of time. It is impossible to collect data manually. So, fast and quick knowledge in the news industry requires scraping news sites. It plays a vital role in getting essential updates about business in a short time.
This article explains everything you need about news scraping and how to scrape the content quickly and effectively..
What is News Scraping?
It is a term used for scraping news content from news sites available on public online media. It means extracting press releases and updates automatically from news websites. As these sites comprise lots of meaningful public data, reviews on newly launched products, and several keys and announcements for business, hence, are effective for any business success.
Benefits of Scraping News Sites
News aggregation helps to collect important content to attract and grow your target audience and turn your platform into a go-to news outlet. Rather than competing with other brands and sites, the news aggregator will provide them with additional exposure.
There are several benefits of scraping news. A few of them are listed below:
- It provides updated information about businesses and more.
- Boosts compliances and operations.
- Extracts verified and authentic news.
- It helps in identifying mitigation and risks.
- It provides information about important business announcements.
The news scraping services aggregate the most highly relevant news content from the website. It lets users avoid the hassles of searching articles, relevant reports, interviews, and more and making them all in one place.
Few Considerations for Scraping Different Types of News Websites
Before you scrape news content from popular news sites, keep in mind the following considerations:
- Choose your Niche : Although you have a vast news aggregator to collect news on various topics, it is best to stay ahead by picking a niche. Make sure to research and determine which topics can get more clicks. It will make your platform fresh.
- Use Only Trustworthy Sources : Collect data from credible sources and double-check your facts. Verify all your links and make sure that all the news on your site is current and relevant.
- Choose How to present the Information on Your Site : You can decide your audience will see your site. You can provide an entire article or a glimpse of the content before redirecting to the source.
List of Data Fields
At iWeb Data Scraping, we provide news website data scraping services from several sites, including Yahoo News, MSN, etc
- News Category and Sub-Category
- Published Date
- Published Time
- News Author
- News Title
- News Body
Why Using News Scraping API is the Best Option?
- Proxy rotation
- Proxy management services
- Specialized modules
- CAPTCHA solving
- Structured and organized results
In this article, we will create the best news scraper to scrape the latest news articles from various newspapers and store them as text. Hence, we will go through the following two steps to perform an in-depth analysis.
- Surface-level intro to webpage and HTML
- I am scraping using Python and BeautifulSoup.
Surface-level intro to webpage and HTML
When we go to any specific URL using a web browser, the particular webpage is a combination of three technologies:
- HTML : It defines webpage content. It is a standard markup language to add content to the website.
- CSS : It styles the webpage.
- JavaScript : It handles all logic handling and web page functionality.
Scraping News Articles Using Python
Python consists of several packages that help in scraping information from a webpage. Here, we will use BeautifulSoup for web scraping.
Install the library packages using the following command.
! pip install beautifulsoup4
We will also use the requests module to help provide BeautifulSoup with the page’s HTML code. Please install it using the following command.
! pip install requests
So, to provide BeautifulSoup with HTML code, we will require a requests module.
Next, install urllib using the following command.
! pip install urllib
urllib is Python’s URL handling module. It helps in fetching URLs.
Importing of Libraries
Now, we will import all the necessary libraries
Import BeautifulSoup on your IDE using the following command.
from bs4 import BeautifulSoup
This library helps get the HTML structure of the desired page and provides functions to access specific elements and extract relevant information.
Now, import urllib using the following command.
import urllib.request, sys, time
To import requests, type the following:
import request
This module sends the HTTP requests to a web server using Python.
Import pandas using the following.
import pandas as pd
We will use this library to make DataFrame.
Now, make a simple get request by just fetching a page.
We will consider the requests.get(url) in a try-except block.
We will also use the ‘for’ loop for pagination.
Inspecting the Response Object
See the response code that the server sent back.
page.status_code
Output
Status of Response object
200
The HTTP 200 OK status response code shows that the requests have succeeded.
Now, access the complete response as text.
page.text
Output
It will return the HTML content of a response object in Unicode.
Look for the specific substring for the test within the response.
if "Politifact" in page.text:
print("Yes, Scarpe it")
Check for the response’s content type.
print (page.headers.get("content-type", "unknown"))
Output
response's Content Type
text/html; charset=utf-8
Delaying the Requests Time
We will call the sleep(2) function with a value of 2 seconds.
time.sleep(2)
Extracting Content from HTML
It’s time to parse HTML content to extract the desired value.
(a) Using Regular Expression
import re # put this at the top of the file
print (re.findall(r'$[0–9,.]+', page.text))
Output
['$294', '$9', '$5.8']
(b) Using BeautifulSoup
soup = BeautifulSoup (page.text, "html.parser")
The below command will look for all tags – < li > with specific attribute ‘o-listicle__item.’
links-soup.find_all('li',attrs={'class':'o-listicle__item'})
Inspecting the Webpage
To understand the above code, inspect the webpage. It will appear like this.
As we need the news section of a particular page, we will go to that article section by choosing the inspect element option. It will highlight the particular web page section and its HTML source.
We will continue with our code.
print(len(links))
This command will extract the number of news articles on a given page.
Finding Elements and Attributes
Look for all anchor tags on the page.
links = soup.find_all("a")
It will find a division tag under < li >. Here ‘j’ is an iterable variable.
Statement = j.find("div",attrs={'class':'m-statement__quote'})
Text.strip() function will return text within this tag and strip any extra spaces.
Statement j.find("div", attrs={'class':'m-
statement__quote'}).text.strip()
We have scraped our first attribute. In the same division, we will look for the anchor tag and return with the value of the hypertext link.
Link=j.find("div", attrs={'class':'m-statement__quote'}).find('a') ['href'].strip()
To get the Data attribute, we will inspect the web page first.
Date j.find('div', attrs={'class':'m-
statement__body'}).find('footer').text[-14:-1].strip()
Source j.find('div', attrs={'class':'m-
statement__author'}).find('a').get('title').strip()
Next, we are using ‘alt’ as an attribute to get()
Let’s combine all concepts and fetch details for five different attributes of my Dataset.
Making Dataset
frame.append([Statement, Link, Date, Source, Label])
upper frame.extend(frame)
Visualising Dataset
For visualizing, use pandas DataFrame.
Make CSV File & Save it to Your machine
Write a CSV file and save it to your machine
Complete Code
For more information, get in touch with iWeb Data Scraping now! You can also reach us for all your web scraping service and mobile app data scraping requirements.
know more : https://www.iwebdatascraping.com/how-to-scrape-news-content-from-popular-news-sites.php
Tag : #Scrape News Content From Popular News Sites#news scraper#news website data scraping services#Scrape News data