Web Scraping with Oracle Analytics
OCI Functions integration with data flows is one of my favourite Oracle Analytics features as it provides to business analytists full control and flexibility over specific data processing needs. This article illustrates how the feature can be used to implement a web scraper, which allows us to extract data from a website and save the result in a new dataset.
The Website to Scrape
Before starting with the implementation of any web scraper, it's fundamental to figure out the structure of the website you are interested in scraping. To easily achieve this task I recommend to use the Developer Tools built directly into the Google Chrome browser.
Let's take Metacritic (a website that aggregates reviews of films, television shows, music albums, video games, and formerly books) as an example and assume that we want to collect all user reviews for a given game and platform.
I noticed that all pages about user reviews for games share the same URL structure:
https://www.metacritic.com/game/PLATFORM/GAME/user-reviews?page=PAGE
In the above URL, PLATFORM
indicates the environment in which games are executed and GAME
the name of the reviewed game. Results are paginated and by default each page contains at most 100 reviews (the most helpful ones are displayed first). page=PAGE
is an optional parameter which allows us to select a specific page where 0 indicates the first page (reviews ranked from 1 to 100), 1 the second page (reviews ranked from 101 to 200) and so on.
Each user review is in a <li>
tag with one of these classes: review user_review
, review user_review first_review
, or review user_review last_review
. For each review it's possible to collect:
- The actual review's - The location depends on its length: for long reviews (390 or more characters) it is in a
<span>
tag withblurb blurb_expanded
class, otherwise it's in<span>
tag embedded in a<div>
withreview_body
class. - The reviewer's username - It is in a
<a>
tag embedded in a<div>
withname
class. - The review's date - It is in a
<div>
tag withdate
class. - The review's score - It is in the first
<div>
tag embedded in a<div>
withreview_grade
class. - The number of users who found the review helpful - It is in a
<span>
tag withtotal_ups
class. - The number of users who rated the review - It is in a
<span>
tag withtotal_thumbs
class. - The link to the next page (if any) - It is in a
<a>
tag withaction
class andrel
attribute equals tonext
.
Implementing the Web Scraping Function
With the structure of the website clear in mind it is straightforward to implement an OCI function to scrape it.
Open the Cloud Shell, and initialize a new Python function:
fn init --runtime python scrape-metacritic
Add the definition of the function to the func.py
file.
funcDefinition = {
"status": {
"returnCode": 0,
"errorMessage": ""
},
"funcDescription": {
"outputs": [
{"name": "game", "dataType": "varchar(100)"},
{"name": "platform", "dataType": "varchar(100)"},
{"name": "name", "dataType": "varchar(100)"},
{"name": "date", "dataType": "varchar(100)"},
{"name": "rating", "dataType": "varchar(100)"},
{"name": "review", "dataType": "varchar(100)"},
{"name": "total_ups", "dataType": "integer"},
{"name": "total_thumbs", "dataType": "integer"}
],
"parameters": [
{"name": "game",
"displayName": "Game",
"description": "The name of the reviewed game",
"required": True,
"value": {"type": "string"}},
{"name": "platform",
"displayName": "Platform",
"description": "The environment in which games are executed",
"required": True,
"value": {"type": "string"}}
],
"bucketName": "federico-bucket",
"isOutputJoinableWithInput": False
}
}
According to the JSON object above, my function requires two strings as input (game
and platform
), returns eight output columns (game
, platform
, name
, date
, rating
, review
, total_ups
and total_thumbs
) that will not be appended to any input dataset in the data flow, and federico-bucket
bucket will be used to share data between Oracle Analytics and OCI Functions.
To extract data from Metacritic I'm going to use the Beautiful Soup Python package which provides simple methods for navigating, searching, and modifying a parse tree in HTML and XML files.
Make a URL request, and parse the response into Beautiful Soup:
url = f'https://www.metacritic.com/game/{platform}/{game}/user-reviews?page=0'
user_agent = {'User-agent': 'Mozilla/5.0'}
response = requests.get(url, headers = user_agent)
soup = BeautifulSoup(response.text, 'html.parser')
Create a dictionary to hold the data that you want to scrape:
review_dict = {'game':[], 'platform':[], 'name':[], 'date':[], 'rating':[], 'review':[], 'total_ups':[], 'total_thumbs':[]}
Use Beautiful Soup to find the data that you are interested in, and append it to the dictionary's lists.
for review in soup.find_all('li', class_=['review user_review', 'review user_review first_review', 'review user_review last_review']):
if review.find('span', class_='blurb blurb_expanded'):
review_dict['review'].append(review.find('span', class_='blurb blurb_expanded').text)
else:
try:
review_dict['review'].append(review.find('div', class_='review_body').find('span').text)
except:
review_dict['review'].append("")
review_dict['platform'].append(platform)
review_dict['name'].append(review.find('div', class_='name').find('a').text)
review_dict['date'].append(review.find('div', class_='date').text)
review_dict['rating'].append(review.find('div', class_='review_grade').find_all('div')[0].text)
review_dict['total_ups'].append(review.find('span', class_='total_ups').text)
review_dict['total_thumbs'].append(review.find('span', class_='total_thumbs').text)
In the code above, the find_all()
method looks through a tag's descendants and retrieves all descendants that match your filters while find()
retrieves only the first one.
Convert the dictionary into a dataframe and write data to a bucket in CSV format:
reviews = pd.DataFrame(review_dict)
reviews.to_csv("oci://" + bucketName + "/" + outputFile)
And that's it! Do not forget to loop through all the pages if you want to scrape all the reviews, update requirements.txt
with any libraries you will be utilizing, and deploy the function to the registry before using it.
The source code of my scrape-metacritic
function is attached below, feel free to use it as a reference to create your own web scraper.
Invoking the Web Scraping Function from a Data Flow
Once the function has been registered in Oracle Analytics, it's possible to invoke it from data flows. Since the data we are interested in is actually on a website, but data flow always require to provide as input at least one dataset (the feature has been designed to trasform data), it's a good idea to create a dummy dataset from an Excel file with a single value in it (Figure 1). This allows us to reduce the size of the input file that Oracle Analytics has to transfer to the OCI bucket, and hence improve the performance.
To scrape all user reviews about eFootball 2023 for Playstation 5 from Metacritic, I only have to include the Apply Custom Script step to the data flow, select my scrape-metacritic
function and specify values for the required input parameters (Figure 2).
The scrape-metacritic
function is executed as soon as any other step is included to the data flow (Figure 3). It's possible to invoke it multiple times and merge the results to collect all user reviews of a game for all platforms, and automate the execution by scheduling the data flow.
Conclusion
Leveraging OCI functions into Oracle Analytics can give business analysts and end-users full control and flexibility over specific data processing needs. In this context, Python and Beautiful Soup can join the party to extend Oracle Analytics capabilities by allowing us to extract data from websites, and save it into datasets.
If you are looking into leveraging OCI functions into Oracle Analytics and want to find out more, please do get in touch or DM me on Twitter @barretbse.
Member discussion