5 min read

Web Scraping with Oracle Analytics

OCI Functions integration with data flows is one of my favourite Oracle Analytics features as it provides to business analytists full control and flexibility over specific data processing needs. This article illustrates how the feature can be used to implement a web scraper, which allows us to extract data from a website and save the result in a new dataset.

⚠️
Please note that I already wrote about creating OCI functions and invoking them from data flows in Oracle Analytics, so I'm not going to cover all the details again in this article.

The Website to Scrape

Before starting with the implementation of any web scraper, it's fundamental to figure out the structure of the website you are interested in scraping. To easily achieve this task I recommend to use the Developer Tools built directly into the Google Chrome browser.

Let's take Metacritic (a website that aggregates reviews of films, television shows, music albums, video games, and formerly books) as an example and assume that we want to collect all user reviews for a given game and platform.

I noticed that all pages about user reviews for games share the same URL structure:

https://www.metacritic.com/game/PLATFORM/GAME/user-reviews?page=PAGE

In the above URL, PLATFORM indicates the environment in which games are executed and GAME the name of the reviewed game. Results are paginated and by default each page contains at most 100 reviews (the most helpful ones are displayed first). page=PAGE is an optional parameter which allows us to select a specific page where 0 indicates the first page (reviews ranked from 1 to 100), 1 the second page (reviews ranked from 101 to 200) and so on.

Each user review is in a <li> tag with one of these classes: review user_review, review user_review first_review, or review user_review last_review. For each review it's possible to collect:

  • The actual review's - The location depends on its length: for long reviews (390 or more characters) it is in a <span> tag with blurb blurb_expanded class, otherwise it's in <span> tag embedded in a <div> with review_body class.
  • The reviewer's username - It is in a <a> tag embedded in a <div> with name class.
  • The review's date - It is in a <div> tag with date class.
  • The review's score - It is in the first <div> tag embedded in a <div> with review_grade class.
  • The number of users who found the review helpful - It is in a <span> tag with total_ups class.
  • The number of users who rated the review - It is in a <span> tag with total_thumbs class.
  • The link to the next page (if any) - It is in a <a> tag with action class and rel attribute equals to next.

Implementing the Web Scraping Function

With the structure of the website clear in mind it is straightforward to implement an OCI function to scrape it.

Open the Cloud Shell, and initialize a new Python function:

fn init --runtime python scrape-metacritic

Add the definition of the function to the func.py file.

funcDefinition = { "status": { "returnCode": 0, "errorMessage": "" }, "funcDescription": { "outputs": [ {"name": "game", "dataType": "varchar(100)"}, {"name": "platform", "dataType": "varchar(100)"}, {"name": "name", "dataType": "varchar(100)"}, {"name": "date", "dataType": "varchar(100)"}, {"name": "rating", "dataType": "varchar(100)"}, {"name": "review", "dataType": "varchar(100)"}, {"name": "total_ups", "dataType": "integer"}, {"name": "total_thumbs", "dataType": "integer"} ], "parameters": [ {"name": "game", "displayName": "Game", "description": "The name of the reviewed game", "required": True, "value": {"type": "string"}}, {"name": "platform", "displayName": "Platform", "description": "The environment in which games are executed", "required": True, "value": {"type": "string"}} ], "bucketName": "federico-bucket", "isOutputJoinableWithInput": False } }

According to the JSON object above, my function requires two strings as input (game and platform), returns eight output columns (game, platform, name, date, rating, review, total_ups and total_thumbs) that will not be appended to any input dataset in the data flow, and federico-bucket bucket will be used to share data between Oracle Analytics and OCI Functions.

To extract data from Metacritic I'm going to use the Beautiful Soup Python package which provides simple methods for navigating, searching, and modifying a parse tree in HTML and XML files.

Make a URL request, and parse the response into Beautiful Soup:

url = f'https://www.metacritic.com/game/{platform}/{game}/user-reviews?page=0' user_agent = {'User-agent': 'Mozilla/5.0'} response = requests.get(url, headers = user_agent) soup = BeautifulSoup(response.text, 'html.parser')

Create a dictionary to hold the data that you want to scrape:

review_dict = {'game':[], 'platform':[], 'name':[], 'date':[], 'rating':[], 'review':[], 'total_ups':[], 'total_thumbs':[]}

Use Beautiful Soup to find the data that you are interested in, and append it to the dictionary's lists.

for review in soup.find_all('li', class_=['review user_review', 'review user_review first_review', 'review user_review last_review']): if review.find('span', class_='blurb blurb_expanded'): review_dict['review'].append(review.find('span', class_='blurb blurb_expanded').text) else: try: review_dict['review'].append(review.find('div', class_='review_body').find('span').text) except: review_dict['review'].append("") review_dict['platform'].append(platform) review_dict['name'].append(review.find('div', class_='name').find('a').text) review_dict['date'].append(review.find('div', class_='date').text) review_dict['rating'].append(review.find('div', class_='review_grade').find_all('div')[0].text) review_dict['total_ups'].append(review.find('span', class_='total_ups').text) review_dict['total_thumbs'].append(review.find('span', class_='total_thumbs').text)

In the code above, the find_all() method looks through a tag's descendants and retrieves all descendants that match your filters while find() retrieves only the first one.

Convert the dictionary into a dataframe and write data to a bucket in CSV format:

reviews = pd.DataFrame(review_dict) reviews.to_csv("oci://" + bucketName + "/" + outputFile)

And that's it! Do not forget to loop through all the pages if you want to scrape all the reviews, update requirements.txt with any libraries you will be utilizing, and deploy the function to the registry before using it.

The source code of my scrape-metacritic function is attached below, feel free to use it as a reference to create your own web scraper.

Invoking the Web Scraping Function from a Data Flow

Once the function has been registered in Oracle Analytics, it's possible to invoke it from data flows. Since the data we are interested in is actually on a website, but data flow always require to provide as input at least one dataset (the feature has been designed to trasform data), it's a good idea to create a dummy dataset from an Excel file with a single value in it (Figure 1). This allows us to reduce the size of the input file that Oracle Analytics has to transfer to the OCI bucket, and hence improve the performance.

Figure 1. Using a dummy dataset as input in a data flow

To scrape all user reviews about eFootball 2023 for Playstation 5 from Metacritic, I only have to include the Apply Custom Script step to the data flow, select my scrape-metacritic function and specify values for the required input parameters (Figure 2).

Figure 2. Invoking the scrape-metacritic function from a data flow

The scrape-metacritic function is executed as soon as any other step is included to the data flow (Figure 3). It's possible to invoke it multiple times and merge the results to collect all user reviews of a game for all platforms, and automate the execution by scheduling the data flow.

Figure 3. The scrape-metacritic function in action

Conclusion

Leveraging OCI functions into Oracle Analytics can give business analysts and end-users full control and flexibility over specific data processing needs. In this context, Python and Beautiful Soup can join the party to extend Oracle Analytics capabilities by allowing us to extract data from websites, and save it into datasets.

If you are looking into leveraging OCI functions into Oracle Analytics and want to find out more, please do get in touch or DM me on Twitter @barretbse.