Project Review: Scraping

Examples of scraping content from external websites

Summary

Introduction

The Scraping project demonstrates the use of the Requests library to handle the request/response cycle for the GET requests that are required to retrieve data from another website.

Python's third party library, Beautiful Soup, is used to parse the required HTML from the response object. Python text functions are used to target, strip and split any required text so that it's in a suitable format to render for human readability.

Finally, the Django component of the application is the handling of the URLs to target a view which returns the required response to the Django template. In other words, there is no 'M' component used within this app, it just uses the 'V' and the 'T' components of the MVT architectural paradigm that Django uses.

The examples give scraping examples using string manipulation of DOM elements.

Features

  • The most comprehensive scraping example is from the BBC website for the results of the EU Referendum. The BBC presents this data as one page for each letter of the alphabet.
  • Renders a formatted Bootstrap table with the results
  • The table uses jQuery to enable fast pagination, sorting, and filtering of results
  • Calculations for percentages amd totals are performed within the Python script which reconcile back to the values on the BBC website
  • Django is used to handle the URL routing and the generation of the views. There are no models within the app

Objectives

I set out to scrape interesting content from the web in order to learn skills with the Requests and Beautiful Soup libraries. I started out by going for a couple of famous speeches which I selected as I was browsing through goodreads.com

The Approach & Solution

The main tool to use for this activity is Firefox Developer tools. This is required to inspect the DOM elements that are required for targeting.

It is also useful to use Jupyter Notebooks for this type of activity as their ability to present data like this is often superior to that of Python's interpreter, even if you use bPython.

Evaluation

There are a couple of enhancements that I would like to add to this project in order to improve its utility and to demonstrate further skills.

The scraping process takes about 15 seconds or so because it needs to extract that data from an external website and inspect the DOM elements of the web pages to do so. In any case, these processes should be respectful of a website as each page is a request to the website which could result in my site being viewed negatively.

To inform the user of progress, I would like to add a progress bar using Celery and Django.

I would also like to scrape the content using multi-processing to speed it up a little. It also gives me an opportunity to delve further into the multiprocessing library.

Languages, Technologies & Skills Used

In approximate order of frequency used...
Languages: Python, jQuery, HTML, Sass
Frameworks / Services: Django, Bootstrap, Font Awesome
Software: VS Code,Firefox Dev Tools, Jupyter Notebooks
Libraries: decimal, collections, string
Notable Packages: Requests, Beautiful Soup
Infrastructure: GitHub, Docker, Poetry

Sources

The following link represents the EU Referendum Results for all areas beginning with 'A'.

https://www.bbc.co.uk/news/politics/eu_referendum/results/local/a

Disclaimer

The choices of topics/items to scrape does not in any way reflect political opinions or affiliations. They are merely chosen as interesting items to produce as part of the project.