Wayne Lambert
Examples of scraping content from external websites
The Scraping project demonstrates the use of the Requests library to handle the request/response cycle for the GET requests that are required to retrieve data from another website.
Python's third party library, Beautiful Soup, is used to parse the required HTML from the response object. Python text functions are used to target, strip and split any required text so that it's in a suitable format to render for human readability.
Finally, the Django component of the application is the handling of the URLs to target a view which returns the required response to the Django template. In other words, there is no 'M' component used within this app, it just uses the 'V' and the 'T' components of the MVT architectural paradigm that Django uses.
The examples give scraping examples using string manipulation of DOM elements.
I set out to scrape interesting content from the web in order to learn skills with the Requests and Beautiful Soup libraries. I started out by going for a couple of famous speeches which I selected as I was browsing through goodreads.com
The main tool to use for this activity is Firefox Developer tools. This is required to inspect the DOM elements that are required for targeting.
It is also useful to use Jupyter Notebooks for this type of activity as their ability to present data like this is often superior to that of Python's interpreter, even if you use bPython.
There are a couple of enhancements that I would like to add to this project in order to improve its utility and to demonstrate further skills.
The scraping process takes about 15 seconds or so because it needs to extract that data from an external website and inspect the DOM elements of the web pages to do so. In any case, these processes should be respectful of a website as each page is a request to the website which could result in my site being viewed negatively.
To inform the user of progress, I would like to add a progress bar using Celery and Django.
I would also like to scrape the content using multi-processing to speed it up a little. It also gives me an opportunity to delve further into the multiprocessing library.
The following link represents the EU Referendum Results for all areas beginning with 'A'.
https://www.bbc.co.uk/news/politics/eu_referendum/results/local/a
The choices of topics/items to scrape does not in any way reflect political opinions or affiliations. They are merely chosen as interesting items to produce as part of the project.