Project Review: Data Science

Analysing Interesting Datasets with Python

Summary

Introduction

The data science app is a demonstration of using Python's Data Science libraries to analyse interesting datasets.

I wanted to analyse datasets that I had analysed in previous roles using Python and Pandas, therefore I created a sample dataset for 'UK Salaries' since I did a lot of compensation analysis for Deutsche Bank.

I also wanted to analyse some additional datasets that I find personally fascinating, therefore I have also included those.

Jupyter notebooks are a good method for consuming an external API and scraping content from the web because of their fast feedback and tidy presentation of deep nested structures that either JSON objects or HTML structures can have.

Features

  • Specifies versions of Python and the used libraries to aid reproducibility
  • Reports various summaries of the data for headline interrogation
  • Visualises the data with various choices of graphs to understand and interpret the data
  • Scrapes content from the web using the Requests and Beautiful Soup libraries.

Objectives

I set out to analyse a number of interesting datasets to demonstrate data analysis skills with Python and it's associated data analysis and visualiation libraries.

The Approach & Solution

For analysing datasets, the approach is largely more fluid than it is for software or web development projects.

For most projects, it consists of two components: data analysis and data visualisation.

For data analysis, it means using stock Python and its extended standard library in addition to Pandas, and NumPy.

It's usually best to use a Conda distribution for this as it handles the requirements for the best data science libraries to use. This does mean, however, that you often cannot use the most up-to-date version of Python and benefit from the new features and speed improvements it brings.

For data visualisation, I always use Matplotlib and Seaborn, however other libraries such as Altair can also become useful.

It's best to approach any analysis and visualisation project with the data analysis first as you can get some fast insights into your data in addition to any problems that the data may have quickly. This will enable you to make corrections to the data so that it doesn't affect the overall accuracy of a project.

The choice of graphic or metric to use is one that defines the success of a data visualisation project. It feels good for a data scientist to present a violin plot, however, if your audience isn't use to interpreting that sort of graph, then the message will not have the desired impact. A simpler graphic such as a box plot would make for a better choice.

Evaluation

I would like to provide interactive notebooks, however, even amongst the array of online notebooks services, there doesn't appear to be one where I can simply push a notebook to GitHub and have an automated GitHub action build an interactive notebook.

When GitHub releases Codespaces, I would be interested to see if this means that anyone with a GitHub account would be able to run a notebook from within GitHub itself since VS Code has evolving native support for notebooks.

My journey in Python started with a lot of OS scripting and data analysis work, so I eventually thought that I should formalise the work that I had undertaken and create some structured notebooks to analyse some interesting datasets.

I am pleased with the skills developed as a data scientist using Python as that aids my skills in web development. In the same way, I feel my skills in web development and software engineering also make me a better data scientist.

Languages, Technologies & Skills Used

In approximate order of frequency used...
Languages: Python
Frameworks / Services: Pandas, NumPy, Matplotlib, Seaborn
Software: VS Code, Jupyter Notebooks
Libraries: collections, decimal, string, json
Notable Packages: Requests, Beautiful Soup
Infrastructure: GitHub, Poetry