A few months ago, before starting a Masters in Data Science, I experimented with a bit of data visualisation. I am an avid reader of Hacker News, which is a great site for programmers to frequent. If you don’t already know it, I can’t recommend it highly enough if you feel like you want to spend your procrastinating time more wisely (that’s my main motivation for visiting, anyway…). It was an ad-hoc survey on there that led me to make this little visualisation and I’d like to share the end result, as well as the process that went into it, for two reasons. The first reason is that it could give some interesting insight into how an amateur would approach a data science problem using nothing but programming knowledge and some common sense. The second reason is that in a few months I will revisit this exact data set and analyse/visualise it again, to see how I can apply what I learned on my course to improve upon my first attempt.
All I can say is: I anticipate the second attempt will be recognisably… better.
So, without further ado I present:
The (sort of) layman’s guide to a basic data visualisation
About the dataset: this was an anonymous survey run on Hacker News of around 400 developers and designers, with data gathered about their job title, seniority, location and salary. The dataset is inherently ‘dirty’ as it contains some of the usual suspects: it has values in different currencies, seniority levels and job titles aren’t standardised and there are multiple different entries for the same location.
The original Hacker News discussion for those interested is here.
As with any data science task, I needed to start with a hypothesis or a purpose. Essentially I wanted to use this mini-project to find a nice way to plot data on a map, so the purpose was to clean the data and find a good library to visualise it geographically. I’ll be honest, being a data science layman I didn’t even have a hypothesis about the data.
I just wanted to make a pretty map.
Data wrangling is my favourite of the many phrases (closely followed by ‘data jiu-jitsu’) used to describe the act of ‘cleaning’ a dataset, i.e. getting it into a shape that’s suitable for analysis. It is apparently what data scientists spend most of their time doing, so it’s an important step. This includes dealing with missing values, correcting duplicate values (e.g. making sure we treat “London” and “London, England” as the same location) and so on.
Let’s have a look at some of the issues with the dataset after an initial inspection:
- The figures are in different currencies, this will need to be rectified
- The job titles need to be merged into one column
- The locations contain duplicates (e.g. ‘US’, ‘USA’, ‘United States’ all appear)
Before I can do any meaningful analysis or visualisation, these issues need to be ironed out. Being a data layman at the time, my data wrangling tool of choice was Excel and I did most of my work manually. My data wrangling consisted of:
- Looking up the most up-to-date exchange rates online and converting all figures to USD (yes, I did this manually…)
- Merging the job titles into one column
- Manually merging duplicate locations (on the country column only)
Doing this all by hand was a pretty tedious method to use, and one I don’t wish to repeat in future. It’s also an implausible method for larger datasets, so when I revisit this dataset I’ll look at what I could have automated using something like Python or R.
You can download my final version of the dataset here (hopefully you’ll agree it’s better!): Salary Data
Summarisation and Aggregation
Now was the time to do some basic statistics and aggregation to be able to see the average salary by country on a map.
To do this, I used some basic R, a popular data science language. I recommend it as a good starting point for aspiring data scientists. My R script was quite simple, the only real important line was the one that took the salaries and averaged them by location:
# load the data SalaryData <- read.csv("SalaryData.csv")
# get mean values in a table byCountry <- aggregate(SalaryUSD~Country, SalaryData, mean)
# write new data to csv write.csv(byCountry, "ByCountry.csv")
Once this was done, it was time to get this onto a map!
If you want to view it in the browser, here’s the final product, otherwise, you can see it in all its glory below.
All the data I wrangled and code I wrote is available on GitHub.
The Future i.e. Take 2
So what am I aiming to do when I return to this in the future? My next attempt on the same dataset will be a more structured attempt including:
- Statistical analysis and hypothesis testing
- Automated data wrangling using Python
- A more interactive visualisation
If you’re interested in data analysis but have no idea where to start, I hope this post will inspire you to try and attempt your own visualisation without fear of getting it wrong. Data science is all about trial and error, so I say just go for it! I used nothing but my own intuition to come up with the process for what I did, there wasn’t any knowledge of best practices behind it, and I still came up with a colourful visual that communicates something about the data.
I hope you enjoyed this, thanks for reading!