If you’re interested in data science, I have some potentially bad news – you need to know some maths.
The good news is that despite what some resources might suggest, you don’t need that much. You still need more than zero, but chances are you’ll have seen a lot of it in high/secondary school.
Here’s an overview of what topics you should brush up on.
I’ve always thought the words “linear algebra” sound more intimidating than they need to.
Basically, you need to know what a vector and a matrix are, the notation to represent them, and how to do basic operations with them (addition, multiplication, transposing, dot products, that sort of thing).
Datasets are usually represented as matrices where each rows is a data point and each column is a feature. When you talk about single data points or parameters to machine learning algorithms, they’re typically vectors. You can avoid a lot of confusion by having your vector/matrix knowledge up to date.
Arguably if you don’t brush up on anything else it should be this.
You should know your means from your medians, your Gaussian distribution from your multinomial, and you should know about the Central Limit Theorem.
Understanding sampling and hypothesis testing is also important.
“Data scientists are statisticians because being a statistician is awesome and anyone who does cool things with data is a statistician.”
Robert Rodriguez, President, American Statistical Association
OK so the head of the American Statistical Association might not be the most reliable source on how useful statistics is.
Overlooking that, the point is that data science is in many ways computational statistics. You can’t get away from the fact that understanding fundamental statistical concepts is essential to make any sense of data.
Being comfortable with representations of probability and seeing probability distributions is enough to cover your bases. You shouldn’t be thrown off by phrases like “conditional probability” or “random variable”.
Oh, also learn and understand Bayes’ Theorem.
Understanding probabilities is a useful life skill anyway. Humans are typically not wired to intuitively understand probabilities (I recommend The Drunkard’s Walk on this subject). Being able to do it is a good skill for a data scientist. Also, many machine learning algorithms deal with probabilities and probability distributions one way or another.
The word “optional” might be controversial among some data scientists. I’d argue that you can go a long way in data science without ever calculating a partial derivative.
Having said that, knowing what a partial derivative is and what it’s used for can’t hurt. Some machine learning algorithms (neural networks, linear regression) require that understanding if you want to go into the details. Don’t go anywhere near “gradient descent” until you understand why you’d want to set a partial derivative to zero.
So high school level calculus (derivatives, integrals, the ‘chain rule’) are useful concepts to know, but don’t start there. The other things I’ve mentioned above are more important.
Khan Academy has been my favourite resource for brushing up on maths subjects. Something about Sal Khan’s teaching style really resonates with me.
For more advanced topics, the YouTube channel mathematicalmonk is also excellent.
For linear algebra, getting familiar with the numpy library is also a good idea if you already know some Python, as numpy encourages you to deal with vectorised operations.
If you already have programming experience, check out Project Euler – it’s a series of mathematical challenges you solve by writing code. It might not be immediately related to data science, but it’s a great way to get a bit more motivated about maths.
Hopefully I’ve convinced you that you don’t need a PhD in maths to embark on the road to data science!
Footnote: This is the 22nd entry in my 30 day blog challenge.