SQL For Data Scientists

SQL is a useful part of a data scientist's toolkit and it can feel like an intimidatingly big area to try and learn alongside all the other data science concepts. I want to present a few key concepts that are enough to get you up and running with SQL!

From what I can tell, the biggest difference between data science curricula and data science job postings is usually knowledge of SQL. I assume most businesses want a data scientist who knows SQL because a lot of corporate data is stored in some sort of relational database. For some reason though, data science courses don't always tend to teach it explicitly.

I wanted to collect some of the concepts which I think are useful for aspiring data scientists to learn about databases and SQL. I'll also link to appropriate parts of the w3schools SQL tutorials along the way.

Query Syntax

Obviously the first step is to understand how to write a SQL query.

SQL is a declarative language. All that means is that when you write a SQL query, you're expressing what the result should look like, rather than how to achieve it.

Let's look at a basic SQL query and see how that's the case.

SELECT
    name, age, height
FROM
    people
WHERE
    job = "data scientist"

SQL is not case sensitive, but I capitalised the keywords (which is a typical thing to do anyway).

If you know that you have a table of data, called people, you can pretty much work out what this query will do. The declarative syntax means you can specify the data source (the people table), what you want to extract (name, age and height) and any filters you want to apply (only get the data for people who are data scientists).

There is a lot going on under the hood in terms of the computer deciding how best to store and index the data, but when you write queries you don't want to have to care about that, you just want your results.

The main keywords you need to know are:

  • SELECT...FROM (to select rows from a specific table)
  • WHERE (to filter rows - optional)
  • INSERT (to insert new rows)
  • UPDATE (to update existing rows)
  • DELETE (to remove rows)
  • GROUP BY (to group data into... well, groups)
  • CREATE TABLE (to create new tables)
  • ALTER TABLE (to make changes to tables like adding new columns)
  • JOIN (to join multiple tables together)

The JOIN keyword

I left the JOIN keyword until last in that list because it warrants its own section.

Merging multiple data sources is a staple data science operation, and that's no different when working with SQL. If you've used the merge function in pandas you'll have seen this already, but let's see how they compare.

Let's take the example of joining two data sources with pandas. One of them is a csv of people, with names, ages, heights and jobs. The other is a csv of phone numbers linked to people's names. The name column is common between both data sources.

import pandas as pd

df = pd.read_csv("people.csv")
phone_numbers = pd.read_csv("phone_numbers.csv")

merged = pd.merge(df, phone_numbers, on="name", how="inner")

That how keyword corresponds to the type of join in SQL. The same operation in SQL looks like this (assuming we have a people and phone_numbers table in a database, rather than csv files):

SELECT
  people.name, people.age, people.height, phone_numbers.phone_number
FROM
  people
  INNER JOIN phone_numbers ON people.name = phone_numbers.name

I've specifically stated in the SELECT clause where the columns come from, because both tables have a name column and SQL would have gotten confused otherwise.

The types of SQL join correspond to the valid values of the "how" keyword in the pandas merge function. They are:

  • Inner Join - only rows where both tables have a value are returned
  • Left Outer Join - only rows where the table on the left of the statement has a value are returned
  • Right Outer Join - only rows where the table on the right of the statement has a value are returned
  • Full Outer Join - all rows are returned from both tables

The outer joins let you keep rows from either table if there are no corresponding rows in the other table.

So a left join in the previous statements would have shown all people, regardless of whether they had a phone number in the second data source. The rows of people who don't have a phone number would have shown a NULL value for the phone number. Using an inner join wouldn't have returned them at all.

It can be helpful to see this visually, and the w3schools pages do that already, but here's another good example.

SQL Tools for Data Science

If you know the basic query syntax and the various join types, you're probably equipped enough to start pulling data out of any SQL database. Programmers working with SQL often use specific tools to access databases, such as Microsoft's SQL Server Management Studio.

However, as a data scientist you may want to do this straight from your code instead. You have a few options for this.

Also, if you're familiar with pandas but not with SQL, the pandas documentation has a section with pandas commands and the associated SQL queries.

Conclusion

I'd argue that's all you need to get up and running.

W3schools is a great interactive resource to test your sql queries, but there are plenty of other good learning resources like Codecademy.

There is more to know of course about relational databases. I haven't covered the concepts of primary keys, foreign keys, or indexes because these are more important for database design rather than data retrieval. Designing a relational database has its own set of skills and required knowledge, but if your only interaction is retrieving data, you shouldn't have to worry about it.

I may write a post about database design in the future, but I'd strongly argue that it's an optional skill for most data scientists.

About David

I'm a freelance data scientist consultant and educator with an MSc. in Data Science and a background in software and web development. My previous roles have been a range of data science, software development, team management and software architecting jobs.

Related articles tagged Programming for Data Scientists

Method Chaining in Pandas

Programming for Data Scientists

A discussion of "method chaining" in pandas. Used for better readability, or harder debugging, depending on how you look at it.

SQL For Data Scientists

Programming for Data Scientists

SQL is a useful part of a data scientist's toolkit and it can feel like an intimidatingly big area to try and learn alongside all the other data science concepts. I want to present a few key concepts that are enough to get you up and running with SQL!

Turning Jupyter Notebooks into Reusable Scripts

Programming for Data Scientists

As part of my commitment to occasionally talk about "programming for data scientists", I want to share ideas that will facilitate this to help data scientists focus on important stuff. In this post I want to share some thoughts on how to make your Jupyter notebooks easier to "productionise".

Contact me

Please give me a brief idea of what you're looking for help with, and I'll get back to you very soon. If you're having trouble with the form below, you can also email me at hello@davidasboth.com

Twitter LinkedIn