Bite sized SQL for Data Analysis

Introduction

I have been away from SQL for quite some time. I decided that I was due for a refresher in data analytics and I will be adding (as I go through it) short blogs on the topic. I have neglected my Hashnode blog like a stepchild I never wanted; Im so sorry Hashnode. Since I am covering this for data analysis work, I won’t go over dropping databases creating tables, yada yada, this will be basic querying and general knowledge.

Here is what I will cover in this write-up:

Protocol to follow
- Table Naming
- Field Naming
Basic SQL
- SELECT
- FROM
- WHERE
- AS
- Arithmetic
- DISTINCT
- VIEW

What I am using

I am using a dataset from Kaggle, which can be found at this link here: Latest Netflix TV shows and movies | Kaggle The database is a PostgreSQL database, though I don’t believe that there will be any differences if you decided to use MySQL (there are some differences, but I don’t think that I am using any other than maybe LIMIT. Though SQL supports LIMIT, PostgreSQL will not support TOP as SQL does and you will need to use LIMIT in Postgres). JetBrains Datagrip for the IDE. I don’t think there is a free tier for this, but you can get MySQL Workbench, the PostgreSQL Workbench, or another from a whole host of other alternatives. There is a free 30-day trial so you can work with me on this if you like.

Rules to Follow

When working with databases (I will probably use ‘db’ at least once or twice in the article in case it trips you up) it is generally good practice to use all lowercase in your table naming as well as your field naming. If your field has a space in it, you should use an underscore for the separation of the words (This is referred to as ‘snake case’). i.e.: field_naming Some people use Pascal case or camel case in their databases, so when in Rome good folks, when in Rome. Column naming should also be singular, and unique from others you have already made.

Setting up

Install Postgres
- The installation of Postgres is easy. Head on over to PostgreSQL: The world’s most advanced open-source database, pick the latest release in your favorite flavor, be it Linux, Windows, or Mac, and follow the prompts.
Install Datagrip (*optional: you can simply just use the *)
- Datagrip is a Jetbrains product that can be found here: DataGrip: The Cross-Platform IDE for Databases & SQL by JetBrains. Datagrip was an easy choice for me since I already use JetBrains products, and I really love them.
- You will want to start up a project, make a repo or destination folder, and save your project there. Pretty self-explanatory.
- In the new project, there is a little plus button with a down arrow that you will need to click to select your data source.

Click on the Postgres option.
Whatever your password was that you set up for PostgreSQL when you set it, will be the password that you will be using here.

Bingo-Bango-Bongo. When your database is created, you will need to create a table to load your .csv file in) With Datagrip, it’s as easy as taking that .csv file, placing it in the folder of your project, dragging it over under your database explorer in Datagrip.

IMPORTANT NOTE: I don’t think that there has ever been a time when I didn’t need to clean up data, you literally can’t avoid it. Even if you had a clean set of data, you might have to have it set up a little bit differently to run things the way you want. What I did with the data set from Kaggle is just take it into Excel and clean up the date so I could use it as a DATE format. I added the .csv with a pipe delimiter instead of a comma due to how many commas there were inside the cast column. I then imported that into my project file and simply dragged it over into the database where it created the table dir as well as all the columns. After the import, I also changed the release year to SMALLINT instead of TEXT.

After I got everything set up, I realized upon export that Excel set the date_added column as a 5-digit number. Whoops. Since this is fixable, but not needed in this run-through, I am going to drop that column.

Now that we have all the data imported into our locally hosted table, we can start having some fun with learning basic queries.

Basic SQL

SELECT and FROM

SELECT and FROM are what you are going to use to choose the column and the table you are working with.

Let’s break this query down.

SELECT is saying “I want this” The ‘*’ in the query is a wildcard (we will go over these a little bit more later) this is saying ‘all of it’. FROM is where we are asking “Where this information is at”. The ‘netflixschema’ (this is the schema or structure of the query tables), the ‘.’ is showing where the table is (which is under that specific schema) and ‘tbl_netflix_tv_and_movies’ is our actual table where our data is contained. It is good practice to finish off a query with a semi-colon.

In total and plain English, we are saying “I want all the things from this where this table is Easy enough, right?

Datagrip automatically gives us the top 10 results in our query. We can change the default by using the dropdown:

Or we can change our query to use LIMIT(“place your desired number here”) I used 5.

We can also get two different columns and ‘trim’ out the rest of the content by dividing the column names with a comma. Let’s say we want to get the title and the type only.

Pretty cool right? If you look at the way the database is set up, the type comes before the title. We can change this by how we manufacture our query by simply saying we want the title first.

WHERE

WHERE is a filter that we can apply to get to a certain name, number, date, etc. that we are looking for. In the above snip, we have a type column that has a TV Show called Frasier, let's see if we can get to that with our query by using the WHERE clause:

Fun, yeah? Even simple queries like this can get to records so much quicker. This is especially useful with massive datasets where things like Excel would fall way short.

AS

What AS is for is making an alias for a column. Let’s pull the title and type of show, but let’s give the columns an alias.

By doing this, we can make different column views, but we can go over that later.

Arithmetic

Using math is something we all have been through, so I am just going to go through the operators that SQL uses quickly. The data set that we are working with really isn’t suited for math, but we can use it in a way that will get the point across. We will use our ‘release_year’ and ‘duration’ for this. Since these were both set up as ints (integers) we can do the math stuffies with them.

Addition

Here we are showing the original release_year and the duration, and then adding them together with a new alias in the third column.

Subtraction

Here we are showing the original release_year and the duration, and then subtracting them together with a new alias in the third column.

Multiplication

Here we are showing the original release_year and the duration, and then multiplying them together with a new alias in the third column.

Division

Here we are showing the original release_year and the duration, and then dividing them together with a new alias in the third column.

Modulus

You may not have seen a modulus if you aren’t a programmer or this is the first time you have dove into SQL. A modulus is the remainder of a division. Let’s look at the result of this.

That would have given you a little different answer than what you were looking for if you used it as a division operator.

Greater Than, Less Than, Equal To

Example of greater than the operator in use:

Example of less than the operator in use:

Example of less than or equal to (of course greater than or equal to is the exact opposite, so I won’t show that)

Here is an example of less than or equal, but we will compare the same column which should show true.

DISTINCT

DISTINCT is a great little statement that can take a dataset and pull out any duplicates. Let’s say we want to pull out all the unique values out of the duration column.

COUNT

COUNT is pretty cool. You can easily use it with distinct to get the count of all the same unique years above.

We can use WHERE to filter how many records have are in the category of ‘Not Rated’.

VIEW

Creating a view is like having an up-to-date data set that you can look at very quickly. Let’s say the db I am pulling from is constantly being updated with new tv shows and I want to find something suitable for my kids to watch within the parameters of it being suitable for the eyeholes of ages 14 and under. Easy peasy with a view.

You can see that we now have a new view with the three columns that we asked SQL to create for us in the left sidebar.

How cool is that?

Conclusion

SQL is such an awesome and powerful tool to know how to use, especially when you get into things like joins and subqueries. I will say, it has been fun to be back in the saddle again with it even though I have been super rusty. I will put together another more advanced write-up at a later date. If you are reading this and you are new to SQL, I hope that you found this helpful!

Beginner SQL for Data Analysis