Stop using spreadsheets for deodorizing that data and clean your .CSV file with Pandas

Sure, spreadsheets are clean, they look sexy, and we all love clean, easy-to-read information. What happens when that data is fouled, or someone forgot to put in a proper NULL/NaN value? It can get smelly very quickly and not fun to deal with (especially when you are dealing with lots and lots of data).

A little background, and a little why

I was a data analyst some time ago. During that time, I would use pandas to read .CSV files received from clients. Sometimes, this data would come in with errors, whether it be a column of data missing, an extra comma in the row, escaped chars, and the list goes on. This job needed constant supervision on file upload failures and quick data manipulation to ensure that hospitals got what they needed from us and got it as quickly as possible. That is when your friendly neighborhood bearded Oscar came in. I have been meddling around in Machine Learning as of late, giving myself migraines, endless frustration, and ramping up that impostor syndrome (because we all don't have enough of that, amirite?) when I realized that my 'data cleaning' was a little worse for wear, and I needed to 'clean up my act' badumdum-tiss. So here are some random things that I used to go through and tear through my daily analyst work so I could spend my time helping other departments.

Import that data set, yo!

I will be using a data set from Kaggle here. It has plenty of columns and rows to work with and manipulate. I have placed my data (in CSV form, of course) inside a folder labeled 'data' because I'm no hippie, and I like my directories organized. We will also be using Jupyter Notebooks; you can download that here for running through this.
(you can also use something like Google's Colaboratory (which you can download that here ), which I sometimes use as well). Enough of that lets import pandas:

When we import pandas, we use 'pd' as the call instead of typing out 'pandas' all the time. Easy enough, now all we have to do is store the file path for our data set in a variable, read the .CSV with pandas, and print out the head (in this case, we will use 10, but you can use another number) to see what we have going on:

Overview of the data

We will take a look at our data set to see what kind of trouble we can get ourselves into. We are aware of what column headers mean what, by simply looking at the raw data.

PAYS = Country Code
Code = Desc. Code
Location = Location/ City or State
Longitude = X Coor
Latitude = Y Coor
Date = Date
I131(Bq/m3) = Iodine 131
Cs134(Bq/m3) = CAESIUM 134
Cs137(Bq/m3) = CAESIUM 137

Let's decipher what we might not need in this data set. Say we want to eliminate all columns except for the Location, Date, Iodine, and Caesium 134 and 137. We will need to drop PAYS, Code, Longitude, and Latitude. Lets throw this in a list and use df.drop():
Note: 'df' is just the name I used to call my data frame; if you like, you can call it 'BillyJack,' and it will work the same

Note: keep in mind that you aren't deleting the data, just dropping it from the data set in which you are working. You can use df.columns to show your columns to prove this

Single record lookup

You can look at a single record by using integer location(iloc[]this is a list too, so you can use it as such)

Filtering the data

Looking up specific values under specific columns

If we wanted to look up anything that is under the Location 'RISOE':

If we wanted to make this its own data frame that we could reference again instead of doing things over, we could assign a variable to it.

Lets say we want to change those filthy radiation columns, that can be easily done by using .rename(columns = {}) like this:

Conclusion

There are tons of things that you can do with Pandas, it is pretty handy, and this is in no way an exhaustive list of the things that it can be used for. It is good at getting your data sets cleaned up so you can use libraries like matplotlib, seaborn, or plotly for visualizations, make quick calculations on your data sets, and edit them quickly. In my opinion, it's much better than using some spreadsheet program. I will probably release an actual series going through data sets at a later date, and get more exhaustive with the content, but for now, download it and mess around.

Is there anything you specifically love about Pandas or even Jupyter Notebooks? Maybe some cool tips that I probably skimmed over that someone new would appreciate? Let us know in the comments!