Learn Data Analysis with Python: A Case Study
In this post, I’ll do a brief tutorial on Python for data analysis. Analyzing data can be an enjoyable, well-paying job. Payscale estimates show that entry-level positions have an average salary above USD $60k. However, there are hard-skill requirements that you must meet, such as knowledge of SQL or a data analysis programming language such as Python or R.
To illustrate using Python for data analysis, we’ll be studying infidelity in marriages, which should be an interesting project. Let’s get started!
How to begin a data analysis project using Python
First, we need a software called an integrated development environment (IDE) to perform analysis with Python. My choice is Google Colaboratory, which I feel provides the most frictionless experience. You can use it directly from your Google Drive. In the video below, I walk you through how to set up a Google Colab, and also walk you through the tutorial
Now that we’re set up, we need tools. In Python, these are called libraries. Here, we will add Pandas and Seaborn, which Python will use to manipulate, analyze, and visualize the data.
- Pandas is an open-source library that assists Python in manipulating and analyzing data.
- Seaborn is a library that Python uses to create meaningful data visualizations.
We want to let Python know we’re going to use these data analysis libraries. To prepare data for analysis, here’s what we do:
#import libraries import pandas as pd import seaborn as sns
In Python, it’s usual to add “as something” when you import the library. This makes the code less lengthy when you call the libraries.
Next, we need data. For this tutorial, I made the data available online, which you can get by running the following code:
#load data data = pd.read_csv("https://bit.ly/udemy_dataset")
Additionally, we always like to have a look at the data, and we can have a look at the first 5 rows by doing a very simple command. Below is a snippet of the data set:
#Looking at the data data.head()
We have a lot of information here, and some that’s not obvious at first. For instance, why is occupation ranked from 1 to 7? The original study that created this data set has an extensive description of each variable. However, it’s a data analyst’s role to master the data. Here’s an overview of the variables:
- Affairs: How often respondents engaged in infidelity during the past year. 0 = none, 1 = once, 2 = twice, 3 = three times, 7 = 4 – 10 times, 12 = monthly or more.
- Gender: The person’s gender.
- Age: How old the person is.
- Yearsmarried: How many years the person has been married.
- Children: Whether or not the person has children.
- Religiousness: How religious the person says they are, on a scale of 1 to 5, with 5 being very religious.
- Education: Level of education. 9 = grade school; 12 = high school; 14 = some college, 16 = college; 17 = some graduate work; 18 = master’s degree; 20 = PhD or MD.
- Occupation: The person’s job. 7 = Physician or CEO of large company; 6 = professional with advanced degree; 5 = managerial administrative, business; 4 = teacher, counselor, social worker, or nurse; 3 = white-collar, like sales or clerical; 2 = farming, semi-skilled or unskilled worker; 1 = student.
- Rating: How high the person rates their satisfaction in their marriage on a scale of 1 to 5, with 5 being very satisfied
Top courses in Data Analysis
Now we analyze the data using Python’s libraries
One of the most common data analysis commands is to look at the summary statistics of the numeric variables. Luckily, this is very simple.
#Summary Statistics data.describe()
This is loads of statistics and information. We have the number of observations per variable (count), the mean (mean), standard deviation (std), the minimum (min), the maximum (max), and the values for the quartiles (25%, 50%, and 75%). The 50% quartile would also be called the median. Here are some examples of what we see:
- We have 601 observations.
- The average person has had 1.5 affairs, is 32.5 years old, has been married for 8 years, rates 3 in religiousness, studied 16 years, and rates his or her marriage a 3.9 out of 5. I omitted occupation because it does not make sense to look at the average.
- Specifically to affairs, we see that even at 75%, the value is 0, which means that only a small percentage of the sample admits to cheating on their spouse. This gives me a new idea to analyze!
Visualizing the data with histograms
If you want to see the distribution of affairs in the sample, the simplest way is to do a histogram using the Seaborn library. A histogram is a helpful graph that organizes data into buckets or ranges, specified by the creator if required.
#Histogram sns.histplot(data = data.affairs)
Because our data has grouped some instances, the insights we get are not extensive. We confirm that most people do not have affairs, but we cannot really see a pattern in those who are unfaithful. So, it is time to move on.
What correlates with affairs?
The correlation measures the strength of a relationship between 2 variables. The value of the correlation metric varies between -1 and 1. If the value is 1, the relationship is positively strong. As the value of one variable increases, the value of the other variable also increases. If it is -1, this indicates that as the value of one variable increases, the value of the other variable decreases.
An example of a robust positive relationship would be people using umbrellas when it’s raining. An example of a strong negative relationship would be people using umbrellas on a sunny day. If the correlation is zero, then it means there is no relationship. An example of zero correlation would be eating chocolate and swimming in the pool. There is no connection between the two—though the idea does not seem bad at all!
To do a good correlation, it’s wise to choose the variables first, and then perform the analysis. Here I’ll choose affairs, age, religiousness, and rating since I am a fan of simple statistics.
Because I like to visualize the correlation using a heatmap, we are going to learn how to do one. A heatmap is particularly useful for those who visualize with colors since the colors change with the values.
#Picking variables data_correlation = data[["affairs", "age", "religiousness", "rating"]] #Correlation heatmap - the command for a correlation is corr() sns.heatmap(data = data_correlation.corr(), annot = True, fmt = '.2g', center = 0, cmap = 'coolwarm', linewidths = 1, linecolor = 'black')
There is quite a bit of information here. Affairs correlates negatively with rating. This means that happier marriages lead to less infidelity, which makes sense to me. It’s also interesting that religiousness also has a negative correlation with affairs. Finally, age and rating also have a negative correlation, which would hint towards love not increasing with age.
Python makes data analysis fun
The possibilities are endless. We could test whether there are statistically significant differences in having affairs between genders. Additionally, we could also see whether having children hinders or magnifies the likelihood of people cheating on their partners. However, I’ll stop here. Using Python for data analysis is fun and easy, but you do need to practice. I hope you start with this one and build from there. If you’re serious about becoming a data analyst, I highly recommend this post on how to become a data analyst from scratch.