Diogo Alves de Resende

In this post, I’ll do a brief tutorial on Python for data analysis. Analyzing data can be an enjoyable, well-paying job. Payscale estimates show that entry-level positions have an average salary above USD $60k. However, there are hard-skill requirements that you must meet, such as knowledge of SQL or a data analysis programming language such as Python or R. 

To illustrate using Python for data analysis, we’ll be studying infidelity in marriages, which should be an interesting project. Let’s get started!

How to begin a data analysis project using Python

First, we need a software called an integrated development environment (IDE) to perform analysis with Python. My choice is Google Colaboratory, which I feel provides the most frictionless experience. You can use it directly from your Google Drive. In the video below, I walk you through how to set up a Google Colab, and also walk you through the tutorial

Now that we’re set up, we need tools. In Python, these are called libraries. Here, we will add Pandas and Seaborn, which Python will use to manipulate, analyze, and visualize the data. 

Data Mining for Business in Python

Last Updated September 2022

Bestseller
  • 138 lectures
  • Beginner Level
4.4 (82)

9 Data Mining algorithms for Supervised, Unsupervised Machine Learning and Explainable Artificial Intelligence. | By Diogo Alves de Resende

Explore Course

We want to let Python know we’re going to use these data analysis libraries. To prepare data for analysis, here’s what we do:

#import libraries
import pandas as pd
import seaborn as sns

In Python, it’s usual to add “as something” when you import the library. This makes the code less lengthy when you call the libraries.

Next, we need data. For this tutorial, I made the data available online, which you can get by running the following code: 

#load data
data = pd.read_csv("https://bit.ly/udemy_dataset")

Additionally, we always like to have a look at the data, and we can have a look at the first 5 rows by doing a very simple command. Below is a snippet of the data set:

#Looking at the data
data.head()
Data set, python for data analysis

We have a lot of information here, and some that’s not obvious at first. For instance, why is occupation ranked from 1 to 7? The original study that created this data set has an extensive description of each variable. However, it’s a data analyst’s role to master the data. Here’s an overview of the variables: 

Top courses in Data Analysis

Now we analyze the data using Python’s libraries

One of the most common data analysis commands is to look at the summary statistics of the numeric variables. Luckily, this is very simple. 

#Summary Statistics
data.describe()
data table, python for data analysis

This is loads of statistics and information. We have the number of observations per variable (count), the mean (mean), standard deviation (std), the minimum (min), the maximum (max), and the values for the quartiles (25%, 50%, and 75%). The 50% quartile would also be called the median. Here are some examples of what we see:

Visualizing the data with histograms

If you want to see the distribution of affairs in the sample, the simplest way is to do a histogram using the Seaborn library. A histogram is a helpful graph that organizes data into buckets or ranges, specified by the creator if required.

#Histogram
sns.histplot(data = data.affairs)
data visualization, histogram

Because our data has grouped some instances, the insights we get are not extensive. We confirm that most people do not have affairs, but we cannot really see a pattern in those who are unfaithful. So, it is time to move on.

Econometrics for Business in R and Python
Diogo Alves de Resende
4.5 (375)
Bestseller
XGBoost for Business in Python and R
Diogo Alves de Resende
4.7 (134)
Bestseller
Regression Analysis for Business Managers in Python and R
Diogo Alves de Resende
4.9 (39)
Highest Rated
Forecasting Models and Time Series for Business in R 2022
Diogo Alves de Resende
4.5 (195)
Bestseller
Data Mining for Business in Python
Diogo Alves de Resende
4.4 (82)
Bestseller
Data Literacy and Analytics for Business Leaders
Diogo Alves de Resende
4.6 (139)

Courses by Diogo Alves de Resende

What correlates with affairs?

The correlation measures the strength of a relationship between 2 variables. The value of the correlation metric varies between -1 and 1. If the value is 1, the relationship is positively strong. As the value of one variable increases, the value of the other variable also increases. If it is -1, this indicates that as the value of one variable increases, the value of the other variable decreases. 

An example of a robust positive relationship would be people using umbrellas when it’s raining. An example of a strong negative relationship would be people using umbrellas on a sunny day. If the correlation is zero, then it means there is no relationship. An example of zero correlation would be eating chocolate and swimming in the pool. There is no connection between the two—though the idea does not seem bad at all!

To do a good correlation, it’s wise to choose the variables first, and then perform the analysis. Here I’ll choose affairs, age, religiousness, and rating since I am a fan of simple statistics.

Because I like to visualize the correlation using a heatmap, we are going to learn how to do one. A heatmap is particularly useful for those who visualize with colors since the colors change with the values.

#Picking variables
data_correlation = data[["affairs", "age", "religiousness", "rating"]]

#Correlation heatmap - the command for a correlation is corr()
sns.heatmap(data = data_correlation.corr(),
            annot = True,
            fmt = '.2g',
            center = 0,
            cmap = 'coolwarm',
            linewidths = 1,
            linecolor = 'black')
data visualization, heatmap

There is quite a bit of information here. Affairs correlates negatively with rating. This means that happier marriages lead to less infidelity, which makes sense to me. It’s also interesting that religiousness also has a negative correlation with affairs. Finally, age and rating also have a negative correlation, which would hint towards love not increasing with age. 

Python makes data analysis fun

The possibilities are endless. We could test whether there are statistically significant differences in having affairs between genders. Additionally, we could also see whether having children hinders or magnifies the likelihood of people cheating on their partners. However, I’ll stop here. Using Python for data analysis is fun and easy, but you do need to practice. I hope you start with this one and build from there. If you’re serious about becoming a data analyst, I highly recommend this post on how to become a data analyst from scratch.

Page Last Updated: February 2022