The Why and How of Teaching Yourself Data Science
Before you decide to become a data scientist, a lot of questions might be on your mind. Is it a great career option? Can I learn data science on my own? Is it worth all the investment and time? I will try and answer all of your questions.
What is data science?
Wikipedia defines data science as,
“A concept to unify statistics, data analysis, machine learning, domain knowledge, and their related methods to understand and analyze actual phenomena with data. It uses techniques and theories drawn from many fields within the context of Mathematics, Statistics, Computer Science, domain knowledge, and information science.”
Last Updated December 2020
Machine Learning A-Z, Data Science, Python for Machine Learning, Math for Machine Learning, Statistics for Data Science | By Jitesh Khurkhuriya, Python, Data Science & Machine Learning A-Z TeamExplore Course
Is Data Science a good career?
Data science is a great career option. Data scientists make an average of $113,000 to $140,000 per year. Learning data science can help you work in this rewarding field. Understanding some of these skills on your own may sound hard and sometimes can be expensive.
Is it worth all the investment?
To avoid the expense and time of top university courses, you can learn data science on your own. The biggest challenge in self-learning is not knowing what to learn, how much to learn, and the sequence of topics to learn. So what should you do?
5 skills you need to become a data scientist
Apart from the skills of subject matter expertise, which can be different for every individual, let us see why we need to learn the following skills.
- Why math?
Most algorithms in machine learning will be expressed in mathematical terms. Hence, it’s an absolute must to familiarize yourself with and brush up on some of the basic concepts of exponents, logarithms, polynomial equations, factoring, quadratic equations, and functions.
Basic probability knowledge like various terms used in probability, conditional probability, random processes, and random variable is good enough.
- Why statistics?
Descriptive statistics help us get some simple but very crucial aspects of data. It is a more straightforward interpretation of the data utilizing central tendency of the data where we identify one central value around which the majority of the data can be located as well as by using the measure of dispersion, which helps us understand the spread of the data.
As the name suggests, inferential statistics help us draw inferences or conclusions about the entire data based on samples. Some of the key concepts are probability distributions, bell curve or normal distribution, central limit theorem, confidence interval, and hypothesis testing.
Inferential statistics create probability distributions to draw inferences about the data. So it is absolutely essential to have some basic understanding of the probability terms.
- Why data visualization?
Sometimes, a simple visualization of the data can help us draw inferences or identify the data patterns. Some of the basic plots that can help to visualize the data are Histogram, Bar Chart, Line Plots, Scatter plots, and Boxplots. Focus on drawing the charts with default parameters first and then progress on customization.
Qualitative data is in the form of alphanumeric or text form, like the color of the car, gender, marital status, and so on. Like numerical features, the graphical visualization of qualitative data can help us identify the similarity or various relationships among different data elements.
- Why Python?
Get a basic understanding of all the data types with particular emphasis on string, numeric, and list types of variables. Get some good hands-on for if-else, for loop, and while loops. Remember, you don’t need to be an expert in Python programming. You should know the basics needed for data science and machine learning. One of the biggest strengths that Python has over other languages is the availability of various modules and functions needed for data science. It would help if you focused on learning multiple math functions, date module, string functions, and methods, particularly the length, slicing and indexing, split, strip, and list functions and methods for search, length, and how to handle multidimensional lists.
- Why machine learning?
Machine learning is extremely efficient, easy to learn, and versatile in processing all kinds of data. Pandas is a must-have tool in every data scientist’s tool kit. It is important to read the data from different types of sources. You may have to convert various texts to numerical data and need to split the data into train and test. Also, learn modules for splitting the data into train and test.
One of the first models everyone learns during their machine learning journey is regression. Regression helps us understand the relationship between different types of variables. Regression is used for predicting numerical values like what will be the future price of a stock or what will be the sale in the next quarter.
Like regression analysis, classification methods are popular for predicting categorical outcomes. We can predict outcomes like, “Will this customer buy my product? or “Will this customer default on the loan repayments?” Learning logistic regression, decision trees, and support vector machines are significant for solving classification problems.
Deep learning is one of the most important topics to learn in data science and machine learning. With deep learning, one can process a large number of features. Also, the ability to create large neural networks increases accuracy. The most significant advantage is its ability to learn features very incrementally. That reduces the need for domain expertise.
So how do you do it?
Use the chart below for guidance on what to learn and how much time to spend on it.
|Section||Topic||Subtopic/Library/Module||Min Time (hrs)||Max Time (hrs)|
|Mathematics||Basic Algebra||Basic Concepts of Exponents, Log, Polynomials, Quadratic equations, and Functions||1||2|
|Calculus||Rate of Change|
Limits of a function
Eigen Vectors and Eigen Values
|Probability||Basic terms of Probability|
|Statistics||Descriptive Statistics||Central Tendency of Data|
Measure of Dispersion
Correlation among variables
|Inferential Statistics||Probability Distributions|
Central Limit Theorem
|Data Visualization||Charts for Numerical Data||Matplotlib Library|
|Charts for Categorical Data||Matplotlib Library|
|Chart Customization||Matplotlib Library|
Editing chart elements
|Python Programming||Data Types||String|
Integer and Float
|File Processing||Processing of various file types like csv, tsv and text files||2||4|
|Modules and Functions||Math|
String functions of Split, Strip
List sort, len
|Machine Learning||Data Processing||Read Dataset using Pandas|
Check for and replace missing values
Convert categorical to numeric
scikit learn preprocessing
scikit learn model_selection.train_test_split
|Regression||scikit learn linear_model.LinearRegression|
scikit learn preprocessing.PolynomialFeatures
|Classification||scikit learn linear_model.LogisticRegression|
scikit learn svm.SVC
scikit learn tree.DecisionTreeClassifier
scikit learn ensemble.RandomForestClassifier
|Feature Selection||scikit learn feature_selection.RFE|
scikit learn feature_selection.GenericUnivariateSelect
|Model Tuning and Model Selection||scikit learn model_selection.cross_val_score|
scikit learn model_selection.GridSearchCV
scikit learn model_selection.RandomizedSearchCV
|Deep Learning||Keras Model Building|
Compile the Keras Neural Network
|Practice Projects||Project 1||Boston House Price Predictions||8||16|
|Project 2||Bike Demand Predictions||16||24|
|Project 3||Automobile Price Predictions||8||16|
|Project 4||Iris Species Classification||4||8|
|Project 5||Pima Indians Diabetes Classification||4||8|
|Project 6||Wine Quality Predictions||4||8|
|Project 7||Bank Telemarketing||8||16|
|Approximate Total Hours||183||278|
|Total weeks with 20 hours per week||9 Weeks||12 Weeks|
|Total weeks with 40-50 hours per week||4 Weeks||6 Weeks|
How deep should you go with each topic?
Within 4 to 12 weeks, you would have acquired enough skills to start your journey in the field of data science. Enter the dates in the plan shared earlier and get started. Let’s build some positive pressure. So, don’t forget to print it and pin it. My course on Udemy helps you understand each of these topics in detail and will help you get started on your data science journey with great confidence. Learn how to make your own data science portfolio in this blog article.
Top courses in Data Science
Data Science students also learn
Empower your team. Lead the industry.
Get a subscription to a library of online courses and digital learning tools for your organization with Udemy for Business.