Why and How to Learn Data Science
Before you decide to become a data scientist, a lot of questions might be on your mind. Is it a great career option? Can I learn data science on my own? Is it worth all the investment and time? I will try and answer all of your questions.
What is data science?
Wikipedia defines data science as,
“A concept to unify statistics, data analysis, machine learning, domain knowledge, and their related methods to understand and analyze actual phenomena with data. It uses techniques and theories drawn from many fields within the context of Mathematics, Statistics, Computer Science, domain knowledge, and information science.”
Last Updated October 2021
Learn and master the Data Science, Python for Machine Learning, Math for Machine Learning, Statistics for Data Science | By Jitesh Khurkhuriya, Python, Data Science & Machine Learning A-Z Team
Explore CourseIs Data Science a good career?
Data science is a great career option. Data scientists make an average of $113,000 to $140,000 per year. Learning data science can help you work in this rewarding field. Understanding some of these skills on your own may sound hard and sometimes can be expensive.
Is it worth all the investment?
To avoid the expense and time of top university courses, you can learn data science on your own. The biggest challenge in self-learning is not knowing what to learn, how much to learn, and the sequence of topics to learn. So what should you do?
5 skills you need to become a data scientist
Apart from the skills of subject matter expertise, which can be different for every individual, let us see why we need to learn the following skills.
- Why math?
Most algorithms in machine learning will be expressed in mathematical terms. Hence, it’s an absolute must to familiarize yourself with and brush up on some of the basic concepts of exponents, logarithms, polynomial equations, factoring, quadratic equations, and functions.
Basic probability knowledge like various terms used in probability, conditional probability, random processes, and random variable is good enough.
- Why statistics?
Descriptive statistics help us get some simple but very crucial aspects of data. It is a more straightforward interpretation of the data utilizing central tendency of the data where we identify one central value around which the majority of the data can be located as well as by using the measure of dispersion, which helps us understand the spread of the data.
As the name suggests, inferential statistics help us draw inferences or conclusions about the entire data based on samples. Some of the key concepts are probability distributions, bell curve or normal distribution, central limit theorem, confidence interval, and hypothesis testing.
Inferential statistics create probability distributions to draw inferences about the data. So it is absolutely essential to have some basic understanding of the probability terms.
- Why data visualization?
Sometimes, a simple visualization of the data can help us draw inferences or identify the data patterns. Some of the basic plots that can help to visualize the data are Histogram, Bar Chart, Line Plots, Scatter plots, and Boxplots. Focus on drawing the charts with default parameters first and then progress on customization.
Qualitative data is in the form of alphanumeric or text form, like the color of the car, gender, marital status, and so on. Like numerical features, the graphical visualization of qualitative data can help us identify the similarity or various relationships among different data elements.
- Why Python?
Get a basic understanding of all the data types with particular emphasis on string, numeric, and list types of variables. Get some good hands-on for if-else, for loop, and while loops. Remember, you don’t need to be an expert in Python programming. You should know the basics needed for data science and machine learning. One of the biggest strengths that Python has over other languages is the availability of various modules and functions needed for data science. It would help if you focused on learning multiple math functions, date module, string functions, and methods, particularly the length, slicing and indexing, split, strip, and list functions and methods for search, length, and how to handle multidimensional lists.
- Why machine learning?
Machine learning is extremely efficient, easy to learn, and versatile in processing all kinds of data. Pandas is a must-have tool in every data scientist’s tool kit. It is important to read the data from different types of sources. You may have to convert various texts to numerical data and need to split the data into train and test. Also, learn modules for splitting the data into train and test.
One of the first models everyone learns during their machine learning journey is regression. Regression helps us understand the relationship between different types of variables. Regression is used for predicting numerical values like what will be the future price of a stock or what will be the sale in the next quarter.
Like regression analysis, classification methods are popular for predicting categorical outcomes. We can predict outcomes like, “Will this customer buy my product? or “Will this customer default on the loan repayments?” Learning logistic regression, decision trees, and support vector machines are significant for solving classification problems.
Deep learning is one of the most important topics to learn in data science and machine learning. With deep learning, one can process a large number of features. Also, the ability to create large neural networks increases accuracy. The most significant advantage is its ability to learn features very incrementally. That reduces the need for domain expertise.
So how do you do it?
Use the chart below for guidance on what to learn and how much time to spend on it.
Section | Topic | Subtopic/Library/Module | Min Time (hrs) | Max Time (hrs) |
---|---|---|---|---|
Mathematics | Basic Algebra | Basic Concepts of Exponents, Log, Polynomials, Quadratic equations, and Functions | 1 | 2 |
Calculus | Rate of Change Limits of a function Derivative Partial Derivative | 1 | 2 | |
Linear Algebra | Vectors Matrix Vector Transformation Eigen Vectors and Eigen Values | 4 | 6 | |
Probability | Basic terms of Probability Conditional Probability Random Processes Random Variables | 1 | 2 | |
Statistics | Descriptive Statistics | Central Tendency of Data Measure of Dispersion Correlation among variables | 2 | 4 |
Inferential Statistics | Probability Distributions Normal Distributions Central Limit Theorem Confidence Interval Hypothesis Testing | 12 | 16 | |
Data Visualization | Charts for Numerical Data | Matplotlib Library Scatter Plot Line Plot Histogram Bar Chart Box Plot | 2 | 4 |
Charts for Categorical Data | Matplotlib Library Histogram Pie Charts | 1 | 2 | |
Chart Customization | Matplotlib Library Figures Subplots Editing chart elements | 2 | 4 | |
Python Programming | Data Types | String Integer and Float List Tuples Dictionary | 1 | 2 |
Control Flow | If-Else For Loops While Loops | 2 | 4 | |
File Processing | Processing of various file types like csv, tsv and text files | 2 | 4 | |
Modules and Functions | Math Date String functions of Split, Strip List sort, len | 4 | 6 | |
Machine Learning | Data Processing | Read Dataset using Pandas Access Data Check for and replace missing values Convert categorical to numeric scikit learn preprocessing scikit learn model_selection.train_test_split | 16 | 20 |
Regression | scikit learn linear_model.LinearRegression scikit learn preprocessing.PolynomialFeatures | 12 | 16 | |
Classification | scikit learn linear_model.LogisticRegression scikit learn svm.SVC scikit learn tree.DecisionTreeClassifier scikit learn ensemble.RandomForestClassifier | 16 | 20 | |
Feature Selection | scikit learn feature_selection.RFE scikit learn feature_selection.GenericUnivariateSelect | 12 | 16 | |
Model Tuning and Model Selection | scikit learn model_selection.cross_val_score scikit learn model_selection.GridSearchCV scikit learn model_selection.RandomizedSearchCV | 16 | 20 | |
Deep Learning | Keras Model Building Layers Activation Functions Loss Functions Optimization Initializers Compile the Keras Neural Network | 24 | 32 | |
Practice Projects | Project 1 | Boston House Price Predictions | 8 | 16 |
Project 2 | Bike Demand Predictions | 16 | 24 | |
Project 3 | Automobile Price Predictions | 8 | 16 | |
Project 4 | Iris Species Classification | 4 | 8 | |
Project 5 | Pima Indians Diabetes Classification | 4 | 8 | |
Project 6 | Wine Quality Predictions | 4 | 8 | |
Project 7 | Bank Telemarketing | 8 | 16 | |
Approximate Total Hours | 183 | 278 | ||
Total weeks with 20 hours per week | 9 Weeks | 12 Weeks | ||
Total weeks with 40-50 hours per week | 4 Weeks | 6 Weeks |
How deep should you go with each topic?
Within 4 to 12 weeks, you would have acquired enough skills to start your journey in the field of data science. Enter the dates in the plan shared earlier and get started. Let’s build some positive pressure. So, don’t forget to print it and pin it. My course on Udemy helps you understand each of these topics in detail and will help you get started on your data science journey with great confidence. Learn how to make your own data science portfolio in this blog article.
Recommended Articles
Top courses in Data Science
Data Science students also learn
Empower your team. Lead the industry.
Get a subscription to a library of online courses and digital learning tools for your organization with Udemy for Business.