Frank Kane

From natural language processing to sentiment analysis, data science is now in demand across many industries. One of the best ways to develop the mastery of this skill is through an assortment of learning data science projects. 

What is a typical data science project?

A typical data science project will take a data set and analyze it for a specific purpose, such as taking a list of transactions and identifying their cost-basis and ROI. When you develop a data science project, you need to consider both the data you have and the desired results.

Person coding on computer

How do I start a data science project?

A data science project always begins by collecting and validating data. It can be impossible for a data science project to yield the right results with incorrect data, regardless of how rigorous the methodology might be otherwise.

Consider trying out some of these data science projects to expand your portfolio, test your data science skills, and get practice in a swiftly-growing field. 

Machine Learning, Data Science and Deep Learning with Python

Last Updated April 2022

  • 115 lectures
  • Beginner Level
4.6 (27,222)

Complete hands-on machine learning tutorial with data science, Tensorflow, artificial intelligence, and neural networks | By Sundog Education by Frank Kane, Frank Kane, Sundog Education Team

Explore Course

Beginner data science projects

1. Customer segmentation

One of the first things a business learns to do is segment its customers. Customer segmentation requires splitting customers along major demographics so customers can be targeted better and delivered superior service.

Customer segmentation is performed on a company-by-company basis because every company has different demographics. So, it’s not enough to simply create a known template for customer segmentation; you have to analyze your customers and split them into reasonably accurate groups.

Customer segmentation involves mining your data for your customer demographics and those customer behaviors.

Think about:

View code: customer-segmentation-python / Hari365

Customer segmentation

2. Credit card fraud detection

Credit card fraud detection requires basic pattern recognition, making it an excellent beginner data science project. It’s also a perfect practical addition to your data science portfolio because it’s something that most websites and apps need today. 

To detect fraud, you can analyze previous transactions, determine whether a client is in the location that you expect, and otherwise identify whether the transaction appears abnormal (such as occurring at an unusual time for the customer’s timezone). Given that, all factors will need to be weighted and adjusted so that your system can reliably detect credit card fraud without a detrimental number of false positives.

This project is a great way to get started in data analysis.

Think about:

View code: credit-card-fraud-detection-using-R-UI-and-model.github / Vineeta12345

3. Natural disaster prediction systems

Wouldn’t it be great if you could detect fire, flood, or earthquakes before they ever occurred? There’s a lot of data out there now for exactly that purpose. 

While you can’t always detect a natural disaster, you can predict the chances of things such as massive wildfires. A natural disaster prediction system would be fed in historical attributes of days during which that disaster occurred — then the system would identify when those disasters are most likely to occur in the future.

But, as with any data-based system, it requires accurate data. That would mean that you would need to mine through many historical disaster-related data to gather your information.

This is a challenging problem. For beginners, you need to know enough to utilize existing systems. More advanced users will dig into technologies such as convolutional neural networks and recurrent neural networks for making predictions from time-series data.

Think about:

View code: NaturalDisasterPredict.github / sarthak-srivastava

4. Recommendation systems

When Netflix starts playing another movie after the one you just watched, what’s happening? Netflix uses a complicated algorithm to determine what you’d most like to see. And it’s basing this on all the information it has about what you’ve watched before.

A recommendation system can be used for television shows, movies, books, music, etc. You simply need to gather information about the things you have liked in the past, and then the system can compare this with other things that are available.

Of course, you will need an archive of information and data relative to what the user has liked. This presents the opportunity to develop a user interface that other users can use and a front-end interface for a back-end database.

This project arms you with a tool chest of recommender system algorithms to try in the R programming language. If Python code is your preference, you might want to check out the SurpriseLib project.

Think about:

View code: recommenderlab.github / cran

Data science project, recommendation system

5. Sentiment analysis

Everyone from politicians to brands loves sentiment analysis. Sentiment analysis looks for mentions of a given person, product, or place and then identifies the general tone of the comment. Sentiment analysis is frequently used to look through hundreds or even thousands of product reviews to see whether those reviews are generally positive.

For eCommerce stores, sentiment analysis can determine which products are more popular. For politicians, sentiment analysis can determine general popularity rates and identify those who may be less than pleased. And for brands, sentiment analysis shows what customers think of them.

But sentiment analysis isn’t easy. You need to develop a rubric intended to determine whether someone is sad, happy, angry, or pleased with a given concept. 

Think about:

This particular project uses a dictionary-based approach using R. You might also explore the use of recurrent neural networks and other machine learning techniques for sentiment analysis. Data sets that include review text and explicit positive or negative ratings can train a neural network and perform sentiment analysis without a dictionary.

View code: SentimentAnalysis.github / cran

6. Design a chatbot

Just a decade ago, designing a chatbot was quite complex. It’s still complicated, but there are better tools for it now. In the last decade, we’ve learned that machine learning with analysis is one of the best methods of developing a chatbot. Chatbots can figure out what it means to sound human through machine learning.

You can begin with publicly accessible source code and start “training” your chatbot. You can control what your chatbot knows and even how it sounds in terms of tone. From there, you can start to understand how machine learning works and how the input is analyzed. 

Through a chatbot, you can learn more about artificial intelligence and narrow AI and its applications.

Think about:

View code: Chatterbot.github / gunthercox

Intermediate data science projects

7. Fake news detection

Fake news is everywhere today. And most people have no idea when they see it. But there’s a reason why Twitter, Facebook, and other social media accounts have been able to flag fake news — or, at least, news that you should be skeptical about.

These systems use certain signals, such as the apparent authenticity of the account that information is being shared to and the “trust” they have with certain domains. Fake news is usually distributed by the same individuals and usually on untrustworthy websites.

Using machine learning, you may need to build an archive of what you should or shouldn’t trust. And you may need to have some component of user feedback to continue to refine and fine-tune the system.

Think about:

This project lets you experiment with several classification algorithms, including naive Bayes, random forest, SVM, SGD, and logistic regression. You’ll gain experience with a wide variety of machine learning and data processing techniques.

View code: fake_news_detection.github / nishitpatel01

8. Gender detection, age prediction, and other demographics

Did you know that you can use a sample of someone’s writing to determine gender, age, and other demographics?

Whether they write circuitously or more directly can sometimes indicate gender socialization. Whether they use certain words or slang can determine age and location. Do you use “pop” or “soda”? Would you say that you “were mad” or “were a little mad”? 

These systems are a great introduction to “training” an AI. You don’t tell the AI that, for instance, women tend to write indirectly (“I think I…” instead of “I”). Instead, you feed the AI multiple samples and let it make its own conclusions. Once it has enough samples, it should be able to determine who is talking to it with a certain degree of accuracy.

(Of course, there aren’t many innate differences in language; these are mostly learned. So the AI won’t always be correct!)

Think about:

View code: age-gender-estimation.github / yu4u

Woman with smartphone using face ID recognition system

9. Speech emotion analyzer

There’s a reason this is an intermediate project: It’s hard even for people to identify emotions. But the speech emotion analyzer is another machine learning system that’s very clever. You don’t need to be able to identify the hallmarks of emotion because machine learning systems can compare large sets of data and draw conclusions about patterns.

You would feed audio into the machine and tell the machine whether it was displaying happiness, sadness, anger, etc. And then the system itself would start to learn what that sounded like, as a human might, without you having to say that “loud means angry” or “sadness is quieter.”

Over time, you’d develop a pretty accurate method of identifying emotion.

This project lets you practice with convolutional neural networks and LSTM (Long Short Term Memory) neural networks, as well as the particulars of dealing with audio data.

View code: Speech-Emotion-Analyzer / MITESHPUTHRANNEU

10. Handwriting recognition

Data science projects can be complex in two ways. One, they may require highly complex algorithms. Two, they require extensive data sets. Many data science projects aren’t very difficult algorithmically, but they are held back by the sheer number of datasets they may require.

Handwriting recognition is similar. You need to feed a lot of data into a system, and you also need to be able to show the system what makes a handwriting sample “belong” to another person — that’s the most challenging aspect. 

You can’t just feed any handwriting samples into this system; you need handwriting samples of the people you hope to match. So, if you had, for instance, a “suspect,” you would need quite a few samples of their handwriting to even start to compare. The easiest way to start this project is by using signatures specifically.

Think about:

This project uses Tensorflow to construct artificial neural networks that decode handwriting images into text. Still, it goes well beyond the standard “MNIST” example that’s limited to recognizing handwritten numerical digits. It can recognize words and lines of text, which places this activity firmly within the “intermediate” category.

View code: SimpleHTR / githubharald

Advanced data science projects

11. Cancer cell identification

A lot of projects are just for fun. But cancer cell detection is an example of how machine learning is truly advancing the human race. Most cancer cells look very similar — and very different from other cells. So, artificial intelligence can automatically detect these cells, potentially leading to better health results.

Interestingly, one of the first cancer detection systems was actually meant to identify types of bread. It sounds like an amusing anecdote, but it highlights something important about machine learning. Whenever a machine learning system is comparing images to identify them, it’s essentially doing the same thing. That’s true, whether it’s cancer or bread!

So, this project can teach you a lot about image recognition.

View code: Breast-Cancer-Detection-using-Deep-Learning.github / sayakpaul

12. Driver drowsiness detection

They often say that driving while drowsy is actually more dangerous than driving while drunk.

But people may not always know how tired they are. A drowsiness detection app is very clever. It looks at a person and where their eyes are and notes if they are having trouble keeping their eyes on the road.

Today, these types of apps actually appear in many cars; they warn a driver if they’ve looked away from the street. These apps are great practice at mining video data for information, such as where someone’s looking. They’re also great for eye-tracking.

But it’s also challenging because it does require inputting and analyzing video — something that can come in handy in many applications.

Think about:

View code: Drowsiness_Detection.github / akshaybahadur21

13. Automated image captions

Did you know that many leading news sites have images captioned by AI? 

It’s not always feasible to have a human comb through this information. Sometimes it’s just better for it to be done automatically. But it’s also more challenging than it seems.

An image caption generator has to do multiple things to be successful. First, it has to identify all the items that are in the image. Second, it has to create a sentence that is grammatically correct (“child on bike”). Third, it has to superimpose that sentence onto the image.

And if anything goes wrong, you’ll have a nonsensical passage (“human by wheels”). If everything isn’t perfectly correct, the data revealed could be insensible. 

Think about:

View code: Image-Caption-Generator.github / dabasajay

14. Traffic sign recognition

Why is traffic sign recognition harder than identifying cancer cells?

When identifying cancer cells (or bread), the image is usually isolated. Traffic signs are generally inundated with other information: other signs, business signs, people, trees, equipment, and more. 

Further, traffic signs change worldwide; you need to be able to identify a wide breadth of them. And many things look like traffic signs, such as business signs. Finally, you usually need to do so quite fast when you need to recognize traffic signs. 

With a cancer cell, you’re only looking at a few options, you have time, and you’re looking at a single image — so, in some ways, it’s easier to implement.

Nevertheless, traffic sign recognition has become extremely popular as autonomous vehicles have become more common. It’s just one example of the wider field of computer vision autonomous vehicles rely on.

Think about:

View code: traffic-sign-recognition.github / wolfapple

Autonomous self-driving car is recognizing road signs. Computer vision and artificial intelligence concept.

Learning data science

Those are just some exciting data science project ideas. As you start working as a data scientist in the real world, you’ll undoubtedly find new, interesting applications for data science. Today, a data scientist can work in fraud detection, deep learning, or exploratory data analysis. Whatever your field of interest, a data scientist can find a career in it.

But because data science is such a broad and versatile field, prospective data scientists will want to develop their expertise and keep up-to-date with data science news and applications. As you build your data science portfolio, consider learning another programming language, such as R. The R programming language is constructed specifically for data science and data analysis, so you’ll get extra practice with data visualization.

After you’ve finished a few data science projects, you should consider looking at some data science interview questions. Together, these projects and questions can help you prepare for a career in data science.

Top courses in Data Science

R Programming A-Z™: R For Data Science With Real Exercises!
Kirill Eremenko, Ligency I Team, Ligency Team
4.7 (45,964)
Python for Data Science and Machine Learning Bootcamp
Jose Portilla
4.6 (114,769)
Machine Learning A-Z™: Hands-On Python & R In Data Science
Kirill Eremenko, Hadelin de Ponteves, Ligency I Team, SuperDataScience Support, Ligency Team
4.6 (158,677)
Statistics for Data Science and Business Analysis
365 Careers, 365 Careers Team
4.6 (30,959)
Complete 2022 Data Science & Machine Learning Bootcamp
Philipp Muellauer, Dr. Angela Yu
4.6 (4,277)
The Data Science Course 2022: Complete Data Science Bootcamp
365 Careers, 365 Careers Team
4.6 (107,200)
Complete Machine Learning & Data Science Bootcamp 2022
Andrei Neagoie, Daniel Bourke, Zero To Mastery
4.7 (10,631)
Python and Data Science for beginners
Bluelime Learning Solutions
4.7 (376)
SQL & Database Design A-Z™: Learn MS SQL Server + PostgreSQL
Kirill Eremenko, Ilya Eremenko, Ligency I Team, Ligency Team
4.7 (4,652)

More Data Science Courses

Data Science students also learn

Empower your team. Lead the industry.

Get a subscription to a library of online courses and digital learning tools for your organization with Udemy for Business.

Request a demo

Courses by Frank Kane

The Ultimate Hands-On Hadoop: Tame your Big Data!
Sundog Education by Frank Kane, Frank Kane, Sundog Education Team
4.6 (26,690)
Apache Spark with Scala - Hands On with Big Data!
Sundog Education by Frank Kane, Frank Kane, Sundog Education Team
4.5 (15,487)
Streaming Big Data with Spark Streaming and Scala - Hands On
Sundog Education by Frank Kane, Frank Kane, Sundog Education Team
4.3 (3,262)
Machine Learning, Data Science and Deep Learning with Python
Sundog Education by Frank Kane, Frank Kane, Sundog Education Team
4.6 (27,222)
Taming Big Data with Apache Spark and Python - Hands On!
Sundog Education by Frank Kane, Frank Kane, Sundog Education Team
4.6 (12,697)
Taming Big Data with MapReduce and Hadoop - Hands On!
Sundog Education by Frank Kane, Frank Kane, Sundog Education Team
4.7 (2,762)
Build a Serverless App with AWS Lambda - Hands On!
Sundog Education by Frank Kane, Brian Tajuddin, Frank Kane, Sundog Education Team
4.5 (2,087)
Elasticsearch 6 and Elastic Stack - In Depth and Hands On!
Sundog Education by Frank Kane, Frank Kane
4 (2,312)
Building Recommender Systems with Machine Learning and AI
Sundog Education by Frank Kane, Frank Kane, Sundog Education Team
4.5 (2,334)
Autonomous Cars: Deep Learning and Computer Vision in Python
Sundog Education by Frank Kane, Frank Kane, Dr. Ryan Ahmed, Ph.D., MBA, Mitchell Bouchard, Sundog Education Team
4.6 (986)
The Ultimate Unofficial Udemy Online Course Creation Guide
Sundog Education by Frank Kane, Frank Kane, Sundog Education Team
4.6 (285)
AWS Certified Data Analytics Specialty 2022 - Hands On!
Sundog Education by Frank Kane, Stephane Maarek | AWS Certified Cloud Practitioner,Solutions Architect,Developer, Frank Kane, Sundog Education Team
4.6 (10,041)

Courses by Frank Kane