Find Duplicates in Excel: Remove the doubles from your Database

excelfindduplicatesWithout a doubt, Microsoft Excel is the number one spreadsheet program. It’s so easy to build reports – users of all skill levels can perform simple analyses with very little training. But to progress, you need to learn the tricks that make your life easier. To become a master of Excel, you have to spend time studying, learning the in’s and out’s of the program. This guide will cover what it takes to find and remove duplicates from your data. There are many ways you can do this, and we cover them all.

First time users must do an introductory course. It accelerates your learning process having an experienced pro run you through the foundation of this program, and means you can start applying your training faster! If you are an advanced user, I highly recommend studying with eduCBA. Their course covers many of the items I discuss today, and will help you build a stronger understanding of the formula’s we will be working with.

When I started using Excel, I had no idea of the functions it could perform. This was a huge road block in my development, and it wasn’t until I had training did I rapidly improve. Once I have seen useful formula’s demonstrated, I may forget the actual formulas, but I always remember what Excel can do. If I need to do a task again in the future, it’s simply a matter of finding the right formula, which is easy through Google or Microsofts own Help function.

Our task today is to identify duplicate information. There are many ways you can do this, but it can be very frustrating. Excel doesn’t have a single tool that can do this for you in one go. It’s different depending on how your data is laid out. Finding duplicates in a single column of information requires different steps to finding duplicates across multiple columns. In a large database, you need a different approach again. To really master this technique, you need to be flexible, and learn a variety of techniques. Today we will run though:

  • Excel’s Remove Duplicates Tool

  • Advanced Filters

  • Conditional Formatting: Highlight Duplicates

  • Conditional Formatting: Creating your own Formula

  • Pivot Tables

In our example, this is the information we work with today:

There’s 9 data records, across 3 related columns of information. There are headers for each column, and we have a wide range of duplicate entries in the table. One of the most common uses for Excel is to manage databases, whether it’s names in a mailing list, inventory, or sales transactions. A big problem that arises is that as a database grows, records get duplicated. This can be from many reasons, the most common being:

  • Entire records are re-entered in the database, creating multiple (identical) records.

  • The records have more than one field that contains the same data (like a name field or address)

Duplicated data is bad. First, it takes twice as much space. In small excel files this isn’t going to be a problem, but it will impact performance if you are working with very large Excel files. Second, when updating records, you will have to update these in multiple places. It’s more complex and has greater room for error if you miss updating one of the duplicates. This leads to different copies of records with different information, causing confusion further down the line.

Excel’s Remove Duplicates Tool

The first method uses a built in data tool. Taking our dataset, I click cell A2. In the ribbon, select Data, and click Remove Duplicates. It looks like this:

At this point a window will pop up for you to add in additional information about the dataset.

You have two options here:

  1. Select the columns from which you want to remove data from.

  2. Select the check box to indicate if your data has headers.

Click OK and we get the following results:

We click OK again and our updated dataset looks like this:

Success! No more duplicates.

For users of Excel 2007 and 2010 that have this function, this is the fastest, easiest and most commonly used way to remove duplicate values from a dataset.

Advanced Filters

Another way to generate the same results is to use Advanced Filters. I click in Cell A2 again, and then navigate to: Data > Filter > Advanced Filter.

The following window will pop-up:

You have two choices here,

  • Filter the list in place: This will remove all duplicates from the list you are working from.

  • Copy to another location.

To filter out the duplicates, make sure you select the tick box “Unique Records Only.”

In our tutorial, let’s keep our original dataset intact. I choose “Copy to another location” for the corrected dataset. I have selected to use unique records only, and “Copy to” Cell A13

Click OK, and the results are pasted into Cell A3 with all duplicates removed:

These first two methods are great ways of removing duplicate information, but what if we simply want to see the records that are duplicated in the dataset without deleting them?

Conditional Formatting: Highlight Duplicates

Conditional formatting allows you to apply different formatting options to cells. Creating a function, you can change a cell’s format if certain conditions are met. You can change the background colour, borders, or even the formatting of the font in the cell.

This is a great option because you can highlight duplicate information without deleting it! The highlights make it easy to check the data, without removing any records. It’s also easy to apply conditional formatting to multiple cells using the format painter.

Let’s run through this process:

Starting in cell A2 of my dataset, I navigate to the ribbon and select:

Home > Conditional Formatting > Highlight Cell Rules > Duplicate Values

A window will pop-up, for the example I have left the formatting as is prompted:

Click OK. This doesn’t have an impact yet on my dataset, as we have only created the formatting in cell A2.

While I have cell A2 selected, I click the Format Painter and paste the formatting into all the cells in column A. The result is this:

Now this is great, but I notice a problem here. All duplicates in Column A are highlighted, but there is one record showing as a duplicate that shouldn’t be.

“Bradley Stuart 21” only appears once. Why is it highlighted?

The formatting is looking at every value in column A, regardless of the other data in the table. As Bradley is repeated, Excel thinks this is a duplicate value. This simple method works best if you have only one column of data.But what do you do if you have an example like mine?

Let’s use a neat function called CONCATENATE. It allows you to join the text together from a number of cells. It works with up to 255 text entries – we only have three so this will be a breeze.

The formula I am entering into cell D2 is

=CONCATENATE(A2,B2,C2)

Lets copy this formula down for every record,

I run the format painter trick again, copying the conditional formatting we set up originally in Column A to the new data in Column D.

Success! Bradley Stuart 21 is no longer highlighted as a duplicate:

Conditional Formatting: Creating your own Formula

Your version of Excel may not have the ability to highlight duplicate values straight from the menu. Or perhaps you would like to learn the formula operating in the background. No problem, it’s easy to use a formula to create these results.

First, I clear the conditional formatting we have just created. Select the data in the table, and navigate to:

Home > Conditional Formatting > Clear Rules > Clear Rules from Selected Cells

Now the formatting is clear, lets create our own. I click cell A2, and navigate to: 

Home > Conditional Formatting > New Rule > and click OK

Select the option “Use a formula to determine which cells to format” and enter this in the box:

=COUNTIF(A:A,A2)>1

Next, we need to choose a format to identify the duplicates, click Format. I’ve chosen a bright red background.

Click OK to close the Format Cells window, and then OK again to apply the Formatting Rule. You should now be back looking at your worksheet. Mine looks like this:

We’ve just created a formula that has the same function as the menu item “Highlight Duplicates” using the COUNTIF logic.

COUNTIF is a great function. It’s quite simple and simply tells Excel to count the number of cells within a certain range of data that meets a criteria. Our formula has told conditional formatting to apply to all cells with a value greater than 1 (i.e. cells that have multiple entries).

If we use format painter to apply this new conditional formatting to Column A, we get the following:

Now we know the logic, we can also apply this to the Concatenate Column D and stop the Bradley Stuart entry from appearing as a duplicate. I simply copy the formatting from cell A2 using the format painter and apply this to Column D.

Success! We have just recreated the results with our own formula.

Let’s do something more fun. Using the COUNTIF function, we can do interesting adaptations to the formula, and see how the results display. If you want Excel to highlight only the duplicate records in your database, use the following formula in your conditional formatting: =COUNTIF($A$2:$A2,A2)>1

This excludes the first instances of Renee and Bradley from being highlighted, valuable if you want to go through and identify only the duplicate values:

So now you have a couple of methods to find duplicate information, but what about triplicates? How would you find only the multiple occurrences of a certain data point? Excel doesn’t build this function in, but it’s simple to create. I’m not going to hard code this one into the formula of the conditional formatting. Instead I’ll update so that my formula refers to a cell that I can edit as required for my data analysis. First, I enter the value “1” into Cell E2.

Updating the conditional formatting formula, I now use this in place of my original formula:

=COUNTIF(A:A,A2)>$E$2

You can see that this is simply showing us results similar to our first two examples, as E2 is using the value of 1 in the formula (i.e. the logic of the formula hasn’t changed):

Let’s change Cell E2 to the value 3 and the formula to:

=COUNTIF(A:A,A2)>=$E$2

Can you see the difference?

The highlighted cells in column A remain unchanged (as there are three entries of Renee and Bradley), but Column D is now showing something different. Only ReneeSharp31 is highlighted, as it is the only entrant in this column that has three entries. Our formula has adjusted to count only records that have three or more entries.

You can use this method to highlight duplicates (using a value of 2), or any combination of multiple data entries that your data-set has.

Flipping the formula, we can use it to identify unique values as well. Put a value of 1 in Cell E2 and try this new formula in your conditional formatting logic:

=COUNTIF(A:A,A2)=$E$2

Only the unique values are showing up!

Pivot Tables

The final method I’m going to teach today requires Pivot Tables. If you haven’t used one before, no problem. Kieran Luke’s course will train you not only on how to use pivot tables, but also applying them to practical situations.

This last method is straightforward, first we are add a new Column F called Count. I am entering 1 into every cell in Column F, so that my table now looks like this:

Creating a Pivot Table is easy, simple click Cell A2, and navigate to: Insert > Pivot Table.

It should automatically pop-up with a window like this:

Double check “Select a table or range” covers all data you want (in our case up to Column F), and you select place the Pivot Table in a New Worksheet. Click OK, and you should now have a blank pivot table screen that looks like this:

In the section on the right that says choose fields to add to report, select First Name and Count. This shows you how many First Names are repeated in your dataset:

If we choose Concatenate instead of First Name, we can see the true duplicate values:

Pivot tables are a great way to interact with your data, you can summarize large amounts of information very fast, in a concise overview. In our example we can easily see the duplicated values. In a larger list you might want to sort the information. Click the drop down arrow next to row labels and click sort. I’ve changed it to Descending by Sum of Count, and click OK

The duplicates are now arranged by the total number of occurrences:

Easy!

Now you have learnt the many ways to identify, highlight and remove duplicate values from your data sets in Excel. Don’t forget to practise, and if you would like to learn more I recommend taking the Advanced Course by Infinite Skills. They run through everything you need to know to really master Excel. It’s definitely a skill worth having, ensuring you stay competitive in today’s workplace.