Working in an office, it’s inevitable you will come across Microsoft Excel. It’s a simple program on the surface; typically used to do arithmetic, create budgets, and manage data. However, once you wrap your head around the basics, there is no end to the functions Excel can perform. In this article, you will learn how to clean imported data. The focus is on TRIM, and its sister functions CLEAN and SUBSTITUTE. If you have never opened Excel before, I recommend doing a quick course to learn the basics. It’s not difficult to get a general understanding in under two hours, complete Kieran Luke’s lessons and ‘become an Excel Rockstar’.
Personally, I used Excel everyday in my previous career. As an analyst, I combined data from a variety of sources looking for trends and insights in the information. I learnt many tricks that sped up my ability to clean and manage data, as small inconsistencies mess with your ability to analyze. To learn what it takes to be a fantastic analyst, check out The Zen of Data. It gives a great background on statistics, and what you can interpret your data.
The biggest strength you need in Excel is troubleshooting. There are many reasons a formula won’t work, identifying what is going wrong makes it simple to correct. One of the tasks I did everyday was importing data into Excel. This causes problems as the imported text often behaves differently to what you expect. The words may look odd (non-printing characters may have been imported as well), or there are additional spaces all through your text, making it impossible to match in your database.
The first cleaning step uses a simple function called TRIM. It’s a tool that helps you remove unwanted spaces from your data. I am talking about cells “ like this ”. TRIM takes spaces before the start of the text and removes them, leaves only one space between words and removes any additional spaces at the end of your text. That’s it.
To better understand TRIM’s functionality, lets do an example. I’m using imported data that is not consistent with my existing data-set – even though it should be. In Excel the ribbon is where we start. I click “Data” in the top of the ribbon, and it shows all of the options for bringing data into your spreadsheet. Sources of information are other programs, the internet or a via text files.
One of the easiest ways to import data, is “From Text.” Almost every data source will allow you to download information as a .csv file (Comma Separated Values). Once you have your source, simply click:
Data > From Text > Select your .csv file > Ok
A text import wizard is going to pop open here, you will see a box similar to this:
Select Next, and you receive a range of options in how the .csv file should display your data.
99% of the time choose comma (.csv stands for comma separated values), and click Next.
Finally, select the format of your data and click Finish. You will be asked where you want the data to be placed, let’s put it in cell A1 of a new worksheet. All you need to do is click ok.
So now our dataset has been successfully imported, let’s check it out. At first glance, it looks great. Do you see any problems?
To match this data with my existing database, I’m going to have problems. VLOOKUP is a strong function that allows you to cross reference one data point with a data-set and return a value if a match is found. If you have no idea what a VLOOKUP is, no problem – we are getting into complex Excel here. To learn, complete a more advanced course geared towards power users and you will learn in no time. Infinite Skills offers my favourite course on Advanced Excel Training. It covers the details of the functions you need to really master Excel.
Back to VLOOKUP. Putting my two databases together, I am going to test if there is any problems in matching the data. Columns F, G and H contain the information in my old customer database, whereas A to D is the newly imported data. I’ve entered a simple VLOOKUP in cell H3 =VLOOKUP(F3,$B$2:$C$4,1,0).
Problem #1. Why isn’t the data finding a match? The email listed in cell F3 appears to be the same as we just imported in B3. That can’t be right… Let’s look closer:
Highlighting B3’s email address in the formula bar, it seems my import has brought me an email address with an additional space at the end. You can see this as the black highlighted text in the formula bar goes one character past the end of .com. To get the two cells to match, we need to remove this space. Manually, its simple. Click the cell, and click delete to remove the extra space. But what if you had 100 emails to format? Or 1000? Or 100,000? Manually isn’t going to cut it.
That’s where TRIM comes in.
TRIM is designed to remove the ASCII “space” character when it has been incorrectly used in a text cell. Specifically, this is character value 32. In English, it simply means all incorrect spaces (the blank spot we leave between words when we press the spacebar) will be deleted if they are in the wrong position.
ASCII stands for the American Standard Code for Information Interchange, and is essentially a set of codes for 255 characters and symbols used in computer programming. If you want to learn more about programming languages, fundamentals for non programmers is a great start. It covers everything a non-programmer needs to know, so you can talk the talk with a developer. Check it out if you have never learned a computer programming language before.
Back to our example, lets use the TRIM function to correct the text entry in cell B3. Enter the formula =TRIM(B3) in cell E3.
Press enter and the text value “email@example.com” will be displayed in cell E3.
Now let’s try our VLOOKUP again, this time using the cleaned data in cell E3, instead of the imported value in cell B3. My VLOOKUP has been modified slightly to use the trimmed data: =VLOOKUP(F3,$E$2:$E$4,1,0)
Success! The VLOOKUP has found a match.
TRIM is a fantastic formula to use when cleaning data. Not only does it remove extra spaces that are hiding at the end of a text cell (like at the end of the email address in our example). It can also remove spaces that are in front of the text. Cell B4 also looks a little funny, the text is not in line with the other data. When we have a look at the value, it seems that there are two spaces before the email begins.
Using TRIM again, we can easily remove these unwanted spaces, and get the resulting VLOOKUP to match the firstname.lastname@example.org email address.
*Remember, the formula in cell E4 is simply =TRIM(B4)
Another great use of the TRIM function, is when you have too many spaces in a sentence. Perhaps the data import has put two or even three spaces between every word. TRIM allows us to easily remove the unnecessary spaces, rapidly improving the quality in your data for analysis.
Enter =TRIM(A1) in cell B1, and you will easily clean up this sentence.
What if it doesnt work?
TRIM was designed to remove the ASCII character created when you hit the spacebar on your keyboard. Because of its special design, it only works on the ASCII character 32. This means that TRIM doesn’t always work. There are a number of other characters that either look like a space, or are not typical letters you need to remove from your data.
CLEAN is a great function, removing the first 32 non-printable characters in the ASCII code (from zero to 31). When you import data from other applications, low level computer programming language may also come with the text. Use =CLEAN( text ) to remove these characters.
I also need to give you a heads up, there is another space character. Called a non breaking space, it’s most commonly used in web pages as a HTML entity, coded  . This character has a decimal value of 160. This is a tricky one as you cannot visibly see a difference to a normal space. You may also find that there are other non-printing characters giving you grief, values 127, 129, 141, 143, 144 and 157 also import from the web and affect data quality.
Let’s do a new example and clean data with a variety of ASCII characters. It’s only one additional step, but we need to use a couple of functions together. Cell A2 contains a number of ASCII characters. Some are easy to see (like the circle), however I have also included the character 160 that we cannot see.
What happens if you don’t know which characters are in a cell? Input the following formula in cell A3 =CODE(A2). The CODE function will show the value of the ASCII number of the first occurring character in the cell. Using this for a quick test, we see that this gives us 32. Because it only works on the first character, click cell A2 and start deleting the text. As we delete (starting at the beginning of the cell A2), we discover that the character 160 is causing the spaces between the letters ABC and 122, and the character 7 is the circle at the end.
We now know we need to remove these three characters to return only the text. Individually, the following functions can assist, but it takes a second step to finish this example:
Cell B2 contains the formula =TRIM(A2)
The result removes all non regular occurrences of character 32 from the data in A2
Notice there are no more spaces before the letters ABC in cell B2
Cell C2 contains the formula =CLEAN(A2)
CLEAN removes the first 32 ASCII characters.
In our example it removes the ASCII character 7 (also known as BEL) from the data in A2
Notice the circle has disappeared from the text in cell C2
Results in B2 and C2 have taken us partway to our result, however there is still a large gap between the letters ABC and the numbers 122. We know this is because of character 160, from our CODE test, so let’s remove it. Here we need to use SUBSTITUTE. This function allows you to swap characters for another. Simply put, we change the character 160 into a normal space (character 32), and remove it with TRIM.
This is a two step process. Take the results in C2 with the BEL character removed. We add the following formula in C3 =TRIM(SUBSTITUTE(C2,CHAR(160),CHAR(32)))
Success! Cell C3 has the final result “ABC 122”
What this formula is doing, is performing the TRIM function on an updated set of data.
We have first removed all occurrences of ASCII characters 0-31 using CLEAN.
The SUBSTITUTE function has swapped the ASCII character 160 for normal spaces 32
The TRIM function removes all (non regular) normal spaces in the text.
Simple! You can adapt this process to remove any troublesome characters that appear in your imported data.
As an analyst, you need to think on your feet. When using imported data, many inconsistencies and errors occur. The simplest way to remove unwanted spaces is to use TRIM, a purpose built function that does exactly this. If the results are not as you expect, there may be other characters or programming you need to clear. Follow the steps to determine first what characters are causing the problem, and then systematically remove these with CLEAR, SUBSTITUTE and TRIM. This process lets you clean any data-set, making it ready for analysis in Excel.
Now you have a good grasp of this function, try your hand cleaning that set of data you have in your to-do list to finish…but you have been putting off. These functions require practice, and the more time you spend using these, the better you will be in Excel.