top of page

The Importance of Clean Data

My name is Austin Mandus and I am a database consultant. I help people move, optimize, connect, and learn how to use databases. I have done a few database cutovers in my time, but time and again the most common issue/pain point that clients experience is due to bad data.


Most people know that clean data is important, but that doesn’t mean they know how important it truly is or what to do about it.  But before we get into how to have clean data, let’s actually explain what I mean.  The data within your database is the information that tells us who, what, when, and where.  So, when we are talking about clean data we want the who, what, when, and where to be accurate, with no duplicate values, and with a proper logical naming convention.  It’s always easier to keep your data clean than it is to clean it up later.  



Accuracy


What do I mean when I say accuracy?  Well… I kinda just mean what I say here.  You want your Invoices to have the correct date, customers to have the correct addresses, employees to have the correct supervisors, items to have the correct prices, etc.  If the data is not accurate in your database then it will constantly spit out incorrect information and cause a mess.  This is the most important part.  If something is inaccurate, stop what you are doing and fix it because not doing it will just cause problems down the line


Duplicates

Duplicates are when unique data points repeat.  Two invoices having the same number, two items having the exact same name, two customers having the same name, etc.  Duplicates cause confusion because it’s very difficult to separate them.  Things that are supposed to be unique need to actually be unique. If you find duplicate data, identify the incorrect one, make sure that all the records linked to that duplicate are moved to the correct one, then delete the incorrect one.  There are some instances where two things are similar, but not identical.  In that case, it is important to differentiate them in an easy-to-understand way.  It is best to do this using a naming convention. 


Naming Convention

A naming convention is the process and reason data is labeled the way they are.  A simple example of this would be Invoice #1, Invoice #2, Invoice #3, etc.  In this example the naming convention is Type of Transaction (invoice), # sign, and then the next number up in the list.  This can get much more complicated if you want to tell basic information in the name.   


For example: Let's say you have three items called Widget 1, Widget B, and Widget A.  The first one is Small and Red, the second is Large and Red, and the third is Small and Blue.  This naming convention is difficult to follow as it doesn’t tell you which widget is the size you are looking for.  As a result, you may select the wrong item out of your warehouse to send to the customer.  A better naming convention for this example would be Type of item-size of item-color of item.  Widget-SM-Red, Widget-LG-Red, and Widget-SM-Blue.  This is a much more legible way to name your items so you can tell exactly what item you are referring to.  This can get as detail-oriented as you need it to be.  You can have the warehouse, subsidiary, subtype, department, etc. in your naming convention. 

The important thing to note is that each name needs to follow the same format.  So, if you name your items Country Type-Subtype-Color-Size.  It always has to follow that exact format with those exact breaks.  For example:  USA_Computer-Laptop-Blue-15" follows the naming convention, but USA-Computer-Laptop-Blue-15" and USA_Computer-Laptop-15"-Blue do not.  The first does not have the underscore after the country and the second has the size and color switched.

You might have noticed that in the previous examples, I used abbreviations for sizes.  You can use abbreviations in a naming convention as well (and I totally suggest that you do), but you have to make sure those abbreviations are consistent.  So if S = Small, you shouldn’t use SM as your abbreviation for small.  In the following example let's say that PUR = Purple.  USA_Hair-Comb-S-PUR is correct, but USA_Hair-Comb-S-PURP, USA_Hair-Comb-S-PRP, and USA_Hair-Comb-S-Purple are all incorrect and do not follow the correct naming convention.

Conclusion

Hopefully, this gives you a better idea of how to keep your data clean. Remember, it is always easier to keep your data clean than it is to clean it after; so, remember to keep your data accurate, duplicate-free, and properly named. Doing this will prevent all sorts of errors and make transitioning between systems much easier. If you need help coming up with a naming convention, or have any questions regarding your databases in general please reach out to us at info@unityconsultingfirm.com.


Need more advice or assistance? Your product strategy and business transformation experts at Unity Consulting are here to help!


コメント


bottom of page