What is data cleaning?

The problem today is not the lack of data, but the wrong data, and know what data to trust. Problems can fall either into erroneous, incomplete or inaccurate data. And rather than being an exception, it is normal to know that you will always receive erroneous data.

The root of the problem starts with bad data collection. Once it is obtained but does not comply with the schemes, and possibly with typos, it is collected with the correct data, but the dirty data dirties the rest, making it unreliable to utilize for analytics.

The problem is so big that Data scientists spend 60% of their time cleaning and organizing data. And also take into account that data collection takes up 19% of their time. So data scientists spend almost 80% of their time preparing data for analysis.

What is exactly data cleaning?

Is the process of preparing data for analysis by removing the data that is incorrect, incomplete, irrelevant or duplicated. Essentially the task of removing errors and anomalies or replacing wrong values with true values from data to get more value in analytics.

There are different ways to do data cleaning; the most conventional being changing errors manually, it can overwhelming task, even when they are easy to identify. Taking into account all the data you have to handle.

And even having specialized software and machine learning at your disposal, it is important to follow up, be monitored and have inconsistencies reviewed to ensure that the results are correct.

In the images below we can see data that are the same, but by writing error or bad recollection, are not categorized where the same as other data, and when you want to analyze it together, it will create discrepancies if not cleaned before.

What steps to follow for a good data cleaning?

To start you should know what data is relevant, what is not, and how to categorize it. 

You’ll find Duplicate observations, where you’ll see scrape data and combined datasets from multiple places. Also Irrelevant observations, for example, if you were building a model for Single-Family homes only, you wouldn’t want observations for Apartments in there.

The best practices you can do to get your data clean are:

-Standardize your processes: Know how to compile your data and make an obligation that is this way, so that there are no typos or bad wording problems. And avoid risk of duplication.

-See where the most common errors are coming from: If you are receiving data from several departments and different software, it is important to know the root, so that it does not affect the other departments.

-Analyze your data once it is “clean”: Once you have all your data cleaned, it is important to check for a second time if everything is correct. Since you will possibly make many decisions from this information, it is not superfluous to check with thirds if everything is in order.

In conclusion, it is important to know what data to trust and to bias it from the beginning, before it affects the rest. It is also important to know what data is important, and if it is congruent store it with the rest if it is not worth what you are looking for. 

Josh Hendrickson. (2016, July). Duplicate Payment? Here’s How To Never Pay The Same Invoice Twice. Retrieved December, 2019, from https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4

Leave a Reply