Saturday, December 17, 2022

How To Implement A Data Cleansing Process: 4 Point Checklist

 


Today, data is the lifeblood of any business, no matter how large or how small it may be.  You depend upon it to keep in touch with your customers, and you examine the traffic to your website to see which pages seem to bring the most attention. 

Also, you keep track of buying patterns so that you can use that for market intelligence when you launch new products and services.  Therefore, this makes you the steward, or the custodian of this data.  Even the laws make you this steward, especially by the provisions of the GDPR and the other financial regulations that are in place here in the United States.

In fact, this whole realm of being a data steward falls into the realm of what is known as “Data Management”.  This is not really a new topic per se, but in the world of Cyber, it is certainly making news headlines.  So, how does one go about making sure that their data is actually being managed properly?

Here are some key tips that you use rather quickly:

1)     Data Cleansing:

In general terms, this simply means that you are keeping your data “clean”, as you take in new information, whether it is from your website or other sources.  You want your database(s) to have the best, organized data as possible.  This involves:

*Having a complete record set for both your employees, customers, and prospects.  This means that all fields are filled out, and there no null values.

*Make sure that all of the records have at least one unique ID next to it.  If you are dealing with people, this could perhaps be their Social Security numbers, or in a worst-case scenario, you can create a unique ID from a random number generator.  If you do find two pieces of datasets with the same ID#, then you know something is not quite right, and needs to be investigated further.

*These datasets should be available to those employees in your business who need them, based upon the principle of Least Privilege.  Not everybody needs to have this kind of access, so make sure that you delegate the rights, privileges, and permissions accordingly.

2) Making it all centralized:

               Back in the days before the Cloud came about, many businesses had their IT and Network              Infrastructures On Prem.  This means that all of the servers and databases were held in some           room, and locked.  Because of this, many databases were held in different servers, and trying to   find which one was where was a pain, especially for audit purposes.  But now with everybody         going into the Cloud, centralization has now become a key trend, even for datasets.  Therefore,            if and when you make your move to the Cloud, you should seriously consider centralizing all of    your data into one major database.  You might view this is as a security risk, but remember that          the major Cloud Providers (such as that of Microsoft Azure) have many tools that you can use to      protect that database, and even provide real time alerts and warnings if any suspicious activity is                detected.  If you still keep your data in disparate locations, this will not only lead to sprawl, but         will also increase the attack surface for the Cyberattacker.

2)     Get rid of the siloes:

Many companies today unfortunately, still work in “siloes”.  This simply means that the different departments work independent from each other, with no communication.  And, if a security were to actually happen, nobody would know how to react to it.  This is where the siloes can become a huge impediment.  Therefore, it is time to break them down, and centralizing data is a huge step forward.  So for example, if HR, and Accounting need access to these datasets, they should be able to get it to readily and easily.  In fact, you can even create what are known as “Federated Accounts”, in which an employee can use the same login credentials to gain access to different pieces of datasets.

3)     Data backup:

This is probably the biggest here, and is the one that has been repeated so many times.  ALWAYS BACK UP YOUR DATA!!! Preferably, you should be backing up your data in the Cloud, so that you can get instant access to it at any time or any location.  Again, Cloud Providers like Azure have made it very easy to back up all of your data, so that you really don’t even need a database administrator to do this for you.  Heck, you can even automate this tool so that you don’t even have to give a second thought to it.  In this regard, there are three types of backup you can choose from:

*Full backup:  A 100% new backup is made, in its entirety.

*Incremental backup:  This is when a backup is made from the last one that was performed.

*Differential backup:  This is when a full backup is made, but then reverts to an incremental one as newer backups are made.

My Thoughts On This:

These are just some of the steps that you can take to maintain an overall Data Governance Strategy for your company.  But remember, keeping data sets “clean” serves other purposes as well.  For example, if your line of business makes heavy usage of both AI and ML, you have to make sure from the very beginning your datasets are cleaned and optimized. 

If not, you will get results that are not accurate. While this might sound like a Herculean task in the very beginning, it is not.

There are many automated tools out there that help you mine your data and flag the ones that seem to be outliers.  Also, by keeping your data cleansed, you will automatically come into compliance with the data privacy laws with once again the GDPR, the CCPA, HIPAA, etc.  This will help you to avoid any audits and costly fines. 

Finally, if you are ever impacted by a security breach, by having the right data strategy in place, you will be able to recover quickly without too much of an impact.

No comments:

Post a Comment

4 Ways How Generative AI Can Combat Deepfakes

  Just last week, I authored an entire article for a client about Deepfakes.   For those of you who do not know what they are, it is basical...