Sunday, June 8, 2025

How Not To Use Synthetic Data In Generative AI

 


As I have described in my previous blogs about Generative AI models, the fuel that keeps their engine running is data – and lots of it.  But, just like gas for your car, the datasets that are fed into the model must be filtered, cleansed, and optimized.  If not, the model will have skewed data in them, which will generate the wrong kind of output that you really cannot make any use of. 

But datasets that you need for your model may not be readily available right when you need it the most.   So, the best resolution to this dilemma is create what is known as “Synthetic Data”.  It can be technically defined as follows:

“Synthetic data is information that's been generated on a computer to augment or replace real data to improve AI models, protect sensitive data, and mitigate bias.”

(SOURCE :  https://research.ibm.com/blog/what-is-synthetic-data)

So, as you can see, you are using an algorithm (which could also be powered by Generative AI) to create “fake” data that has not yet been found in the real world, but it could make it is their eventually.  Using Synthetic Data is a fantastic way to create datasets that you need to at least start training your models on. 

The one caveat here is that it is always best to use real world data first.

But the scary part of this is that it is expected that by the year of 2030, most of the datasets used in Generative AI models will be all synthetic.  It is also important to keep in mind that that the actual concept of creating and using Synthetic Data is not anything new, in fact it dates to many years ago.  It is not until now that its popularity has really picked up.

But as we all know, anything that is data related is a prime target for the Cyberattacker.  This event includes Synthetic Data.  You might be asking this question:  “If it is fake data, why would they then be interested in it?” 

Well, the truth of the matter is that even if the Cyberattacker were able to heist some fake data, they can still use that to try to extrapolate what the real data could look like.  Then, if it seems too valuable enough, they will then pursue it.

So, while Synthetic Data theoretically may have no value to it, it is still particularly important to try to keep them as secure as possible.  Here are some keys in which you can do this:

1)     The Outliers:

As it was just described, even from within your Synthetic Data, you will want to make sure that there are no outliers that exist.  Apart from screwing up the outputs, the Cyberattacker will take quick notice of this, and pounce upon them.  That is why even your real-world data needs to be thoroughly checked for this.

2)     The Risks:

To make sure that you have not contaminated any of the algorithms by using Synthetic Data, you will want to run each and every time at least a Vulnerability Scan (preferable a Penetration Test) to make sure that there are no vulnerabilities that have come about as a result.  If they have, you need to remediate it quickly, as this will be an easy backdoor for  the Cyberattacker to penetrate through and totally wreak havoc on your models.

3)     Longevity:

You do not want to keep Synthetic Data any longer than you absolutely need to.  This even holds true for real world data.  By keeping both kinds of datasets for an extended period, because if you do, you are only exposing yourself to becoming the victim of a security breach.  Remember, Synthetic Data can be created very quickly, if you ever need to have them again.  So, there should be  no questions asked about discarding them.

My Thoughts on This:

Here are some other things to keep in mind when creating and making use of Synthetic Data:

Ø  Never, every 100% on Synthetic Data to train your Generative AI models.  By doing so, they will become “out of touch” with reality, and when the time comes that you feed into its real-world data, you could quite easily cause your model and its algorithms to completely crash.

 

Ø  It is always best to use real world data.  But be careful about the sources where you get them from.  Always vet out your suppliers, because if they provide you with something that has been trademarked or copyrighted (such as content from a manuscript), you could very well be facing a serious lawsuit.

 

Ø  The Data Privacy laws, such as those of the CCPA and the GDPR, also have tenets and provisions about using Synthetic Data.  They treat any misuse of that in the same way as real-world data.  Therefore, you will always want to make sure that your controls that you have over them are optimized all the time.

 

Ø  Do not even think about coming Synthetic Data and real-world data together. Not only will this mess up the models, but if you combine some real data about your customers mixed in with fake ones, you will be brewing a lot of trouble for yourself.  In other words, decide which one to use, and stick with only that.

 

Finally, keep in mind that it is particularly important that you keep an overall eye on your models and algorithms.  You will always want to make sure that they are optimized not only to give you the best results possible, but to also mitigate the risks of a security breach happening to them.

 

No comments:

Post a Comment

CrowdStrike One Year Later: 3 Key Lessons Learned

  Well guess what people?   It has been a year since the CrowdStrike fiasco, and from what we know, it was the biggest Cybersecurity   fiasc...