As I have described in my previous blogs about Generative
AI models, the fuel that keeps their engine running is data – and lots of
it. But, just like gas for your car, the
datasets that are fed into the model must be filtered, cleansed, and optimized. If not, the model will have skewed data in
them, which will generate the wrong kind of output that you really cannot make
any use of.
But datasets that you need for your model may not be
readily available right when you need it the most. So, the best resolution to this dilemma is
create what is known as “Synthetic Data”.
It can be technically defined as follows:
“Synthetic data is information that's been generated on a
computer to augment or replace real data to improve AI models, protect
sensitive data, and mitigate bias.”
(SOURCE : https://research.ibm.com/blog/what-is-synthetic-data)
So, as you can see, you are using an algorithm (which could
also be powered by Generative AI) to create “fake” data that has not yet been found
in the real world, but it could make it is their eventually. Using Synthetic Data is a fantastic way to
create datasets that you need to at least start training your models on.
The one caveat here is that it is always best to use real
world data first.
But the scary part of this is that it is expected that by
the year of 2030, most of the datasets used in Generative AI models will be all
synthetic. It is also important to keep
in mind that that the actual concept of creating and using Synthetic Data is
not anything new, in fact it dates to many years ago. It is not until now that its popularity has
really picked up.
But as we all know, anything that is data related is a
prime target for the Cyberattacker. This
event includes Synthetic Data. You might
be asking this question: “If it is fake
data, why would they then be interested in it?”
Well, the truth of the matter is that even if the
Cyberattacker were able to heist some fake data, they can still use that to try
to extrapolate what the real data could look like. Then, if it seems too valuable enough, they
will then pursue it.
So, while Synthetic Data theoretically may have no value
to it, it is still particularly important to try to keep them as secure as possible. Here are some keys in which you can do this:
1) The Outliers:
As it was just described,
even from within your Synthetic Data, you will want to make sure that there are
no outliers that exist. Apart from
screwing up the outputs, the Cyberattacker will take quick notice of this, and
pounce upon them. That is why even your real-world
data needs to be thoroughly checked for this.
2) The
Risks:
To make sure that you have
not contaminated any of the algorithms by using Synthetic Data, you will want
to run each and every time at least a Vulnerability Scan (preferable a
Penetration Test) to make sure that there are no vulnerabilities that have come
about as a result. If they have, you
need to remediate it quickly, as this will be an easy backdoor for the Cyberattacker to penetrate through and totally
wreak havoc on your models.
3) Longevity:
You do not want to keep
Synthetic Data any longer than you absolutely need to. This even holds true for real world
data. By keeping both kinds of datasets
for an extended period, because if you do, you are only exposing yourself to becoming
the victim of a security breach. Remember,
Synthetic Data can be created very quickly, if you ever need to have them
again. So, there should be no questions asked about discarding them.
My Thoughts on This:
Here are some other things to keep in mind when creating
and making use of Synthetic Data:
Ø Never,
every 100% on Synthetic Data to train your Generative AI models. By doing so, they will become “out of touch”
with reality, and when the time comes that you feed into its real-world data,
you could quite easily cause your model and its algorithms to completely crash.
Ø It
is always best to use real world data.
But be careful about the sources where you get them from. Always vet out your suppliers, because if
they provide you with something that has been trademarked or copyrighted (such
as content from a manuscript), you could very well be facing a serious lawsuit.
Ø The Data
Privacy laws, such as those of the CCPA and the GDPR, also have tenets and
provisions about using Synthetic Data. They
treat any misuse of that in the same way as real-world data. Therefore, you will always want to make sure
that your controls that you have over them are optimized all the time.
Ø
Do not even think about coming Synthetic Data
and real-world data together. Not only will this mess up the models, but if you
combine some real data about your customers mixed in with fake ones, you will
be brewing a lot of trouble for yourself.
In other words, decide which one to use, and stick with only that.
Finally, keep in mind that it is particularly important
that you keep an overall eye on your models and algorithms. You will always want to make sure that they
are optimized not only to give you the best results possible, but to also
mitigate the risks of a security breach happening to them.
No comments:
Post a Comment