Sunday, May 26, 2024

What Is a Large Language Model, And How Do I Secure Them? Find Out Here

 


In the world of AI today, we are certainly hearing a lot of buzzwords that are floating around today.  A lot of them of them come from the vendors themselves, most notably those of Google, Microsoft, and OpenAI.  But on a technical level, the only one that most people have at least heard of is that of “Generative AI”. 

Simply put, this is where you submit a query to ChatGPT, and the output to it (which is actually the answer you are looking for) can come in a wide variety of formats, ranging from the simple text answer to even an audio or video file.

But another integral part of AI that is going to also take the world by “storm” is that of Large Language Models, also known as “LLMs” for short.  But before we go any further on this, it is first important to define it, which as follows:

“Large language models (LLMs) are a category of foundation models trained on immense amounts of data making them capable of understanding and generating natural language and other types of content to perform a wide range of tasks.”

(SOURCE:  https://www.ibm.com/topics/large-language-models)

So while you think that ChatGPT already uses large amounts of data for it to learn, and answer your queries, the LLM can take datasets that are at least 100X as large and still have the ability to generate the right outputs.  Some differentiating factors between this and other areas of AI, such as Machine Learning and Neural Networks include:

*It needs to be hosted on several Virtual Machines given the size of the datasets that they process.

*It also tries to comprehend the human language that is spoken to it, and even tries to create the output in the same way.

But given its sheer power, LLMs are also prone to be in the cross hairs of the Cyberattacker.  For example, if an LLM is used in a Chatbot (or “Digital Personality”), it can actually be quickly manipulated in such a way that it can easily launch a Social Engineering Attack.  For instance, after the tool has developed a good, and trusting rapport with the end user, the conversation can then shift to him or her giving away their confidential information.

So in order to help mitigate this risk of happening, it is very important to establish a set of best practices and standards that you should follow.  Here are some starting points:

1)     Always keep an eye:

One of the cardinal rules in Cybersecurity is to always keep tabs on abnormal behavior.  But if your organization is large enough in terms of endpoints and network security devices, this can be an almost task to do for your IT Security team to accomplish in a timely fashion.  Therefore, for the purposes of automation, and to only provide those messages and warnings that are truly legitimate, you should seriously consider using a Generative AI based tool in this regard.  But keep in mind that that this too will have to be trained, so it can learn what to look out for in the future with regard to unusual trends.

2)     Create solid prompts:

The advent of ChatGPT has created a new field called “Prompt Engineering”.  This is the art of writing queries that will guide the Generative AI model or LLM into giving you the most specific answer possible.  For example, when you type in keywords in Google, within seconds, you get a list of a ton of resources that you can use to find the answer to your question.  But this is not the case with Generative AI.  The goal of it is not to give you a list of resources to use (unless you actually ask for that), its objective is to give you the best possible answer the first time around.  But in order to do this, at the sending end, you need to craft a query that will allow for it to happen.  This is not something that you can learn from taking an online class, it comes with lots of time as well as practice.  There are tools available to help you to do this, and I know for a fact that CoPilot from Microsoft, has a library of prompts that you can use and even further customize to your own needs.  But, creating open ended prompts can also pose a security risk to the LLM.  Therefore, if you are going be using something like ChatGPT quite heavily, it is highly recommended that you get better at “Prompt Engineering”.

3)     Keep training ‘em:

Unfortunately, many people think that once you have an AI model in hand, it will always work forever.  While this is true, the performance of it will degrade over time quickly if you don’t keep optimizing it.  By this I mean that you are constantly giving it datasets for it to keep on learning.  But keep in mind also that these datasets have to be cleansed and optimized, to make sure that there are no levels of skewness or outliers that persist.  Remember in the end, all AI is “Garbage In And Garbage Out”.  In other words, the outputs that you get from it are only as good as the datasets that you feed into it.

4)     Keep ‘em safe:

Not everybody in your organization needs to know the proverbial “Secret Sauce” that creates the foundation for your Generative AI model or LLM.  Therefore in this regard, access should be highly restricted to those who need  to have it.  Even in these cases, make sure that you are following the concepts of “Least Privilege” which explicitly states that the rights, privileges, and permissions that have been assigned are no longer what needs to be done in terms of the job tasks.

5)     Find the holes:

Just like anything else in Cybersecurity, even Generative AI models and LLMs are prone to having their fair share of weaknesses and gaps.  Therefore, you need to be able to find and  remediate them quickly.  Some of the best ways to do this are through Penetration Testing and Vulnerability Scanning.  Also, you can implement a methodology called “Adversarial Testing”.  In this scenario, you are taking the mindset of a Cyberattacker, and breaking down your models to see where all of the weak points are at.

My Thoughts On This:

The above list is to get you started on thinking about how important it is to secure your Generative AI models and LLMs.  If you don’t take this seriously, you could be facing a huge Data Exfiltration Attack.  Also, it is very important to keep in mind that all of the datasets you use and store for the purposes of AI now also come under the data privacy laws, those of the GDPR, CCPA, HIPAA, etc. 

If you don’t have the right controls in place and face a security breach, you could be prone to a very exhaustive audit and even face very harsh penalties as a result.  For more details on this, click on the link below:

https://www.darkreading.com/vulnerabilities-threats/bad-actors-will-use-large-language-models-defenders-can-too

No comments:

Post a Comment

How To Launch A Better Penetration Test In 2025: 4 Golden Tips

  In my past 16+ years as a tech writer, one of the themes that I have written a lot about is Penetration Testing.   I have written man blog...