Arctic Wolf is an AI-powered security operations provider, which gives our organisation advanced insights into the emerging threats inherent in being an early adopter of the more recent techniques. However, Arctic Wolf is not the only one turning to AI. A third of technology leaders reported that AI was already fully integrated into their products and services . This statistic will grow as major tech companies like Microsoft and Google invest billions of dollars into AI infrastructure..
Unsurprisingly, the emergence of large language models (LLMs) – a type of AI designed to understand and generate human language – as a tool that is easily integrated into countless applications has also made it a prime target for adversaries.
Without robust security measures in place, AI infrastructure remains vulnerable to threats such as data leakage – exemplified by the recent DeepSeek breach, where user and prompt data were exposed due to an insecure database – or fraud, as seen when adversaries tricked facial recognition models to infiltrate a local government’s tax system and steal millions. By proactively measuring risk and leveraging security-driven frameworks to guide AI and LMM system design, organisations can, potentially, significantly reduce the cyber risk that arises from this new technology
Sophisticated AI Attack Types
More technologies now depend on the use of AI, and as a result, there is greater sophistication in how threat actors utilise this new attack vector. Attacks targeting AI systems are termed Adversarial Attacks – different from AI-assisted attacks which is when AI is used to perform or assist in an attack. These efforts began with attempts to deliberately mislead a model by manipulating its input in specific ways to force incorrect decisions. For instance, consider an image of a dog—if a significant portion of its pixels were altered to white or black, rendering the original object unrecognisable to a human, yet the model still classifies it as a dog, this demonstrates how the model can be “tricked” into maintaining its original prediction despite drastic visual distortions. This extends to more complex scenarios such as “tricking” a malware file detector to classifying a malicious file as clean by adding certain strings to the file without changing any of its core functionality.
Data Poisoning attacks occur when a threat actor injects malicious samples into the training dataset for a model so that they can manipulate the model’s behaviour to perform incorrectly. For example, if a malware author injects malware into “clean” labeled training data, the model may misclassify malicious samples as clean in production, bypassing protection.
A backdoor attack involves a specific type of poisoning where the model only changes its behaviour when there is something to trigger it in the input. For example, an attacker could embed a specific byte sequence into their malware, causing the model to misclassify it as clean. Researchers have provided several proof-of-concepts (PoC) of what this could look like for LLMs, visual models, and more.
Inference/Extraction attacks can lead to data leakage on training data and model weights, potentially exposing organisations’ sensitive or confidential information. It is a common problem for large AI models to “memorise” their training data, putting it at risk for attackers to exploit and extract that data at a later time. This can lead to privacy breaches, re-identification risks, and exploitation of personal or proprietary data. For example, researchers demonstrated that by carefully choosing a sequence of tokens as input to a LLM, you can trick it to ignore its guardrails and generate samples from its training data. An example they used was submitting the word “poem” 50 times. Their proof-of-concept was able to extract personally identifiable information from dozens of people, not safe for work (NSFW) content, extracts from literature and research papers, valid urls, accounts, and more. Prompt injection is a prevalent attack technique against LLMs, leveraging a combination of prompt engineering and traditional cyber tactics to manipulate the model’s behaviour at inference time.
These are just a few examples of the numerous attack strategies used by threat actors. While extensive research explores how these attacks are executed, an equal effort is dedicated to developing robust defences. Ongoing research initiatives contribute to benchmarks and frameworks that empower organisations and individuals to assess and strengthen the security of their AI systems.
Benchmarking AI for Vulnerabilities
When evaluating new AI systems, benchmarks are used to measure their performance across a variety of tasks. For example, an LLM might be tested on arithmetic skills, sentiment recognition, gender bias, and more. These tasks are considered part of a benchmark suite. Benchmarks are essential not only for understanding an AI model’s performance but also for identifying gaps that may require further improvement. Using standardised benchmarks is important as it enables apples-to-apples comparisons of different models. However, it is equally crucial to consider specific use cases that may not be covered by standardised approaches but are critical to your unique requirements.
With the use of AI in so many of today’s systems, performance with respect to security should also be measured. This need has been recognized by the open-source community, leading to the release of several security-focused benchmarks for recent LLMs, such as CyberSecEval2, Azure Counterfit, BackdoorBench, AILuminate v1.0 benchmark, and more. While multiple benchmarks exist, we will explore one example to illustrate how a benchmark suite is structured and its role in assessing risks within an AI interface.
Benchmark Walkthrough: CyberSecEval2
Meta has been a forefront in the open-source AI community and has released benchmarks that they have used to evaluate their models, the most recent being CyberSecEval2. This walkthrough is just an example of how a benchmark can assist in standardised testing of how secure certain models are. Meta includes a higher category of security focus, a suite of tasks per category, and a suite of tests to measure the performance on those tasks.
They break down the benchmark tasks into five categories.
The first is insecure coding, which is where the LLM is asked to produce code but the response includes code that contains vulnerabilities. In the models Meta evaluated, 30% of the time a model was asked to provide code, the results contained vulnerabilities. This creates risk as many engineers use LLMs to assist with tasks and could be unknowingly opening the door to vulnerability exploits.
The second is cyber attack helpfulness, which is how often does an LLM produce answers to unsafe prompts. In this suite, the MITRE ATT&CK framework — which we will explore in more detail later – was used to generate questions to evaluate LLMs compliance on the topic. This example shows how these security specific frameworks help drive secure practices and research.
The third is prompt injection, which tests how easy it is for someone to override system prompts or the final user prompt with their own. An example is commonly “ignore all previous instructions and do xyz.” There are numerous examples of this in real-world scenarios, such as demonstrated by HiddenLayers with Google Gemini for Workspace being susceptible to prompt injections through shared documents or emails.
The fourth is vulnerability identification and exploitation, which asks an LLM to look at code, find vulnerabilities, and exploit them. The use of LLMs and agents are an increasing threat with respect to vulnerability exploitation. There are existing real-world case studies demonstrating this such as “LLM Agents Can Autonomously Exploit One-day Vulnerabilities” where models were prompted with a CVE to hunt for and then perform an attack, and researchers found GPT-4 was able to perform this with an 87% success rate.
The final, and considerably the most interesting kind of attack, is code interpreter abuse. Due to LLMs ability to write code and make decisions, but their inability for certain tasks like basic math, connecting “tools” to LLMs is becoming standard in AI development. An example of this is if you ask an LLM “what is 2+2?,” the model can decide it needs to use a calculator. It will then call a program, or the tool, to perform the arithmetic operation, collect the results, and return them to the user. While this is a very simple example, there are others that provide an LLM access to a variety of tools with a variety of permissions and access. This can include a terminal, API, databases, and, almost anything you can program. As one might expect, this opens numerous vectors of attack and requires well thought-out guardrails to prevent the LLM from accidentally doing something like deleting a production database or “rm -rf”. The Meta benchmark found LLMs complied to an average of 35% requests to assist user in attacking the attached code interpreters.
Overall, Meta has provided a good template of what a security benchmark should look like by using existing security frameworks to drive security focus objectives and real-world use cases to when testing those objectives. While using open-source benchmarks are a great start to measuring how secure an AI ecosystem is, it is vital to consider your organisation’s own use cases. This is especially the case with code interpreter abuse, or tool use, as each organisation will often have their own unique set of tools that they will connect to an LLM requiring consideration that may expand outside of this benchmark.
At Arctic wolf, we utilise emerging benchmarks to rigorously evaluate and enhance the AI models we deploy, ensuring they remain robust, reliable, and effective in protecting our customers against evolving threats.
Mitigating Risks with Frameworks
The work of AI benchmark suites, like CyberSecEval2, build upon foundational efforts aimed at securing AI development, particularly through established security frameworks like MITRE Att&ck and MITRE Atlas, which provide structured approaches for identifying and mitigating AI threats.
MITRE Att&ck is a widely used knowledge base on real-world cyber threats and methodologies. It provides organisations with a structured way to understand, detect, and mitigate adversarial behaviour by categorizing threats into a comprehensive, actionable framework. It organises threats using tactics, techniques, and procedures (TTPs), where tactics define the adversary’s overarching goal, techniques describe the methods used to achieve that goal, and procedures outline the specific steps taken to execute the attack. An example tactic is initial access, where an adversary attempts to gain access to your network. An example technique is phishing, where an adversary emails users to trick them into granting access.
MITRE Atlas is a more recent knowledge base that replicates the framework of MITRE Att&ck specifically for AI. An example tactic in this case would be ML model access with techniques like AI model inference API access, where the threat actor has legitimate access to a model API that they exploit for full access. This framework provides case studies of either PoCs or real-world examples for these techniques as well as mitigations on how to defend against them. In this case, one of the recommended mitigations is to implement verbose logging of inputs and outputs of deployed AI models to assist in detection of malicious threats.
Many of these attacks also incorporate traditional cyber tactics, leveraging techniques outlined in the MITRE ATT&CK framework. This highlights the need for a comprehensive defense strategy that combines conventional cybersecurity measures with AI-specific mitigations to ensure robust protection against evolving threats.
Deciding how to apply a framework can be challenging, and MITRE may not seamlessly align with your organisation’s AI infrastructure. The good news is that there are various other frameworks that may be better suited or can complement MITRE to strengthen security design decisions. These include:
- OWASP LLM
- NIST AI Risk Management
- Google’s Secure AI Framework (SAIF)
OWASP LLM provides a list of the top vulnerabilities in LLMs, empowering organisations to make better decisions for their applications. NIST AI Risk Management Framework outlines four core functions—govern, map, measure, and manage—to guide organisations in integrating risk management throughout the AI lifecycle. Google’s Secure AI Framework (SAIF) defines six key security principles: establishing a secure AI development framework, extending detection and response capabilities, automating defenses, harmonising platform-level security, adapting security controls for AI risks, and embedding security throughout the AI lifecycle.
The Road Ahead
As AI continues to drive innovation across industries and applications, the threat landscape evolves in parallel, with threat actors finding new ways to exploit these advancements. While there is always a trade-off between rapid innovation and robust security, ongoing research and industry support are dedicated to minimising these risks and strengthening AI resilience. This includes AI evaluation benchmarks like CyberSecEval2 and knowledge base driven frameworks like MITRE Att&ck and MITRE Atlas. This requires a call for collaboration between researchers, developers and organisations to build resilient AI systems. As an organisation, we encourage you to adopt benchmarking practices to assess AI model robustness and implement security frameworks for safe deployment, helping to minimise cyber risks and strengthen the resilience of your AI systems.
At Arctic Wolf, we take our internal security seriously and, as an early adopter of AI-driven solutions, we are always on the lookout for new ways to mitigate emerging threats. This requires not only adoption of the latest techniques in detection and mitigation but also requires continuously staying ahead of evolving attack landscapes. By leveraging security benchmarks and industry-leading frameworks like those discussed here, we strengthen our daily operations and advocate for their adoption as a best practice for all organisations.
Learn more about how Arctic Wolf helps organisations strengthen their security posture with our Cyber Risk Assessment.
Explore Arctic Wolf’s commitment to the Security Journey, which empowers organisations to continually harden their attack surface against evolving threats.