Written by Artem Kobrin, AWS Ambassador and Head of Cloud at Neurons Lab
In this article, we will explore advanced attack techniques against large language models (LLMs) in detail, then explain how to mitigate these risks using external guardrails and safety procedures.
This is Part 2 in our guide on understanding and preventing LLM attacks. If you haven’t seen it yet, take a look at Part 1 of this article – there we outline the main attack types on LLMs and share some key overarching principles for defending AI models against such threats.
Safety measures have a vital role in protecting AI models against threats – just as any type of computer software, system, or network requires cybersecurity.
Prompt injection, hijacking, and leaking
Following on from a brief description of prompt injection in Part 1, here is a deep dive into what it involves. Prompt injection is a type of vulnerability that targets LLMs.
This technique involves creating specially designed inputs to manipulate the AI model’s behavior, aiming to bypass its security controls and cause it to produce unintended outputs.
Prompt injections try to exploit the way LLMs process and generate text based on inputs. By inserting carefully worded text, without enough safeguards in place, an attacker could trick the model into:
- Producing unauthorized content
- Ignoring previous instructions or security measures
- Accessing restricted data
The attack takes advantage of the inherent trust these models place in their inputs. The diagram below is a visualization from an AWS re:Inforce presentation covering Cloud security, demonstrating the principles behind a prompt injection:
Building on this, there are several other types of prompt-based attacks including:
- Prompt hijacking: This technique aims to override the original prompts with new instructions controlled by the attacker. The goal is to make the LLM produce unintended or malicious outputs. Unlike jailbreaking, which targets the model’s safety filters directly, prompt hijacking focuses on manipulating the application’s intended behavior.
- Prompt leaking: This is an attack that attempts to reveal the hidden instructions or system prompts that developers use to guide LLM behavior. The goal is to explore the developer’s prompts to find potential weaknesses. Attackers are aiming to obtain sensitive information or understand the underlying logic of an AI system.
Jailbreaking
Jailbreaking methods often involve complex prompts or sneaking in hidden instructions to try confusing an LLM, exploiting weaknesses in how it understands and processes language.
By using clever phrasing or unconventional requests, users attempt to lead the model to provide outputs that developers programmed it to avoid.
Let’s look at a few examples, again with the help of visualizations from AWS re:Inforce. This is a simple example showing how attackers could try using affirmative instructions:
The next example involves an adversarial suffix:
One of the most well-known prompts is ‘do anything now’ or DAN, as visualized in a recent research paper by Xinyue Shen et al. The full DAN prompt is omitted in the diagram below:
Research from Anthropic Claude has also explored many-shot jailbreaking, involving faux dialogue. In this article, Anthropic also explains steps the company has taken to prevent such attacks:
It’s important to note that jailbreaking attempts are constantly evolving, and LLM developers are working continuously to keep their models protected against such manipulations.
Guardrails to prevent LLM attacks
Guardrails introduce safeguards to block such attacks and implement responsible AI principles. They provide safety controls and filter out content that is harmful or against policies, customized to bespoke applications.
Here are some illustrative examples from an AWS Machine Learning blog article by Harel Gal et al, showing how such guardrails work after the initial user input.
Example 1:
Example 2:
Among the best safeguards in the industry feature in Amazon Bedrock Guardrails. Building on native protections already in foundation models, these guardrails are capable of blocking the vast majority of more harmful content.
Amazon Bedrock Guardrails can also filter at least 75% of AI hallucinated responses for retrieval-augmented generation (RAG) workloads. AI hallucinations are instances where LLMs generate and present false, misleading, or inaccurate information as if it were factual, producing outputs that are not based on their training data.
Other benefits of using Amazon Bedrock Guardrails include abilities to:
- Redact sensitive or personally identifiable information (PII) to protect privacy
- Block offensive content with a custom word filter
- Build a consistent level of AI safety across applications
There are other guardrail options available to use in addition to those included in Amazon Bedrock, further increasing the level of protection against attacks:
They include, but are not limited to:
- Amazon Comprehend: This application uses machine learning (ML) to produce insights from text and classify user intent. For example, its trust and safety feature can detect toxic content from adversarial prompts and redact PII that users may inadvertently provide.
- NVIDIA NeMo: This open-source toolkit provides several key rails for LLM-powered conversational AI models. These include rails for fact-checking, keeping conversations on topic, moderating content, preventing hallucinations, and restricting jailbreaking attacks.
- LLM Guard with Regex Scanner: This scanner cleans prompts using predefined keywords, patterns, and regular expressions to identify and undesired content.
Ensuring security in AI projects – a case study
Neurons Lab ensures privacy and security for clients in highly regulated industries such as banking which require the utmost diligence.
Clients retain full ownership of proprietary cloud environments or can choose on-premises deployment, providing complete control within your secure infrastructure.
For more details, find out how we enhanced compliance and data management in financial marketing with an LLM constructor for Visa:
Our locally deployed LLMs operate entirely within clients’ ecosystems, eliminating the risk of data breaches or unauthorized access. All data stays within the client’s controlled environment, guaranteeing compliance with data protection regulations and internal security policies.
With models running locally, there is no need for external data transfers, ensuring that proprietary information is never exposed to third-party systems. Bespoke security protocols tailored to specific requirements increase the protection of sensitive information.
Recommendations for preventing and mitigating LLM attacks
Some of the main attack types against LLMs include model extraction and evasion, prompt injection, data poisoning, inference, and exploiting supply chain vulnerabilities.
Protecting your AI model against such attacks requires implementing responsible principles and safety mechanisms, such as external guardrails.
Here are several recommendations for preventing and mitigating attacks against LLMs:
- Follow the safety requirements in the OWASP Application Security Verification Standard
- Create a secure MLOps strategy
- Implement data cleaning methods early in the training model
- Verify security procedures of third parties in your supply chain
- Fine-tune the model to improve accuracy
- Set limits on the user input length and format – many prompt attacks are long
- Ringfence your application – only provide the LLM with the access it needs
- Test your model thoroughly before launch – use a team to try ‘breaking’ it and fix any faults you find
- Use data filtering to remove harmful or sensitive content – from both inputs and outputs
- Identify and block any malicious users – they will often try and fail with their initial attacks, so monitor usage patterns and anomalies and take action quickly
For more recommendations, read our tips for effective risk management in AI delivery.
About Neurons Lab
Neurons Lab is an AI consultancy that provides end-to-end services – from identifying high-impact AI applications to integrating and scaling the technology. We empower companies to capitalize on AI’s capabilities.
As an AWS Advanced Tier Services Partner and Generative AI Competency holder, we have successfully delivered tailored solutions to over 100 clients, including Fortune 500 companies and governmental organizations.
Our global team comprises data scientists, subject matter experts, and cloud specialists supported by an extensive talent pool of 500 experts. Get in touch with the team here.
References
Characterizing and Evaluating In-The-Wild Jailbreak Prompts on LLMs; Xinyue Shen et al, October 2024
Protect your generative AI applications against jailbreaks; AWS, June 2024
Build safe and responsible generative AI applications with guardrails; AWS, June 2024
Many-shot jailbreaking; Anthropic, April 2024